Ruby Driver for Apache Cassandra

Usage

Connecting and Discovering Nodes

Ruby driver will connect to 127.0.0.1 if no :hosts given to Cassandra.cluster. It will automatically discover all peers and add them to cluster metadata.

require 'cassandra'

cluster = Cassandra.cluster

cluster.each_host do |host|
  puts "Host #{host.ip}: id=#{host.id} datacenter=#{host.datacenter} rack=#{host.rack}"
end

You can also specify a list of seed nodes to connect to. The set of IP addresses we pass to the Cassandra.cluster is simply an initial set of contact points. After the driver connects to one of these nodes, it will automatically discover the rest of the nodes in the cluster, so you don’t need to list every node in your cluster.

Executing Queries

You run CQL statements by passing them to Session#execute.

keyspace = 'system'
session  = cluster.connect(keyspace)

session.execute('SELECT keyspace_name, columnfamily_name FROM schema_columnfamilies').each do |row|
  puts "The keyspace #{row['keyspace_name']} has a table called #{row['columnfamily_name']}"
end

For queries that will be run repeatedly, you should use Prepared statements.

Parameterized queries

If you’re using Cassandra 2.0 or later you no longer have to build CQL strings when you want to insert a value in a query, there’s a new feature that lets you bind values with regular statements:

session.execute("UPDATE users SET age = ? WHERE user_name = ?", arguments: [41, 'Sam'])

If you find yourself doing this often, it’s better to use prepared statements. As a rule of thumb, if your application is sending a request more than once, a prepared statement is almost always the right choice.

When you use bound values with regular statements the type of the values has to be guessed. Cassandra supports multiple different numeric types, but there’s no reliable way of guessing whether or not a Ruby Fixnum should be encoded as a BIGINT or INT, or whether a Ruby Float is a DOUBLE or FLOAT. When there are multiple choices the encoder will pick the larger type (e.g. BIGINT over INT). For Ruby strings it will always guess VARCHAR, never BLOB. Check out this types mapping table for additional details.

Also note that some parameterized queries will not work unless prepared, for example, queries with LIMIT ? clause.

Executing Statements in Parallel

With fully asynchronous api, it is very easy to run queries in parallel:

data = [
  [41, 'Sam'],
  [35, 'Bob']
]

# execute all statements in background
futures = data.map do |(age, username)|
  session.execute_async("UPDATE users SET age = ? WHERE user_name = ?", arguments: [age, username])
end

# block until both statements executed
futures.each(&:join)

Prepared Statements

The driver supports prepared statements. Use #prepare to create a statement object, and then call #execute on that object to run a statement. You must supply values for all bound parameters when you call #execute.

statement = session.prepare('INSERT INTO users (username, email) VALUES (?, ?)')

session.execute(statement, arguments: ['avalanche123', 'bulat.shakirzyanov@datastax.com'])

You should prepare a statement for a given query only once and then reuse it by calling #execute. Re-preparing the same statement will have a negative impact on the performance and should be avoided.

A prepared statement can be run many times, but the CQL parsing will only be done once on each node. Use prepared statements for queries you run over and over again.

INSERT, UPDATE, DELETE and SELECT statements can be prepared, other statements may raise QueryError.

For each query, statements are prepared lazily - each call to #execute selects a host to try (according to a load balancing policy) and a statement is prepared if needed.

Changing keyspaces

You can specify a keyspace to change to immediately after connection by passing the keyspace option to Cql::Cluster#connect, you can also use the Session#execute method to change keyspace of an existing session:

session.execute('USE measurements')

Creating keyspaces and tables

There is no special facility for creating keyspaces and tables, they are created by executing CQL:

keyspace_definition = <<-KEYSPACE_CQL
  CREATE KEYSPACE measurements
  WITH replication = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
  }
KEYSPACE_CQL

table_definition = <<-TABLE_CQL
  CREATE TABLE events (
    id INT,
    date DATE,
    comment VARCHAR,
    PRIMARY KEY (id)
  )
TABLE_CQL

session.execute(keyspace_definition)
session.execute('USE measurements')
session.execute(table_definition)

You can also ALTER keyspaces and tables, and you can read more about that in the CQL3 syntax documentation.

Batching

If you’re using Cassandra 2.0 or later you can build batch requests, either from simple or prepared statements. Batches must not contain any select statements, only INSERT, UPDATE and DELETE statements are allowed.

There are a few different ways to work with batches, one is where you build up a batch with a block:

batch = session.batch do |batch|
  batch.add("UPDATE users SET name = 'Sue' WHERE user_id = 'unicorn31'")
  batch.add("UPDATE users SET name = 'Kim' WHERE user_id = 'dudezor13'")
  batch.add("UPDATE users SET name = 'Jim' WHERE user_id = 'kittenz98'")
end

session.execute(batch)

Another is by creating a batch building it later:

batch = session.batch

batch.add("UPDATE users SET name = 'Sue' WHERE user_id = 'unicorn31'")
batch.add("UPDATE users SET name = 'Kim' WHERE user_id = 'dudezor13'")
batch.add("UPDATE users SET name = 'Jim' WHERE user_id = 'kittenz98'")

session.execute(batch)

You can mix any combination of statements in a batch:

prepared_statement = session.prepare("UPDATE users SET name = ? WHERE user_id = ?")

batch = session.batch do |batch|
  batch.add(prepared_statement, 'Sue', 'unicorn31')
  batch.add("UPDATE users SET age = 19 WHERE user_id = 'unicorn31'")
  batch.add("INSERT INTO activity (user_id, what, when) VALUES (?, 'login', NOW())", 'unicorn31')
end

session.execute(batch)

Batches can have one of three different types: logged, unlogged or counter, where logged is the default. Their exact semantics are defined in the Cassandra documentation, but this is how you specify which one you want:

counter_statement = session.prepare("UPDATE my_counter_table SET my_counter = my_counter + ? WHERE id = ?")

batch = session.counter_batch do |batch|
  batch.add(counter_statement, 3, 'some_counter')
  batch.add(counter_statement, 2, 'another_counter')
end

session.execute(batch)

Paging

If you’re using Cassandra 2.0 or later you can page your query results by adding the :page_size option to a query:

result = client.execute("SELECT * FROM large_table WHERE id = 'partition_with_lots_of_data'", page_size: 100)

while result
  result.each do |row|
    p row
  end
  result = result.next_page
end

Consistency

You can specify the default consistency to use when you create a new Cluster:

client = Cassandra.cluster(consistency: :all)

Compression

The CQL protocol supports frame compression, which can give you a performance boost if your requests or responses are big. To enable it you can specify compression to use in Cassandra.cluster.

Cassandra currently supports two compression algorithms: Snappy and LZ4. ruby driver supports both, but in order to use them you will have to install the snappy or lz4-ruby gems separately. Once it’s installed you can enable compression like this:

cluster = Cassandra.cluster(compression: :snappy)

cluster = Cassandra.cluster(compression: :lz4)

Which one should you choose? On paper the LZ4 algorithm is more efficient and the one Cassandra defaults to for SSTable compression. They both achieve roughly the same compression ratio, but LZ4 does it quicker.

Logging

You can pass a standard Ruby logger to the client to get some more information about what is going on:

require 'logger'

cluster = Cassandra.cluster(logger: Logger.new($stderr))

Most of the logging will be when the driver connects and discovers new nodes, when connections fail and so on. The logging is designed to not cause much overhead and only relatively rare events are logged (e.g. normal requests are not logged).

Architecture

The diagram below represents a high level architecture of the driver. Each arrow represents direction of ownership, where owner is pointed to by its children. For example, a single Cassandra::Cluster instance can manage multiple Cassandra::Session instances, etc.

Thread safety

Except for results everything in the driver is thread safe. You only need a single cluster object in your application and usually a single session.

Result objects are wrappers around an array of rows and their primary use case is iteration, something that makes little sense to do concurrently. Because of this they’ve been designed to not be thread safe to avoid the unnecessary cost of locking.

Cluster

A Cluster instance allows to configure different important aspects of the way connections and queries will be handled. At this level you can configure everything from contact points (address of the nodes to be contacted initially before the driver performs node discovery), the request routing policy, retry and reconnection policies, and so forth. Generally such settings are set once at the application level.

require 'cassandra'

cluster = Cassandra.cluster(
            :hosts => ['10.1.1.3', '10.1.1.4', '10.1.1.5'],
            :load_balancing_policy => Cassandra::LoadBalancing::Policies::DCAwareRoundRobin.new("US_EAST")
          )

Session

Sessions are used for query execution. Internally a Session manages connection pools as well as tracks current keyspace. A session should be reused as much as possible, however it is ok to create several independent session for interacting with different keyspaces in the same application.

CQL3

Ruby driver doesn’t parse or understand CQL3, it uses cassandra native protocol to send requests to cassandra and translates responses.

Read more about CQL3 in the CQL3 syntax documentation and the Cassandra query documentation.

Troubleshooting

Cassandra.cluster is taking too long

Upon initial connection, the Ruby driver begins inspecting schema metadata and reconstructing token ranges. This is necessary for token-aware load balancing, but might take an very long time for some schemas/clusters. If you cannot wait for such a long startup time, you can disable schema metadata altogether by passing synchronize_schema: false to Cassandra.cluster.

I get “connection refused” errors

Make sure that the native transport protocol is enabled. If you’re running Cassandra 1.2.5 or later the native transport protocol is enabled by default, if you’re running an earlier version (but later than 1.2) you must enable it by editing cassandra.yaml and setting start_native_transport to true.

To verify that the native transport protocol is enabled, search your logs for the message “Starting listening for CQL clients” and look at which IP and port it is binding to.

I get “Deadlock detected” errors

This means that the driver’s IO reactor has crashed hard. Most of the time it means that you’re using a framework, server or runtime that forks and you call Client.connect in the parent process. Check the documentation and see if there’s any way you can register to run some piece of code in the child process just after a fork, and connect there.

This is how you do it in Resque:

Resque.after_fork = proc do
  # connect to Cassandra here
end

and this is how you do it in Passenger:

PhusionPassenger.on_event(:starting_worker_process) do |forked|
  if forked
    # connect to Cassandra here
  end
end

in Unicorn you do it in the config file:

after_fork do |server, worker|
  # connect to Cassandra here
end

Since prepared statements are tied to a particular connection, you’ll need to recreate those after forking as well.

If your process does not fork and you still encounter deadlock errors, it might also be a bug. All IO is done is a dedicated thread, and if something happens that makes that thread shut down, Ruby will detect that the locks that the client code is waiting on can’t be unlocked.

I get `QueryError`

All errors that originate on the server side are raised as QueryError. If you get one of these the error is in your CQL or on the server side.

I’m not getting all elements back from my list/set/map

There’s a known issue with collections that get too big. The protocol uses a short for the size of collections, but there is no way for Cassandra to stop you from creating a collection bigger than 65536 elements, so when you do the size field overflows with strange results. The data is there, you just can’t get it back.

Authentication doesn’t work

If you’re using Cassandra 2.0 or DataStax Enterprise 3.1 or higher and/or are using something other than the built in Password authenticator your setup is supported.

DSE before 3.1 uses a non-standard protocol and is not currently supported.

I get “end of file reached” / I’m connecting to port 9160 and it doesn’t work

Port 9160 is the old Thrift interface, the binary protocol runs on 9042. This is also the default port for ruby-driver, so unless you’ve changed the port in cassandra.yaml, don’t override the port.

I get namespace conflicts with another gem

Use require 'datastax/cassandra' and DataStax::Cassandra to get a namespaced version of the gem and prevent conflicts with other gems that use top level Cassandra namespace.

Something else is not working

Open an issue and someone will try to help you out. Please include the gem version, Casandra version and Ruby version, and explain as much about what you’re doing as you can, preferably the smallest piece of code that reliably triggers the problem. The more information you give, the better the chances you will get help.

Performance tips

Use asynchronous apis

To get maximum performance you can’t wait for a request to complete before sending the next. Use _async method to run multiple requests in parallel

Use prepared statements

When you use prepared statements you don’t have to smash strings together to create a chunk of CQL to send to the server. Avoiding creating many and large strings in Ruby can be a performance gain in itself. Not sending the query every time, but only the actual data also decreases the traffic over the network, and it decreases the time it takes for the server to handle the request since it doesn’t have to parse CQL. Prepared statements are also very convenient, so there is really no reason not to use them.

Use JRuby

If you want to be serious about Ruby performance you have to use JRuby. The ruby driver is completely thread safe, and the CQL protocol is pipelined by design so you can spin up as many threads as you like and your requests per second will scale more or less linearly (up to what your cores, network and Cassandra cluster can deliver, obviously).

Applications using ruby driver and JRuby can do over 10,000 write requests per second from a single EC2 m1.large if tuned correctly.

Try batching

Batching in Cassandra isn’t always as good as in other (non-distributed) databases. Since rows are distributed accross the cluster the coordinator node must still send the individual pieces of a batch to other nodes, and you could have done that yourself instead.

For Cassandra 1.2 it is often best not to use batching at all, you’ll have to smash strings together to create the batch statements, and that will waste time on the client side, will take longer to push over the network, and will take longer to parse and process on the server side. Prepared statements are almost always a better choice.

Cassandra 2.0 introduced a new form of batches where you can send a batch of prepared statement executions as one request (you can send non-prepared statements too, but we’re talking performance here). These bring the best of both worlds and can be beneficial for some use cases. Some of the same caveats still apply though and you should test it for your use case.

Whenever you use batching, try compression too.

Try compression

If your requests or responses are big, compression can help decrease the amound of traffic over the network, which is often a good thing. If your requests and responses are small, compression often doesn’t do anything. You should benchmark and see what works for you. The Snappy compressor that comes with ruby driver uses very little CPU, so most of the time it doesn’t hurt to leave it on.

In read-heavy applications requests are often small, and need no compression, but responses can be big. In these situations you can modify the compressor used to turn off compression for requests completely. The Snappy compressor that comes with ruby driver will not compress frames less than 64 bytes, for example, and you can change this threshold when you create the compressor.

Compression works best for large requests, so if you use batching you should benchmark if compression gives you a speed boost.

Contents