package rdd
Contains com.datastax.spark.connector.rdd.CassandraTableScanRDD class that is the main entry point for analyzing Cassandra data from Spark.
- Alphabetic
- By Inheritance
- rdd
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
class
CassandraCoGroupedRDD[T] extends RDD[Seq[Seq[T]]]
A RDD which pulls from provided separate CassandraTableScanRDDs which share partition keys type and keyspaces.
A RDD which pulls from provided separate CassandraTableScanRDDs which share partition keys type and keyspaces. These tables will be joined on READ using a merge iterator. As long as we join on the token of the partition key the two iterators should be read in order. Note: this implementation do not restrict partition keys has the same names, but they should have the same types
-
class
CassandraJoinRDD[L, R] extends CassandraRDD[(L, R)] with CassandraTableRowReaderProvider[R] with AbstractCassandraJoin[L, R]
An RDD that will do a selecting join between
left
RDD and the specified Cassandra Table This will perform individual selects to retrieve the rows from Cassandra and will take advantage of RDDs that have been partitioned with the com.datastax.spark.connector.rdd.partitioner.ReplicaPartitionerAn RDD that will do a selecting join between
left
RDD and the specified Cassandra Table This will perform individual selects to retrieve the rows from Cassandra and will take advantage of RDDs that have been partitioned with the com.datastax.spark.connector.rdd.partitioner.ReplicaPartitioner- L
item type on the left side of the join (any RDD)
- R
item type on the right side of the join (fetched from Cassandra)
-
class
CassandraLeftJoinRDD[L, R] extends CassandraRDD[(L, Option[R])] with CassandraTableRowReaderProvider[R] with AbstractCassandraJoin[L, Option[R]]
An RDD that will do a selecting join between
left
RDD and the specified Cassandra Table This will perform individual selects to retrieve the rows from Cassandra and will take advantage of RDDs that have been partitioned with the com.datastax.spark.connector.rdd.partitioner.ReplicaPartitionerAn RDD that will do a selecting join between
left
RDD and the specified Cassandra Table This will perform individual selects to retrieve the rows from Cassandra and will take advantage of RDDs that have been partitioned with the com.datastax.spark.connector.rdd.partitioner.ReplicaPartitioner- L
item type on the left side of the join (any RDD)
- R
item type on the right side of the join (fetched from Cassandra)
- sealed trait CassandraLimit extends AnyRef
-
class
CassandraMergeJoinRDD[L, R] extends RDD[(Seq[L], Seq[R])]
A RDD which pulls from two separate CassandraTableScanRDDs which share partition keys and keyspaces.
A RDD which pulls from two separate CassandraTableScanRDDs which share partition keys and keyspaces. These tables will be joined on READ using a merge iterator. As long as we join on the token of the partition key the two iterators should be read in order.
- case class CassandraPartitionLimit(rowsNumber: Long) extends CassandraLimit with Product with Serializable
- abstract class CassandraRDD[R] extends RDD[R]
-
trait
CassandraTableRowReaderProvider[R] extends AnyRef
Used to get a RowReader of type [R] for transforming the rows of a particular Cassandra table into scala objects.
Used to get a RowReader of type [R] for transforming the rows of a particular Cassandra table into scala objects. Performs necessary checking of the schema and output class to make sure they are compatible.
-
class
CassandraTableScanRDD[R] extends CassandraRDD[R] with CassandraTableRowReaderProvider[R]
RDD representing a Table Scan of A Cassandra table.
RDD representing a Table Scan of A Cassandra table.
This class is the main entry point for analyzing data in Cassandra database with Spark. Obtain objects of this class by calling com.datastax.spark.connector.SparkContextFunctions.cassandraTable.
Configuration properties should be passed in the SparkConf configuration of SparkContext.
CassandraRDD
needs to open connection to Cassandra, therefore it requires appropriate connection property values to be present in SparkConf. For the list of required and available properties, see CassandraConnector.CassandraRDD
divides the data set into smaller partitions, processed locally on every cluster node. A data partition consists of one or more contiguous token ranges. To reduce the number of roundtrips to Cassandra, every partition is fetched in batches.The following properties control the number of partitions and the fetch size: - spark.cassandra.input.split.sizeInMB: approx amount of data to be fetched into a single Spark partition, default 512 MB - spark.cassandra.input.fetch.sizeInRows: number of CQL rows fetched per roundtrip, default 1000
A
CassandraRDD
object gets serialized and sent to every Spark Executor, which then calls thecompute
method to fetch the data on every node. ThegetPreferredLocations
method tells Spark the preferred nodes to fetch a partition from, so that the data for the partition are at the same node the task was sent to. If Cassandra nodes are collocated with Spark nodes, the queries are always sent to the Cassandra process running on the same node as the Spark Executor process, hence data are not transferred between nodes. If a Cassandra node fails or gets overloaded during read, the queries are retried to a different node.By default, reads are performed at ConsistencyLevel.LOCAL_ONE in order to leverage data-locality and minimize network traffic. This read consistency level is controlled by the spark.cassandra.input.consistency.level property.
- sealed trait ClusteringOrder extends Serializable
-
case class
CqlWhereClause(predicates: Seq[String], values: Seq[Any]) extends Product with Serializable
Represents a logical conjunction of CQL predicates.
Represents a logical conjunction of CQL predicates. Each predicate can have placeholders denoted by '?' which get substituted by values from the
values
array. The number of placeholders must match the size of thevalues
array. -
class
DseGraphPartitioner[V, T <: Token[V]] extends Partitioner with Logging
A custom partitoner specifically for RDDs made for DseGraph.
A custom partitoner specifically for RDDs made for DseGraph.
The general idea is
For a vertex Determine the ~label property Determine which RDD represents the data in that label Determine the C* token of the vertex given it's label Determine which partition in the found RDD would contain that Token Determine the offset of that RDD's partitions in the UnionRDD Return the partition index added to offset
-
class
DseGraphUnionedRDD[R] extends UnionRDD[R]
An extension of UnionRDD which automatically assigns a partitioner based on the way DseGraph stores and partitions vertex information.
An extension of UnionRDD which automatically assigns a partitioner based on the way DseGraph stores and partitions vertex information. The graphLabels should map 1 to 1 with the RDDs provided in sequence. This ordering is used to develop a mapping between labels and the RDDs which represent the data for that label.
-
class
EmptyCassandraRDD[R] extends CassandraRDD[R]
Represents a CassandraRDD with no rows.
Represents a CassandraRDD with no rows. This RDD does not load any data from Cassandra and doesn't require for the table to exist.
- class MapRowWriter extends RowWriter[Map[String, AnyRef]]
-
case class
ReadConf(splitCount: Option[Int] = None, splitSizeInMB: Int = ReadConf.SplitSizeInMBParam.default, fetchSizeInRows: Int = ..., consistencyLevel: ConsistencyLevel = ..., taskMetricsEnabled: Boolean = ReadConf.TaskMetricParam.default, throughputMiBPS: Option[Double] = None, readsPerSec: Option[Int] = ReadConf.ReadsPerSecParam.default, parallelismLevel: Int = ..., executeAs: Option[String] = None) extends Product with Serializable
Read settings for RDD
Read settings for RDD
- splitCount
number of partitions to divide the data into; unset by default
- splitSizeInMB
size of Cassandra data to be read in a single Spark task; determines the number of partitions, but ignored if
splitCount
is set- fetchSizeInRows
number of CQL rows to fetch in a single round-trip to Cassandra
- consistencyLevel
consistency level for reads, default LOCAL_ONE; higher consistency level will disable data-locality
- taskMetricsEnabled
whether or not enable task metrics updates (requires Spark 1.2+)
- readsPerSec
maximum read throughput allowed per single core in requests/s while joining an RDD with C* table (joinWithCassandraTable operation) also used by enterprise integrations
- case class SparkPartitionLimit(rowsNumber: Long) extends CassandraLimit with Product with Serializable
-
trait
ValidRDDType[T] extends AnyRef
- Annotations
- @implicitNotFound( ... )
Value Members
- object CassandraCoGroupedRDD extends Serializable
- object CassandraLimit
- object CassandraRDD extends Serializable
- object CassandraTableScanRDD extends Serializable
- object ClusteringOrder extends Serializable
- object CqlWhereClause extends Serializable
-
object
DseGraphUnionedRDD extends Serializable
A Java Friendly api for DseGraphUnionedRDD to make it easier to call from VertexInputRDD
- object ReadConf extends Logging with Serializable
- object ValidRDDType