Re: OLAP, Hadoop, Spark and Cassandra
Mladen Marović <mladen...@...>
A slight correction and clarification of my previous post - the total number of partitions/splits is exactly equal to total_number_of_tokens + 1. In a 3-node cassandra cluster where each node has 256 tokens (if set to default), this would result in a total of 769 partitions, in a single-node cluster this would be 257, etc. There is no "1 task that collects results, or something similar".
toggle quoted message
Show quoted text
This makes sense when you consider that Cassandra partitions data using 64-bit row key hashes, that the total range of 64-bit integer hash values is equal to [-2^63, 2^63 - 1], and that tokens are simply 64-bit integer values used to determine what data partitions a node gets. Splitting that range with n different tokens always gives n + 1 subsets. A log excerpt from a 1-node cassandra cluster with 16 tokens confirms this: 18720 [Executor task launch worker for task 0] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-4815577940669380240, '-2942172956248108515] @[master]) 18720 [Executor task launch worker for task 1] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((7326109958794842850, '7391123213565411179] @[master]) 18721 [Executor task launch worker for task 3] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-2942172956248108515, '-2847854446434006096] @[master]) 18740 [Executor task launch worker for task 2] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-9223372036854775808, '-8839354777455528291] @[master]) 28369 [Executor task launch worker for task 4] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((4104296217363716109, '7326109958794842850] @[master]) 28651 [Executor task launch worker for task 5] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((8156279557766590813, '-9223372036854775808] @[master]) 34467 [Executor task launch worker for task 6] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-6978843450179888845, '-5467974851507832526] @[master]) 54235 [Executor task launch worker for task 7] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((2164465249293820494, '3738744141825711063] @[master]) 56122 [Executor task launch worker for task 8] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-2847854446434006096, '180444324727144184] @[master]) 60564 [Executor task launch worker for task 9] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((180444324727144184, '720824306927062455] @[master]) 74783 [Executor task launch worker for task 10] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-8839354777455528291, '-7732322859452179159] @[master]) 78171 [Executor task launch worker for task 11] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-7732322859452179159, '-6978843450179888845] @[master]) 79362 [Executor task launch worker for task 12] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((3738744141825711063, '4104296217363716109] @[master]) 91036 [Executor task launch worker for task 13] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-5467974851507832526, '-4815577940669380240] @[master]) 92250 [Executor task launch worker for task 14] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((1437322944493769078, '2164465249293820494] @[master]) 92363 [Executor task launch worker for task 15] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((720824306927062455, '1437322944493769078] @[master]) 94339 [Executor task launch worker for task 16] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((7391123213565411179, '8156279557766590813] @[master]) Best regards, Mladen On Tuesday, December 1, 2020 at 8:05:19 AM UTC+1 HadoopMarc wrote:
|
|