Spark and Cassandra via Python

Question

Spark and Cassandra via Python

I have huge data stored in cassandra and I wanted to spark it through python. I just wanted to know how to connect spark and cassandra through python. I've seen people using sc.cassandraTable but it doesn't work and fetch all data at once from cassandra and then feeding until spark doesn't make sense. Any suggestions?

+3

python cassandra apache-spark pyspark

Rakesh 09 Apr 17 at 18:51

source to share

2 answers

RussS · Answer 1 · 2017-04-09T19:03:24+0000

Have you tried the examples in the documentation.

Python documentation from Spark Cassandra

 spark.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="kv", keyspace="test")\
    .load().show()

Marko Švaljek · Answer 2 · 2017-04-09T19:59:19+0000

I'll just give my "short" 2 cents. White papers are totally fine for you. You might want to point out why this doesn't work, that is, you're running out of memory (maybe you just need to increase the "driver" memory), or there is some specific error that is causing your example to fail. It would also be nice if you provide this example.

Here are some of my opinions / experiences I had. This is usually not always the case, but in most cases you have multiple columns in your sections. You don't always have to load all the data into the table, and more or less you can keep the processing (most of the time) within one section. Since the data is sorted within a section, this usually happens rather quickly. And there were no major problems.

If you don't want the whole store in casssandra to get a spark cycle for your processing, you really have a lot of solutions. This will mainly be the cvor material. Some of the more common ones are:

Do the processing in your application right away - some kind of cross-link infrastructure such as hazelcast of even the best akka cluster might be required, this is a really broad topic.
sparking - just do your processing right away in the process of micro-library and flash to read to some layer of resistance - maybe cassandra
apache flink - use correct streaming solution and periodically flush process state to ie cassandra
Store the data in cassandra the way it should be read - this is the most appropriate approach (it's just hard to tell with the information you provided).
The list could go on and on ... User-defined function in cassandra, aggregated functions if your task is something simpler.

It may also be a good idea that you provide some details about your use case. More or less what I said here is rather general and vague, but again, putting it all in a comment, it just doesn't make sense.

Spark and Cassandra via Python

More articles: