You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by guillaume farcy <gu...@imt-atlantique.net> on 2022/03/21 15:33:51 UTC

[Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

Hello,

I am a student and I am currently doing a big data project.
Here is my code:
https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3

My project is to retrieve messages from a twitch chat and send them into 
kafka then spark reads the kafka topic to perform the processing in the 
provided gist.

I will want to send these messages into cassandra.

I tested a first solution on line 72 which works but when there are too 
many messages spark crashes. Probably due to the fact that my function 
connects to cassandra each time it is called.

I tried the object approach to mutualize the connection object but 
without success:
_pickle.PicklingError: Could not serialize object: TypeError: cannot 
pickle '_thread.RLock' object

Can you please tell me how to do this?
Or at least give me some advice?

Sincerely,
FARCY Guillaume.



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

Posted by Sean Owen <sr...@gmail.com>.

Looks like you are trying to apply this class/function across Spark, but it
contains a non-serialized object, the connection. That has to be
initialized on use, otherwise you try to send it from the driver and that
can't work.

On Mon, Mar 21, 2022 at 11:51 AM guillaume farcy <
guillaume.farcy@imt-atlantique.net> wrote:

> Hello,
>
> I am a student and I am currently doing a big data project.
> Here is my code:
> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3
>
> My project is to retrieve messages from a twitch chat and send them into
> kafka then spark reads the kafka topic to perform the processing in the
> provided gist.
>
> I will want to send these messages into cassandra.
>
> I tested a first solution on line 72 which works but when there are too
> many messages spark crashes. Probably due to the fact that my function
> connects to cassandra each time it is called.
>
> I tried the object approach to mutualize the connection object but
> without success:
> _pickle.PicklingError: Could not serialize object: TypeError: cannot
> pickle '_thread.RLock' object
>
> Can you please tell me how to do this?
> Or at least give me some advice?
>
> Sincerely,
> FARCY Guillaume.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

Posted by Mich Talebzadeh <mi...@gmail.com>.

dear student,


Check this article of  mine in Linkedin


Processing Change Data Capture with Spark Structured Streaming
<https://www.linkedin.com/pulse/processing-change-data-capture-spark-structured-talebzadeh-ph-d-/>


There is a link to GitHub
<https://github.com/michTalebzadeh/SparkStructuredStreaming>  as well.


This writes to the Google BigQuery table. You can write to Cassandra via
JDBC connection if i am correct.



HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 21 Mar 2022 at 16:51, guillaume farcy <
guillaume.farcy@imt-atlantique.net> wrote:

> Hello,
>
> I am a student and I am currently doing a big data project.
> Here is my code:
> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3
>
> My project is to retrieve messages from a twitch chat and send them into
> kafka then spark reads the kafka topic to perform the processing in the
> provided gist.
>
> I will want to send these messages into cassandra.
>
> I tested a first solution on line 72 which works but when there are too
> many messages spark crashes. Probably due to the fact that my function
> connects to cassandra each time it is called.
>
> I tried the object approach to mutualize the connection object but
> without success:
> _pickle.PicklingError: Could not serialize object: TypeError: cannot
> pickle '_thread.RLock' object
>
> Can you please tell me how to do this?
> Or at least give me some advice?
>
> Sincerely,
> FARCY Guillaume.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

completely agree with Alex, also if you are just writing to Cassandra then
what is the purpose of writing to Kafka broker?

Generally people just find it sound as if adding more components to their
architecture is great, but sadly it is not. Remove the Kafka broker, incase
you are not broadcasting your messages to a set of wider solutions. Also
SPARK is an overkill in the way you are using it.

There are fantastic solutions available in the market like Presto SQL, Big
Query, Redshift, Athena, Snowflake, etc and SPARK is just one of the tools
and often a difficult one to configure and run.

Regards,
Gourav Sengupta

On Fri, Mar 25, 2022 at 1:19 PM Alex Ott <al...@gmail.com> wrote:

> You don't need to use foreachBatch to write to Cassandra. You just need to
> use Spark Cassandra Connector version 2.5.0 or higher - it supports native
> writing of stream data into Cassandra.
>
> Here is an announcement:
> https://www.datastax.com/blog/advanced-apache-cassandra-analytics-now-open-all
>
> guillaume farcy  at "Mon, 21 Mar 2022 16:33:51 +0100" wrote:
>  gf> Hello,
>
>  gf> I am a student and I am currently doing a big data project.
>  gf> Here is my code:
>  gf> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3
>
>  gf> My project is to retrieve messages from a twitch chat and send them
> into kafka then spark
>  gf> reads the kafka topic to perform the processing in the provided gist.
>
>  gf> I will want to send these messages into cassandra.
>
>  gf> I tested a first solution on line 72 which works but when there are
> too many messages
>  gf> spark crashes. Probably due to the fact that my function connects to
> cassandra each time
>  gf> it is called.
>
>  gf> I tried the object approach to mutualize the connection object but
> without success:
>  gf> _pickle.PicklingError: Could not serialize object: TypeError: cannot
> pickle
>  gf> '_thread.RLock' object
>
>  gf> Can you please tell me how to do this?
>  gf> Or at least give me some advice?
>
>  gf> Sincerely,
>  gf> FARCY Guillaume.
>
>
>
>  gf> ---------------------------------------------------------------------
>  gf> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
> --
> With best wishes,                    Alex Ott
> http://alexott.net/
> Twitter: alexott_en (English), alexott (Russian)
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

Posted by Alex Ott <al...@gmail.com>.

You don't need to use foreachBatch to write to Cassandra. You just need to
use Spark Cassandra Connector version 2.5.0 or higher - it supports native
writing of stream data into Cassandra.

Here is an announcement: https://www.datastax.com/blog/advanced-apache-cassandra-analytics-now-open-all

guillaume farcy  at "Mon, 21 Mar 2022 16:33:51 +0100" wrote:
 gf> Hello,

 gf> I am a student and I am currently doing a big data project.
 gf> Here is my code:
 gf> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3

 gf> My project is to retrieve messages from a twitch chat and send them into kafka then spark
 gf> reads the kafka topic to perform the processing in the provided gist.

 gf> I will want to send these messages into cassandra.

 gf> I tested a first solution on line 72 which works but when there are too many messages
 gf> spark crashes. Probably due to the fact that my function connects to cassandra each time
 gf> it is called.

 gf> I tried the object approach to mutualize the connection object but without success:
 gf> _pickle.PicklingError: Could not serialize object: TypeError: cannot pickle
 gf> '_thread.RLock' object

 gf> Can you please tell me how to do this?
 gf> Or at least give me some advice?

 gf> Sincerely,
 gf> FARCY Guillaume.



 gf> ---------------------------------------------------------------------
 gf> To unsubscribe e-mail: user-unsubscribe@spark.apache.org



-- 
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org