You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gary Malouf <ma...@gmail.com> on 2013/10/26 02:15:15 UTC

Spark integration with HDFS and Cassandra simultaneously

We have a use case in which much of our raw data is stored in HDFS today.
 We'd like to write our Spark jobs such that they read/aggregate data from
HDFS and can output to our Cassandra cluster.

Is there any way of doing this in spark 0.7.3?

Re: Spark integration with HDFS and Cassandra simultaneously

Posted by Patrick Wendell <pw...@gmail.com>.
Err - "Hi Gary"!


On Sat, Oct 26, 2013 at 10:14 PM, Patrick Wendell <pw...@gmail.com>wrote:

> Hey Rohit,
>
> A single SparkContext can be used to read and write files of different
> formats, including HDFS or cassandra. For instance you could do this:
>
> rdd1 = sc.textFile(XXX)  // Some text file in HDFS
> rdd1.saveAsHadoopFile(.., classOf[ColumnFamilyOutputFormat], ...)  // Save
> into a cassandra file (see Cassandra example)
>
> This is a common pattern when using Spark for ETL between different
> storage systems.
>
> - Patrick
>
>
> On Sat, Oct 26, 2013 at 7:31 PM, Gary Malouf <ma...@gmail.com>wrote:
>
>> Hi Rohit,
>>
>> We are big users of the Spark Shell - it is used by our analytics team
>> for the same purposes that Hive used to be.  The SparkContext which is
>> provided at startup I guess would have to be one of HDFS or Cassandra - I
>> take it we would then manually create a second context?
>>
>> Thanks,
>>
>> Gary
>>
>>
>> On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <ro...@tuplejump.com> wrote:
>>
>>> Hello Gary,
>>>
>>> This is very easy to do. You can read your data from HDFS using
>>> FileInputFormat, transform it to a desired rows and write to Cassandra
>>> using ColumnFamilyInputFormat.
>>>
>>> Our library called Calliope (Apache Licensed),
>>> http://tuplejump.github.io/calliope/ can make the task of writing to C*
>>> easier.
>>>
>>>
>>> In case you don't want to convert it to rows and keep them as files in
>>> Cassandra, our lightweight Cassandra backed HDFS compatible filesystem,
>>> SnackFS can help you. SnackFS will be part of next Calliope release later
>>> this month, but we can provide you access if you would like to try it out.
>>>
>>> Feel free to mail me directly in case you need any assistance.
>>>
>>>
>>> Regards,
>>> Rohit
>>> founder @ tuplejump
>>>
>>>
>>>
>>>
>>> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <ma...@gmail.com>wrote:
>>>
>>>> We have a use case in which much of our raw data is stored in HDFS
>>>> today.  We'd like to write our Spark jobs such that they read/aggregate
>>>> data from HDFS and can output to our Cassandra cluster.
>>>>
>>>> Is there any way of doing this in spark 0.7.3?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ____________________________
>>> www.tuplejump.com
>>> *The Data Engineering Platform*
>>>
>>
>>
>

Re: Spark integration with HDFS and Cassandra simultaneously

Posted by Rohit Rai <ro...@tuplejump.com>.
Hello Thunder,

We don't use the hive branch underneath current Calliope release as it
focuses on Spark and Cassandra integration. In next EA release coming later
this month we plan to bring in the cas-handler to support Shark on
Cassandra.

Regards,
Rohit


On Mon, Oct 28, 2013 at 9:53 PM, Thunder Stumpges <
thunder.stumpges@gmail.com> wrote:

> This is great. I've been following this thread quietly, very interested!
>
> We are using Cassandra with CQL3 and composite primary keys (v2.0.1)
> with good success from our application servers. We also have
> Hadoop/Hive, but haven't been able to get Spark into production yet
> with how busy we have been.
>
> Just Friday I found https://github.com/milliondreams/hive.git as being
> a current connector for C* with Hadoop. Rohit, it looks like you're
> active on that project as well. Does Calliope use this library
> underneath?
>
> Thanks, great group here. Very excited to use Spark and Spark
> Streaming in the very near future!
>
> -Thunder
>
>
>
> On Sun, Oct 27, 2013 at 11:53 PM, Rohit Rai <ro...@tuplejump.com> wrote:
> > Gary,
> >
> > As Patrick suggests, you can read from HDFS, to create an RDD and output
> the
> > RDD to C*.
> >
> > On writing to C*, look at the Cassandra example here -
> >
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
> >
> > Of interest will be lines 104 to 127 which show how to transform an RDD
> to
> > C* mutations.
> >
> > <shameless_plug>
> > If you would like your analytics team to be able to do the transforms and
> > not worry about understanding mutations and stuff, I'll again suggest
> take a
> > look at Calliope, in which you can provide the transforms as implicits in
> > the Shell so they don't even need to know about it.
> >
> > You can additionally provide cas config also as predefined variables so
> all
> > the analytics guys need to know is they are writing to C*.
> >
> > Of course you can already do all that without calliope too, just that it
> > will make your work easier. ;)
> >
> > I you want to use Calliope,
> > You can read about writing using Calliope here -
> > http://tuplejump.github.io/calliope/show-me-the-code.html
> >
> > And if you really don't want to signup for the early access release you
> can
> > get the G.A. release along with source and instructions to get the
> binaries
> > from here -
> > https://github.com/tuplejump/calliope-release
> >
> > </shameless_plug>
> >
> > Regards,
> > Rohit
> > founder @ tuplejump
> >
> >
> >
> >
> > On Sun, Oct 27, 2013 at 10:44 AM, Patrick Wendell <pw...@gmail.com>
> > wrote:
> >>
> >> Hey Rohit,
> >>
> >> A single SparkContext can be used to read and write files of different
> >> formats, including HDFS or cassandra. For instance you could do this:
> >>
> >> rdd1 = sc.textFile(XXX)  // Some text file in HDFS
> >> rdd1.saveAsHadoopFile(.., classOf[ColumnFamilyOutputFormat], ...)  //
> Save
> >> into a cassandra file (see Cassandra example)
> >>
> >> This is a common pattern when using Spark for ETL between different
> >> storage systems.
> >>
> >> - Patrick
> >>
> >>
> >> On Sat, Oct 26, 2013 at 7:31 PM, Gary Malouf <ma...@gmail.com>
> >> wrote:
> >>>
> >>> Hi Rohit,
> >>>
> >>> We are big users of the Spark Shell - it is used by our analytics team
> >>> for the same purposes that Hive used to be.  The SparkContext which is
> >>> provided at startup I guess would have to be one of HDFS or Cassandra
> - I
> >>> take it we would then manually create a second context?
> >>>
> >>> Thanks,
> >>>
> >>> Gary
> >>>
> >>>
> >>> On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <ro...@tuplejump.com>
> wrote:
> >>>>
> >>>> Hello Gary,
> >>>>
> >>>> This is very easy to do. You can read your data from HDFS using
> >>>> FileInputFormat, transform it to a desired rows and write to
> Cassandra using
> >>>> ColumnFamilyInputFormat.
> >>>>
> >>>> Our library called Calliope (Apache Licensed),
> >>>> http://tuplejump.github.io/calliope/ can make the task of writing to
> C*
> >>>> easier.
> >>>>
> >>>>
> >>>> In case you don't want to convert it to rows and keep them as files in
> >>>> Cassandra, our lightweight Cassandra backed HDFS compatible
> filesystem,
> >>>> SnackFS can help you. SnackFS will be part of next Calliope release
> later
> >>>> this month, but we can provide you access if you would like to try it
> out.
> >>>>
> >>>> Feel free to mail me directly in case you need any assistance.
> >>>>
> >>>>
> >>>> Regards,
> >>>> Rohit
> >>>> founder @ tuplejump
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <ma...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> We have a use case in which much of our raw data is stored in HDFS
> >>>>> today.  We'd like to write our Spark jobs such that they
> read/aggregate data
> >>>>> from HDFS and can output to our Cassandra cluster.
> >>>>>
> >>>>> Is there any way of doing this in spark 0.7.3?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> ____________________________
> >>>> www.tuplejump.com
> >>>> The Data Engineering Platform
> >>>
> >>>
> >>
> >
> >
> >
> > --
> >
> > ____________________________
> > www.tuplejump.com
> > The Data Engineering Platform
>



-- 

____________________________
www.tuplejump.com
*The Data Engineering Platform*

Re: Spark integration with HDFS and Cassandra simultaneously

Posted by Thunder Stumpges <th...@gmail.com>.
This is great. I've been following this thread quietly, very interested!

We are using Cassandra with CQL3 and composite primary keys (v2.0.1)
with good success from our application servers. We also have
Hadoop/Hive, but haven't been able to get Spark into production yet
with how busy we have been.

Just Friday I found https://github.com/milliondreams/hive.git as being
a current connector for C* with Hadoop. Rohit, it looks like you're
active on that project as well. Does Calliope use this library
underneath?

Thanks, great group here. Very excited to use Spark and Spark
Streaming in the very near future!

-Thunder



On Sun, Oct 27, 2013 at 11:53 PM, Rohit Rai <ro...@tuplejump.com> wrote:
> Gary,
>
> As Patrick suggests, you can read from HDFS, to create an RDD and output the
> RDD to C*.
>
> On writing to C*, look at the Cassandra example here -
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
>
> Of interest will be lines 104 to 127 which show how to transform an RDD to
> C* mutations.
>
> <shameless_plug>
> If you would like your analytics team to be able to do the transforms and
> not worry about understanding mutations and stuff, I'll again suggest take a
> look at Calliope, in which you can provide the transforms as implicits in
> the Shell so they don't even need to know about it.
>
> You can additionally provide cas config also as predefined variables so all
> the analytics guys need to know is they are writing to C*.
>
> Of course you can already do all that without calliope too, just that it
> will make your work easier. ;)
>
> I you want to use Calliope,
> You can read about writing using Calliope here -
> http://tuplejump.github.io/calliope/show-me-the-code.html
>
> And if you really don't want to signup for the early access release you can
> get the G.A. release along with source and instructions to get the binaries
> from here -
> https://github.com/tuplejump/calliope-release
>
> </shameless_plug>
>
> Regards,
> Rohit
> founder @ tuplejump
>
>
>
>
> On Sun, Oct 27, 2013 at 10:44 AM, Patrick Wendell <pw...@gmail.com>
> wrote:
>>
>> Hey Rohit,
>>
>> A single SparkContext can be used to read and write files of different
>> formats, including HDFS or cassandra. For instance you could do this:
>>
>> rdd1 = sc.textFile(XXX)  // Some text file in HDFS
>> rdd1.saveAsHadoopFile(.., classOf[ColumnFamilyOutputFormat], ...)  // Save
>> into a cassandra file (see Cassandra example)
>>
>> This is a common pattern when using Spark for ETL between different
>> storage systems.
>>
>> - Patrick
>>
>>
>> On Sat, Oct 26, 2013 at 7:31 PM, Gary Malouf <ma...@gmail.com>
>> wrote:
>>>
>>> Hi Rohit,
>>>
>>> We are big users of the Spark Shell - it is used by our analytics team
>>> for the same purposes that Hive used to be.  The SparkContext which is
>>> provided at startup I guess would have to be one of HDFS or Cassandra - I
>>> take it we would then manually create a second context?
>>>
>>> Thanks,
>>>
>>> Gary
>>>
>>>
>>> On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <ro...@tuplejump.com> wrote:
>>>>
>>>> Hello Gary,
>>>>
>>>> This is very easy to do. You can read your data from HDFS using
>>>> FileInputFormat, transform it to a desired rows and write to Cassandra using
>>>> ColumnFamilyInputFormat.
>>>>
>>>> Our library called Calliope (Apache Licensed),
>>>> http://tuplejump.github.io/calliope/ can make the task of writing to C*
>>>> easier.
>>>>
>>>>
>>>> In case you don't want to convert it to rows and keep them as files in
>>>> Cassandra, our lightweight Cassandra backed HDFS compatible filesystem,
>>>> SnackFS can help you. SnackFS will be part of next Calliope release later
>>>> this month, but we can provide you access if you would like to try it out.
>>>>
>>>> Feel free to mail me directly in case you need any assistance.
>>>>
>>>>
>>>> Regards,
>>>> Rohit
>>>> founder @ tuplejump
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <ma...@gmail.com>
>>>> wrote:
>>>>>
>>>>> We have a use case in which much of our raw data is stored in HDFS
>>>>> today.  We'd like to write our Spark jobs such that they read/aggregate data
>>>>> from HDFS and can output to our Cassandra cluster.
>>>>>
>>>>> Is there any way of doing this in spark 0.7.3?
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________
>>>> www.tuplejump.com
>>>> The Data Engineering Platform
>>>
>>>
>>
>
>
>
> --
>
> ____________________________
> www.tuplejump.com
> The Data Engineering Platform

Re: Spark integration with HDFS and Cassandra simultaneously

Posted by Rohit Rai <ro...@tuplejump.com>.
Gary,

As Patrick suggests, you can read from HDFS, to create an RDD and output
the RDD to C*.

On writing to C*, look at the Cassandra example here -
https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala

Of interest will be lines 104 to 127 which show how to transform an RDD to
C* mutations.

<shameless_plug>
If you would like your analytics team to be able to do the transforms and
not worry about understanding mutations and stuff, I'll again suggest take
a look at Calliope, in which you can provide the transforms as implicits in
the Shell so they don't even need to know about it.

You can additionally provide cas config also as predefined variables so all
the analytics guys need to know is they are writing to C*.

Of course you can already do all that without calliope too, just that it
will make your work easier. ;)

I you want to use Calliope,
You can read about writing using Calliope here -
http://tuplejump.github.io/calliope/show-me-the-code.html

And if you really don't want to signup for the early access release you can
get the G.A. release along with source and instructions to get the binaries
from here -
https://github.com/tuplejump/calliope-release

</shameless_plug>

Regards,
Rohit
founder @ tuplejump




On Sun, Oct 27, 2013 at 10:44 AM, Patrick Wendell <pw...@gmail.com>wrote:

> Hey Rohit,
>
> A single SparkContext can be used to read and write files of different
> formats, including HDFS or cassandra. For instance you could do this:
>
> rdd1 = sc.textFile(XXX)  // Some text file in HDFS
> rdd1.saveAsHadoopFile(.., classOf[ColumnFamilyOutputFormat], ...)  // Save
> into a cassandra file (see Cassandra example)
>
> This is a common pattern when using Spark for ETL between different
> storage systems.
>
> - Patrick
>
>
> On Sat, Oct 26, 2013 at 7:31 PM, Gary Malouf <ma...@gmail.com>wrote:
>
>> Hi Rohit,
>>
>> We are big users of the Spark Shell - it is used by our analytics team
>> for the same purposes that Hive used to be.  The SparkContext which is
>> provided at startup I guess would have to be one of HDFS or Cassandra - I
>> take it we would then manually create a second context?
>>
>> Thanks,
>>
>> Gary
>>
>>
>> On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <ro...@tuplejump.com> wrote:
>>
>>> Hello Gary,
>>>
>>> This is very easy to do. You can read your data from HDFS using
>>> FileInputFormat, transform it to a desired rows and write to Cassandra
>>> using ColumnFamilyInputFormat.
>>>
>>> Our library called Calliope (Apache Licensed),
>>> http://tuplejump.github.io/calliope/ can make the task of writing to C*
>>> easier.
>>>
>>>
>>> In case you don't want to convert it to rows and keep them as files in
>>> Cassandra, our lightweight Cassandra backed HDFS compatible filesystem,
>>> SnackFS can help you. SnackFS will be part of next Calliope release later
>>> this month, but we can provide you access if you would like to try it out.
>>>
>>> Feel free to mail me directly in case you need any assistance.
>>>
>>>
>>> Regards,
>>> Rohit
>>> founder @ tuplejump
>>>
>>>
>>>
>>>
>>> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <ma...@gmail.com>wrote:
>>>
>>>> We have a use case in which much of our raw data is stored in HDFS
>>>> today.  We'd like to write our Spark jobs such that they read/aggregate
>>>> data from HDFS and can output to our Cassandra cluster.
>>>>
>>>> Is there any way of doing this in spark 0.7.3?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ____________________________
>>> www.tuplejump.com
>>> *The Data Engineering Platform*
>>>
>>
>>
>


-- 

____________________________
www.tuplejump.com
*The Data Engineering Platform*

Re: Spark integration with HDFS and Cassandra simultaneously

Posted by Patrick Wendell <pw...@gmail.com>.
Hey Rohit,

A single SparkContext can be used to read and write files of different
formats, including HDFS or cassandra. For instance you could do this:

rdd1 = sc.textFile(XXX)  // Some text file in HDFS
rdd1.saveAsHadoopFile(.., classOf[ColumnFamilyOutputFormat], ...)  // Save
into a cassandra file (see Cassandra example)

This is a common pattern when using Spark for ETL between different storage
systems.

- Patrick


On Sat, Oct 26, 2013 at 7:31 PM, Gary Malouf <ma...@gmail.com> wrote:

> Hi Rohit,
>
> We are big users of the Spark Shell - it is used by our analytics team for
> the same purposes that Hive used to be.  The SparkContext which is provided
> at startup I guess would have to be one of HDFS or Cassandra - I take it we
> would then manually create a second context?
>
> Thanks,
>
> Gary
>
>
> On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <ro...@tuplejump.com> wrote:
>
>> Hello Gary,
>>
>> This is very easy to do. You can read your data from HDFS using
>> FileInputFormat, transform it to a desired rows and write to Cassandra
>> using ColumnFamilyInputFormat.
>>
>> Our library called Calliope (Apache Licensed),
>> http://tuplejump.github.io/calliope/ can make the task of writing to C*
>> easier.
>>
>>
>> In case you don't want to convert it to rows and keep them as files in
>> Cassandra, our lightweight Cassandra backed HDFS compatible filesystem,
>> SnackFS can help you. SnackFS will be part of next Calliope release later
>> this month, but we can provide you access if you would like to try it out.
>>
>> Feel free to mail me directly in case you need any assistance.
>>
>>
>> Regards,
>> Rohit
>> founder @ tuplejump
>>
>>
>>
>>
>> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <ma...@gmail.com>wrote:
>>
>>> We have a use case in which much of our raw data is stored in HDFS
>>> today.  We'd like to write our Spark jobs such that they read/aggregate
>>> data from HDFS and can output to our Cassandra cluster.
>>>
>>> Is there any way of doing this in spark 0.7.3?
>>>
>>
>>
>>
>> --
>>
>> ____________________________
>> www.tuplejump.com
>> *The Data Engineering Platform*
>>
>
>

Re: Spark integration with HDFS and Cassandra simultaneously

Posted by Gary Malouf <ma...@gmail.com>.
Hi Rohit,

We are big users of the Spark Shell - it is used by our analytics team for
the same purposes that Hive used to be.  The SparkContext which is provided
at startup I guess would have to be one of HDFS or Cassandra - I take it we
would then manually create a second context?

Thanks,

Gary


On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <ro...@tuplejump.com> wrote:

> Hello Gary,
>
> This is very easy to do. You can read your data from HDFS using
> FileInputFormat, transform it to a desired rows and write to Cassandra
> using ColumnFamilyInputFormat.
>
> Our library called Calliope (Apache Licensed),
> http://tuplejump.github.io/calliope/ can make the task of writing to C*
> easier.
>
>
> In case you don't want to convert it to rows and keep them as files in
> Cassandra, our lightweight Cassandra backed HDFS compatible filesystem,
> SnackFS can help you. SnackFS will be part of next Calliope release later
> this month, but we can provide you access if you would like to try it out.
>
> Feel free to mail me directly in case you need any assistance.
>
>
> Regards,
> Rohit
> founder @ tuplejump
>
>
>
>
> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <ma...@gmail.com>wrote:
>
>> We have a use case in which much of our raw data is stored in HDFS today.
>>  We'd like to write our Spark jobs such that they read/aggregate data from
>> HDFS and can output to our Cassandra cluster.
>>
>> Is there any way of doing this in spark 0.7.3?
>>
>
>
>
> --
>
> ____________________________
> www.tuplejump.com
> *The Data Engineering Platform*
>

Re: Spark integration with HDFS and Cassandra simultaneously

Posted by Rohit Rai <ro...@tuplejump.com>.
Hello Gary,

This is very easy to do. You can read your data from HDFS using
FileInputFormat, transform it to a desired rows and write to Cassandra
using ColumnFamilyInputFormat.

Our library called Calliope (Apache Licensed),
http://tuplejump.github.io/calliope/ can make the task of writing to C*
easier.


In case you don't want to convert it to rows and keep them as files in
Cassandra, our lightweight Cassandra backed HDFS compatible filesystem,
SnackFS can help you. SnackFS will be part of next Calliope release later
this month, but we can provide you access if you would like to try it out.

Feel free to mail me directly in case you need any assistance.


Regards,
Rohit
founder @ tuplejump




On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <ma...@gmail.com> wrote:

> We have a use case in which much of our raw data is stored in HDFS today.
>  We'd like to write our Spark jobs such that they read/aggregate data from
> HDFS and can output to our Cassandra cluster.
>
> Is there any way of doing this in spark 0.7.3?
>



-- 

____________________________
www.tuplejump.com
*The Data Engineering Platform*