You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by maheshtwc <ma...@twc-contractor.com> on 2014/06/17 21:52:32 UTC

Spark streaming RDDs to Parquet records

Hello,

Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:

// Create streaming context
val ssc = new StreamingContext(...)

// Obtain a DStream of events
val ds = KafkaUtils.createStream(...)

// Get Spark context to get to the SQL context
val sc = ds.context.sparkContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// For each RDD
ds.foreachRDD((rdd: RDD[Array[Byte]]) => {

    // What do I do next?
})

Thanks,
Mahesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark streaming RDDs to Parquet records

Posted by Anita Tailor <ta...@gmail.com>.

Thanks Mahesh,

I came across this example, look like it might give us some directions.

https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example

Thanks
Anita


On 20 June 2014 09:03, maheshtwc <ma...@twc-contractor.com>
wrote:

> Unfortunately, I couldn’t figure it out without involving Avro.
>
> Here is something that may be useful since it uses Avro generic records
> (so no case classes needed) and transforms to Parquet.
>
>
> http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet
> /
>
> HTH,
> Mahesh
>
> From: "Anita Tailor [via Apache Spark User List]" <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=7971&i=0>>
>
> Date: Thursday, June 19, 2014 at 12:53 PM
> To: Mahesh Padmanabhan <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=7971&i=1>>
>
> Subject: Re: Spark streaming RDDs to Parquet records
>
> I have similar case where I have RDD [List[Any], List[Long] ] and wants to
> save it as Parquet file.
> My understanding is that only RDD of case classes can be converted to
> SchemaRDD. So is there any way I can save this RDD as Parquet file without
> using Avro?
>
> Thanks in advance
> Anita
>
>
> On 18 June 2014 05:03, Michael Armbrust <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=7939&i=0>> wrote:
>
>> If you convert the data to a SchemaRDD you can save it as Parquet:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet
>>
>>
>> On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) <[hidden
>> email] <http://user/SendEmail.jtp?type=node&node=7939&i=1>> wrote:
>>
>>> Thanks Krishna. Seems like you have to use Avro and then convert that to
>>> Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look
>>> into this some more.
>>>
>>> Thanks,
>>> Mahesh
>>>
>>> From: Krishna Sankar <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=7939&i=2>>
>>> Reply-To: "[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=7939&i=3>" <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=7939&i=4>>
>>> Date: Tuesday, June 17, 2014 at 2:41 PM
>>> To: "[hidden email] <http://user/SendEmail.jtp?type=node&node=7939&i=5>"
>>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=7939&i=6>>
>>>
>>> Subject: Re: Spark streaming RDDs to Parquet records
>>>
>>> Mahesh,
>>>
>>>    - One direction could be : create a parquet schema, convert & save
>>>    the records to hdfs.
>>>    - This might help
>>>    https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala
>>>
>>> Cheers
>>> <k/>
>>>
>>>
>>> On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=7939&i=7>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Is there an easy way to convert RDDs within a DStream into Parquet
>>>> records?
>>>> Here is some incomplete pseudo code:
>>>>
>>>> // Create streaming context
>>>> val ssc = new StreamingContext(...)
>>>>
>>>> // Obtain a DStream of events
>>>> val ds = KafkaUtils.createStream(...)
>>>>
>>>> // Get Spark context to get to the SQL context
>>>> val sc = ds.context.sparkContext
>>>>
>>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>>>
>>>> // For each RDD
>>>> ds.foreachRDD((rdd: RDD[Array[Byte]]) => {
>>>>
>>>>     // What do I do next?
>>>> })
>>>>
>>>> Thanks,
>>>> Mahesh
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>> ------------------------------
>>> This E-mail and any of its attachments may contain Time Warner Cable
>>> proprietary information, which is privileged, confidential, or subject to
>>> copyright belonging to Time Warner Cable. This E-mail is intended solely
>>> for the use of the individual or entity to which it is addressed. If you
>>> are not the intended recipient of this E-mail, you are hereby notified that
>>> any dissemination, distribution, copying, or action taken in relation to
>>> the contents of and attachments to this E-mail is strictly prohibited and
>>> may be unlawful. If you have received this E-mail in error, please notify
>>> the sender immediately and permanently delete the original and any copy of
>>> this E-mail and any printout.
>>>
>>
>>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7939.html
>  To unsubscribe from Spark streaming RDDs to Parquet records, click here.
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> ------------------------------
> View this message in context: Re: Spark streaming RDDs to Parquet records
> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7971.html>
>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Re: Spark streaming RDDs to Parquet records

Posted by maheshtwc <ma...@twc-contractor.com>.

Unfortunately, I couldn’t figure it out without involving Avro.

Here is something that may be useful since it uses Avro generic records (so no case classes needed) and transforms to Parquet.

http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/

HTH,
Mahesh

From: "Anita Tailor [via Apache Spark User List]" <ml...@n3.nabble.com>>
Date: Thursday, June 19, 2014 at 12:53 PM
To: Mahesh Padmanabhan <ma...@twc-contractor.com>>
Subject: Re: Spark streaming RDDs to Parquet records

I have similar case where I have RDD [List[Any], List[Long] ] and wants to save it as Parquet file.
My understanding is that only RDD of case classes can be converted to SchemaRDD. So is there any way I can save this RDD as Parquet file without using Avro?

Thanks in advance
Anita


On 18 June 2014 05:03, Michael Armbrust <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=0>> wrote:
If you convert the data to a SchemaRDD you can save it as Parquet: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet


On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=1>> wrote:
Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more.

Thanks,
Mahesh

From: Krishna Sankar <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=2>>
Reply-To: "[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=3>" <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=4>>
Date: Tuesday, June 17, 2014 at 2:41 PM
To: "[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=5>" <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=6>>
Subject: Re: Spark streaming RDDs to Parquet records

Mahesh,

 *   One direction could be : create a parquet schema, convert & save the records to hdfs.
 *   This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

Cheers
<k/>


On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=7>> wrote:
Hello,

Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:

// Create streaming context
val ssc = new StreamingContext(...)

// Obtain a DStream of events
val ds = KafkaUtils.createStream(...)

// Get Spark context to get to the SQL context
val sc = ds.context.sparkContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// For each RDD
ds.foreachRDD((rdd: RDD[Array[Byte]]) => {

    // What do I do next?
})

Thanks,
Mahesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


________________________________
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.




________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7939.html
To unsubscribe from Spark streaming RDDs to Parquet records, click here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7762&code=bWFoZXNoLnBhZG1hbmFiaGFuQHR3Yy1jb250cmFjdG9yLmNvbXw3NzYyfDE3Mjg5ODI4OTI=>.
NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7971.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark streaming RDDs to Parquet records

Posted by Padmanabhan, "Mahesh (contractor)" <ma...@twc-contractor.com>.

Unfortunately, I couldn’t figure it out without involving Avro.

Here is something that may be useful since it uses Avro generic records (so no case classes needed) and transforms to Parquet.

http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/

HTH,
Mahesh

From: "Anita Tailor [via Apache Spark User List]" <ml...@n3.nabble.com>>
Date: Thursday, June 19, 2014 at 12:53 PM
To: Mahesh Padmanabhan <ma...@twc-contractor.com>>
Subject: Re: Spark streaming RDDs to Parquet records

I have similar case where I have RDD [List[Any], List[Long] ] and wants to save it as Parquet file.
My understanding is that only RDD of case classes can be converted to SchemaRDD. So is there any way I can save this RDD as Parquet file without using Avro?

Thanks in advance
Anita


On 18 June 2014 05:03, Michael Armbrust <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=0>> wrote:
If you convert the data to a SchemaRDD you can save it as Parquet: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet


On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=1>> wrote:
Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more.

Thanks,
Mahesh

From: Krishna Sankar <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=2>>
Reply-To: "[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=3>" <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=4>>
Date: Tuesday, June 17, 2014 at 2:41 PM
To: "[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=5>" <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=6>>
Subject: Re: Spark streaming RDDs to Parquet records

Mahesh,

 *   One direction could be : create a parquet schema, convert & save the records to hdfs.
 *   This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

Cheers
<k/>


On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=7>> wrote:
Hello,

Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:

// Create streaming context
val ssc = new StreamingContext(...)

// Obtain a DStream of events
val ds = KafkaUtils.createStream(...)

// Get Spark context to get to the SQL context
val sc = ds.context.sparkContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// For each RDD
ds.foreachRDD((rdd: RDD[Array[Byte]]) => {

    // What do I do next?
})

Thanks,
Mahesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


________________________________
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.




________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7939.html
To unsubscribe from Spark streaming RDDs to Parquet records, click here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7762&code=bWFoZXNoLnBhZG1hbmFiaGFuQHR3Yy1jb250cmFjdG9yLmNvbXw3NzYyfDE3Mjg5ODI4OTI=>.
NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

Re: Spark streaming RDDs to Parquet records

Posted by Anita Tailor <ta...@gmail.com>.

I have similar case where I have RDD [List[Any], List[Long] ] and wants to
save it as Parquet file.
My understanding is that only RDD of case classes can be converted to
SchemaRDD. So is there any way I can save this RDD as Parquet file without
using Avro?

Thanks in advance
Anita


On 18 June 2014 05:03, Michael Armbrust <mi...@databricks.com> wrote:

> If you convert the data to a SchemaRDD you can save it as Parquet:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet
>
>
> On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) <
> mahesh.padmanabhan@twc-contractor.com> wrote:
>
>>  Thanks Krishna. Seems like you have to use Avro and then convert that
>> to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll
>> look into this some more.
>>
>>  Thanks,
>> Mahesh
>>
>>   From: Krishna Sankar <ks...@gmail.com>
>> Reply-To: "user@spark.apache.org" <us...@spark.apache.org>
>> Date: Tuesday, June 17, 2014 at 2:41 PM
>> To: "user@spark.apache.org" <us...@spark.apache.org>
>> Subject: Re: Spark streaming RDDs to Parquet records
>>
>>  Mahesh,
>>
>>    - One direction could be : create a parquet schema, convert & save
>>    the records to hdfs.
>>    - This might help
>>    https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala
>>
>>  Cheers
>> <k/>
>>
>>
>> On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <
>> mahesh.padmanabhan@twc-contractor.com> wrote:
>>
>>> Hello,
>>>
>>> Is there an easy way to convert RDDs within a DStream into Parquet
>>> records?
>>> Here is some incomplete pseudo code:
>>>
>>> // Create streaming context
>>> val ssc = new StreamingContext(...)
>>>
>>> // Obtain a DStream of events
>>> val ds = KafkaUtils.createStream(...)
>>>
>>> // Get Spark context to get to the SQL context
>>> val sc = ds.context.sparkContext
>>>
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>>
>>> // For each RDD
>>> ds.foreachRDD((rdd: RDD[Array[Byte]]) => {
>>>
>>>     // What do I do next?
>>> })
>>>
>>> Thanks,
>>> Mahesh
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>> ------------------------------
>> This E-mail and any of its attachments may contain Time Warner Cable
>> proprietary information, which is privileged, confidential, or subject to
>> copyright belonging to Time Warner Cable. This E-mail is intended solely
>> for the use of the individual or entity to which it is addressed. If you
>> are not the intended recipient of this E-mail, you are hereby notified that
>> any dissemination, distribution, copying, or action taken in relation to
>> the contents of and attachments to this E-mail is strictly prohibited and
>> may be unlawful. If you have received this E-mail in error, please notify
>> the sender immediately and permanently delete the original and any copy of
>> this E-mail and any printout.
>>
>
>

Re: Spark streaming RDDs to Parquet records

Posted by Michael Armbrust <mi...@databricks.com>.

If you convert the data to a SchemaRDD you can save it as Parquet:
http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet


On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) <
mahesh.padmanabhan@twc-contractor.com> wrote:

>  Thanks Krishna. Seems like you have to use Avro and then convert that to
> Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look
> into this some more.
>
>  Thanks,
> Mahesh
>
>   From: Krishna Sankar <ks...@gmail.com>
> Reply-To: "user@spark.apache.org" <us...@spark.apache.org>
> Date: Tuesday, June 17, 2014 at 2:41 PM
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Subject: Re: Spark streaming RDDs to Parquet records
>
>  Mahesh,
>
>    - One direction could be : create a parquet schema, convert & save the
>    records to hdfs.
>    - This might help
>    https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala
>
>  Cheers
> <k/>
>
>
> On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <
> mahesh.padmanabhan@twc-contractor.com> wrote:
>
>> Hello,
>>
>> Is there an easy way to convert RDDs within a DStream into Parquet
>> records?
>> Here is some incomplete pseudo code:
>>
>> // Create streaming context
>> val ssc = new StreamingContext(...)
>>
>> // Obtain a DStream of events
>> val ds = KafkaUtils.createStream(...)
>>
>> // Get Spark context to get to the SQL context
>> val sc = ds.context.sparkContext
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>
>> // For each RDD
>> ds.foreachRDD((rdd: RDD[Array[Byte]]) => {
>>
>>     // What do I do next?
>> })
>>
>> Thanks,
>> Mahesh
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
> ------------------------------
> This E-mail and any of its attachments may contain Time Warner Cable
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to Time Warner Cable. This E-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this E-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this E-mail is strictly prohibited and
> may be unlawful. If you have received this E-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this E-mail and any printout.
>

Re: Spark streaming RDDs to Parquet records

Posted by Padmanabhan, "Mahesh (contractor)" <ma...@twc-contractor.com>.

Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more.

Thanks,
Mahesh

From: Krishna Sankar <ks...@gmail.com>>
Reply-To: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Date: Tuesday, June 17, 2014 at 2:41 PM
To: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Spark streaming RDDs to Parquet records

Mahesh,

 *   One direction could be : create a parquet schema, convert & save the records to hdfs.
 *   This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

Cheers
<k/>

On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <ma...@twc-contractor.com>> wrote:
Hello,

Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:

// Create streaming context
val ssc = new StreamingContext(...)

// Obtain a DStream of events
val ds = KafkaUtils.createStream(...)

// Get Spark context to get to the SQL context
val sc = ds.context.sparkContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// For each RDD
ds.foreachRDD((rdd: RDD[Array[Byte]]) => {

    // What do I do next?
})

Thanks,
Mahesh

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

________________________________
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

Re: Spark streaming RDDs to Parquet records

Posted by Krishna Sankar <ks...@gmail.com>.

Mahesh,

   - One direction could be : create a parquet schema, convert & save the
   records to hdfs.
   - This might help
   https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

Cheers
<k/>


On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <
mahesh.padmanabhan@twc-contractor.com> wrote:

> Hello,
>
> Is there an easy way to convert RDDs within a DStream into Parquet records?
> Here is some incomplete pseudo code:
>
> // Create streaming context
> val ssc = new StreamingContext(...)
>
> // Obtain a DStream of events
> val ds = KafkaUtils.createStream(...)
>
> // Get Spark context to get to the SQL context
> val sc = ds.context.sparkContext
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> // For each RDD
> ds.foreachRDD((rdd: RDD[Array[Byte]]) => {
>
>     // What do I do next?
> })
>
> Thanks,
> Mahesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>