You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mahebub Sayyed <ma...@gmail.com> on 2014/07/17 11:19:17 UTC

Apache kafka + spark + Parquet

Hi All,

Currently we are reading (multiple) topics from Apache kafka and storing
that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
different tables).
but we are facing some performance issue with HBase.
so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
Spark*.

difficulties:
1. How to read multiple topics from kafka using spark?
2. One tuple belongs to multiple tables, How to write one topic to multiple
parquet files with proper partitioning using spark??

Please help me
Thanks in advance.

-- 
*Regards,*

*Mahebub *

Re: Apache kafka + spark + Parquet

Posted by buntu <bu...@gmail.com>.

> Now we are storing Data direct from Kafka to Parquet.

We are currently using Camus and wanted to know how you went about storing
to Parquet?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-kafka-spark-Parquet-tp10037p10441.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Apache kafka + spark + Parquet

Posted by Michael Armbrust <mi...@databricks.com>.

We don't have support for partitioned parquet yet.  There is a JIRA here:
https://issues.apache.org/jira/browse/SPARK-2406


On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das <ta...@gmail.com>
wrote:

> val kafkaStream = KafkaUtils.createStream(... ) // see the example in my
> previous post
>
> val transformedStream = kafkaStream.map ...   // whatever transformation
> you want to do
>
> transformedStream.foreachRDD((rdd: RDD[...], time: Time) => {
>      // save the rdd to parquet file, using time as the file name, see
> other link i sent in how to do it
>      // every batch of data will create a new parquet file
> })
>
>
> Maybe michael (cc'ed) will be able to give more insights about the parquet
> stuff.
>
> TD
>
>
>
>
> On Thu, Jul 17, 2014 at 3:59 AM, Mahebub Sayyed <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> To migrate data from *HBase *to *Parquet* we used following query
>> through * Impala*:
>>
>> INSERT INTO table PARQUET_HASHTAGS(
>>
>> key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time, hashtag_time,
>> tweet_id, user_id, user_name,
>> hashtag_year
>> ) *partition(year, month, day)* SELECT key, city_name, country_name,
>> hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time,
>> hashtag_time, tweet_id, user_id, user_name,hashtag_year, cast(hashtag_year
>> as int),cast(hashtag_month as int), cast(hashtag_date as int) from HASHTAGS
>> where hashtag_year='2014' and hashtag_month='04' and hashtag_date='01'
>> ORDER BY key,hashtag_year,hashtag_month,hashtag_date LIMIT 10000000 offset
>> 0;
>>
>> using above query we have successfully migrated form HBase to Parquet
>> files with proper partitions.
>>
>> Now we are storing Data direct from *Kafka *to *Parquet.*
>>
>> *How is it possible to create partitions while storing data direct from
>> kafka to Parquet files??*
>> *(likewise created in above query)*
>>
>>
>> On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> 1. You can put in multiple kafka topics in the same Kafka input stream.
>>> See the example KafkaWordCount
>>> <https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala> .
>>> However they will all be read through a single receiver (though multiple
>>> threads, one per topic). To parallelize the read (for increasing
>>> throughput), you can create multiple Kafka input streams, and splits the
>>> topics appropriately between them.
>>>
>>> 2. You can easily read and write to parquet files in Spark. Any RDD
>>> (generated through DStreams in Spark Streaming, or otherwise), can be
>>> converted to a SchemaRDD and then saved in the parquet format as
>>> rdd.saveAsParquetFile. See the Spark SQL guide
>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files> for
>>> more details. So if you want to write a same dataset (as RDDs) to two
>>> different parquet files, you just have to call saveAsParquetFile twice (on
>>> same or transformed versions of the RDD), as shown in the guide.
>>>
>>> Hope this helps!
>>>
>>> TD
>>>
>>>
>>> On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Currently we are reading (multiple) topics from Apache kafka and
>>>> storing that in HBase (multiple tables) using twitter storm (1 tuple stores
>>>> in 4 different tables).
>>>>  but we are facing some performance issue with HBase.
>>>> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
>>>> Spark*.
>>>>
>>>> difficulties:
>>>>  1. How to read multiple topics from kafka using spark?
>>>> 2. One tuple belongs to multiple tables, How to write one topic to
>>>> multiple parquet files with proper partitioning using spark??
>>>>
>>>> Please help me
>>>> Thanks in advance.
>>>>
>>>> --
>>>> *Regards,*
>>>>
>>>> *Mahebub *
>>>>
>>>
>>>
>>
>>
>> --
>> *Regards,*
>> *Mahebub Sayyed*
>>
>
>

Re: Apache kafka + spark + Parquet

Posted by Tathagata Das <ta...@gmail.com>.

val kafkaStream = KafkaUtils.createStream(... ) // see the example in my
previous post

val transformedStream = kafkaStream.map ...   // whatever transformation
you want to do

transformedStream.foreachRDD((rdd: RDD[...], time: Time) => {
     // save the rdd to parquet file, using time as the file name, see
other link i sent in how to do it
     // every batch of data will create a new parquet file
})


Maybe michael (cc'ed) will be able to give more insights about the parquet
stuff.

TD




On Thu, Jul 17, 2014 at 3:59 AM, Mahebub Sayyed <ma...@gmail.com>
wrote:

> Hi,
>
> To migrate data from *HBase *to *Parquet* we used following query through
> * Impala*:
>
> INSERT INTO table PARQUET_HASHTAGS(
>
> key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time, hashtag_time,
> tweet_id, user_id, user_name,
> hashtag_year
> ) *partition(year, month, day)* SELECT key, city_name, country_name,
> hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time,
> hashtag_time, tweet_id, user_id, user_name,hashtag_year, cast(hashtag_year
> as int),cast(hashtag_month as int), cast(hashtag_date as int) from HASHTAGS
> where hashtag_year='2014' and hashtag_month='04' and hashtag_date='01'
> ORDER BY key,hashtag_year,hashtag_month,hashtag_date LIMIT 10000000 offset
> 0;
>
> using above query we have successfully migrated form HBase to Parquet
> files with proper partitions.
>
> Now we are storing Data direct from *Kafka *to *Parquet.*
>
> *How is it possible to create partitions while storing data direct from
> kafka to Parquet files??*
> *(likewise created in above query)*
>
>
> On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> 1. You can put in multiple kafka topics in the same Kafka input stream.
>> See the example KafkaWordCount
>> <https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala> .
>> However they will all be read through a single receiver (though multiple
>> threads, one per topic). To parallelize the read (for increasing
>> throughput), you can create multiple Kafka input streams, and splits the
>> topics appropriately between them.
>>
>> 2. You can easily read and write to parquet files in Spark. Any RDD
>> (generated through DStreams in Spark Streaming, or otherwise), can be
>> converted to a SchemaRDD and then saved in the parquet format as
>> rdd.saveAsParquetFile. See the Spark SQL guide
>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files> for
>> more details. So if you want to write a same dataset (as RDDs) to two
>> different parquet files, you just have to call saveAsParquetFile twice (on
>> same or transformed versions of the RDD), as shown in the guide.
>>
>> Hope this helps!
>>
>> TD
>>
>>
>> On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <ma...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Currently we are reading (multiple) topics from Apache kafka and storing
>>> that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
>>> different tables).
>>>  but we are facing some performance issue with HBase.
>>> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
>>> Spark*.
>>>
>>> difficulties:
>>>  1. How to read multiple topics from kafka using spark?
>>> 2. One tuple belongs to multiple tables, How to write one topic to
>>> multiple parquet files with proper partitioning using spark??
>>>
>>> Please help me
>>> Thanks in advance.
>>>
>>> --
>>> *Regards,*
>>>
>>> *Mahebub *
>>>
>>
>>
>
>
> --
> *Regards,*
> *Mahebub Sayyed*
>

Re: Apache kafka + spark + Parquet

Posted by Mahebub Sayyed <ma...@gmail.com>.

Hi,

To migrate data from *HBase *to *Parquet* we used following query through
* Impala*:

INSERT INTO table PARQUET_HASHTAGS(
key, city_name, country_name, hashtag_date, hashtag_text,
hashtag_source, hashtag_month, posted_time, hashtag_time,
tweet_id, user_id, user_name,
hashtag_year
) *partition(year, month, day)* SELECT key, city_name, country_name,
hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time,
hashtag_time, tweet_id, user_id, user_name,hashtag_year, cast(hashtag_year
as int),cast(hashtag_month as int), cast(hashtag_date as int) from HASHTAGS
where hashtag_year='2014' and hashtag_month='04' and hashtag_date='01'
ORDER BY key,hashtag_year,hashtag_month,hashtag_date LIMIT 10000000 offset
0;

using above query we have successfully migrated form HBase to Parquet files
with proper partitions.

Now we are storing Data direct from *Kafka *to *Parquet.*

*How is it possible to create partitions while storing data direct from
kafka to Parquet files??*
*(likewise created in above query)*


On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das <tathagata.das1565@gmail.com
> wrote:

> 1. You can put in multiple kafka topics in the same Kafka input stream.
> See the example KafkaWordCount
> <https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala> .
> However they will all be read through a single receiver (though multiple
> threads, one per topic). To parallelize the read (for increasing
> throughput), you can create multiple Kafka input streams, and splits the
> topics appropriately between them.
>
> 2. You can easily read and write to parquet files in Spark. Any RDD
> (generated through DStreams in Spark Streaming, or otherwise), can be
> converted to a SchemaRDD and then saved in the parquet format as
> rdd.saveAsParquetFile. See the Spark SQL guide
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files> for
> more details. So if you want to write a same dataset (as RDDs) to two
> different parquet files, you just have to call saveAsParquetFile twice (on
> same or transformed versions of the RDD), as shown in the guide.
>
> Hope this helps!
>
> TD
>
>
> On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <ma...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Currently we are reading (multiple) topics from Apache kafka and storing
>> that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
>> different tables).
>>  but we are facing some performance issue with HBase.
>> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
>> Spark*.
>>
>> difficulties:
>>  1. How to read multiple topics from kafka using spark?
>> 2. One tuple belongs to multiple tables, How to write one topic to
>> multiple parquet files with proper partitioning using spark??
>>
>> Please help me
>> Thanks in advance.
>>
>> --
>> *Regards,*
>>
>> *Mahebub *
>>
>
>


-- 
*Regards,*
*Mahebub Sayyed*

Re: Apache kafka + spark + Parquet

Posted by Tathagata Das <ta...@gmail.com>.

1. You can put in multiple kafka topics in the same Kafka input stream. See
the example KafkaWordCount
<https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala>
.
However they will all be read through a single receiver (though multiple
threads, one per topic). To parallelize the read (for increasing
throughput), you can create multiple Kafka input streams, and splits the
topics appropriately between them.

2. You can easily read and write to parquet files in Spark. Any RDD
(generated through DStreams in Spark Streaming, or otherwise), can be
converted to a SchemaRDD and then saved in the parquet format as
rdd.saveAsParquetFile. See the Spark SQL guide
<http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files>
for
more details. So if you want to write a same dataset (as RDDs) to two
different parquet files, you just have to call saveAsParquetFile twice (on
same or transformed versions of the RDD), as shown in the guide.

Hope this helps!

TD

On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <ma...@gmail.com>
wrote:

> Hi All,
>
> Currently we are reading (multiple) topics from Apache kafka and storing
> that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
> different tables).
>  but we are facing some performance issue with HBase.
> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
> Spark*.
>
> difficulties:
>  1. How to read multiple topics from kafka using spark?
> 2. One tuple belongs to multiple tables, How to write one topic to
> multiple parquet files with proper partitioning using spark??
>
> Please help me
> Thanks in advance.
>
> --
> *Regards,*
>
> *Mahebub *
>