You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kevin Mellott <ke...@gmail.com> on 2016/10/06 21:22:07 UTC

Spark Streaming Advice

I'm attempting to implement a Spark Streaming application that will consume
application log messages from a message broker and store the information in
HDFS. During the data ingestion, we apply a custom schema to the logs,
partition by application name and log date, and then save the information
as parquet files.

All of this works great, except we end up having a large number of parquet
files created. It's my understanding that Spark Streaming is unable to
control the number of files that get generated in each partition; can
anybody confirm that is true?

Also, has anybody else run into a similar situation regarding data
ingestion with Spark Streaming and do you have any tips to share? Our end
goal is to store the information in a way that makes it efficient to query,
using a tool like Hive or Impala.

Thanks,
Kevin

Re: Spark Streaming Advice

Posted by Kevin Mellott <ke...@gmail.com>.

The batch interval was set to 30 seconds; however, after getting the
parquet files to save faster I lowered the interval to 10 seconds. The
number of log messages contained in each batch varied from just a few up to
around 3500, with the number of partitions ranging from 1 to around 15.

I will have to check out HBase as well; I've heard good things!

Thanks,
Kevin

On Mon, Oct 10, 2016 at 11:38 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> Hi Kevin,
>
> What is the streaming interval (batch interval) above?
>
> I do analytics on streaming trade data but after manipulation of
> individual messages I store the selected on in Hbase. Very fast.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 October 2016 at 15:25, Kevin Mellott <ke...@gmail.com>
> wrote:
>
>> Whilst working on this application, I found a setting that drastically
>> improved the performance of my particular Spark Streaming application. I'm
>> sharing the details in hopes that it may help somebody in a similar
>> situation.
>>
>> As my program ingested information into HDFS (as parquet files), I
>> noticed that the time to process each batch was significantly greater than
>> I anticipated. Whether I was writing a single parquet file (around 8KB) or
>> around 10-15 files (8KB each), that step of the processing was taking
>> around 30 seconds. Once I set the configuration below, this operation
>> reduced from 30 seconds to around 1 second.
>>
>> // ssc = instance of SparkStreamingContext
>> ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata",
>> "false")
>>
>> I've also verified that the parquet files being generated are usable by
>> both Hive and Impala.
>>
>> Hope that helps!
>> Kevin
>>
>> On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <ke...@gmail.com>
>> wrote:
>>
>>> I'm attempting to implement a Spark Streaming application that will
>>> consume application log messages from a message broker and store the
>>> information in HDFS. During the data ingestion, we apply a custom schema to
>>> the logs, partition by application name and log date, and then save the
>>> information as parquet files.
>>>
>>> All of this works great, except we end up having a large number of
>>> parquet files created. It's my understanding that Spark Streaming is unable
>>> to control the number of files that get generated in each partition; can
>>> anybody confirm that is true?
>>>
>>> Also, has anybody else run into a similar situation regarding data
>>> ingestion with Spark Streaming and do you have any tips to share? Our end
>>> goal is to store the information in a way that makes it efficient to query,
>>> using a tool like Hive or Impala.
>>>
>>> Thanks,
>>> Kevin
>>>
>>
>>
>

Re: Spark Streaming Advice

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Kevin,

What is the streaming interval (batch interval) above?

I do analytics on streaming trade data but after manipulation of individual
messages I store the selected on in Hbase. Very fast.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 10 October 2016 at 15:25, Kevin Mellott <ke...@gmail.com>
wrote:

> Whilst working on this application, I found a setting that drastically
> improved the performance of my particular Spark Streaming application. I'm
> sharing the details in hopes that it may help somebody in a similar
> situation.
>
> As my program ingested information into HDFS (as parquet files), I noticed
> that the time to process each batch was significantly greater than I
> anticipated. Whether I was writing a single parquet file (around 8KB) or
> around 10-15 files (8KB each), that step of the processing was taking
> around 30 seconds. Once I set the configuration below, this operation
> reduced from 30 seconds to around 1 second.
>
> // ssc = instance of SparkStreamingContext
> ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata",
> "false")
>
> I've also verified that the parquet files being generated are usable by
> both Hive and Impala.
>
> Hope that helps!
> Kevin
>
> On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <ke...@gmail.com>
> wrote:
>
>> I'm attempting to implement a Spark Streaming application that will
>> consume application log messages from a message broker and store the
>> information in HDFS. During the data ingestion, we apply a custom schema to
>> the logs, partition by application name and log date, and then save the
>> information as parquet files.
>>
>> All of this works great, except we end up having a large number of
>> parquet files created. It's my understanding that Spark Streaming is unable
>> to control the number of files that get generated in each partition; can
>> anybody confirm that is true?
>>
>> Also, has anybody else run into a similar situation regarding data
>> ingestion with Spark Streaming and do you have any tips to share? Our end
>> goal is to store the information in a way that makes it efficient to query,
>> using a tool like Hive or Impala.
>>
>> Thanks,
>> Kevin
>>
>
>

Re: Spark Streaming Advice

Posted by Jörn Franke <jo...@gmail.com>.

Your file size is too small this has a significant impact on the namenode. Use Hbase or maybe hawq to store small writes.

> On 10 Oct 2016, at 16:25, Kevin Mellott <ke...@gmail.com> wrote:
> 
> Whilst working on this application, I found a setting that drastically improved the performance of my particular Spark Streaming application. I'm sharing the details in hopes that it may help somebody in a similar situation.
> 
> As my program ingested information into HDFS (as parquet files), I noticed that the time to process each batch was significantly greater than I anticipated. Whether I was writing a single parquet file (around 8KB) or around 10-15 files (8KB each), that step of the processing was taking around 30 seconds. Once I set the configuration below, this operation reduced from 30 seconds to around 1 second.
> 
> // ssc = instance of SparkStreamingContext
> ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
> 
> I've also verified that the parquet files being generated are usable by both Hive and Impala.
> 
> Hope that helps!
> Kevin
> 
>> On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <ke...@gmail.com> wrote:
>> I'm attempting to implement a Spark Streaming application that will consume application log messages from a message broker and store the information in HDFS. During the data ingestion, we apply a custom schema to the logs, partition by application name and log date, and then save the information as parquet files.
>> 
>> All of this works great, except we end up having a large number of parquet files created. It's my understanding that Spark Streaming is unable to control the number of files that get generated in each partition; can anybody confirm that is true? 
>> 
>> Also, has anybody else run into a similar situation regarding data ingestion with Spark Streaming and do you have any tips to share? Our end goal is to store the information in a way that makes it efficient to query, using a tool like Hive or Impala.
>> 
>> Thanks,
>> Kevin
>

Re: Spark Streaming Advice

Posted by Kevin Mellott <ke...@gmail.com>.

Whilst working on this application, I found a setting that drastically
improved the performance of my particular Spark Streaming application. I'm
sharing the details in hopes that it may help somebody in a similar
situation.

As my program ingested information into HDFS (as parquet files), I noticed
that the time to process each batch was significantly greater than I
anticipated. Whether I was writing a single parquet file (around 8KB) or
around 10-15 files (8KB each), that step of the processing was taking
around 30 seconds. Once I set the configuration below, this operation
reduced from 30 seconds to around 1 second.

// ssc = instance of SparkStreamingContext
ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata",
"false")

I've also verified that the parquet files being generated are usable by
both Hive and Impala.

Hope that helps!
Kevin

On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <ke...@gmail.com>
wrote:

> I'm attempting to implement a Spark Streaming application that will
> consume application log messages from a message broker and store the
> information in HDFS. During the data ingestion, we apply a custom schema to
> the logs, partition by application name and log date, and then save the
> information as parquet files.
>
> All of this works great, except we end up having a large number of parquet
> files created. It's my understanding that Spark Streaming is unable to
> control the number of files that get generated in each partition; can
> anybody confirm that is true?
>
> Also, has anybody else run into a similar situation regarding data
> ingestion with Spark Streaming and do you have any tips to share? Our end
> goal is to store the information in a way that makes it efficient to query,
> using a tool like Hive or Impala.
>
> Thanks,
> Kevin
>