You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mohit Anchlia <mo...@gmail.com> on 2015/08/12 01:06:33 UTC

Partitioning in spark streaming

How does partitioning in spark work when it comes to streaming? What's the
best way to partition a time series data grouped by a certain tag like
categories of product video, music etc.

Re: Partitioning in spark streaming

Posted by Tathagata Das <td...@databricks.com>.
Yes.

On Wed, Aug 12, 2015 at 12:12 PM, Mohit Anchlia <mo...@gmail.com>
wrote:

> Thanks! To write to hdfs I do need to use saveAs method?
>
> On Wed, Aug 12, 2015 at 12:01 PM, Tathagata Das <td...@databricks.com>
> wrote:
>
>> This is how Spark does. It writes the task output to a uniquely-named
>> temporary file, and then atomically (after the task successfully completes)
>> renames the temp file to the expected file name <file>/<partition-XXX>
>>
>>
>> On Tue, Aug 11, 2015 at 9:53 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>>
>>> Thanks for the info. When data is written in hdfs how does spark keeps
>>> the filenames written by multiple executors unique
>>>
>>> On Tue, Aug 11, 2015 at 9:35 PM, Hemant Bhanawat <he...@gmail.com>
>>> wrote:
>>>
>>>> Posting a comment from my previous mail post:
>>>>
>>>> When data is received from a stream source, receiver creates blocks of
>>>> data.  A new block of data is generated every blockInterval milliseconds. N
>>>> blocks of data are created during the batchInterval where N =
>>>> batchInterval/blockInterval. A RDD is created on the driver for the blocks
>>>> created during the batchInterval. The blocks generated during the
>>>> batchInterval are partitions of the RDD.
>>>>
>>>> Now if you want to repartition based on a key, a shuffle is needed.
>>>>
>>>> On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia <mo...@gmail.com>
>>>> wrote:
>>>>
>>>>> How does partitioning in spark work when it comes to streaming? What's
>>>>> the best way to partition a time series data grouped by a certain tag like
>>>>> categories of product video, music etc.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Partitioning in spark streaming

Posted by Mohit Anchlia <mo...@gmail.com>.
Thanks for the info. When data is written in hdfs how does spark keeps the
filenames written by multiple executors unique

On Tue, Aug 11, 2015 at 9:35 PM, Hemant Bhanawat <he...@gmail.com>
wrote:

> Posting a comment from my previous mail post:
>
> When data is received from a stream source, receiver creates blocks of
> data.  A new block of data is generated every blockInterval milliseconds. N
> blocks of data are created during the batchInterval where N =
> batchInterval/blockInterval. A RDD is created on the driver for the blocks
> created during the batchInterval. The blocks generated during the
> batchInterval are partitions of the RDD.
>
> Now if you want to repartition based on a key, a shuffle is needed.
>
> On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
>
>> How does partitioning in spark work when it comes to streaming? What's
>> the best way to partition a time series data grouped by a certain tag like
>> categories of product video, music etc.
>>
>
>

Re: Partitioning in spark streaming

Posted by Hemant Bhanawat <he...@gmail.com>.
Posting a comment from my previous mail post:

When data is received from a stream source, receiver creates blocks of
data.  A new block of data is generated every blockInterval milliseconds. N
blocks of data are created during the batchInterval where N =
batchInterval/blockInterval. A RDD is created on the driver for the blocks
created during the batchInterval. The blocks generated during the
batchInterval are partitions of the RDD.

Now if you want to repartition based on a key, a shuffle is needed.

On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia <mo...@gmail.com>
wrote:

> How does partitioning in spark work when it comes to streaming? What's the
> best way to partition a time series data grouped by a certain tag like
> categories of product video, music etc.
>

Re: Partitioning in spark streaming

Posted by Mohit Anchlia <mo...@gmail.com>.
I am also trying to understand how are files named when writing to hadoop?
for eg: how does "saveAs" method ensures that each executor is generating
unique files?

On Tue, Aug 11, 2015 at 4:21 PM, ayan guha <gu...@gmail.com> wrote:

> partitioning - by itself - is a property of RDD. so essentially it is no
> different in case of streaming where each batch is one RDD. You can use
> partitionBy on RDD and pass on your custom partitioner function to it.
>
> One thing you should consider is how balanced are your partitions ie your
> partition scheme should not skew data into one partition too much.
>
> Best
> Ayan
>
> On Wed, Aug 12, 2015 at 9:06 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
>
>> How does partitioning in spark work when it comes to streaming? What's
>> the best way to partition a time series data grouped by a certain tag like
>> categories of product video, music etc.
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Partitioning in spark streaming

Posted by ayan guha <gu...@gmail.com>.
partitioning - by itself - is a property of RDD. so essentially it is no
different in case of streaming where each batch is one RDD. You can use
partitionBy on RDD and pass on your custom partitioner function to it.

One thing you should consider is how balanced are your partitions ie your
partition scheme should not skew data into one partition too much.

Best
Ayan

On Wed, Aug 12, 2015 at 9:06 AM, Mohit Anchlia <mo...@gmail.com>
wrote:

> How does partitioning in spark work when it comes to streaming? What's the
> best way to partition a time series data grouped by a certain tag like
> categories of product video, music etc.
>



-- 
Best Regards,
Ayan Guha