You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Josh <jo...@gmail.com> on 2018/02/22 19:38:58 UTC

Partitioning a stream randomly and writing to files with TextIO

Hi all,

I want to read a large dataset using BigQueryIO, and then randomly
partition the rows into three chunks, where one partition has 80% of the
data and there are two other partitions with 10% and 10%. I then want to
write the three partitions to three files in GCS.

I have a couple of quick questions:
(1) What would be the best way to do this random partitioning with Beam? I
think I can just use a PartitionFn which uses Math.random to determine
which of the three partitions an element should go to, but not sure if
there is a better approach.

(2) I would then take the resulting PCollectionList and use TextIO to write
each partition to a GCS file. For this, would I need all data for the
largest partition to fit into the memory of a single worker?

Thanks for any advice,

Josh

Re: Partitioning a stream randomly and writing to files with TextIO

Posted by Lukasz Cwik <lc...@google.com>.

There shouldn't be any swapping or memory concerns if your using Dataflow
(unless each element is large (GiB++)). Dataflow will process small
segments of the files all in parallel and write these results out before
processing more so the entire PCollection is never required to be in memory
at a given time.

On Fri, Feb 23, 2018 at 12:22 AM, Carlos Alonso <ca...@mrcalonso.com>
wrote:

> Hi Lukasz, could you please elaborate a bit more around the 2nd part?
> What's important to know, from the developers perspective, about Dataflow's
> memory management? How big can partitions grow? And what are the
> performance considerations? As this sounds like if the workers will "swap"
> into disk if partitions are very big, right?
>
> Thanks!
>
> On Fri, Feb 23, 2018 at 2:27 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> 1) Creating a PartitionFn is the right way to go. I would suggest using
>> something which would give you stable output so you could replay your
>> pipeline and this would be useful for tests as well. Use something like the
>> object's hashcode and divide the hash space into 80%/10%/10% segments could
>> work just make sure that if you go with hashcode the hashcode function
>> distribute elements well.
>>
>> 2) This is runner dependent but most runners don't require storing
>> everything in memory. For example if you were using Dataflow, you would
>> only need to store a couple of elements in memory not the entire
>> PCollection.
>>
>> On Thu, Feb 22, 2018 at 11:38 AM, Josh <jo...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I want to read a large dataset using BigQueryIO, and then randomly
>>> partition the rows into three chunks, where one partition has 80% of the
>>> data and there are two other partitions with 10% and 10%. I then want to
>>> write the three partitions to three files in GCS.
>>>
>>> I have a couple of quick questions:
>>> (1) What would be the best way to do this random partitioning with Beam?
>>> I think I can just use a PartitionFn which uses Math.random to determine
>>> which of the three partitions an element should go to, but not sure if
>>> there is a better approach.
>>>
>>> (2) I would then take the resulting PCollectionList and use TextIO to
>>> write each partition to a GCS file. For this, would I need all data for the
>>> largest partition to fit into the memory of a single worker?
>>>
>>> Thanks for any advice,
>>>
>>> Josh
>>>
>>
>>

Re: Partitioning a stream randomly and writing to files with TextIO

Posted by Carlos Alonso <ca...@mrcalonso.com>.

Hi Lukasz, could you please elaborate a bit more around the 2nd part?
What's important to know, from the developers perspective, about Dataflow's
memory management? How big can partitions grow? And what are the
performance considerations? As this sounds like if the workers will "swap"
into disk if partitions are very big, right?

Thanks!

On Fri, Feb 23, 2018 at 2:27 AM Lukasz Cwik <lc...@google.com> wrote:

> 1) Creating a PartitionFn is the right way to go. I would suggest using
> something which would give you stable output so you could replay your
> pipeline and this would be useful for tests as well. Use something like the
> object's hashcode and divide the hash space into 80%/10%/10% segments could
> work just make sure that if you go with hashcode the hashcode function
> distribute elements well.
>
> 2) This is runner dependent but most runners don't require storing
> everything in memory. For example if you were using Dataflow, you would
> only need to store a couple of elements in memory not the entire
> PCollection.
>
> On Thu, Feb 22, 2018 at 11:38 AM, Josh <jo...@gmail.com> wrote:
>
>> Hi all,
>>
>> I want to read a large dataset using BigQueryIO, and then randomly
>> partition the rows into three chunks, where one partition has 80% of the
>> data and there are two other partitions with 10% and 10%. I then want to
>> write the three partitions to three files in GCS.
>>
>> I have a couple of quick questions:
>> (1) What would be the best way to do this random partitioning with Beam?
>> I think I can just use a PartitionFn which uses Math.random to determine
>> which of the three partitions an element should go to, but not sure if
>> there is a better approach.
>>
>> (2) I would then take the resulting PCollectionList and use TextIO to
>> write each partition to a GCS file. For this, would I need all data for the
>> largest partition to fit into the memory of a single worker?
>>
>> Thanks for any advice,
>>
>> Josh
>>
>
>

Re: Partitioning a stream randomly and writing to files with TextIO

Posted by Josh <jo...@gmail.com>.

I see, thanks Lukasz - I will try setting that up. Good shout on using
hashcode / ensuring the pipeline is deterministic!

On 23 Feb 2018 01:27, "Lukasz Cwik" <lc...@google.com> wrote:

> 1) Creating a PartitionFn is the right way to go. I would suggest using
> something which would give you stable output so you could replay your
> pipeline and this would be useful for tests as well. Use something like the
> object's hashcode and divide the hash space into 80%/10%/10% segments could
> work just make sure that if you go with hashcode the hashcode function
> distribute elements well.
>
> 2) This is runner dependent but most runners don't require storing
> everything in memory. For example if you were using Dataflow, you would
> only need to store a couple of elements in memory not the entire
> PCollection.
>
> On Thu, Feb 22, 2018 at 11:38 AM, Josh <jo...@gmail.com> wrote:
>
>> Hi all,
>>
>> I want to read a large dataset using BigQueryIO, and then randomly
>> partition the rows into three chunks, where one partition has 80% of the
>> data and there are two other partitions with 10% and 10%. I then want to
>> write the three partitions to three files in GCS.
>>
>> I have a couple of quick questions:
>> (1) What would be the best way to do this random partitioning with Beam?
>> I think I can just use a PartitionFn which uses Math.random to determine
>> which of the three partitions an element should go to, but not sure if
>> there is a better approach.
>>
>> (2) I would then take the resulting PCollectionList and use TextIO to
>> write each partition to a GCS file. For this, would I need all data for the
>> largest partition to fit into the memory of a single worker?
>>
>> Thanks for any advice,
>>
>> Josh
>>
>
>

Re: Partitioning a stream randomly and writing to files with TextIO

Posted by Lukasz Cwik <lc...@google.com>.

1) Creating a PartitionFn is the right way to go. I would suggest using
something which would give you stable output so you could replay your
pipeline and this would be useful for tests as well. Use something like the
object's hashcode and divide the hash space into 80%/10%/10% segments could
work just make sure that if you go with hashcode the hashcode function
distribute elements well.

2) This is runner dependent but most runners don't require storing
everything in memory. For example if you were using Dataflow, you would
only need to store a couple of elements in memory not the entire
PCollection.

On Thu, Feb 22, 2018 at 11:38 AM, Josh <jo...@gmail.com> wrote:

> Hi all,
>
> I want to read a large dataset using BigQueryIO, and then randomly
> partition the rows into three chunks, where one partition has 80% of the
> data and there are two other partitions with 10% and 10%. I then want to
> write the three partitions to three files in GCS.
>
> I have a couple of quick questions:
> (1) What would be the best way to do this random partitioning with Beam? I
> think I can just use a PartitionFn which uses Math.random to determine
> which of the three partitions an element should go to, but not sure if
> there is a better approach.
>
> (2) I would then take the resulting PCollectionList and use TextIO to
> write each partition to a GCS file. For this, would I need all data for the
> largest partition to fit into the memory of a single worker?
>
> Thanks for any advice,
>
> Josh
>