You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Shivam Singhal <sh...@gmail.com> on 2022/08/10 19:25:04 UTC

[JAVA] Handling repeated elements when merging two pcollections

I have two PCollections, CollectionA & CollectionB of type KV<String,
Byte[]>.


I would like to merge them into one PCollection but CollectionA &
CollectionB might have some elements with the same key. In those repeated
cases, I would like to keep the element from CollectionA & drop the
repeated element from CollectionB.

Does anyone know a simple method to do this?

Thanks,
Shivam Singhal

Re: [JAVA] Handling repeated elements when merging two pcollections

Posted by Luke Cwik via user <us...@beam.apache.org>.
Sorry, I should have said that you should Flatten and do a GroupByKey, not
a CoGroupByKey making the pipeline like:
PCollectionA -> Flatten -> GroupByKey -> ParDo(EmitOnlyFirstElementPerKey)
PCollectionB -/

The CoGroupByKey will have one iterable per PCollection containing zero or
more elements depending on how many elements each PCollection had for that
key. So yes you could solve it with CoGroupByKey but Flatten+GroupByKey is
much simpler.

On Wed, Aug 10, 2022 at 1:31 PM Shivam Singhal <sh...@gmail.com>
wrote:

> Think this should solve my problem.
>
> Thanks Evan ans Luke!
>
> On Thu, 11 Aug 2022 at 1:49 AM, Luke Cwik via user <us...@beam.apache.org>
> wrote:
>
>> Use CoGroupByKey to join the two PCollections and emit only the first
>> value of each iterable with the key.
>>
>> Duplicates will appear as iterables with more then one value while keys
>> without duplicates will have iterables containing exactly one value.
>>
>> On Wed, Aug 10, 2022 at 12:25 PM Shivam Singhal <
>> shivamsinghal5432@gmail.com> wrote:
>>
>>> I have two PCollections, CollectionA & CollectionB of type KV<String,
>>> Byte[]>.
>>>
>>>
>>> I would like to merge them into one PCollection but CollectionA &
>>> CollectionB might have some elements with the same key. In those repeated
>>> cases, I would like to keep the element from CollectionA & drop the
>>> repeated element from CollectionB.
>>>
>>> Does anyone know a simple method to do this?
>>>
>>> Thanks,
>>> Shivam Singhal
>>>
>>

Re: [JAVA] Handling repeated elements when merging two pcollections

Posted by Shivam Singhal <sh...@gmail.com>.
Think this should solve my problem.

Thanks Evan ans Luke!

On Thu, 11 Aug 2022 at 1:49 AM, Luke Cwik via user <us...@beam.apache.org>
wrote:

> Use CoGroupByKey to join the two PCollections and emit only the first
> value of each iterable with the key.
>
> Duplicates will appear as iterables with more then one value while keys
> without duplicates will have iterables containing exactly one value.
>
> On Wed, Aug 10, 2022 at 12:25 PM Shivam Singhal <
> shivamsinghal5432@gmail.com> wrote:
>
>> I have two PCollections, CollectionA & CollectionB of type KV<String,
>> Byte[]>.
>>
>>
>> I would like to merge them into one PCollection but CollectionA &
>> CollectionB might have some elements with the same key. In those repeated
>> cases, I would like to keep the element from CollectionA & drop the
>> repeated element from CollectionB.
>>
>> Does anyone know a simple method to do this?
>>
>> Thanks,
>> Shivam Singhal
>>
>

Re: [JAVA] Handling repeated elements when merging two pcollections

Posted by Luke Cwik via user <us...@beam.apache.org>.
Use CoGroupByKey to join the two PCollections and emit only the first value
of each iterable with the key.

Duplicates will appear as iterables with more then one value while keys
without duplicates will have iterables containing exactly one value.

On Wed, Aug 10, 2022 at 12:25 PM Shivam Singhal <sh...@gmail.com>
wrote:

> I have two PCollections, CollectionA & CollectionB of type KV<String,
> Byte[]>.
>
>
> I would like to merge them into one PCollection but CollectionA &
> CollectionB might have some elements with the same key. In those repeated
> cases, I would like to keep the element from CollectionA & drop the
> repeated element from CollectionB.
>
> Does anyone know a simple method to do this?
>
> Thanks,
> Shivam Singhal
>

Re: [JAVA] Handling repeated elements when merging two pcollections

Posted by Evan Galpin <eg...@apache.org>.
Hi Shivam,

When you say "merge the PCollections" do you mean Flatten, or somehow join?
CoGroupByKey[1] would be a good choice if you need to join based on key.
You would then be able to implement application logic to keep 1 of the 2
records if there is a way to decipher an element from CollectionA vs.
CollectionB by only examining the elements.

If there isn't a natural way of determining which element to keep by only
examining the elements themselves, you could further nest the data in a KV
ex. If CollectionA holds data like KV<k1, v1> and CollectionB is KV<k1, v2>
you could transform these into something like KV<k1, KV<"COLLECTION_A",
v1>> and KV<k1, KV<"COLLECTION_B", v2>>. Then when you CoGroupByKey, these
elements would be grouped based on both having k1, and the source/origin
PCollection could be deciphered based on the key of the inner KV.

Thanks,
Evan

[1]
https://beam.apache.org/documentation/transforms/java/aggregation/cogroupbykey/

On Wed, Aug 10, 2022 at 3:25 PM Shivam Singhal <sh...@gmail.com>
wrote:

> I have two PCollections, CollectionA & CollectionB of type KV<String,
> Byte[]>.
>
>
> I would like to merge them into one PCollection but CollectionA &
> CollectionB might have some elements with the same key. In those repeated
> cases, I would like to keep the element from CollectionA & drop the
> repeated element from CollectionB.
>
> Does anyone know a simple method to do this?
>
> Thanks,
> Shivam Singhal
>