You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by bits horoscope <bi...@gmail.com> on 2020/12/18 23:45:18 UTC

GroupByKey is generating fake duplicates on Dataflow

Hi Apache Beam community, I have been dealing with a bug in a GroupByKey
step.

I'm reading an XML file with many info, something like this.
<?xml version="1.0"?>
<ListingsSet>
<ListingInfo>
<code>ABC-37717</code>
<description>First Listing</description>
</ListingInfo>
<ListingInfo>
<code>ABC-37718</code>
<description>Second listing</description>
</ListingInfo>
<ListingInfo>
<code>ABC-37719</code>
<description>Third listing</description>
</ListingInfo>
</ListingsSet>

I want to work only with the listings with unique code and discard the
duplicate ones. I have checked the input and all the codes are different,
however, my Dataflow pipeline is considering as duplicates many listings
(which are not true, because all are distinct). I have read about the
shards and the stuff that dataflow does in the cloud, so maybe the
windowing is considering the same element and then mark it as duplicate.
But I don't know how to correct it. What would you recommend to me?

This is the code of the pipeline

*final* TupleTag<SerListingInfo> tagCodeUnique = *new*
TupleTag<SerListingInfo>() {};

*final* TupleTag<SerListingInfo> tagCodeDup = *new*
TupleTag<SerListingInfo>() {};

PCollectionTuple tupleCode = tuplePhones.get(tagOutListings)

.apply(*new* RemoveDuplicates<SerListingInfo, String>(

"RemoveDuplicatesCode",

tagCodeUnique,

tagCodeDup,

*new* KeyMapperCode(),

opts.getDuplicateCodeComparator()));


The transform:


@Override

*public* PCollectionTuple expand(PCollection<T> input) {

*return* input

.apply(WithKeys.*of*(*this*.mapper))

.apply(GroupByKey.*create*())

.apply("Pick" + *this*.getName(), ParDo

.*of*(*new* FnPickDuplicate<K, T>(*this*.tagDuplicates, *this*.
duplicateComparator, *this*.getName()))

.withOutputTags(*this*.tagUnique, TupleTagList.*of*(*this*.tagDuplicates)));

}
*Andres Bravo*
<https://twitter.com/SirAndyBrave> @SirAndyBrave