You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by bits horoscope <bi...@gmail.com> on 2020/12/18 23:45:18 UTC
GroupByKey is generating fake duplicates on Dataflow
Hi Apache Beam community, I have been dealing with a bug in a GroupByKey
step.
I'm reading an XML file with many info, something like this.
<?xml version="1.0"?>
<ListingsSet>
<ListingInfo>
<code>ABC-37717</code>
<description>First Listing</description>
</ListingInfo>
<ListingInfo>
<code>ABC-37718</code>
<description>Second listing</description>
</ListingInfo>
<ListingInfo>
<code>ABC-37719</code>
<description>Third listing</description>
</ListingInfo>
</ListingsSet>
I want to work only with the listings with unique code and discard the
duplicate ones. I have checked the input and all the codes are different,
however, my Dataflow pipeline is considering as duplicates many listings
(which are not true, because all are distinct). I have read about the
shards and the stuff that dataflow does in the cloud, so maybe the
windowing is considering the same element and then mark it as duplicate.
But I don't know how to correct it. What would you recommend to me?
This is the code of the pipeline
*final* TupleTag<SerListingInfo> tagCodeUnique = *new*
TupleTag<SerListingInfo>() {};
*final* TupleTag<SerListingInfo> tagCodeDup = *new*
TupleTag<SerListingInfo>() {};
PCollectionTuple tupleCode = tuplePhones.get(tagOutListings)
.apply(*new* RemoveDuplicates<SerListingInfo, String>(
"RemoveDuplicatesCode",
tagCodeUnique,
tagCodeDup,
*new* KeyMapperCode(),
opts.getDuplicateCodeComparator()));
The transform:
@Override
*public* PCollectionTuple expand(PCollection<T> input) {
*return* input
.apply(WithKeys.*of*(*this*.mapper))
.apply(GroupByKey.*create*())
.apply("Pick" + *this*.getName(), ParDo
.*of*(*new* FnPickDuplicate<K, T>(*this*.tagDuplicates, *this*.
duplicateComparator, *this*.getName()))
.withOutputTags(*this*.tagUnique, TupleTagList.*of*(*this*.tagDuplicates)));
}
*Andres Bravo*
<https://twitter.com/SirAndyBrave> @SirAndyBrave