You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apex.apache.org by "Francis Fernandes (JIRA)" <ji...@apache.org> on 2016/10/21 13:19:58 UTC

[jira] [Commented] (APEXMALHAR-2309) TimeBasedDedupOperator marks new tuples as duplicates if expired tuples exist

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595046#comment-15595046 ] 

Francis Fernandes commented on APEXMALHAR-2309:
-----------------------------------------------

This happens during the async handling of the tuples. All tuples being processed in the processAuxiliary method of the AbstractDeduper are tracked under asyncEvents which is then used to compare the later tuples. Currently only the keys are compared, so if a tuple comes with the same key but a time greater than the expiry window it is still marked as a duplicate tuple. The fix is to compare the time if an existing key is found. If it is greater than the existing tuple time, then emit as unique.

> TimeBasedDedupOperator marks new tuples as duplicates if expired tuples exist
> -----------------------------------------------------------------------------
>
>                 Key: APEXMALHAR-2309
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2309
>             Project: Apache Apex Malhar
>          Issue Type: Bug
>    Affects Versions: 3.5.0
>            Reporter: Francis Fernandes
>            Assignee: Francis Fernandes
>
> The deduper marks valid tuples outside the expiry window as duplicates. 
> Consider the following configuration (number of buckets = 1 )
> {code}
>   <property>
>     <name>dt.application.DedupTestApp.operator.Deduper.prop.expireBefore</name>
>     <value>10</value>
>   </property>
>   <property>
>     <name>dt.application.DedupTestApp.operator.Deduper.prop.bucketSpan</name>
>     <value>10</value>
>   </property>
> {code}
> The data piped in is : 
> {code}
> "10",1474614305000,"Test"
> "11",1474614315000,"Test"
> "10",1474614325000,"Test"
> {code}
> The 3rd tuple is valid since it is outside of the expiry window. But it is marked as duplicate because although the first tuple although expired is still present in the Bucket.flash.
> The issue happens when the expiry duration lesser than the checkpointing duration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)