You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@chukwa.apache.org by "Ari Rabkin (JIRA)" <ji...@apache.org> on 2009/06/30 10:30:47 UTC

[jira] Created: (CHUKWA-338) duplicate suppression in archiver

duplicate suppression in archiver
---------------------------------

                 Key: CHUKWA-338
                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
             Project: Hadoop Chukwa
          Issue Type: New Feature
          Components: Data Processors
            Reporter: Ari Rabkin


Right now, Archiver uses an identity reducer.

It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-338) duplicate suppression in archiver

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730501#action_12730501 ] 

Jerome Boulon commented on CHUKWA-338:
--------------------------------------

The ChukwaArchiveKey contains a TimePartition so you may not removed all duplicates (outside of the 1 hour TimePartition window) and this will also be true in case of backfilling.
Is that a problem for you?

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-338) duplicate suppression in archiver

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated CHUKWA-338:
---------------------------------

    Component/s:     (was: Data Processors)
                 MR Data Processors

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: MR Data Processors
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.4.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730523#action_12730523 ] 

Ari Rabkin commented on CHUKWA-338:
-----------------------------------

Agreed.  That's what I meant by "use content to resolve dups".

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated CHUKWA-338:
------------------------------

    Fix Version/s:     (was: 0.3.0)
                   0.4.0

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.4.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-338) duplicate suppression in archiver

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730510#action_12730510 ] 

Jerome Boulon commented on CHUKWA-338:
--------------------------------------

Ari,
Yes, a secondary sort (grouping comparator) will solve the issue but I'm not sure if all current adaptors are in line with the concept of virtual offset so that would be the first think to validate.
Also, if you have more than one value for the same key, you may want to double check that they actually have the same size/content to make sure it's a real duplicate and not an issue with the virtual offset, especially after rotation.

Since in my mind, the archiver is a background process, it should not be too bad to allways check for real duplicates vs false duplicates (same SequenceId but not same content).





> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin reassigned CHUKWA-338:
---------------------------------

    Assignee: Ari Rabkin

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725526#action_12725526 ] 

Ari Rabkin commented on CHUKWA-338:
-----------------------------------

There's one subtle case to think about.  The sequence ID in a chunk is based on the LAST byte in the chunk.  So what if you have two different chunks that end at the same place, one longer than another?

Answer:  Keep track of how much you've already written for that stream, and act accordingly.  Slightly tricky code, but not monstrous, since the reduce gets chunks in sorted order.

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated CHUKWA-338:
------------------------------

    Fix Version/s: 0.5.0
                       (was: 0.4.0)

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: MR Data Processors
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.5.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated CHUKWA-338:
------------------------------

    Fix Version/s: 0.3.0
           Status: Patch Available  (was: Open)

Suppresses duplicates, but doesn't merge overlapped chunks.

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730502#action_12730502 ] 

Ari Rabkin commented on CHUKWA-338:
-----------------------------------

Hrm.  It actually is a problem, but not currently a decisive one.  Good catch.  Reopening issue.

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin reopened CHUKWA-338:
-------------------------------


As per Jerome's observation, this doesn't catch all duplicates.

Proposed fix: Change sort keys, ignore time partition for purposes of assigning chunks to a reducer. And be more aggressive in using content to resolve dups.

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated CHUKWA-338:
------------------------------

    Resolution: Duplicate
        Status: Resolved  (was: Patch Available)

Resolved by CHUKWA-346.

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-338) duplicate suppression in archiver

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated CHUKWA-338:
------------------------------

    Attachment: archiveDupSuppress.patch

> duplicate suppression in archiver
> ---------------------------------
>
>                 Key: CHUKWA-338
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-338
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>            Reporter: Ari Rabkin
>         Attachments: archiveDupSuppress.patch
>
>
> Right now, Archiver uses an identity reducer.
> It should be straightforward to write a custom reducer that does duplicate detection and suppression if we get multiple chunks with the same key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.