You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@chukwa.apache.org by "Guille -bisho- (JIRA)" <ji...@apache.org> on 2010/03/10 16:37:27 UTC

[jira] Created: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Store the cluster in the key for performance and easier customization on mappers
--------------------------------------------------------------------------------

                 Key: CHUKWA-462
                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
             Project: Hadoop Chukwa
          Issue Type: Improvement
          Components: Data Processors
            Reporter: Guille -bisho-


Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.

- It's slow to apply a preg to each record
- It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
- Takes unneeded space on records storing the cluster on each of them.

The proposed path:

- Extracts the cluster from chunk tags just once per chunk, much faster.
- Stores the cluster in the key, so it's easy to recover.
- It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
- Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Guille -bisho- (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844112#action_12844112 ] 

Guille -bisho- commented on CHUKWA-462:
---------------------------------------

The problem is that some mappers doesn't call to the buildGenericRecord() on the parent. I have fixed all of them and tests now run fine.

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff, cluster_in_ChukwaRecordKey.v4.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Guille -bisho- (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Guille -bisho- updated CHUKWA-462:
----------------------------------

    Attachment: cluster_in_ChukwaRecordKey.v4.diff

Fixes mappers that don't use the parent class for building the key, plus minor fixes.

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff, cluster_in_ChukwaRecordKey.v4.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Guille -bisho- (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Guille -bisho- updated CHUKWA-462:
----------------------------------

    Attachment:     (was: cluster_in_ChukwaRecordKey.v4.diff)

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Guille -bisho- (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844127#action_12844127 ] 

Guille -bisho- commented on CHUKWA-462:
---------------------------------------

There is still one test failing:
Testcase: testFSMBuilder_JobHistory020(org.apache.hadoop.chukwa.analysis.salsa.fsm.TestFSMBuilder):	FAILED
Error running FSMBuilder: java.io.IOException: Job failed!
junit.framework.AssertionFailedError: Error running FSMBuilder: java.io.IOException: Job failed!
at org.apache.hadoop.chukwa.analysis.salsa.fsm.TestFSMBuilder.testFSMBuilder_JobHistory020(TestFSMBuilder.java:354)

I don't know why, because the cluster is extracted correctly. I will continue with this on tuesday, I'm on a travel. If anyone know what could be happening here, please tell me.

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff, cluster_in_ChukwaRecordKey.v4.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Guille -bisho- (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Guille -bisho- updated CHUKWA-462:
----------------------------------

    Attachment: cluster_in_ChukwaRecordKey.v3.diff

Proposed patch

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Guille -bisho- (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843945#action_12843945 ] 

Guille -bisho- commented on CHUKWA-462:
---------------------------------------

Sure, I'll take on this

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Eric Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843849#action_12843849 ] 

Eric Yang commented on CHUKWA-462:
----------------------------------

+1 Looks good, and it'll speeds up demux.  The original record design was aiming for generalization instead of speed.  In real use case, it's better to have the concept of grouping data by cluster.  Hence, the cluster concept is already set in stone in Chukwa.  Hence, this performance improvement is a reasonable trading off for "clusterName" to become a reserved keyword for Chukwa.

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843722#action_12843722 ] 

Ari Rabkin commented on CHUKWA-462:
-----------------------------------

I am okay with this, but don't know that part of the code as well as Eric and Jerome.  I don't feel comfortable committing it without giving them a chance to comment.   Eric, can you confirm that this should go in?

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Guille -bisho- (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Guille -bisho- updated CHUKWA-462:
----------------------------------

    Attachment: cluster_in_ChukwaRecordKey.v4.diff

Adds setClusterName() to mappers that doesn't use the AbstractMapper helper, plus fix in regexp.

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff, cluster_in_ChukwaRecordKey.v4.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Eric Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843861#action_12843861 ] 

Eric Yang commented on CHUKWA-462:
----------------------------------

Test cases failed after applying this patch:

{noformat}
[junit] Running org.apache.hadoop.chukwa.analysis.salsa.fsm.TestFSMBuilder
[junit] Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 123.183 sec
[junit] Running org.apache.hadoop.chukwa.tools.backfilling.TestBackfillingLoader
[junit] Tests run: 4, Failures: 3, Errors: 0, Time elapsed: 73.1 sec
[junit] Test org.apache.hadoop.chukwa.tools.backfilling.TestBackfillingLoader FAILED
[junit] Running org.apache.hadoop.chukwa.util.TestCreateRecordFile
[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.792 sec
[junit] Test org.apache.hadoop.chukwa.util.TestCreateRecordFile FAILED
[junit] Running org.apache.hadoop.chukwa.util.TestFilter
[junit] Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 0.131 sec
{noformat}


> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-462) Store the cluster in the key for performance and easier customization on mappers

Posted by "Eric Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843863#action_12843863 ] 

Eric Yang commented on CHUKWA-462:
----------------------------------

Running org.apache.hadoop.chukwa.util.TestFilter failed for other reason.  I will fix that one.  Please review the rest.  Thanks.

> Store the cluster in the key for performance and easier customization on mappers
> --------------------------------------------------------------------------------
>
>                 Key: CHUKWA-462
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-462
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>            Reporter: Guille -bisho-
>         Attachments: cluster_in_ChukwaRecordKey.v3.diff
>
>
> Right now the chukwa framework is storing the destination cluster as a tag in the Chunk. Then the tags are copied to the ChukwaRecord, and before storing it, it's parsed with a regular expression from each record.
> - It's slow to apply a preg to each record
> - It's harder to modify the destination cluster from the mapper, you have to tweak the tags field.
> - Takes unneeded space on records storing the cluster on each of them.
> The proposed path:
> - Extracts the cluster from chunk tags just once per chunk, much faster.
> - Stores the cluster in the key, so it's easy to recover.
> - It's easy to tweak from the mapper. Just alter it with key.setClusterName(String clusterName)
> - Strips the cluster from the tags field of the resulting chukwa records. If the tags field is empty, completely skips setting the tags field in the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.