You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@chukwa.apache.org by "Eric Yang (JIRA)" <ji...@apache.org> on 2010/04/28 00:03:35 UTC

[jira] Created: (CHUKWA-481) Improve demux reducer partitioning algorithm

Improve demux reducer partitioning algorithm
--------------------------------------------

Key: CHUKWA-481
URL: https://issues.apache.org/jira/browse/CHUKWA-481
Project: Hadoop Chukwa
Issue Type: Improvement
Components: MR Data Processors
Environment: Redhat EL 5.1, Java 6
Reporter: Eric Yang
Assignee: Eric Yang

Reducer partitioning for demux could be redefined to optimize for two different use case:

Case #1, demux is responsible for crunching large volumes of the same data type (dozen of types). It will probably make more sense to partition the reducer by time grouping + data type (extend TotalOrderPartitioner). I.e. A user can have evenly distributed workload for each reducer base on time interval. A distributed hash table like Hbase/voldermort could be the down stream system to store/cache the data for data serving. This model is great for collecting fixed time interval logs like hadoop metrics, and ExecAdaptor which generates repetitive time series summary.

Case #2, demux is responsible for crunching hundred of different data type, but small volumn for each data type. The current demux implementation is using this model, where a single data type is reduced by one reducer slot (ChukwaRecordPartitioner). One draw back from this model,the data from each data type must have similar volume. Otherwise, the largest data volume type becomes the long tail of the mapreduce job. Materialized report is easy to generate by using this model because the single reducer per data type has view to all data of the given demux run. This model works great for many different application and all logging through Chukwa Log4j appender. I.e. web crawl, or log file indexing / viewing.

I am thinking to change the default Chukwa demux implementation to case #1, and restructure the current demux as Archive Organizer. Any suggestion or objection?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-481) Improve demux reducer partitioning algorithm

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867624#action_12867624 ] 

Bill Graham commented on CHUKWA-481:
------------------------------------

I agree that being able to configure the default partitioner like we currently do with the default mapper/reducer would be best. That way whatever is decided to be the hard-coded 'reasonable default' can be overriden in configs. Being able to configure partitioner-per-dataType isn't a use case for us. If we choose not so support it now, we should at lease leave the configuration model open to support it in the future.

> Improve demux reducer partitioning algorithm
> --------------------------------------------
>
>                 Key: CHUKWA-481
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-481
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: MR Data Processors
>         Environment: Redhat EL 5.1, Java 6
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>
> Reducer partitioning for demux could be redefined to optimize for two different use case:
> Case #1, demux is responsible for crunching large volumes of the same data type (dozen of types).  It will probably make more sense to partition the reducer by time grouping + data type (extend TotalOrderPartitioner).  I.e. A user can have evenly distributed workload for each reducer base on time interval.  A distributed hash table like Hbase/voldermort could be the down stream system to store/cache the data for data serving.  This model is great for collecting fixed time interval logs like hadoop metrics, and ExecAdaptor which generates repetitive time series summary.
>  
> Case #2, demux is responsible for crunching hundred of different data type, but small volumn for each data type.  The current demux implementation is using this model, where a single data type is reduced by one reducer slot (ChukwaRecordPartitioner).  One draw back from this model,the data from each data type must have similar volume.  Otherwise, the largest data volume type becomes the long tail of the mapreduce job.  Materialized report is easy to generate by using this model because the single reducer per data type has view to all data of the given demux run.  This model works great for many different application and all logging through Chukwa Log4j appender.  I.e. web crawl, or log file indexing / viewing.
>  
> I am thinking to change the default Chukwa demux implementation to case #1, and restructure the current demux as Archive Organizer.  Any suggestion or objection?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-481) Improve demux reducer partitioning algorithm

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861617#action_12861617 ] 

Jerome Boulon commented on CHUKWA-481:
--------------------------------------

Partitioning is the key of M/R so reducing the partitioning function to 2 implementations will not make sense for everybody. 
I understand that you are interested in case#1 and case#1 will be only good when you can predict what kind of data you're going to have and to do your grouping function but this will not be useful for Hive output for example. The ideal case will be to support partitioning function at the dataType level so everyone can define the partitioning function that is the right for a specific dataType... but that the ideal case, the minimum will be to have the partitioning class define in chukwa-demux-conf.xml. This way anybody will be free to implement/configure the system to match their requirements.



> Improve demux reducer partitioning algorithm
> --------------------------------------------
>
>                 Key: CHUKWA-481
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-481
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: MR Data Processors
>         Environment: Redhat EL 5.1, Java 6
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>
> Reducer partitioning for demux could be redefined to optimize for two different use case:
> Case #1, demux is responsible for crunching large volumes of the same data type (dozen of types).  It will probably make more sense to partition the reducer by time grouping + data type (extend TotalOrderPartitioner).  I.e. A user can have evenly distributed workload for each reducer base on time interval.  A distributed hash table like Hbase/voldermort could be the down stream system to store/cache the data for data serving.  This model is great for collecting fixed time interval logs like hadoop metrics, and ExecAdaptor which generates repetitive time series summary.
>  
> Case #2, demux is responsible for crunching hundred of different data type, but small volumn for each data type.  The current demux implementation is using this model, where a single data type is reduced by one reducer slot (ChukwaRecordPartitioner).  One draw back from this model,the data from each data type must have similar volume.  Otherwise, the largest data volume type becomes the long tail of the mapreduce job.  Materialized report is easy to generate by using this model because the single reducer per data type has view to all data of the given demux run.  This model works great for many different application and all logging through Chukwa Log4j appender.  I.e. web crawl, or log file indexing / viewing.
>  
> I am thinking to change the default Chukwa demux implementation to case #1, and restructure the current demux as Archive Organizer.  Any suggestion or objection?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.