You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Klaas Bosteels (JIRA)" <ji...@apache.org> on 2009/06/05 10:24:08 UTC

[jira] Created: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Streaming partitioner should allow command, not just Java class
---------------------------------------------------------------

                 Key: HADOOP-5979
                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
             Project: Hadoop Core
          Issue Type: Improvement
          Components: contrib/streaming
            Reporter: Klaas Bosteels


Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717989#action_12717989 ] 

Klaas Bosteels commented on HADOOP-5979:
----------------------------------------

Ok, so we should let the partitioner command determine the partition number directly then.

Further comments:
* Can we be sure that the number of partitions if always equal to the number of reducers under all circumstances?
* The format of the data that gets written to and read from the partitioner command should, of course, depend on the used {{InputWriter}} and {{OutputReader}} classes, but Milind's suggestion sounds good for the text-based case.

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717622#action_12717622 ] 

Klaas Bosteels commented on HADOOP-5979:
----------------------------------------

bq. But still, the command needs to have an idea of how many partitions there are, isn't it? Or maybe, you are saying that it's up to the command developer to assume a certain partition count and implement the command... I agree that it's simple but am not sure whether all use cases would be covered with this model..

Maybe it doesn't cover every possible use case, but it should cover the most common ones I think, and in case of streaming it might be more important to implement something that's very simple and easy to use instead of trying to make things as general as possible. Personally, I don't think I ever implemented a partitioner that couldn't be replaced by a command that outputs keys which then get hashed to determine the partition number. 

bq. What did you mean by "we wouldn't need any additional reading/writing logic" ? There is at least that much reading/writing as your code outlined, ist it?

I meant that {{org.apache.hadoop.streaming.io.InputWriter}} and {{org.apache.hadoop.streaming.io.OutputReader}} wouldn't have to be extended in any way.

Having said that, extending {{InputWriter}} and {{OutputReader}} is perfectly feasible, so if you think it's better to work with partition numbers directly we could also implement something like:
{code}
public int getPartition(K2 key, V2 value, int numPartitions) {
  if (!ignoreKey) {
    inWriter_.writeKey(key);
  }
  inWriter_.writeValue(value);
  inWriter_.writeNumber(numPartitions);
  return outReader_.readNumber();
}
{code}
This would definitely be more flexible and might also be more efficient in certain cases, so maybe it is indeed preferable. I guess that a partitioner command would also be a rather advanced feature anyway, so maybe it's fine to expect a bit more effort from the people who use it and let it determine the partition number directly.

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716599#action_12716599 ] 

Klaas Bosteels commented on HADOOP-5979:
----------------------------------------

I haven't thought much about the details yet, but the easiest way to implement it might be to add a {{PipePartitioner}} that extends {{PipeMapper}} yes, much like {{PipeCombiner}} is an extension of {{PipeReducer}}. The {{PipePartitioner}} would have to implement {{Partitioner}}, however, so it would also have to add an {{int getPartition(Object key, Object value, int numPartitions)}} method, which could work somewhat similarly to the {{void map(...)}} method. The way I see it, this method would use {{inWriter_}} to write the key and value to the standard input of the partitioner command and then rely on {{outReader_}} to read the key and value returned for this pair and supply them to the {{int getPartition(...)}} method of a wrapped partitioner, i.e., simplified it could look something like:

{code}
public int getPartition(K2 key, V2 value, int numPartitions) {
  if (!ignoreKey) {
    inWriter_.writeKey(key);
  }
  inWriter_.writeValue(value);
  if (!outReader_.readKeyValue()) {
    throw RuntimeException("partioner must output one key/val pair for each input pair");
  }
  Object newKey = outReader_.getCurrentKey();
  Object newValue = outReader_.getCurrentValue();
  return realPartitioner.getPartition(newKey, newValue, numPartitions);
}
{code}

Streaming users could then easily define partitioners by specifying a partitioner command that transforms key/value pairs in such a way that the wrapped partitioner shows the desired behavior. The default wrapped partitioner should probably be {{HashPartitioner}}. 

Does this make sense to you, Devaraj?

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716573#action_12716573 ] 

Devaraj Das edited comment on HADOOP-5979 at 6/5/09 3:42 AM:
-------------------------------------------------------------

Klaas, could you please shed some light on how you are thinking of implementing this? I guess all what you are saying happens in the framework's PipeMapper, is that right?

      was (Author: devaraj):
    Klaas, could you please shed some light on how you are thinking of implementing this? 
  
> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Milind Bhandarkar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717809#action_12717809 ] 

Milind Bhandarkar commented on HADOOP-5979:
-------------------------------------------

It would be valuable to have a non-java streaming partitioner. It should be executed once per map task, and should take as input (through stdin), the text-encoded key value pairs (one per line, separated by field separator), and output on stdout a number (again, text-encoded) for each key value pair.

Number of partitions, i.e. number of reducers should already be available to this streaming partitioner as the environment variable mapred_reduce_tasks. So, no need to pass it in each line.

Partitioner need not be an "advanced" feature. Think about a parallel bucketing operation, where number of buckets is predetermined, so the mapper makes a decision where each value should go. In this case, the key is a partition ID, and value is the record to be bucketed. HashBased partitioner of course does not work in this case. But a streaming partitioner, such as 'cut -f1' is what is needed.

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717546#action_12717546 ] 

Devaraj Das commented on HADOOP-5979:
-------------------------------------

I am not able to clearly see how this whole thing would fit in the MR model in the implementation we have in Hadoop. So the way it works is that the outputcollector thread in PipeMapper collects the key/vals from the streaming mapper and emits them to the framework. The framework part of the data-path, MapTask.MapOutputBuffer.collect, then invokes getPartition on the key/value and dumps it in the key/val buffer (which at a later point is sorted and spilled to disk).
In the approach you outlined, the Partitioner would update the key/value. What would be collected by MapTask? We'd like to keep the original key/value intact, right? Where would the getPartition get called?
Another approach for implementing this feature is that you have a special Java implementation of the partitioner that in its getPartition method writes to the command and reads back the partition number. This model will be similar to the PipeMapper/Reducer models. The main difference would be that the getPartition would be a blocking call (as opposed to the map or reduce where the write-to and read-from the process is asynchronous). 
Thoughts?

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717602#action_12717602 ] 

Devaraj Das commented on HADOOP-5979:
-------------------------------------

bq. it's simpler (the partitioner command doesn't need to know how many partitions there are),
But still, the command needs to have an idea of how many partitions there are, isn't it? Or maybe, you are saying that it's up to the command developer to assume a certain partition count and implement the command... I agree that it's simple but am not sure whether all use cases would be covered with this model..

bq. we could reuse more code that's already there (if we let the the partitioner command output both a key and a value and pass that on to a wrapped partitioner, like in the code sample I gave above, we even wouldn't need any additional reading/writing logic).
What did you mean by "we wouldn't need any additional reading/writing logic" ? There is at least that much reading/writing as your code outlined, ist it?


> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717591#action_12717591 ] 

Klaas Bosteels commented on HADOOP-5979:
----------------------------------------

Yeah, I was actually suggesting such a special Java implementation that writes to and reads from a command, but instead of letting the command generate the partition number directly, I thought it might make sense to let it output a key or even a key/value pair (which are completely separate from the other MapReduce keys and values) and determine the partition from that. So instead of generating the same number for pairs that need to go to the same reducer, the partitioner command would just have to generate the same key for those pairs. The benefits of such an approach would be that
# it's simpler (the partitioner command doesn't need to know how many partitions there are),
# it might be easier to define a suitable partitioner command (when using shell tools it might be easier to output a string instead of a specific number for example),
# we could reuse more code that's already there (if we let the the partitioner command output both a key and a value and pass that on to a wrapped partitioner, like in the code sample I gave above, we even wouldn't need any additional reading/writing logic).

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716573#action_12716573 ] 

Devaraj Das commented on HADOOP-5979:
-------------------------------------

Klaas, could you please shed some light on how you are thinking of implementing this? 

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716535#action_12716535 ] 

Klaas Bosteels commented on HADOOP-5979:
----------------------------------------

I think the best approach would be to let the partitioner command take key/value pairs as input (just like the mapper command) and make it output a (new) key for each pair, which then gets hashed to determine the partition number for the pair. Any other thoughts or comments on this?

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited to Java classes only. Allowing commands to be specified as partitioner as well would greatly improve the flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.