You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2007/04/06 18:02:32 UTC

[jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner

Streaming should allow to specify a partitioner
-----------------------------------------------

                 Key: HADOOP-1215
                 URL: https://issues.apache.org/jira/browse/HADOOP-1215
             Project: Hadoop
          Issue Type: Improvement
            Reporter: Runping Qi




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner

Posted by Runping Qi <ru...@yahoo-inc.com>.

Responses are inline below.


> -----Original Message-----
> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
> Sent: Wednesday, April 11, 2007 5:09 PM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to
> specify a partitioner
> 
> This looks very good.
> 
> Probably my "splitter for reduce" is what you call "partitioner"  -- a
> function that given a "key" K and the number of "partitions" N produces
> the I in (0..N-1).
> 
[Runping Qi] 
Yes, that is exactly what the partitioner does.

> "sorter for reduce" -- the function that defines how records with the
> same key are ordered when presented to reduce.
> 
[Runping Qi] 
This can be achieved through partitioner class for streaming.
Suppose your inputs are a list of records, each with multiple fields.
Logically you want to group them by fields 1,2 and 3, but you also want the
records sorted by fields 4 and 5 within each group. What you can do is to
have your mapper compose keys using fields 1,2,3,4 and 5, and have your
partitioner partition by fields 1,2, and 3 only.


> It may be useful to be able to specify how the Map input is
> partitioned.  It is part of InputFormat.  However, from a naive user
> perspective specifying how you read records and find record boundaries
> is very different from specifying how to partition the input.   (I
> agree this is not high priority issue -- as long as I can specify the
> number of map tasks I'd like to have).
> 
> Should the list of specifiable parameters also include Combiner class?
> 
[Runping Qi] 
A good point. Streaming already allows that.


> And once again, it would be great if Abacus classes where available in
> the reworked Streaming through exactly same mechanism without addition
> conventions.
> E.g. I'd like to have tab separated <key, value> as input,
> IdentityMapper, and the Abacus class that gives me the sum, the count,
> and std of values for each key.
>   (It is https://issues.apache.org/jira/browse/HADOOP-1247)
> 
[Runping Qi] 
That will be the work of https://issues.apache.org/jira/browse/HADOOP-1247
It is coming soon.



> -- ab
> 
> On Apr 10, 2007, at 1:58 PM, Runping Qi wrote:
> 
> >
> > Hi Arkady,
> >
> > With my changes that should be available soon, the user can specify
> > the followings:
> >
> > 1. Mapper (a java mapper class or an executable)
> > 2. Reducer (a Java reducer class or an executable). Reduce NONE will be
> > introduced as per HADOOP-1216.
> > 3. Inputformat class
> > 4. OutputFormat class
> > 5. Partitioner
> >
> > I don't understand what do you mean by (input partitioner, splitter for
> > reduce, sorter for reduce). Can you explain?
> >
> > Hadoop has a collection of built-in classes:
> >
> > IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper,
> > LongSumReducer
> >
> > TextInputFormat, SequenceFileInputFormat, TextOutputFormat,
> > SequenceFileOutputFormat, NullOutputFormat
> >
> > Some more coming soon:
> >
> > SequenceFileToLineInputFormat, KeyValueTextInputFormat.
> >
> > We can add IdentityMapper/IdentityReducer/
> > KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop
> > Streaming.
> >
> >
> > Runping
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
> >> Sent: Tuesday, April 10, 2007 1:24 PM
> >> To: hadoop-dev@lucene.apache.org
> >> Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to
> >> specify a partitioner
> >>
> >> To extend this,
> >> I'd suggest that Hadoop Streaming is interfaced in the following way:
> >>
> >> Map reduce process is parameterized by several algorithms.
> >> This includes at least
> >> 1. mapper
> >> 2. reducer  (including special case of NONE)
> >> 3. input format
> >> 4. input partitioner
> >> 5. splitter for reduce
> >> 6. sorter for reduce
> >>
> >> The current Hadoop Streaming allows to specify only the 1 and 2 (and
> >> gives a limited control on 3)
> >> Nicely, the 1 (mapper) can be specified both as a command to stream
> >> the
> >> data through, or a Java class to use.
> >>
> >> It would make a lot of sense to
> >> (a) allow to specify a Java class that implements each of these
> >> (b) provide meaningful defaults, so that the user of Hadoop Streaming
> >> does need to worry about details irrelevant for her specific task.
> >> (c) provide a set of useful classes so that the user can pick the
> >> necessary ones rather than re-implementing same things again and
> >> again.
> >> (c.1) make sure that there is a convenient short-hand to specify these
> >> predefined classes (e.g. without long package prefix)
> >>
> >> In particular, it would be good to have predefined Identity mapper and
> >> reducer (the mapper actually is available now), reducers that provide
> >> simple aggregation (like in Abacus), input formats for commonly used
> >> formats (including CSV, flat XML, etc), sorter different from
> >> splitter,
> >> etc.
> >>
> >> Then "Streaming should allow to specify a partitioner" would be
> >> automatically resolved as a special case.
> >> It might be better to implement the whole consistent approach rather
> >> then do special cases one by one.
> >>
> >> -- ab
> >>
> >>
> >> On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:
> >>
> >>> Streaming should allow to specify a partitioner
> >>> -----------------------------------------------
> >>>
> >>>                  Key: HADOOP-1215
> >>>                  URL:
> >>> https://issues.apache.org/jira/browse/HADOOP-1215
> >>>              Project: Hadoop
> >>>           Issue Type: Improvement
> >>>             Reporter: Runping Qi
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> This message is automatically generated by JIRA.
> >>> -
> >>> You can reply to this email to add a comment to the issue online.
> >>>
> >
> >

Re: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner

Posted by Arkady Borkovsky <ar...@yahoo-inc.com>.

This looks very good.

Probably my "splitter for reduce" is what you call "partitioner"  -- a 
function that given a "key" K and the number of "partitions" N produces 
the I in (0..N-1).

"sorter for reduce" -- the function that defines how records with the 
same key are ordered when presented to reduce.

It may be useful to be able to specify how the Map input is 
partitioned.  It is part of InputFormat.  However, from a naive user 
perspective specifying how you read records and find record boundaries 
is very different from specifying how to partition the input.   (I 
agree this is not high priority issue -- as long as I can specify the 
number of map tasks I'd like to have).

Should the list of specifiable parameters also include Combiner class?

And once again, it would be great if Abacus classes where available in 
the reworked Streaming through exactly same mechanism without addition 
conventions.
E.g. I'd like to have tab separated <key, value> as input, 
IdentityMapper, and the Abacus class that gives me the sum, the count, 
and std of values for each key.
  (It is https://issues.apache.org/jira/browse/HADOOP-1247)

-- ab

On Apr 10, 2007, at 1:58 PM, Runping Qi wrote:

>
> Hi Arkady,
>
> With my changes that should be available soon, the user can specify 
> the followings:
>
> 1. Mapper (a java mapper class or an executable)
> 2. Reducer (a Java reducer class or an executable). Reduce NONE will be
> introduced as per HADOOP-1216.
> 3. Inputformat class
> 4. OutputFormat class
> 5. Partitioner
>
> I don't understand what do you mean by (input partitioner, splitter for
> reduce, sorter for reduce). Can you explain?
>
> Hadoop has a collection of built-in classes:
>
> IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper,
> LongSumReducer
>
> TextInputFormat, SequenceFileInputFormat, TextOutputFormat,
> SequenceFileOutputFormat, NullOutputFormat
>
> Some more coming soon:
>
> SequenceFileToLineInputFormat, KeyValueTextInputFormat.
>
> We can add IdentityMapper/IdentityReducer/
> KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop
> Streaming.
>
>
> Runping
>
>
>
>
>> -----Original Message-----
>> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
>> Sent: Tuesday, April 10, 2007 1:24 PM
>> To: hadoop-dev@lucene.apache.org
>> Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to
>> specify a partitioner
>>
>> To extend this,
>> I'd suggest that Hadoop Streaming is interfaced in the following way:
>>
>> Map reduce process is parameterized by several algorithms.
>> This includes at least
>> 1. mapper
>> 2. reducer  (including special case of NONE)
>> 3. input format
>> 4. input partitioner
>> 5. splitter for reduce
>> 6. sorter for reduce
>>
>> The current Hadoop Streaming allows to specify only the 1 and 2 (and
>> gives a limited control on 3)
>> Nicely, the 1 (mapper) can be specified both as a command to stream 
>> the
>> data through, or a Java class to use.
>>
>> It would make a lot of sense to
>> (a) allow to specify a Java class that implements each of these
>> (b) provide meaningful defaults, so that the user of Hadoop Streaming
>> does need to worry about details irrelevant for her specific task.
>> (c) provide a set of useful classes so that the user can pick the
>> necessary ones rather than re-implementing same things again and 
>> again.
>> (c.1) make sure that there is a convenient short-hand to specify these
>> predefined classes (e.g. without long package prefix)
>>
>> In particular, it would be good to have predefined Identity mapper and
>> reducer (the mapper actually is available now), reducers that provide
>> simple aggregation (like in Abacus), input formats for commonly used
>> formats (including CSV, flat XML, etc), sorter different from 
>> splitter,
>> etc.
>>
>> Then "Streaming should allow to specify a partitioner" would be
>> automatically resolved as a special case.
>> It might be better to implement the whole consistent approach rather
>> then do special cases one by one.
>>
>> -- ab
>>
>>
>> On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:
>>
>>> Streaming should allow to specify a partitioner
>>> -----------------------------------------------
>>>
>>>                  Key: HADOOP-1215
>>>                  URL: 
>>> https://issues.apache.org/jira/browse/HADOOP-1215
>>>              Project: Hadoop
>>>           Issue Type: Improvement
>>>             Reporter: Runping Qi
>>>
>>>
>>>
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>
>

RE: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner

Posted by Runping Qi <ru...@yahoo-inc.com>.

Hi Arkady,

With my changes that should be available soon, the user can specify the
followings:

1. Mapper (a java mapper class or an executable)
2. Reducer (a Java reducer class or an executable). Reduce NONE will be
introduced as per HADOOP-1216.
3. Inputformat class
4. OutputFormat class
5. Partitioner

I don't understand what do you mean by (input partitioner, splitter for
reduce, sorter for reduce). Can you explain?

Hadoop has a collection of built-in classes:

IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper,
LongSumReducer

TextInputFormat, SequenceFileInputFormat, TextOutputFormat,
SequenceFileOutputFormat, NullOutputFormat

Some more coming soon:

SequenceFileToLineInputFormat, KeyValueTextInputFormat.

We can add IdentityMapper/IdentityReducer/
KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop
Streaming.


Runping




> -----Original Message-----
> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
> Sent: Tuesday, April 10, 2007 1:24 PM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to
> specify a partitioner
> 
> To extend this,
> I'd suggest that Hadoop Streaming is interfaced in the following way:
> 
> Map reduce process is parameterized by several algorithms.
> This includes at least
> 1. mapper
> 2. reducer  (including special case of NONE)
> 3. input format
> 4. input partitioner
> 5. splitter for reduce
> 6. sorter for reduce
> 
> The current Hadoop Streaming allows to specify only the 1 and 2 (and
> gives a limited control on 3)
> Nicely, the 1 (mapper) can be specified both as a command to stream the
> data through, or a Java class to use.
> 
> It would make a lot of sense to
> (a) allow to specify a Java class that implements each of these
> (b) provide meaningful defaults, so that the user of Hadoop Streaming
> does need to worry about details irrelevant for her specific task.
> (c) provide a set of useful classes so that the user can pick the
> necessary ones rather than re-implementing same things again and again.
> (c.1) make sure that there is a convenient short-hand to specify these
> predefined classes (e.g. without long package prefix)
> 
> In particular, it would be good to have predefined Identity mapper and
> reducer (the mapper actually is available now), reducers that provide
> simple aggregation (like in Abacus), input formats for commonly used
> formats (including CSV, flat XML, etc), sorter different from splitter,
> etc.
> 
> Then "Streaming should allow to specify a partitioner" would be
> automatically resolved as a special case.
> It might be better to implement the whole consistent approach rather
> then do special cases one by one.
> 
> -- ab
> 
> 
> On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:
> 
> > Streaming should allow to specify a partitioner
> > -----------------------------------------------
> >
> >                  Key: HADOOP-1215
> >                  URL: https://issues.apache.org/jira/browse/HADOOP-1215
> >              Project: Hadoop
> >           Issue Type: Improvement
> >             Reporter: Runping Qi
> >
> >
> >
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >

Re: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner

Posted by Arkady Borkovsky <ar...@yahoo-inc.com>.

To extend this,
I'd suggest that Hadoop Streaming is interfaced in the following way:

Map reduce process is parameterized by several algorithms.
This includes at least
1. mapper
2. reducer  (including special case of NONE)
3. input format
4. input partitioner
5. splitter for reduce
6. sorter for reduce

The current Hadoop Streaming allows to specify only the 1 and 2 (and 
gives a limited control on 3)
Nicely, the 1 (mapper) can be specified both as a command to stream the 
data through, or a Java class to use.

It would make a lot of sense to
(a) allow to specify a Java class that implements each of these
(b) provide meaningful defaults, so that the user of Hadoop Streaming 
does need to worry about details irrelevant for her specific task.
(c) provide a set of useful classes so that the user can pick the 
necessary ones rather than re-implementing same things again and again.
(c.1) make sure that there is a convenient short-hand to specify these 
predefined classes (e.g. without long package prefix)

In particular, it would be good to have predefined Identity mapper and 
reducer (the mapper actually is available now), reducers that provide 
simple aggregation (like in Abacus), input formats for commonly used 
formats (including CSV, flat XML, etc), sorter different from splitter, 
etc.

Then "Streaming should allow to specify a partitioner" would be 
automatically resolved as a special case.
It might be better to implement the whole consistent approach rather 
then do special cases one by one.

-- ab

On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:

> Streaming should allow to specify a partitioner
> -----------------------------------------------
>
>                  Key: HADOOP-1215
>                  URL: https://issues.apache.org/jira/browse/HADOOP-1215
>              Project: Hadoop
>           Issue Type: Improvement
>             Reporter: Runping Qi
>
>
>
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Resolved: (HADOOP-1215) Streaming should allow to specify a partitioner

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi resolved HADOOP-1215.
--------------------------------

    Resolution: Fixed


Hixed as a part of hadoop-1214

> Streaming should allow to specify a partitioner
> -----------------------------------------------
>
>                 Key: HADOOP-1215
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1215
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1215) Streaming should allow to specify a partitioner

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi reassigned HADOOP-1215:
----------------------------------

    Assignee: Runping Qi

> Streaming should allow to specify a partitioner
> -----------------------------------------------
>
>                 Key: HADOOP-1215
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1215
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.