You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2007/04/21 02:33:15 UTC

[jira] Created: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

clean up the protocol between stream mapper/reducer and the framework
---------------------------------------------------------------------

Key: HADOOP-1284
URL: https://issues.apache.org/jira/browse/HADOOP-1284
Project: Hadoop
Issue Type: Improvement
Reporter: Runping Qi

Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
The mapper/reducer generates line oriented output. The framework picks up line by line, and split
each line into a key/value pair. By default, the substring up to the first tab char is the key, and the
substring after the first tab char is the value.

However, in many cases, the application wants some control over how the pair is split.
Here, I'd like to introduce the following configuration variables for that:

1. "streaming.output.field.separator": the value will be the tab key, by default. But the user can specify a different one (e.g. '|', or ' ', etc.)
A map output line can be considered as a list of fields separated by the separator.

2. "streaming.num.fields.for.mapout.key": the number of the first fields will be used the map output key (and for sorting in the reduce side).
The default value is 1.
The rest of the fields will be used as the value. For example, I can specify the first 5 fields as my mapout key.

3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to achieve "primary/secondary" composite
key effect as proposed in HADOOP485. The default value is 1. For example, I can set "streaming.num.fields.for.partitioning" to 3
and "streaming.num.fields.for.mapout.key" to 5. This effectively amounts to saying that fields 4 and 5 are my secondary key.

With the above default values, it is compatible with the current behavior while introducing a new desirable feature in a clean way.

Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi updated HADOOP-1284:
-------------------------------

    Status: Patch Available  (was: Open)

> clean up the protocol between stream mapper/reducer and the framework
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-1284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>         Attachments: patch-1284.txt
>
>
> Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks up line by line, and split 
> each line into a key/value pair. By default, the substring up to the first tab char is the key, and the 
> substring after the first tab char is the value.
> However, in many cases, the application wants some control over how the pair is split. 
> Here, I'd like to introduce the following configuration variables for that:
> 1. "streaming.output.field.separator": the value will be the tab key, by default. 
> But the user can specify a different one (e.g. ':', or ', ', etc.)
> A map output line can be considered as a list of fields separated by the separator.
> 2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the map output key  
> (and for sorting in the reduce side). 
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can specify the first 5 fields as my mapout key.
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to 
> achieve "primary/secondary" composite 
> key effect as proposed in HADOOP485. The default value is 1. 
> For example, I can set "streaming.num.fields.for.partitioning" to 3 
> and "streaming.num.fields.for.mapout.key" to 5. 
> This effectively amounts to saying that fields 4 and 5 are my secondary key.
> With the above default values, it is compatible with the current behavior 
> while introducing a new desirable feature in a clean way.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492248 ] 

Hadoop QA commented on HADOOP-1284:
-----------------------------------

Integrated in Hadoop-Nightly #71 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/71/)

> clean up the protocol between stream mapper/reducer and the framework
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-1284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>             Fix For: 0.13.0
>
>         Attachments: patch-1284.txt
>
>
> Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks up line by line, and split 
> each line into a key/value pair. By default, the substring up to the first tab char is the key, and the 
> substring after the first tab char is the value.
> However, in many cases, the application wants some control over how the pair is split. 
> Here, I'd like to introduce the following configuration variables for that:
> 1. "streaming.output.field.separator": the value will be the tab key, by default. 
> But the user can specify a different one (e.g. ':', or ', ', etc.)
> A map output line can be considered as a list of fields separated by the separator.
> 2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the map output key  
> (and for sorting in the reduce side). 
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can specify the first 5 fields as my mapout key.
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to 
> achieve "primary/secondary" composite 
> key effect as proposed in HADOOP485. The default value is 1. 
> For example, I can set "streaming.num.fields.for.partitioning" to 3 
> and "streaming.num.fields.for.mapout.key" to 5. 
> This effectively amounts to saying that fields 4 and 5 are my secondary key.
> With the above default values, it is compatible with the current behavior 
> while introducing a new desirable feature in a clean way.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi updated HADOOP-1284:
-------------------------------

    Attachment: patch-1284.txt

> clean up the protocol between stream mapper/reducer and the framework
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-1284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>         Attachments: patch-1284.txt
>
>
> Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks up line by line, and split 
> each line into a key/value pair. By default, the substring up to the first tab char is the key, and the 
> substring after the first tab char is the value.
> However, in many cases, the application wants some control over how the pair is split. 
> Here, I'd like to introduce the following configuration variables for that:
> 1. "streaming.output.field.separator": the value will be the tab key, by default. 
> But the user can specify a different one (e.g. ':', or ', ', etc.)
> A map output line can be considered as a list of fields separated by the separator.
> 2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the map output key  
> (and for sorting in the reduce side). 
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can specify the first 5 fields as my mapout key.
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to 
> achieve "primary/secondary" composite 
> key effect as proposed in HADOOP485. The default value is 1. 
> For example, I can set "streaming.num.fields.for.partitioning" to 3 
> and "streaming.num.fields.for.mapout.key" to 5. 
> This effectively amounts to saying that fields 4 and 5 are my secondary key.
> With the above default values, it is compatible with the current behavior 
> while introducing a new desirable feature in a clean way.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1284:
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.13.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Runping!

> clean up the protocol between stream mapper/reducer and the framework
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-1284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>             Fix For: 0.13.0
>
>         Attachments: patch-1284.txt
>
>
> Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks up line by line, and split 
> each line into a key/value pair. By default, the substring up to the first tab char is the key, and the 
> substring after the first tab char is the value.
> However, in many cases, the application wants some control over how the pair is split. 
> Here, I'd like to introduce the following configuration variables for that:
> 1. "streaming.output.field.separator": the value will be the tab key, by default. 
> But the user can specify a different one (e.g. ':', or ', ', etc.)
> A map output line can be considered as a list of fields separated by the separator.
> 2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the map output key  
> (and for sorting in the reduce side). 
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can specify the first 5 fields as my mapout key.
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to 
> achieve "primary/secondary" composite 
> key effect as proposed in HADOOP485. The default value is 1. 
> For example, I can set "streaming.num.fields.for.partitioning" to 3 
> and "streaming.num.fields.for.mapout.key" to 5. 
> This effectively amounts to saying that fields 4 and 5 are my secondary key.
> With the above default values, it is compatible with the current behavior 
> while introducing a new desirable feature in a clean way.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491725 ] 

Hadoop QA commented on HADOOP-1284:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12356261/patch-1284.txt applied and successfully tested against trunk revision r532083.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/78/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/78/console

> clean up the protocol between stream mapper/reducer and the framework
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-1284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>         Attachments: patch-1284.txt
>
>
> Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks up line by line, and split 
> each line into a key/value pair. By default, the substring up to the first tab char is the key, and the 
> substring after the first tab char is the value.
> However, in many cases, the application wants some control over how the pair is split. 
> Here, I'd like to introduce the following configuration variables for that:
> 1. "streaming.output.field.separator": the value will be the tab key, by default. 
> But the user can specify a different one (e.g. ':', or ', ', etc.)
> A map output line can be considered as a list of fields separated by the separator.
> 2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the map output key  
> (and for sorting in the reduce side). 
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can specify the first 5 fields as my mapout key.
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to 
> achieve "primary/secondary" composite 
> key effect as proposed in HADOOP485. The default value is 1. 
> For example, I can set "streaming.num.fields.for.partitioning" to 3 
> and "streaming.num.fields.for.mapout.key" to 5. 
> This effectively amounts to saying that fields 4 and 5 are my secondary key.
> With the above default values, it is compatible with the current behavior 
> while introducing a new desirable feature in a clean way.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi reassigned HADOOP-1284:
----------------------------------

    Assignee: Runping Qi

> clean up the protocol between stream mapper/reducer and the framework
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-1284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>
> Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks up line by line, and split 
> each line into a key/value pair. By default, the substring up to the first tab char is the key, and the 
> substring after the first tab char is the value.
> However, in many cases, the application wants some control over how the pair is split. 
> Here, I'd like to introduce the following configuration variables for that:
> 1. "streaming.output.field.separator": the value will be the tab key, by default. But the user can specify a different one (e.g. '|', or ' ', etc.)
> A map output line can be considered as a list of fields separated by the separator.
> 2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the map output key  (and for sorting in the reduce side). 
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can specify the first 5 fields as my mapout key.
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to achieve "primary/secondary" composite 
> key effect as proposed in HADOOP485. The default value is 1. For example, I can set "streaming.num.fields.for.partitioning" to 3 
> and "streaming.num.fields.for.mapout.key" to 5. This effectively amounts to saying that fields 4 and 5 are my secondary key.
> With the above default values, it is compatible with the current behavior while introducing a new desirable feature in a clean way.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by Arkady Borkovsky <ar...@yahoo-inc.com>.

Wonderful!

On Apr 25, 2007, at 12:30 PM, Runping Qi wrote:

> Arkady,
>
> The FieldSelectionMapReduce class and KeyFieldBasedPartitioner class  
> allows
> to do exactly what you want (namely, you select fields 6,3,8 and 5 as  
> your
> sorting keys).
>
> Runping
>
>
>> -----Original Message-----
>> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
>> Sent: Wednesday, April 25, 2007 12:17 PM
>> To: hadoop-dev@lucene.apache.org
>> Subject: Re: [jira] Updated: (HADOOP-1284) clean up the protocol  
>> between
>> stream mapper/reducer and the framework
>>
>> Runping,
>>
>> as we discussed yesterday, it may be better to implement more complete
>> functionality that would allow to specify any combination of fields to
>> be used as for partitioning and for sorting.
>> This can be easily implemented top of the functionality this specific
>> patch provides.  (By prepending the actual keys by the "streaming
>> mapper" class, and stripping them in "streaming reducer" class before
>> feeding to the streaming reducer command provided by the user.
>>
>> However, at the user level, I'd suggest you export the "complete"
>> functionality, rather than limiting it by requiring the keys to be in
>> the beginning of the record.
>>
>> On Apr 25, 2007, at 11:13 AM, Runping Qi (JIRA) wrote:
>>
>>>
>>>      [
>>> https://issues.apache.org/jira/browse/HADOOP-1284?
>>> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Runping Qi updated HADOOP-1284:
>>> -------------------------------
>>>
>>>     Description:
>>> Right now, the protocol between stream mapper/reducer and the
>>> framework is very inflexible.
>>> The mapper/reducer generates line oriented output. The framework  
>>> picks
>>> up line by line, and split
>>> each line into a key/value pair. By default, the substring up to the
>>> first tab char is the key, and the
>>> substring after the first tab char is the value.
>>>
>>> However, in many cases, the application wants some control over how
>>> the pair is split.
>>> Here, I'd like to introduce the following configuration variables for
>>> that:
>>>
>>> 1. "streaming.output.field.separator": the value will be the tab key,
>>> by default.
>>> But the user can specify a different one (e.g. ':', or ', ', etc.)
>>> A map output line can be considered as a list of fields separated by
>>> the separator.
>>>
>>> 2. "streaming.num.fields.for.mapout.key":  the number of the first
>>> fields will be used the map output key
>>> (and for sorting in the reduce side).
>>> The default value is 1.
>>> The rest of the fields will be used as the value.  For example, I can
>>> specify the first 5 fields as my mapout key.
>>>
>>> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
>>> fewer fields for partitioning to
>>> achieve "primary/secondary" composite
>>> key effect as proposed in HADOOP485. The default value is 1.
>>> For example, I can set "streaming.num.fields.for.partitioning" to 3
>>> and "streaming.num.fields.for.mapout.key" to 5.
>>> This effectively amounts to saying that fields 4 and 5 are my
>>> secondary key.
>>>
>>> With the above default values, it is compatible with the current
>>> behavior
>>> while introducing a new desirable feature in a clean way.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>>
>>>   was:
>>>
>>> Right now, the protocol between stream mapper/reducer and the
>>> framework is very inflexible.
>>> The mapper/reducer generates line oriented output. The framework  
>>> picks
>>> up line by line, and split
>>> each line into a key/value pair. By default, the substring up to the
>>> first tab char is the key, and the
>>> substring after the first tab char is the value.
>>>
>>> However, in many cases, the application wants some control over how
>>> the pair is split.
>>> Here, I'd like to introduce the following configuration variables for
>>> that:
>>>
>>> 1. "streaming.output.field.separator": the value will be the tab key,
>>> by default. But the user can specify a different one (e.g. '|', or '
>>> ', etc.)
>>> A map output line can be considered as a list of fields separated by
>>> the separator.
>>>
>>> 2. "streaming.num.fields.for.mapout.key":  the number of the first
>>> fields will be used the map output key  (and for sorting in the  
>>> reduce
>>> side).
>>> The default value is 1.
>>> The rest of the fields will be used as the value.  For example, I can
>>> specify the first 5 fields as my mapout key.
>>>
>>> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
>>> fewer fields for partitioning to achieve "primary/secondary"  
>>> composite
>>> key effect as proposed in HADOOP485. The default value is 1. For
>>> example, I can set "streaming.num.fields.for.partitioning" to 3
>>> and "streaming.num.fields.for.mapout.key" to 5. This effectively
>>> amounts to saying that fields 4 and 5 are my secondary key.
>>>
>>> With the above default values, it is compatible with the current
>>> behavior while introducing a new desirable feature in a clean way.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>>
>>>
>>> This patch implemented the proposed protocol.
>>>
>>> With this patch, the streaming user can specify a field separatot for
>>> the mapper's output and/or a field separator
>>> for the reducer's output. The default will be the tab char.
>>>
>>> The user can also specify how many fields in the output consitute the
>>> keys. The default is 1.
>>> The rest part of a line will be the value.
>>>
>>> A partitioner class, KeyFieldBasedPartitioner in mapred.lib, is also
>>> implemented.
>>> The user can specify the number of the fields in the map output keys
>>> will be used for partitioning.
>>>
>>> Also a urility class, FieldSelectionMapReduce in mapred.lib, is  
>>> added.
>>> This class allows the
>>> user to create map/reduce jobs that manapulate text data like the  
>>> Unix
>>> cut utility.
>>> The user can specify field separator (delimiter for cut) and specify
>>> which fields to select, and
>>> by which fields to partition/sort.
>>>
>>> Two unit tests are introduced.
>>> All the unit tests passed.
>>>
>>> [ Show > ] Runping Qi [25/Apr/07 11:07 AM] This patch implemented the
>>> proposed protocol. With this patch, the streaming user can specify a
>>> field separatot for the mapper's output and/or a field separator for
>>> the reducer's output. The default will be the tab char. The user can
>>> also specify how many fields in the output consitute the keys. The
>>> default is 1. The rest part of a line will be the value. A  
>>> partitioner
>>> class, KeyFieldBasedPartitioner in mapred.lib, is also implemented.
>>> The user can specify the number of the fields in the map output keys
>>> will be used for partitioning. Also a urility class,
>>> FieldSelectionMapReduce in mapred.lib, is added. This class allows  
>>> the
>>> user to create map/reduce jobs that manapulate text data like the  
>>> Unix
>>> cut utility. The user can specify field separator (delimiter for cut)
>>> and specify which fields to select, and by which fields to
>>> partition/sort. Two unit tests are introduced. All the unit tests
>>> passed.
>>>
>>>
>>>> clean up the protocol between stream mapper/reducer and the  
>>>> framework
>>>> -------------------------------------------------------------------- 
>>>> -
>>>>
>>>>                 Key: HADOOP-1284
>>>>                 URL:  
>>>> https://issues.apache.org/jira/browse/HADOOP-1284
>>>>             Project: Hadoop
>>>>          Issue Type: Improvement
>>>>            Reporter: Runping Qi
>>>>         Assigned To: Runping Qi
>>>>         Attachments: patch-1284.txt
>>>>
>>>>
>>>> Right now, the protocol between stream mapper/reducer and the
>>>> framework is very inflexible.
>>>> The mapper/reducer generates line oriented output. The framework
>>>> picks up line by line, and split
>>>> each line into a key/value pair. By default, the substring up to the
>>>> first tab char is the key, and the
>>>> substring after the first tab char is the value.
>>>> However, in many cases, the application wants some control over how
>>>> the pair is split.
>>>> Here, I'd like to introduce the following configuration variables  
>>>> for
>>>> that:
>>>> 1. "streaming.output.field.separator": the value will be the tab  
>>>> key,
>>>> by default.
>>>> But the user can specify a different one (e.g. ':', or ', ', etc.)
>>>> A map output line can be considered as a list of fields separated by
>>>> the separator.
>>>> 2. "streaming.num.fields.for.mapout.key":  the number of the first
>>>> fields will be used the map output key
>>>> (and for sorting in the reduce side).
>>>> The default value is 1.
>>>> The rest of the fields will be used as the value.  For example, I  
>>>> can
>>>> specify the first 5 fields as my mapout key.
>>>> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
>>>> fewer fields for partitioning to
>>>> achieve "primary/secondary" composite
>>>> key effect as proposed in HADOOP485. The default value is 1.
>>>> For example, I can set "streaming.num.fields.for.partitioning" to 3
>>>> and "streaming.num.fields.for.mapout.key" to 5.
>>>> This effectively amounts to saying that fields 4 and 5 are my
>>>> secondary key.
>>>> With the above default values, it is compatible with the current
>>>> behavior
>>>> while introducing a new desirable feature in a clean way.
>>>> Thoughts?
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>

RE: [jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by Runping Qi <ru...@yahoo-inc.com>.

Arkady,

The FieldSelectionMapReduce class and KeyFieldBasedPartitioner class allows
to do exactly what you want (namely, you select fields 6,3,8 and 5 as your
sorting keys).

Runping


> -----Original Message-----
> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
> Sent: Wednesday, April 25, 2007 12:17 PM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: [jira] Updated: (HADOOP-1284) clean up the protocol between
> stream mapper/reducer and the framework
> 
> Runping,
> 
> as we discussed yesterday, it may be better to implement more complete
> functionality that would allow to specify any combination of fields to
> be used as for partitioning and for sorting.
> This can be easily implemented top of the functionality this specific
> patch provides.  (By prepending the actual keys by the "streaming
> mapper" class, and stripping them in "streaming reducer" class before
> feeding to the streaming reducer command provided by the user.
> 
> However, at the user level, I'd suggest you export the "complete"
> functionality, rather than limiting it by requiring the keys to be in
> the beginning of the record.
> 
> On Apr 25, 2007, at 11:13 AM, Runping Qi (JIRA) wrote:
> 
> >
> >      [
> > https://issues.apache.org/jira/browse/HADOOP-1284?
> > page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> >
> > Runping Qi updated HADOOP-1284:
> > -------------------------------
> >
> >     Description:
> > Right now, the protocol between stream mapper/reducer and the
> > framework is very inflexible.
> > The mapper/reducer generates line oriented output. The framework picks
> > up line by line, and split
> > each line into a key/value pair. By default, the substring up to the
> > first tab char is the key, and the
> > substring after the first tab char is the value.
> >
> > However, in many cases, the application wants some control over how
> > the pair is split.
> > Here, I'd like to introduce the following configuration variables for
> > that:
> >
> > 1. "streaming.output.field.separator": the value will be the tab key,
> > by default.
> > But the user can specify a different one (e.g. ':', or ', ', etc.)
> > A map output line can be considered as a list of fields separated by
> > the separator.
> >
> > 2. "streaming.num.fields.for.mapout.key":  the number of the first
> > fields will be used the map output key
> > (and for sorting in the reduce side).
> > The default value is 1.
> > The rest of the fields will be used as the value.  For example, I can
> > specify the first 5 fields as my mapout key.
> >
> > 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
> > fewer fields for partitioning to
> > achieve "primary/secondary" composite
> > key effect as proposed in HADOOP485. The default value is 1.
> > For example, I can set "streaming.num.fields.for.partitioning" to 3
> > and "streaming.num.fields.for.mapout.key" to 5.
> > This effectively amounts to saying that fields 4 and 5 are my
> > secondary key.
> >
> > With the above default values, it is compatible with the current
> > behavior
> > while introducing a new desirable feature in a clean way.
> >
> > Thoughts?
> >
> >
> >
> >
> >   was:
> >
> > Right now, the protocol between stream mapper/reducer and the
> > framework is very inflexible.
> > The mapper/reducer generates line oriented output. The framework picks
> > up line by line, and split
> > each line into a key/value pair. By default, the substring up to the
> > first tab char is the key, and the
> > substring after the first tab char is the value.
> >
> > However, in many cases, the application wants some control over how
> > the pair is split.
> > Here, I'd like to introduce the following configuration variables for
> > that:
> >
> > 1. "streaming.output.field.separator": the value will be the tab key,
> > by default. But the user can specify a different one (e.g. '|', or '
> > ', etc.)
> > A map output line can be considered as a list of fields separated by
> > the separator.
> >
> > 2. "streaming.num.fields.for.mapout.key":  the number of the first
> > fields will be used the map output key  (and for sorting in the reduce
> > side).
> > The default value is 1.
> > The rest of the fields will be used as the value.  For example, I can
> > specify the first 5 fields as my mapout key.
> >
> > 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
> > fewer fields for partitioning to achieve "primary/secondary" composite
> > key effect as proposed in HADOOP485. The default value is 1. For
> > example, I can set "streaming.num.fields.for.partitioning" to 3
> > and "streaming.num.fields.for.mapout.key" to 5. This effectively
> > amounts to saying that fields 4 and 5 are my secondary key.
> >
> > With the above default values, it is compatible with the current
> > behavior while introducing a new desirable feature in a clean way.
> >
> > Thoughts?
> >
> >
> >
> >
> >
> > This patch implemented the proposed protocol.
> >
> > With this patch, the streaming user can specify a field separatot for
> > the mapper's output and/or a field separator
> > for the reducer's output. The default will be the tab char.
> >
> > The user can also specify how many fields in the output consitute the
> > keys. The default is 1.
> > The rest part of a line will be the value.
> >
> > A partitioner class, KeyFieldBasedPartitioner in mapred.lib, is also
> > implemented.
> > The user can specify the number of the fields in the map output keys
> > will be used for partitioning.
> >
> > Also a urility class, FieldSelectionMapReduce in mapred.lib, is added.
> > This class allows the
> > user to create map/reduce jobs that manapulate text data like the Unix
> > cut utility.
> > The user can specify field separator (delimiter for cut) and specify
> > which fields to select, and
> > by which fields to partition/sort.
> >
> > Two unit tests are introduced.
> > All the unit tests passed.
> >
> > [ Show > ] Runping Qi [25/Apr/07 11:07 AM] This patch implemented the
> > proposed protocol. With this patch, the streaming user can specify a
> > field separatot for the mapper's output and/or a field separator for
> > the reducer's output. The default will be the tab char. The user can
> > also specify how many fields in the output consitute the keys. The
> > default is 1. The rest part of a line will be the value. A partitioner
> > class, KeyFieldBasedPartitioner in mapred.lib, is also implemented.
> > The user can specify the number of the fields in the map output keys
> > will be used for partitioning. Also a urility class,
> > FieldSelectionMapReduce in mapred.lib, is added. This class allows the
> > user to create map/reduce jobs that manapulate text data like the Unix
> > cut utility. The user can specify field separator (delimiter for cut)
> > and specify which fields to select, and by which fields to
> > partition/sort. Two unit tests are introduced. All the unit tests
> > passed.
> >
> >
> >> clean up the protocol between stream mapper/reducer and the framework
> >> ---------------------------------------------------------------------
> >>
> >>                 Key: HADOOP-1284
> >>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
> >>             Project: Hadoop
> >>          Issue Type: Improvement
> >>            Reporter: Runping Qi
> >>         Assigned To: Runping Qi
> >>         Attachments: patch-1284.txt
> >>
> >>
> >> Right now, the protocol between stream mapper/reducer and the
> >> framework is very inflexible.
> >> The mapper/reducer generates line oriented output. The framework
> >> picks up line by line, and split
> >> each line into a key/value pair. By default, the substring up to the
> >> first tab char is the key, and the
> >> substring after the first tab char is the value.
> >> However, in many cases, the application wants some control over how
> >> the pair is split.
> >> Here, I'd like to introduce the following configuration variables for
> >> that:
> >> 1. "streaming.output.field.separator": the value will be the tab key,
> >> by default.
> >> But the user can specify a different one (e.g. ':', or ', ', etc.)
> >> A map output line can be considered as a list of fields separated by
> >> the separator.
> >> 2. "streaming.num.fields.for.mapout.key":  the number of the first
> >> fields will be used the map output key
> >> (and for sorting in the reduce side).
> >> The default value is 1.
> >> The rest of the fields will be used as the value.  For example, I can
> >> specify the first 5 fields as my mapout key.
> >> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
> >> fewer fields for partitioning to
> >> achieve "primary/secondary" composite
> >> key effect as proposed in HADOOP485. The default value is 1.
> >> For example, I can set "streaming.num.fields.for.partitioning" to 3
> >> and "streaming.num.fields.for.mapout.key" to 5.
> >> This effectively amounts to saying that fields 4 and 5 are my
> >> secondary key.
> >> With the above default values, it is compatible with the current
> >> behavior
> >> while introducing a new desirable feature in a clean way.
> >> Thoughts?
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >

Re: [jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by Arkady Borkovsky <ar...@yahoo-inc.com>.

Runping,

as we discussed yesterday, it may be better to implement more complete  
functionality that would allow to specify any combination of fields to  
be used as for partitioning and for sorting.
This can be easily implemented top of the functionality this specific  
patch provides.  (By prepending the actual keys by the "streaming  
mapper" class, and stripping them in "streaming reducer" class before  
feeding to the streaming reducer command provided by the user.

However, at the user level, I'd suggest you export the "complete"  
functionality, rather than limiting it by requiring the keys to be in  
the beginning of the record.

On Apr 25, 2007, at 11:13 AM, Runping Qi (JIRA) wrote:

>
>      [  
> https://issues.apache.org/jira/browse/HADOOP-1284? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Runping Qi updated HADOOP-1284:
> -------------------------------
>
>     Description:
> Right now, the protocol between stream mapper/reducer and the  
> framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks  
> up line by line, and split
> each line into a key/value pair. By default, the substring up to the  
> first tab char is the key, and the
> substring after the first tab char is the value.
>
> However, in many cases, the application wants some control over how  
> the pair is split.
> Here, I'd like to introduce the following configuration variables for  
> that:
>
> 1. "streaming.output.field.separator": the value will be the tab key,  
> by default.
> But the user can specify a different one (e.g. ':', or ', ', etc.)
> A map output line can be considered as a list of fields separated by  
> the separator.
>
> 2. "streaming.num.fields.for.mapout.key":  the number of the first  
> fields will be used the map output key
> (and for sorting in the reduce side).
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can  
> specify the first 5 fields as my mapout key.
>
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use  
> fewer fields for partitioning to
> achieve "primary/secondary" composite
> key effect as proposed in HADOOP485. The default value is 1.
> For example, I can set "streaming.num.fields.for.partitioning" to 3
> and "streaming.num.fields.for.mapout.key" to 5.
> This effectively amounts to saying that fields 4 and 5 are my  
> secondary key.
>
> With the above default values, it is compatible with the current  
> behavior
> while introducing a new desirable feature in a clean way.
>
> Thoughts?
>
>
>
>
>   was:
>
> Right now, the protocol between stream mapper/reducer and the  
> framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks  
> up line by line, and split
> each line into a key/value pair. By default, the substring up to the  
> first tab char is the key, and the
> substring after the first tab char is the value.
>
> However, in many cases, the application wants some control over how  
> the pair is split.
> Here, I'd like to introduce the following configuration variables for  
> that:
>
> 1. "streaming.output.field.separator": the value will be the tab key,  
> by default. But the user can specify a different one (e.g. '|', or '  
> ', etc.)
> A map output line can be considered as a list of fields separated by  
> the separator.
>
> 2. "streaming.num.fields.for.mapout.key":  the number of the first  
> fields will be used the map output key  (and for sorting in the reduce  
> side).
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can  
> specify the first 5 fields as my mapout key.
>
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use  
> fewer fields for partitioning to achieve "primary/secondary" composite
> key effect as proposed in HADOOP485. The default value is 1. For  
> example, I can set "streaming.num.fields.for.partitioning" to 3
> and "streaming.num.fields.for.mapout.key" to 5. This effectively  
> amounts to saying that fields 4 and 5 are my secondary key.
>
> With the above default values, it is compatible with the current  
> behavior while introducing a new desirable feature in a clean way.
>
> Thoughts?
>
>
>
>
>
> This patch implemented the proposed protocol.
>
> With this patch, the streaming user can specify a field separatot for  
> the mapper's output and/or a field separator
> for the reducer's output. The default will be the tab char.
>
> The user can also specify how many fields in the output consitute the  
> keys. The default is 1.
> The rest part of a line will be the value.
>
> A partitioner class, KeyFieldBasedPartitioner in mapred.lib, is also  
> implemented.
> The user can specify the number of the fields in the map output keys
> will be used for partitioning.
>
> Also a urility class, FieldSelectionMapReduce in mapred.lib, is added.  
> This class allows the
> user to create map/reduce jobs that manapulate text data like the Unix  
> cut utility.
> The user can specify field separator (delimiter for cut) and specify  
> which fields to select, and
> by which fields to partition/sort.
>
> Two unit tests are introduced.
> All the unit tests passed.
>
> [ Show » ] Runping Qi [25/Apr/07 11:07 AM] This patch implemented the  
> proposed protocol. With this patch, the streaming user can specify a  
> field separatot for the mapper's output and/or a field separator for  
> the reducer's output. The default will be the tab char. The user can  
> also specify how many fields in the output consitute the keys. The  
> default is 1. The rest part of a line will be the value. A partitioner  
> class, KeyFieldBasedPartitioner in mapred.lib, is also implemented.  
> The user can specify the number of the fields in the map output keys  
> will be used for partitioning. Also a urility class,  
> FieldSelectionMapReduce in mapred.lib, is added. This class allows the  
> user to create map/reduce jobs that manapulate text data like the Unix  
> cut utility. The user can specify field separator (delimiter for cut)  
> and specify which fields to select, and by which fields to  
> partition/sort. Two unit tests are introduced. All the unit tests  
> passed.
>
>
>> clean up the protocol between stream mapper/reducer and the framework
>> ---------------------------------------------------------------------
>>
>>                 Key: HADOOP-1284
>>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
>>             Project: Hadoop
>>          Issue Type: Improvement
>>            Reporter: Runping Qi
>>         Assigned To: Runping Qi
>>         Attachments: patch-1284.txt
>>
>>
>> Right now, the protocol between stream mapper/reducer and the  
>> framework is very inflexible.
>> The mapper/reducer generates line oriented output. The framework  
>> picks up line by line, and split
>> each line into a key/value pair. By default, the substring up to the  
>> first tab char is the key, and the
>> substring after the first tab char is the value.
>> However, in many cases, the application wants some control over how  
>> the pair is split.
>> Here, I'd like to introduce the following configuration variables for  
>> that:
>> 1. "streaming.output.field.separator": the value will be the tab key,  
>> by default.
>> But the user can specify a different one (e.g. ':', or ', ', etc.)
>> A map output line can be considered as a list of fields separated by  
>> the separator.
>> 2. "streaming.num.fields.for.mapout.key":  the number of the first  
>> fields will be used the map output key
>> (and for sorting in the reduce side).
>> The default value is 1.
>> The rest of the fields will be used as the value.  For example, I can  
>> specify the first 5 fields as my mapout key.
>> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use  
>> fewer fields for partitioning to
>> achieve "primary/secondary" composite
>> key effect as proposed in HADOOP485. The default value is 1.
>> For example, I can set "streaming.num.fields.for.partitioning" to 3
>> and "streaming.num.fields.for.mapout.key" to 5.
>> This effectively amounts to saying that fields 4 and 5 are my  
>> secondary key.
>> With the above default values, it is compatible with the current  
>> behavior
>> while introducing a new desirable feature in a clean way.
>> Thoughts?
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi updated HADOOP-1284:
-------------------------------

    Description: 
Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
The mapper/reducer generates line oriented output. The framework picks up line by line, and split 
each line into a key/value pair. By default, the substring up to the first tab char is the key, and the 
substring after the first tab char is the value.

However, in many cases, the application wants some control over how the pair is split. 
Here, I'd like to introduce the following configuration variables for that:

1. "streaming.output.field.separator": the value will be the tab key, by default. 
But the user can specify a different one (e.g. ':', or ', ', etc.)
A map output line can be considered as a list of fields separated by the separator.

2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the map output key  
(and for sorting in the reduce side). 
The default value is 1.
The rest of the fields will be used as the value.  For example, I can specify the first 5 fields as my mapout key.

3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to 
achieve "primary/secondary" composite 
key effect as proposed in HADOOP485. The default value is 1. 
For example, I can set "streaming.num.fields.for.partitioning" to 3 
and "streaming.num.fields.for.mapout.key" to 5. 
This effectively amounts to saying that fields 4 and 5 are my secondary key.

With the above default values, it is compatible with the current behavior 
while introducing a new desirable feature in a clean way.

Thoughts?




  was:

Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
The mapper/reducer generates line oriented output. The framework picks up line by line, and split 
each line into a key/value pair. By default, the substring up to the first tab char is the key, and the 
substring after the first tab char is the value.

However, in many cases, the application wants some control over how the pair is split. 
Here, I'd like to introduce the following configuration variables for that:

1. "streaming.output.field.separator": the value will be the tab key, by default. But the user can specify a different one (e.g. '|', or ' ', etc.)
A map output line can be considered as a list of fields separated by the separator.

2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the map output key  (and for sorting in the reduce side). 
The default value is 1.
The rest of the fields will be used as the value.  For example, I can specify the first 5 fields as my mapout key.

3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to achieve "primary/secondary" composite 
key effect as proposed in HADOOP485. The default value is 1. For example, I can set "streaming.num.fields.for.partitioning" to 3 
and "streaming.num.fields.for.mapout.key" to 5. This effectively amounts to saying that fields 4 and 5 are my secondary key.

With the above default values, it is compatible with the current behavior while introducing a new desirable feature in a clean way.

Thoughts?





This patch implemented the proposed protocol.

With this patch, the streaming user can specify a field separatot for the mapper's output and/or a field separator 
for the reducer's output. The default will be the tab char.

The user can also specify how many fields in the output consitute the keys. The default is 1.
The rest part of a line will be the value.

A partitioner class, KeyFieldBasedPartitioner in mapred.lib, is also implemented. 
The user can specify the number of the fields in the map output keys 
will be used for partitioning.

Also a urility class, FieldSelectionMapReduce in mapred.lib, is added. This class allows the
user to create map/reduce jobs that manapulate text data like the Unix cut utility.
The user can specify field separator (delimiter for cut) and specify which fields to select, and 
by which fields to partition/sort.

Two unit tests are introduced.
All the unit tests passed.

[ Show » ] Runping Qi [25/Apr/07 11:07 AM] This patch implemented the proposed protocol. With this patch, the streaming user can specify a field separatot for the mapper's output and/or a field separator for the reducer's output. The default will be the tab char. The user can also specify how many fields in the output consitute the keys. The default is 1. The rest part of a line will be the value. A partitioner class, KeyFieldBasedPartitioner in mapred.lib, is also implemented. The user can specify the number of the fields in the map output keys will be used for partitioning. Also a urility class, FieldSelectionMapReduce in mapred.lib, is added. This class allows the user to create map/reduce jobs that manapulate text data like the Unix cut utility. The user can specify field separator (delimiter for cut) and specify which fields to select, and by which fields to partition/sort. Two unit tests are introduced. All the unit tests passed. 


> clean up the protocol between stream mapper/reducer and the framework
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-1284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>         Attachments: patch-1284.txt
>
>
> Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks up line by line, and split 
> each line into a key/value pair. By default, the substring up to the first tab char is the key, and the 
> substring after the first tab char is the value.
> However, in many cases, the application wants some control over how the pair is split. 
> Here, I'd like to introduce the following configuration variables for that:
> 1. "streaming.output.field.separator": the value will be the tab key, by default. 
> But the user can specify a different one (e.g. ':', or ', ', etc.)
> A map output line can be considered as a list of fields separated by the separator.
> 2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the map output key  
> (and for sorting in the reduce side). 
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can specify the first 5 fields as my mapout key.
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to 
> achieve "primary/secondary" composite 
> key effect as proposed in HADOOP485. The default value is 1. 
> For example, I can set "streaming.num.fields.for.partitioning" to 3 
> and "streaming.num.fields.for.mapout.key" to 5. 
> This effectively amounts to saying that fields 4 and 5 are my secondary key.
> With the above default values, it is compatible with the current behavior 
> while introducing a new desirable feature in a clean way.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.