You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Keith Jackson (JIRA)" <ji...@apache.org> on 2009/10/19 22:36:59 UTC

[jira] Created: (MAPREDUCE-1122) streaming with custom input format does not support the new API

streaming with custom input format does not support the new API
---------------------------------------------------------------

                 Key: MAPREDUCE-1122
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: contrib/streaming
    Affects Versions: 0.20.1
         Environment: any OS
            Reporter: Keith Jackson


When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919601#action_12919601 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1122:
----------------------------------------------------

Will upload a new patch soon.

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878515#action_12878515 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1122:
----------------------------------------------------

Users can specify Mapper/Reducer to be Java Mapper/Reducer or a command. Also, he could specify input format, output format and partitioner for his streaming job. The below tables summarize the mapper or reducer in use when streaming supports both old and new api.

Note : In the tables below, NS stands for 'Not specified".

*Table 1* Mapper-in-use for given spec, when num reducers  = 0:
||Mapper || InputFormat || OutputFormat || Valid conf?|| Mapper-in-use ||
|Command|NS|NS|Yes|New|
|Command|Old|NS|Yes|Old|
|Command|Old|Old|Yes|Old|
|Command|Old|New|{color:red}No{color}|
|Command|New|NS|Yes|New|
|Command|New|Old|{color:red}No{color}|
|Command|New|New|Yes|New|
|Old|NS|NS|Yes|Old|
|Old|NS|Old|Yes|Old|
|Old|Old|NS|Yes|Old|
|Old|Old|Old|Yes|Old|
|Old|-|New|{color:red}No{color}|
|Old|New|-|{color:red}No{color}|
|New|NS|NS|Yes|New|
|New|NS|New|Yes|New|
|New|New|NS|Yes|New|
|New|New|New|Yes|New|
|New|-|Old|{color:red}No{color}|
|New|Old|-|{color:red}No{color}|

*Table 2* Mapper-in-use for given spec, when num reducers != 0:
||Mapper || InputFormat || Partitioner|| Valid conf?|| Mapper-in-use ||
|Command|NS|NS|Yes|New|
|Command|Old|NS|Yes|Old|
|Command|Old|Old|Yes|Old|
|Command|Old|New|{color:red}No{color}|
|Command|New|NS|Yes|New|
|Command|New|Old|{color:red}No{color}|
|Command|New|New|Yes|New|
|Old|NS|NS|Yes|Old|
|Old|NS|Old|Yes|Old|
|Old|Old|NS|Yes|Old|
|Old|Old|Old|Yes|Old|
|Old|New|-|{color:red}No{color}|
|Old|-|New|{color:red}No{color}|
|New|NS|NS|Yes|New|
|New|NS|New|Yes|New|
|New|New|NS|Yes|New|
|New|New|New|Yes|New|
|New|Old|-|{color:red}No{color}|
|New|-|Old|{color:red}No{color}|

*Table 3* Reducer-in-use for a given spec :
|| Reducer || OutputFormat || Valid conf?|| Reducer-in-use ||
| Command | NS |Yes |New|
| Command | Old |Yes |Old |
| Command | New |Yes |New| 
|Old|NS|Yes|Old|
|New|NS|Yes|New|
|Old|Old|Yes|Old|
|New|New|Yes|New|
|Old|New|{color:red}No{color}|
|New|Old|{color:red}No{color}|


> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated MAPREDUCE-1122:
-----------------------------------------------

           Status: Patch Available  (was: Open)
     Hadoop Flags: [Incompatible change]
    Fix Version/s: 0.22.0

Patch is ready for review.

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Jaideep (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865708#action_12865708 ] 

Jaideep commented on MAPREDUCE-1122:
------------------------------------

Some changes that are needed in order to support this.
* Everywhere in StreamJob, o.a.h.mapred.JobConf is used. To allow 
new input and output formats, new o.a.h.mapreduce.Job object should be 
used instead. Alternatively we can create and set configuration without 
relying on JobConf or Job methods, and only create a JobConf or Job 
object depending upon whether old or new API is being used.

* PipeMapper and PipeReducer are also based on the old api. We will have 
to create new Mappers and Reducers based on the new API in order to 
support newer input and output formats. PipeMapRed also uses JobConf at 
a number of places. Almost all of these calls could be replaced by calls 
to Configuration object.

* StreamInputFormat extends o.a.h.mapred.KeyValueTextInputFormat. It 
should extend o.a.h.mapreduce.lib.input.KeyValueTextInputFormat

* StreamBaseRecordReader extends o.a.h.mapred.RecordReader. New class 
confirming to new API is needed.

* Some static methods in StreamUtil.java are using old api -
     getCurrentSplit - uses o.a.h.mapred.FileSplit and Jobconf. This 
method is not used anywhere else in the code.
     isLocalJobTracker - uses JobConf.
     getTaskInfo - uses JobConf to get type of a task and taskid. used 
in PipeMapRed.setStreamJobDetails to set the taskid.
     addJobConfToEnvironment - takes a JobConf as argument. Should also 
take a Job.
    There is a static TaskID class in StreamUtils.java as well. If its not needed can it be removed?

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919340#action_12919340 ] 

Jeremy Hanna commented on MAPREDUCE-1122:
-----------------------------------------

Is there any update on this?  It's kind of a pain to have to support the old and new API in a custom InputFormat/RecordReader in order to enable streaming.

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated MAPREDUCE-1122:
-----------------------------------------------

    Status: Open  (was: Patch Available)

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885482#action_12885482 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1122:
----------------------------------------------------

For supporting new api in streaming, the implementation involves two major tasks:
# Setting job configuration for the streaming job: set appropriate mapper and reducer depending on the arguments passed. Summarizing the above requirements table :
 ** The old api mapper, PipeMapper, is used as mapper for the job only if mapper is command and 
    a) old api input format is passed  or
    b) #reduces=0 and old api output format is passed or 
    c) #reduces !=0 and old api partitioner is passed.
 ** Similarly the old api reducer, PipeReducer, is used as reducer for the job only if reducer is command and old output format is passed.
# Implementation of new api streaming mapper, reducer and etc.


> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890205#action_12890205 ] 

Vinod K V commented on MAPREDUCE-1122:
--------------------------------------

I started looking at this patch. Its BIG, and streaming isn't exactly my 'home-ground' :) I had to spend quite some time reviewing it. Please bear with me, we will need to go through some iterations, at-least one more big one, to get this close.

 - First up, the patch needs some merging to be done to accommodate recent commits in streaming.

 - It'd be really good if we can separate the new classes into new packages, library classes into a lib package and implementation classes to an impl package?

 - There are two ways of handing the skipping of bad records in the new api - (1) put the code in place and document that it isn't supported yet so that whenever MAPREDUCE-1932 moves in skipping automatically works in streaming or (2) remove the code altogether and create a child issue of MAPREDUCE-1932 for streaming. Looks like you intended to do (2) but I do see some (dead) code related to skipping in the new api classes, for e.g. StreamingMapper. We should either chose (1) completely or (2) completely.

Otherwise the overall functionality looks good to me, correctness including. Just some minor comments.

StreamingMapper.java
 - This log statement is new and we are doing for every key. Too aggressive?
    LOG.info("input " + key + " "+ value);
 - Difference in logging compared to old PipeMapred class when exceptions happen in map.
 - Missing @Override annotation for methods overridden.

StreamingReducer.java
 - Not logging exit code when exceptions happen in reduce. Used to be the case in old code.
 - Missing @Override annotation for methods overridden.

How about passing configuration configuration to InputWriter.initialize() and let TextInputWriter/TextOutputReader maintain themselves the key/vaule separators and related information instead of polluting StreamingMapper and StreamingReducer?

StreamingCombiner
 - Missing @Override annotation for method overridden.

Autoinputformat2
 - No configure method like in AutoInputFormat?
 - Name? Once we move the lib classes to a  new package, this class's name can stay the same old AutoInputFormat.

StreamingXmlRecordReader.java
 - Log.info statement in init() bears the wrong (parent) class name.
 - nextKeyValue() should be synchronized? In old api it was.

StreamingBaseRecordReader.java
 - getStatus() has changed w.r.t printing 'pos' also when compared to the older StreamBaseRecordReader.java

StreamJob.java
 - bq. Removes deprectaed StreamJob(String[] argv, boolean mayExit);
   Just checking. Is the compatibility left in one release?
 - Same for StreamJob.go()?
 - boolean isOldIF argument to setOutputFormat is not used at all.
 - Cluster and hence StreamJob never close client connection themselves at all! ( May be another ticket)

TestStreamingStatus: 
 - +    //testStreamJob(false);// nonempty input
Commented intentionally, in testReporing()?
 - Comments from +262 to +265 are no longer valid, right?

TrApp.java
 - Some expect() and expectDefined() calls are dropped. I could understand why the ones related to output format are dropped to accommodate testing both new and old apis. But removing of the checks related to input file and file length didn't make sense to me.

TaskInputOuputContextImpl.
 - The changes here were a surprise to me. Should be related to MAPREDUCE-1905. Are you incorporating that here, or just kept them in the patch for running. If it's the later, please provide a patch without these changes. If it is the former, we will need to include the testcase from there too.

Miscellaneous comments:
 - It's right time for us to  mark all the touched classes/interfaces according to the classification taxonomy.
 - Should we make the initialize methods in InputWriter and OutputReader abstract now?
 - TestStreamingAPICompatibility class needs some javadoc.
 - TODO: In the end we need to be sure tests pass with LinuxTaskController as well. Please do this with your next patch if you'ven't already.

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated MAPREDUCE-1122:
-----------------------------------------------

    Status: Patch Available  (was: Open)

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122-1.txt, patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887308#action_12887308 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1122:
----------------------------------------------------

Forgot to mention that skipping bad records functionality is not added for new api classes, because the support is not there for new api in the framework itself(MAPREDUCE-1932).

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885573#action_12885573 ] 

Hadoop QA commented on MAPREDUCE-1122:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448755/patch-1122.txt
  against trunk revision 960808.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 92 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/287/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/287/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/287/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/287/console

This message is automatically generated.

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885819#action_12885819 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1122:
----------------------------------------------------

bq. -1 contrib tests.
The failure is because of MAPREDUCE-1834.

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated MAPREDUCE-1122:
-----------------------------------------------

    Attachment: patch-1122-1.txt

Patch is updated to trunk with most of the review comments incorporated. Patch should be applied on top of MAPREDUCE-1905 to pass all tests.

bq. It'd be really good if we can separate the new classes into new packages, library classes into a lib package and implementation classes to an impl package?
Done

bq. There are two ways of handing the skipping of bad records in the new api ...........
Removed the dead code related to skipping in new api classes. Will add a subtask to MAPREDUCE-1932 to add support for streaming.

StreamingReducer.java
bq. Not logging exit code when exceptions happen in reduce. Used to be the case in old code.
Exit code is already logged in StreamingProcessManager. Even in old code, it was getting logged twice.

bq. How about passing configuration configuration to InputWriter.initialize() and let TextInputWriter/TextOutputReader maintain themselves the key/vaule separators and related information instead of polluting StreamingMapper and StreamingReducer?
Did not do this. It makes the code more complicated because, mapper and reducers have different configuration parameter names.

Autoinputformat2
bq. No configure method like in AutoInputFormat?
New api does not have configure for inputformat.

StreamJob.java
bq. Is the compatibility left in one release?
Yes. all the removed deprecated methods have been deprectaed since release 0.19


TrApp.java
bq. Some expect() and expectDefined() calls are dropped. I could understand why the ones related to output format are dropped to accommodate testing both new and old apis. But removing of the checks related to input file and file length didn't make sense to me.
New api does not have the configuration parameters for input file and length (HADOOP-5973).

bq. Should we make the initialize methods in InputWriter and OutputReader abstract now?
Did not do this. I don't think it is required.

Patch incorporates all other commands

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.22.0
>
>         Attachments: patch-1122-1.txt, patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated MAPREDUCE-1122:
-----------------------------------------------

    Attachment: patch-1122.txt

Attaching a patch which does the following:
* Deprectaes all the library classes in streaming such as AutoInputFormat, StreamInputFormat, StreamXmlRecordReader etc. and adds new classes which use new api. 
* Changes the tools DumpTypedBytes and LoadTypedBytes to use new api classes.
* Adds StreamJobConfig holding all the configuration properties used in streaming.
* Adds classes StreamingMapper, StreamingReducer and StreamingCombiner which extend new api Mapper and Reducer classes.
  ** Adds a class StreamingProcess which starts streaming process, MR output/error threads and waits for the threads and etc. This functionality is in PipeMapred.java for the old api mapper/reducer; PipeMapper and PipeReducer extend PipeMapred and implement old Mapper/Reducer interfaces. We cannot make StreamingMapper/StreamingReducer extend StreamingProcess because in new api mapper and reducer are not interfaces. So moved this into a separate class so that StreamingMapper/StreamingReducer composes it.
  ** InputWriter and OutputReader added in HADOOP-1722 take PipeMapred instance as a parameter for the constructor. But it does not make sense now because the process handling is served by separate class, StreamingProcess, for new api mapper/reducer. So, did a following Incompatible change (looks clean now):
  *** Changes OutputReader constructor to take DataInput as parameter, instead of PipeMapRed
  *** Changes InputWriter constructor to take DataOutput as parameter, instead of PipeMapRed
* Moves some utility methods in PipeMapRed to StreamUtil.
* Removes deprectaed StreamJob(String[] argv, boolean mayExit); Deprecates static public JobConf createJob(String[] argv); and adds static public Job createStreamingJob(String[] argv)
* Refactors setJobConf() into multiple setters to set appropriate mapper/reducer in use.
* Adds unit tests for all the usecases described [above|https://issues.apache.org/jira/browse/MAPREDUCE-1122?focusedCommentId=12878515&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12878515]


> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>         Attachments: patch-1122.txt
>
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (MAPREDUCE-1122) streaming with custom input format does not support the new API

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu reassigned MAPREDUCE-1122:
--------------------------------------------------

    Assignee: Amareshwari Sriramadasu

> streaming with custom input format does not support the new API
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-1122
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1122
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.20.1
>         Environment: any OS
>            Reporter: Keith Jackson
>            Assignee: Amareshwari Sriramadasu
>
> When trying to implement a custom input format for use with streaming, I have found that streaming does not support the new API, org.apache.hadoop.mapreduce.InputFormat, but requires the old API, org.apache.hadoop.mapred.InputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.