You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Adam Warrington <aw...@gmail.com> on 2011/04/04 21:33:32 UTC

Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/
-----------------------------------------------------------

Review request for pig.


Summary
-------

This is a patch for PIG-1702, which describes an issue where the task output logs for PIG streaming jobs contains null input-split information. The ability to query the input-split information through the JobConf went away with the new MR API. We must now gain a reference to the underlying FiletSplit, and query this reference for that information.


Diffs
-----

  trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 1088692 

Diff: https://reviews.apache.org/r/547/diff


Testing
-------


Thanks,

Adam


Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

Posted by Adam Warrington <aw...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/#review383
-----------------------------------------------------------



trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
<https://reviews.apache.org/r/547/#comment725>

    Referencing PigMapReduce.sJobContext may cause a race condition in local Pig jobs, similar to what is described in PIG-1831. Should a similar fix be applied where the context in PigMapReduce is in thread local storage?


- Adam


On 2011-05-19 16:27:22, Adam Warrington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/547/
> -----------------------------------------------------------
> 
> (Updated 2011-05-19 16:27:22)
> 
> 
> Review request for pig.
> 
> 
> Summary
> -------
> 
> This is a patch for PIG-1702, which describes an issue where the task output logs for PIG streaming jobs contains null input-split information. The ability to query the input-split information through the JobConf went away with the new MR API. We must now gain a reference to the underlying FiletSplit, and query this reference for that information.
> 
> 
> Diffs
> -----
> 
>   trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 1088692 
> 
> Diff: https://reviews.apache.org/r/547/diff
> 
> 
> Testing
> -------
> 
> To test this, I wrote a very simple python script to pass data through using PIG. After checking the task logs of the completed task, the stderr logs now contain valid input split information. Below are the scripts and test data used.
> 
> ### PIG commands run ###
> DEFINE testpy `test.py` SHIP ('test.py');
> raw_records = LOAD '/test.txt2'; 
> T1 = STREAM raw_records THROUGH testpy;
> dump T1;
> 
> ### test.py ###
> #!/usr/bin/python
> import sys
> 
> cnt = 0
> for line in sys.stdin:
>     print line.strip() + " " + str(cnt)
>     cnt += 1
> 
> ### contents of /test.txt on hdfs ###
> one line
> two line
> three line
> four line
> 
> 
> Thanks,
> 
> Adam
> 
>


Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

Posted by Adam Warrington <aw...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/
-----------------------------------------------------------

(Updated 2011-05-19 16:27:22.583249)


Review request for pig.


Changes
-------

Sigh...I edited this a while back, but didn't publish what I wrote.


Summary
-------

This is a patch for PIG-1702, which describes an issue where the task output logs for PIG streaming jobs contains null input-split information. The ability to query the input-split information through the JobConf went away with the new MR API. We must now gain a reference to the underlying FiletSplit, and query this reference for that information.


Diffs
-----

  trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 1088692 

Diff: https://reviews.apache.org/r/547/diff


Testing (updated)
-------

To test this, I wrote a very simple python script to pass data through using PIG. After checking the task logs of the completed task, the stderr logs now contain valid input split information. Below are the scripts and test data used.

### PIG commands run ###
DEFINE testpy `test.py` SHIP ('test.py');
raw_records = LOAD '/test.txt2'; 
T1 = STREAM raw_records THROUGH testpy;
dump T1;

### test.py ###
#!/usr/bin/python
import sys

cnt = 0
for line in sys.stdin:
    print line.strip() + " " + str(cnt)
    cnt += 1

### contents of /test.txt on hdfs ###
one line
two line
three line
four line


Thanks,

Adam


Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

Posted by Adam Warrington <aw...@gmail.com>.

> On 2011-04-13 18:03:22, Dmitriy Ryaboy wrote:
> > trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java, line 205
> > <https://reviews.apache.org/r/547/diff/1/?file=14980#file14980line205>
> >
> >     please clean up whitespace :)

Oops, sorry. I'll clean that up.


> On 2011-04-13 18:03:22, Dmitriy Ryaboy wrote:
> > trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java, line 202
> > <https://reviews.apache.org/r/547/diff/1/?file=14980#file14980line202>
> >
> >     Do we care about the specifics of how this output is written?
> >     
> >     Seems like it would be less code, and potentially better in the long run (if we are dealing with other kinds of splits) to just call toString() on the InputSplit. FileSplit already defines its own toString() which prints out the path, the start offset, and the length.
> 
> Ashutosh Chauhan wrote:
>     I agree with Dmitriy. If possible, we should avoid special casing for a particular type of InputSplit. Further, InputSplit provides getLocations() and getLength() api which should be used instead of FileSplit specific api.

So it seems the options are to either:

1. Use the input splits toString() method.
2. Use just getLocations and getLength, which are part of the InputSplit API.

I'm leaning towards toString, because it is going to contain useful information for the common case of FIleSplit which getLocations won't have, that being the file offset and the file name.

If this is the common consensus, I'll submit a patch with that update. Let me know.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/#review452
-----------------------------------------------------------


On 2011-05-19 16:27:22, Adam Warrington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/547/
> -----------------------------------------------------------
> 
> (Updated 2011-05-19 16:27:22)
> 
> 
> Review request for pig.
> 
> 
> Summary
> -------
> 
> This is a patch for PIG-1702, which describes an issue where the task output logs for PIG streaming jobs contains null input-split information. The ability to query the input-split information through the JobConf went away with the new MR API. We must now gain a reference to the underlying FiletSplit, and query this reference for that information.
> 
> 
> Diffs
> -----
> 
>   trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 1088692 
> 
> Diff: https://reviews.apache.org/r/547/diff
> 
> 
> Testing
> -------
> 
> To test this, I wrote a very simple python script to pass data through using PIG. After checking the task logs of the completed task, the stderr logs now contain valid input split information. Below are the scripts and test data used.
> 
> ### PIG commands run ###
> DEFINE testpy `test.py` SHIP ('test.py');
> raw_records = LOAD '/test.txt2'; 
> T1 = STREAM raw_records THROUGH testpy;
> dump T1;
> 
> ### test.py ###
> #!/usr/bin/python
> import sys
> 
> cnt = 0
> for line in sys.stdin:
>     print line.strip() + " " + str(cnt)
>     cnt += 1
> 
> ### contents of /test.txt on hdfs ###
> one line
> two line
> three line
> four line
> 
> 
> Thanks,
> 
> Adam
> 
>


Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

Posted by Ashutosh Chauhan <ha...@apache.org>.

> On 2011-04-13 18:03:22, Dmitriy Ryaboy wrote:
> > trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java, line 202
> > <https://reviews.apache.org/r/547/diff/1/?file=14980#file14980line202>
> >
> >     Do we care about the specifics of how this output is written?
> >     
> >     Seems like it would be less code, and potentially better in the long run (if we are dealing with other kinds of splits) to just call toString() on the InputSplit. FileSplit already defines its own toString() which prints out the path, the start offset, and the length.

I agree with Dmitriy. If possible, we should avoid special casing for a particular type of InputSplit. Further, InputSplit provides getLocations() and getLength() api which should be used instead of FileSplit specific api.


- Ashutosh


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/#review452
-----------------------------------------------------------


On 2011-04-04 19:33:32, Adam Warrington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/547/
> -----------------------------------------------------------
> 
> (Updated 2011-04-04 19:33:32)
> 
> 
> Review request for pig.
> 
> 
> Summary
> -------
> 
> This is a patch for PIG-1702, which describes an issue where the task output logs for PIG streaming jobs contains null input-split information. The ability to query the input-split information through the JobConf went away with the new MR API. We must now gain a reference to the underlying FiletSplit, and query this reference for that information.
> 
> 
> Diffs
> -----
> 
>   trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 1088692 
> 
> Diff: https://reviews.apache.org/r/547/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Adam
> 
>


Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/#review452
-----------------------------------------------------------



trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
<https://reviews.apache.org/r/547/#comment876>

    Do we care about the specifics of how this output is written?
    
    Seems like it would be less code, and potentially better in the long run (if we are dealing with other kinds of splits) to just call toString() on the InputSplit. FileSplit already defines its own toString() which prints out the path, the start offset, and the length.



trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
<https://reviews.apache.org/r/547/#comment875>

    please clean up whitespace :)


- Dmitriy


On 2011-04-04 19:33:32, Adam Warrington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/547/
> -----------------------------------------------------------
> 
> (Updated 2011-04-04 19:33:32)
> 
> 
> Review request for pig.
> 
> 
> Summary
> -------
> 
> This is a patch for PIG-1702, which describes an issue where the task output logs for PIG streaming jobs contains null input-split information. The ability to query the input-split information through the JobConf went away with the new MR API. We must now gain a reference to the underlying FiletSplit, and query this reference for that information.
> 
> 
> Diffs
> -----
> 
>   trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 1088692 
> 
> Diff: https://reviews.apache.org/r/547/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Adam
> 
>