You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alex Newman (JIRA)" <ji...@apache.org> on 2008/12/18 14:33:44 UTC

[jira] Created: (PIG-570) Large BZip files Seem to loose data in Pig

Large BZip files  Seem to loose data in Pig
-------------------------------------------

                 Key: PIG-570
                 URL: https://issues.apache.org/jira/browse/PIG-570
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.0.0, types_branch, 0.1.0, site
         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
            Reporter: Alex Newman
             Fix For: types_branch, 0.1.0, site, 0.0.0


So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:

- Maps seem to be completing in a unbelievably fast rate

With uncompressed data
Status: Succeeded
Started at: Wed Dec 17 21:31:10 EST 2008
Finished at: Wed Dec 17 22:42:09 EST 2008
Finished in: 1hrs, 10mins, 59sec
map	100.00%
4670	0	0	4670	0	0 / 21
reduce	57.72%
13	0	0	13	0	0 / 4


With bzip compressed data

Started at: Wed Dec 17 21:17:28 EST 2008
Failed at: Wed Dec 17 21:17:52 EST 2008
Failed in: 24sec
Black-listed TaskTrackers: 2
Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
Task Attempts
map	100.00%
183	0	0	15	168	54 / 22
reduce	100.00%
13	0	0	0	13	0 / 0

The errors we get:
ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
Last 4KB
attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by Mridul <mr...@yahoo-inc.com>.
A similar thing existed with PigStorage iirc (atleast last time I 
checked it a while back - unless I missed something) ...
If the record boundary aligned itself with hdfs boundary, the subsequent 
record would get dropped by pig.

To illustrate
map1 would read until end of its block or last record boundary - 
whichever happens last.
map2 would assume partial read by map1 and proceed to find record 
delimiter for its block - and read from there on.
Hence if map1's record boundary and end of hdfs block coincide, map2 
ends up skipping first record from its block.

Not sure if similar thing is happening here.

Regards,
Mridul

Benjamin Reed (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Benjamin Reed updated PIG-570:
> ------------------------------
>
>     Attachment: PIG-570.patch
>
> I believe the problem is due to bad position tracking. In the current version of the code, we chop up the input into blocks, but unfortunately when using bzip there are bzip block boundaries, HDFS block boundaries, and record boundaries. if the bzip block boundaries line up too closely, a record could get skipped or possibly corrupted.
>
> i was able to reproduce a problem, hopefully it is the same as your problem in the attached test case.
>
> the root cause turn out to be improper tracking of "position". if we blindly use the position of the underlying stream and a bzip block and HDFS block line up we may think that we have read the first record of the next slice when in fact we have only read the bzip block header.
>
> the attached patch fixes the problem by defining the position of the stream as the position of the start of the current block header in the underlying stream.
>
>   
>> Large BZip files  Seem to loose data in Pig
>> -------------------------------------------
>>
>>                 Key: PIG-570
>>                 URL: https://issues.apache.org/jira/browse/PIG-570
>>             Project: Pig
>>          Issue Type: Bug
>>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>>            Reporter: Alex Newman
>>             Fix For: types_branch, 0.0.0, 0.1.0, site
>>
>>         Attachments: PIG-570.patch
>>
>>
>> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
>> - Maps seem to be completing in a unbelievably fast rate
>> With uncompressed data
>> Status: Succeeded
>> Started at: Wed Dec 17 21:31:10 EST 2008
>> Finished at: Wed Dec 17 22:42:09 EST 2008
>> Finished in: 1hrs, 10mins, 59sec
>> map	100.00%
>> 4670	0	0	4670	0	0 / 21
>> reduce	57.72%
>> 13	0	0	13	0	0 / 4
>> With bzip compressed data
>> Started at: Wed Dec 17 21:17:28 EST 2008
>> Failed at: Wed Dec 17 21:17:52 EST 2008
>> Failed in: 24sec
>> Black-listed TaskTrackers: 2
>> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
>> Task Attempts
>> map	100.00%
>> 183	0	0	15	168	54 / 22
>> reduce	100.00%
>> 13	0	0	0	13	0 / 0
>> The errors we get:
>> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
>> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
>> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
>> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
>> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
>> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
>> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
>> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>> Last 4KB
>> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
>> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
>> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
>> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
>> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
>> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
>> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
>> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
>> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>>     
>
>   


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment: PIG-570.patch

I believe the problem is due to bad position tracking. In the current version of the code, we chop up the input into blocks, but unfortunately when using bzip there are bzip block boundaries, HDFS block boundaries, and record boundaries. if the bzip block boundaries line up too closely, a record could get skipped or possibly corrupted.

i was able to reproduce a problem, hopefully it is the same as your problem in the attached test case.

the root cause turn out to be improper tracking of "position". if we blindly use the position of the underlying stream and a bzip block and HDFS block line up we may think that we have read the first record of the next slice when in fact we have only read the bzip block header.

the attached patch fixes the problem by defining the position of the stream as the position of the start of the current block header in the underlying stream.

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment: PIG-570.patch

Tightened up the test case and increased the number of bits used for the signature to the full 48-bits. (Since I now use the start of the block boundary as the offset we can use the whole thing.)

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660838#action_12660838 ] 

Olga Natkovich commented on PIG-570:
------------------------------------

Ok, lets keep the size and just make the patch against types branch, thanks.

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment:     (was: PIG-570.patch)

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reassigned PIG-570:
------------------------------

    Assignee: Benjamin Reed

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.0.0, 0.1.0, 0.2.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>            Assignee: Benjamin Reed
>             Fix For: 0.2.0
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment: PIG-570.patch
                bzipTest.bz2

Fixed the bzip for the test cases to have carefully crafted bad corner cases.

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment: PIG-570.patch

Regenerated patch against types branch

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Status: Patch Available  (was: Open)

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.0.0, types_branch, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.1.0, site, 0.0.0
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment:     (was: PIG-570.patch)

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment: bzipTest.bz2

this is the test data for the bzip unit test. it should go under test/org/apache/pig/test/data/bzipTest.bz2

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659994#action_12659994 ] 

Olga Natkovich commented on PIG-570:
------------------------------------

Ben, thanks. This is great!

I tried to apply your patch but it failed. Can you make sure that you patch is relative to latest code in types branch. Also, is it possible to have smaller test data or is this the smallest data that shows the problem?

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment:     (was: bzipTest.bz2)

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660518#action_12660518 ] 

Benjamin Reed commented on PIG-570:
-----------------------------------

Ah sorry, i did the patch with respect to trunk. i'll regen. it is probably possible to create a smaller test case, but it will take awhile. i did simple brute force trial an error to get the test case. (i thought it was pretty good that i was able to keep it to under 2M :) are you concerned about the size or the run time?

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-570:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

patch committed; thanks, Ben!

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.1.0, site, 0.0.0
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment:     (was: PIG-570.patch)

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map	100.00%
> 4670	0	0	4670	0	0 / 21
> reduce	57.72%
> 13	0	0	13	0	0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
> Task Attempts
> map	100.00%
> 183	0	0	15	168	54 / 22
> reduce	100.00%
> 13	0	0	0	13	0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX, )
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com	FAILED	
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.