You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by da...@ya.ru on 2009/02/19 00:03:13 UTC

4,522,292 records lost while sorting

Hi,

I passed 3,344,109,862 records to ORDER and got 3,339,587,570 in the
output with no noticeable errors.

There were three jobs.
First got 3,344,109,862 records (map input) and produced the same
number (map output).
Second got 248,820 (map input) and produced 1 (reduce output).
Third got 3,339,587,570 (map input) and produced the same number
(reduce output).
So I guess something was wrong in the second job.

I used pig from trunk at revision 743989 and hadoop from branch-0.19
at revision 745383.

I'd be happy to use pig with no data lost and ready to provide
additional details or tests if it helps.
Thanks.

Re: 4,522,292 records lost while sorting

Posted by da...@ya.ru.
I counted the number of records on uncompressed input and sorted output.
3,344,109,862 input
3,339,587,570 output

So, the difference is 4,522,292 records.

2009/2/19  <da...@ya.ru>:
> 2009/2/19 Alan Gates <ga...@yahoo-inc.com>:
>> The second job in an order by is a sampling job, so the fact that it wrote
>> only one record is expected.  The one record is a quantiles tuple that
>> describes how the next job should set up its partitioner.
>>
>> The third job should read exactly the same number of records as the first
>> job wrote, as its input is the output of the first job.
>>
>> Are you getting these numbers from hadoop's UI?
>
> Yes.  At the beginning of using pig I used to check actual record
> numbers to match with hadoop's UI and they did not differ.
>
>> If so, are you using
>> compression, as that sometimes messes up the reporting of the hadoop UI.
>
> Yes, input data is bzip2 compressed.  I will check without compression.
>
>>  Also, did you run a separate job to count the number of records on your
>> input and output files?
>
> No.  I will check numbers separately.
>
>> We haven't tested current pig with hadoop 19.  In fact, I didn't think it
>> ran with it at all without applying a patch.  I don't know if that could
>> contribute to this or not.
>
> I'm using PIG-573.patch from
> https://issues.apache.org/jira/browse/PIG-573 to run current pig on
> hadoop branch-0.19.
>
>>
>> Alan.
>>
>> On Feb 18, 2009, at 3:03 PM, <da...@ya.ru> wrote:
>>
>>> Hi,
>>>
>>> I passed 3,344,109,862 records to ORDER and got 3,339,587,570 in the
>>> output with no noticeable errors.
>>>
>>> There were three jobs.
>>> First got 3,344,109,862 records (map input) and produced the same
>>> number (map output).
>>> Second got 248,820 (map input) and produced 1 (reduce output).
>>> Third got 3,339,587,570 (map input) and produced the same number
>>> (reduce output).
>>> So I guess something was wrong in the second job.
>>>
>>> I used pig from trunk at revision 743989 and hadoop from branch-0.19
>>> at revision 745383.
>>>
>>> I'd be happy to use pig with no data lost and ready to provide
>>> additional details or tests if it helps.
>>> Thanks.
>>
>>
>

Re: 4,522,292 records lost while sorting

Posted by da...@ya.ru.
2009/2/19 Alan Gates <ga...@yahoo-inc.com>:
> The second job in an order by is a sampling job, so the fact that it wrote
> only one record is expected.  The one record is a quantiles tuple that
> describes how the next job should set up its partitioner.
>
> The third job should read exactly the same number of records as the first
> job wrote, as its input is the output of the first job.
>
> Are you getting these numbers from hadoop's UI?

Yes.  At the beginning of using pig I used to check actual record
numbers to match with hadoop's UI and they did not differ.

> If so, are you using
> compression, as that sometimes messes up the reporting of the hadoop UI.

Yes, input data is bzip2 compressed.  I will check without compression.

>  Also, did you run a separate job to count the number of records on your
> input and output files?

No.  I will check numbers separately.

> We haven't tested current pig with hadoop 19.  In fact, I didn't think it
> ran with it at all without applying a patch.  I don't know if that could
> contribute to this or not.

I'm using PIG-573.patch from
https://issues.apache.org/jira/browse/PIG-573 to run current pig on
hadoop branch-0.19.

>
> Alan.
>
> On Feb 18, 2009, at 3:03 PM, <da...@ya.ru> wrote:
>
>> Hi,
>>
>> I passed 3,344,109,862 records to ORDER and got 3,339,587,570 in the
>> output with no noticeable errors.
>>
>> There were three jobs.
>> First got 3,344,109,862 records (map input) and produced the same
>> number (map output).
>> Second got 248,820 (map input) and produced 1 (reduce output).
>> Third got 3,339,587,570 (map input) and produced the same number
>> (reduce output).
>> So I guess something was wrong in the second job.
>>
>> I used pig from trunk at revision 743989 and hadoop from branch-0.19
>> at revision 745383.
>>
>> I'd be happy to use pig with no data lost and ready to provide
>> additional details or tests if it helps.
>> Thanks.
>
>

Re: 4,522,292 records lost while sorting

Posted by Alan Gates <ga...@yahoo-inc.com>.
The second job in an order by is a sampling job, so the fact that it  
wrote only one record is expected.  The one record is a quantiles  
tuple that describes how the next job should set up its partitioner.

The third job should read exactly the same number of records as the  
first job wrote, as its input is the output of the first job.

Are you getting these numbers from hadoop's UI?  If so, are you using  
compression, as that sometimes messes up the reporting of the hadoop  
UI.  Also, did you run a separate job to count the number of records  
on your input and output files?

We haven't tested current pig with hadoop 19.  In fact, I didn't think  
it ran with it at all without applying a patch.  I don't know if that  
could contribute to this or not.

Alan.

On Feb 18, 2009, at 3:03 PM, <da...@ya.ru> wrote:

> Hi,
>
> I passed 3,344,109,862 records to ORDER and got 3,339,587,570 in the
> output with no noticeable errors.
>
> There were three jobs.
> First got 3,344,109,862 records (map input) and produced the same
> number (map output).
> Second got 248,820 (map input) and produced 1 (reduce output).
> Third got 3,339,587,570 (map input) and produced the same number
> (reduce output).
> So I guess something was wrong in the second job.
>
> I used pig from trunk at revision 743989 and hadoop from branch-0.19
> at revision 745383.
>
> I'd be happy to use pig with no data lost and ready to provide
> additional details or tests if it helps.
> Thanks.