You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by da...@ya.ru on 2009/03/03 22:22:48 UTC
Re: 4,522,292 records lost while sorting

I counted the number of records on uncompressed input and sorted output.
3,344,109,862 input
3,339,587,570 output

So, the difference is 4,522,292 records.

2009/2/19  <da...@ya.ru>:
> 2009/2/19 Alan Gates <ga...@yahoo-inc.com>:
>> The second job in an order by is a sampling job, so the fact that it wrote
>> only one record is expected.  The one record is a quantiles tuple that
>> describes how the next job should set up its partitioner.
>>
>> The third job should read exactly the same number of records as the first
>> job wrote, as its input is the output of the first job.
>>
>> Are you getting these numbers from hadoop's UI?
>
> Yes.  At the beginning of using pig I used to check actual record
> numbers to match with hadoop's UI and they did not differ.
>
>> If so, are you using
>> compression, as that sometimes messes up the reporting of the hadoop UI.
>
> Yes, input data is bzip2 compressed.  I will check without compression.
>
>>  Also, did you run a separate job to count the number of records on your
>> input and output files?
>
> No.  I will check numbers separately.
>
>> We haven't tested current pig with hadoop 19.  In fact, I didn't think it
>> ran with it at all without applying a patch.  I don't know if that could
>> contribute to this or not.
>
> I'm using PIG-573.patch from
> https://issues.apache.org/jira/browse/PIG-573 to run current pig on
> hadoop branch-0.19.
>
>>
>> Alan.
>>
>> On Feb 18, 2009, at 3:03 PM, <da...@ya.ru> wrote:
>>
>>> Hi,
>>>
>>> I passed 3,344,109,862 records to ORDER and got 3,339,587,570 in the
>>> output with no noticeable errors.
>>>
>>> There were three jobs.
>>> First got 3,344,109,862 records (map input) and produced the same
>>> number (map output).
>>> Second got 248,820 (map input) and produced 1 (reduce output).
>>> Third got 3,339,587,570 (map input) and produced the same number
>>> (reduce output).
>>> So I guess something was wrong in the second job.
>>>
>>> I used pig from trunk at revision 743989 and hadoop from branch-0.19
>>> at revision 745383.
>>>
>>> I'd be happy to use pig with no data lost and ready to provide
>>> additional details or tests if it helps.
>>> Thanks.
>>
>>
>