You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Prasanth J <bu...@gmail.com> on 2012/07/27 03:04:29 UTC

Total count of RandomSampleLoader is unpredicatable

Hello everyone

I am using RandomSampleLoader to load 1000 tuples per mapper. I have 11 map jobs in a small dataset and 109 map jobs in a large dataset. 

I am expecting 11000 tuples from the small dataset and 109000 tuples from the large dataset. But the actual number of tuples that I get is always more than what I expected. In small dataset case I am getting 15000 tuples whereas in large dataset case I am getting 145000 (sometimes 150000) tuples. 

Is this a bug? or is it an expected behavior? If reservoir sampling is used by all mappers then why is the number of total samples is more?

Thanks
-- Prasanth


Re: Total count of RandomSampleLoader is unpredicatable

Posted by Jie Li <ji...@cs.duke.edu>.
Not sure if it's the same issue, but I also see the counter of Map
input records is greater than the actual number of input records in
some cases.

Jie

On Thu, Jul 26, 2012 at 6:04 PM, Prasanth J <bu...@gmail.com> wrote:
> Hello everyone
>
> I am using RandomSampleLoader to load 1000 tuples per mapper. I have 11 map jobs in a small dataset and 109 map jobs in a large dataset.
>
> I am expecting 11000 tuples from the small dataset and 109000 tuples from the large dataset. But the actual number of tuples that I get is always more than what I expected. In small dataset case I am getting 15000 tuples whereas in large dataset case I am getting 145000 (sometimes 150000) tuples.
>
> Is this a bug? or is it an expected behavior? If reservoir sampling is used by all mappers then why is the number of total samples is more?
>
> Thanks
> -- Prasanth
>