You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Alan Gates <ga...@hortonworks.com> on 2012/08/24 00:20:06 UTC

Re: Number of mappers in MRCompiler

Sorry for the very slow response, but here it is, hopefully better late than never.

On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:

> Thanks Alan.
> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups. 
> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded. 
> Is there any other way to know the exact number of samples loaded? 
Not that I know of.

> 
> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format? 
Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.

Alan.

Re: Number of mappers in MRCompiler

Posted by Prasanth J <bu...@gmail.com>.

Oh yeah.. This question is not related to our cube sampling stuff that we discussed.. wanted to know the reason behind that just out of curiosity :) 


Thanks
-- Prasanth

On Aug 23, 2012, at 11:20 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I think we decided to instead stub in a special loader that reads a
> few records from each underlying split, in a single mapper (by using a
> single wrapping split), right?
> 
> On Thu, Aug 23, 2012 at 7:55 PM, Prasanth J <bu...@gmail.com> wrote:
>> I see. Thanks Alan for your reply.
>> Also one more question that I posted earlier was
>> 
>> I used RandomSampleLoader and specified a sample size of 100. The number of map tasks that are executed is 110. So I am expecting total samples that are received on the reducer to be 110*100 = 11000 but its always more than the expected value. The actual received tuples is between 14000 to 15000. I am not sure if its a bug or if I am missing something. Is it an expected behavior?
>> 
>> Thanks
>> -- Prasanth
>> 
>> On Aug 23, 2012, at 6:20 PM, Alan Gates <ga...@hortonworks.com> wrote:
>> 
>>> Sorry for the very slow response, but here it is, hopefully better late than never.
>>> 
>>> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:
>>> 
>>>> Thanks Alan.
>>>> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups.
>>>> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded.
>>>> Is there any other way to know the exact number of samples loaded?
>>> Not that I know of.
>>> 
>>>> 
>>>> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format?
>>> Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.
>>> 
>>> Alan.
>>> 
>>

Re: Number of mappers in MRCompiler

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I think we decided to instead stub in a special loader that reads a
few records from each underlying split, in a single mapper (by using a
single wrapping split), right?

On Thu, Aug 23, 2012 at 7:55 PM, Prasanth J <bu...@gmail.com> wrote:
> I see. Thanks Alan for your reply.
> Also one more question that I posted earlier was
>
> I used RandomSampleLoader and specified a sample size of 100. The number of map tasks that are executed is 110. So I am expecting total samples that are received on the reducer to be 110*100 = 11000 but its always more than the expected value. The actual received tuples is between 14000 to 15000. I am not sure if its a bug or if I am missing something. Is it an expected behavior?
>
> Thanks
> -- Prasanth
>
> On Aug 23, 2012, at 6:20 PM, Alan Gates <ga...@hortonworks.com> wrote:
>
>> Sorry for the very slow response, but here it is, hopefully better late than never.
>>
>> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:
>>
>>> Thanks Alan.
>>> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups.
>>> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded.
>>> Is there any other way to know the exact number of samples loaded?
>> Not that I know of.
>>
>>>
>>> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format?
>> Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.
>>
>> Alan.
>>
>

Re: Number of mappers in MRCompiler

Posted by Prasanth J <bu...@gmail.com>.

I see. Thanks Alan for your reply. 
Also one more question that I posted earlier was

I used RandomSampleLoader and specified a sample size of 100. The number of map tasks that are executed is 110. So I am expecting total samples that are received on the reducer to be 110*100 = 11000 but its always more than the expected value. The actual received tuples is between 14000 to 15000. I am not sure if its a bug or if I am missing something. Is it an expected behavior?

Thanks
-- Prasanth

On Aug 23, 2012, at 6:20 PM, Alan Gates <ga...@hortonworks.com> wrote:

> Sorry for the very slow response, but here it is, hopefully better late than never.
> 
> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:
> 
>> Thanks Alan.
>> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups. 
>> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded. 
>> Is there any other way to know the exact number of samples loaded? 
> Not that I know of.
> 
>> 
>> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format? 
> Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.
> 
> Alan.
>