You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Prasanth J <bu...@gmail.com> on 2012/07/26 00:47:21 UTC

Number of mappers in MRCompiler

Hello everyone

I would like know if there is a way to know the number of mappers while compiling physical plan to MR-plan. 

Thanks
-- Prasanth

Re: Number of mappers in MRCompiler

Posted by Prasanth J <bu...@gmail.com>.

Oh yeah.. This question is not related to our cube sampling stuff that we discussed.. wanted to know the reason behind that just out of curiosity :) 


Thanks
-- Prasanth

On Aug 23, 2012, at 11:20 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I think we decided to instead stub in a special loader that reads a
> few records from each underlying split, in a single mapper (by using a
> single wrapping split), right?
> 
> On Thu, Aug 23, 2012 at 7:55 PM, Prasanth J <bu...@gmail.com> wrote:
>> I see. Thanks Alan for your reply.
>> Also one more question that I posted earlier was
>> 
>> I used RandomSampleLoader and specified a sample size of 100. The number of map tasks that are executed is 110. So I am expecting total samples that are received on the reducer to be 110*100 = 11000 but its always more than the expected value. The actual received tuples is between 14000 to 15000. I am not sure if its a bug or if I am missing something. Is it an expected behavior?
>> 
>> Thanks
>> -- Prasanth
>> 
>> On Aug 23, 2012, at 6:20 PM, Alan Gates <ga...@hortonworks.com> wrote:
>> 
>>> Sorry for the very slow response, but here it is, hopefully better late than never.
>>> 
>>> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:
>>> 
>>>> Thanks Alan.
>>>> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups.
>>>> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded.
>>>> Is there any other way to know the exact number of samples loaded?
>>> Not that I know of.
>>> 
>>>> 
>>>> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format?
>>> Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.
>>> 
>>> Alan.
>>> 
>>

Re: Number of mappers in MRCompiler

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I think we decided to instead stub in a special loader that reads a
few records from each underlying split, in a single mapper (by using a
single wrapping split), right?

On Thu, Aug 23, 2012 at 7:55 PM, Prasanth J <bu...@gmail.com> wrote:
> I see. Thanks Alan for your reply.
> Also one more question that I posted earlier was
>
> I used RandomSampleLoader and specified a sample size of 100. The number of map tasks that are executed is 110. So I am expecting total samples that are received on the reducer to be 110*100 = 11000 but its always more than the expected value. The actual received tuples is between 14000 to 15000. I am not sure if its a bug or if I am missing something. Is it an expected behavior?
>
> Thanks
> -- Prasanth
>
> On Aug 23, 2012, at 6:20 PM, Alan Gates <ga...@hortonworks.com> wrote:
>
>> Sorry for the very slow response, but here it is, hopefully better late than never.
>>
>> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:
>>
>>> Thanks Alan.
>>> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups.
>>> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded.
>>> Is there any other way to know the exact number of samples loaded?
>> Not that I know of.
>>
>>>
>>> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format?
>> Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.
>>
>> Alan.
>>
>

Re: Number of mappers in MRCompiler

Posted by Prasanth J <bu...@gmail.com>.

I see. Thanks Alan for your reply. 
Also one more question that I posted earlier was

I used RandomSampleLoader and specified a sample size of 100. The number of map tasks that are executed is 110. So I am expecting total samples that are received on the reducer to be 110*100 = 11000 but its always more than the expected value. The actual received tuples is between 14000 to 15000. I am not sure if its a bug or if I am missing something. Is it an expected behavior?

Thanks
-- Prasanth

On Aug 23, 2012, at 6:20 PM, Alan Gates <ga...@hortonworks.com> wrote:

> Sorry for the very slow response, but here it is, hopefully better late than never.
> 
> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:
> 
>> Thanks Alan.
>> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups. 
>> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded. 
>> Is there any other way to know the exact number of samples loaded? 
> Not that I know of.
> 
>> 
>> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format? 
> Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.
> 
> Alan.
>

Re: Number of mappers in MRCompiler

Posted by Alan Gates <ga...@hortonworks.com>.

Sorry for the very slow response, but here it is, hopefully better late than never.

On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:

> Thanks Alan.
> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups. 
> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded. 
> Is there any other way to know the exact number of samples loaded? 
Not that I know of.

> 
> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format? 
Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.

Alan.

Re: Number of mappers in MRCompiler

Posted by Prasanth J <bu...@gmail.com>.

Thanks Alan.
The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups. 
Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded. 
Is there any other way to know the exact number of samples loaded? 

By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format? 

Thanks
-- Prasanth

On Jul 25, 2012, at 6:49 PM, Alan Gates <ga...@hortonworks.com> wrote:

> No.  The number of mappers is determined by the InputFormat used by your load function (TextInputFormat if you're using the default PigStorage loader) when the Hadoop job is submitted.  Pig doesn't have access to that info until it's handed the jobs off to MapReduce.
> 
> Alan.
> 
> On Jul 25, 2012, at 3:47 PM, Prasanth J wrote:
> 
>> Hello everyone
>> 
>> I would like know if there is a way to know the number of mappers while compiling physical plan to MR-plan. 
>> 
>> Thanks
>> -- Prasanth
>> 
>

Re: Number of mappers in MRCompiler

Posted by Alan Gates <ga...@hortonworks.com>.

No.  The number of mappers is determined by the InputFormat used by your load function (TextInputFormat if you're using the default PigStorage loader) when the Hadoop job is submitted.  Pig doesn't have access to that info until it's handed the jobs off to MapReduce.

Alan.

On Jul 25, 2012, at 3:47 PM, Prasanth J wrote:

> Hello everyone
> 
> I would like know if there is a way to know the number of mappers while compiling physical plan to MR-plan. 
> 
> Thanks
> -- Prasanth
>