You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rahul Bhattacharjee <ra...@gmail.com> on 2013/04/16 10:49:27 UTC

Hadoop sampler related query!

Hi,

I have a question related to Hadoop's input sampler ,which is used for
investigating the data set before hand using random selection , sampling
etc .Mainly used for total sort , used in pig's skewed join implementation
as well.

The question here is -

Mapper<K,V,OK,OV>

K and V are input key and value of the mapper .Essentially coming in from
the input format. OK and OV are output key and value emitted from the
mapper.

Looking at the input sample's code ,it looks like it is creating the
partition based on the input key of the mapper.

I think the partitions should be created considering the output key (OK)
and the output key sort comparator should be used for sorting the samples.

If partitioning is done based on input key and the mapper emits a different
key then the total sort wouldn't hold any good.

Is there is any condition that input sample is to be only used for
mapper<K,V,K,V1>?


Thanks,
Rahul

Re: Hadoop sampler related query!

Posted by Mahesh Balija <ba...@gmail.com>.
Agreed with your explanation. One downside with your approach could be, if
we collect samples from the intermediate keys on demand it might limit the
partitioning to occur until all the mappers can complete and major downside
could be we need to keep all the intermediate records in memory (which may
NOT be possible) or need to write all the intermediate records locally
until all mappers complete and collecting the samples is done and then you
start partitioning the intermediate keys by iterating through all the
records one more time, which could also affect performance drastically of
the whole program.

Best,
Mahesh Balija,
Calsoft Labs.


On Wed, Apr 24, 2013 at 12:37 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks for the response Mahesh. I thought of this , but do not know why is
> this limitation.
>
> While sampling to pick up certain records and run our logic over the key
> part.Thats why there is a limitation that mappers key must be the same as
> mappers output key.
>
> However , I think , if we can run the mapper of the collected sample and
> do our analysis over the key of the mappers then it would be great and
> there wouldn't be any limitation like what we have today.
>
> Was wondering why its not like this.
>
> Thanks,
> Rahul
>
>
> On Wed, Apr 24, 2013 at 11:23 AM, Mahesh Balija <
> balijamahesh.mca@gmail.com> wrote:
>
>> Hi Rahul,
>>
>>              The limitation to use InputSampler is, the K and OK (I mean
>> Map INKEY and OUTKEY) both should be of same type.
>>              Technically because, while collecting the samples (ie.,
>> arraylist of keys) in writePartitionFile method it uses the INKEY as the
>> key. And for writing the partition file it uses Mapper OutputKEY as the
>> KEY.
>>
>>              Logically also this is the expected behavior of sampling
>> because, while collecting the samples the only source is the input splits
>> (INKEY) from which it collects the samples and for generating partition
>> file you need to generate based on the Mapper outkey type.
>>
>> Best,
>> Mahesh Balija,
>> CalsoftLabs.
>>
>>
>> On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> + mapred dev
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to Hadoop's input sampler ,which is used for
>>>> investigating the data set before hand using random selection , sampling
>>>> etc .Mainly used for total sort , used in pig's skewed join implementation
>>>> as well.
>>>>
>>>> The question here is -
>>>>
>>>> Mapper<K,V,OK,OV>
>>>>
>>>> K and V are input key and value of the mapper .Essentially coming in
>>>> from the input format. OK and OV are output key and value emitted from the
>>>> mapper.
>>>>
>>>> Looking at the input sample's code ,it looks like it is creating the
>>>> partition based on the input key of the mapper.
>>>>
>>>> I think the partitions should be created considering the output key
>>>> (OK) and the output key sort comparator should be used for sorting the
>>>> samples.
>>>>
>>>> If partitioning is done based on input key and the mapper emits a
>>>> different key then the total sort wouldn't hold any good.
>>>>
>>>> Is there is any condition that input sample is to be only used for
>>>> mapper<K,V,K,V1>?
>>>>
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>
>>
>

Re: Hadoop sampler related query!

Posted by Mahesh Balija <ba...@gmail.com>.
Agreed with your explanation. One downside with your approach could be, if
we collect samples from the intermediate keys on demand it might limit the
partitioning to occur until all the mappers can complete and major downside
could be we need to keep all the intermediate records in memory (which may
NOT be possible) or need to write all the intermediate records locally
until all mappers complete and collecting the samples is done and then you
start partitioning the intermediate keys by iterating through all the
records one more time, which could also affect performance drastically of
the whole program.

Best,
Mahesh Balija,
Calsoft Labs.


On Wed, Apr 24, 2013 at 12:37 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks for the response Mahesh. I thought of this , but do not know why is
> this limitation.
>
> While sampling to pick up certain records and run our logic over the key
> part.Thats why there is a limitation that mappers key must be the same as
> mappers output key.
>
> However , I think , if we can run the mapper of the collected sample and
> do our analysis over the key of the mappers then it would be great and
> there wouldn't be any limitation like what we have today.
>
> Was wondering why its not like this.
>
> Thanks,
> Rahul
>
>
> On Wed, Apr 24, 2013 at 11:23 AM, Mahesh Balija <
> balijamahesh.mca@gmail.com> wrote:
>
>> Hi Rahul,
>>
>>              The limitation to use InputSampler is, the K and OK (I mean
>> Map INKEY and OUTKEY) both should be of same type.
>>              Technically because, while collecting the samples (ie.,
>> arraylist of keys) in writePartitionFile method it uses the INKEY as the
>> key. And for writing the partition file it uses Mapper OutputKEY as the
>> KEY.
>>
>>              Logically also this is the expected behavior of sampling
>> because, while collecting the samples the only source is the input splits
>> (INKEY) from which it collects the samples and for generating partition
>> file you need to generate based on the Mapper outkey type.
>>
>> Best,
>> Mahesh Balija,
>> CalsoftLabs.
>>
>>
>> On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> + mapred dev
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to Hadoop's input sampler ,which is used for
>>>> investigating the data set before hand using random selection , sampling
>>>> etc .Mainly used for total sort , used in pig's skewed join implementation
>>>> as well.
>>>>
>>>> The question here is -
>>>>
>>>> Mapper<K,V,OK,OV>
>>>>
>>>> K and V are input key and value of the mapper .Essentially coming in
>>>> from the input format. OK and OV are output key and value emitted from the
>>>> mapper.
>>>>
>>>> Looking at the input sample's code ,it looks like it is creating the
>>>> partition based on the input key of the mapper.
>>>>
>>>> I think the partitions should be created considering the output key
>>>> (OK) and the output key sort comparator should be used for sorting the
>>>> samples.
>>>>
>>>> If partitioning is done based on input key and the mapper emits a
>>>> different key then the total sort wouldn't hold any good.
>>>>
>>>> Is there is any condition that input sample is to be only used for
>>>> mapper<K,V,K,V1>?
>>>>
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>
>>
>

Re: Hadoop sampler related query!

Posted by Mahesh Balija <ba...@gmail.com>.
Agreed with your explanation. One downside with your approach could be, if
we collect samples from the intermediate keys on demand it might limit the
partitioning to occur until all the mappers can complete and major downside
could be we need to keep all the intermediate records in memory (which may
NOT be possible) or need to write all the intermediate records locally
until all mappers complete and collecting the samples is done and then you
start partitioning the intermediate keys by iterating through all the
records one more time, which could also affect performance drastically of
the whole program.

Best,
Mahesh Balija,
Calsoft Labs.


On Wed, Apr 24, 2013 at 12:37 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks for the response Mahesh. I thought of this , but do not know why is
> this limitation.
>
> While sampling to pick up certain records and run our logic over the key
> part.Thats why there is a limitation that mappers key must be the same as
> mappers output key.
>
> However , I think , if we can run the mapper of the collected sample and
> do our analysis over the key of the mappers then it would be great and
> there wouldn't be any limitation like what we have today.
>
> Was wondering why its not like this.
>
> Thanks,
> Rahul
>
>
> On Wed, Apr 24, 2013 at 11:23 AM, Mahesh Balija <
> balijamahesh.mca@gmail.com> wrote:
>
>> Hi Rahul,
>>
>>              The limitation to use InputSampler is, the K and OK (I mean
>> Map INKEY and OUTKEY) both should be of same type.
>>              Technically because, while collecting the samples (ie.,
>> arraylist of keys) in writePartitionFile method it uses the INKEY as the
>> key. And for writing the partition file it uses Mapper OutputKEY as the
>> KEY.
>>
>>              Logically also this is the expected behavior of sampling
>> because, while collecting the samples the only source is the input splits
>> (INKEY) from which it collects the samples and for generating partition
>> file you need to generate based on the Mapper outkey type.
>>
>> Best,
>> Mahesh Balija,
>> CalsoftLabs.
>>
>>
>> On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> + mapred dev
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to Hadoop's input sampler ,which is used for
>>>> investigating the data set before hand using random selection , sampling
>>>> etc .Mainly used for total sort , used in pig's skewed join implementation
>>>> as well.
>>>>
>>>> The question here is -
>>>>
>>>> Mapper<K,V,OK,OV>
>>>>
>>>> K and V are input key and value of the mapper .Essentially coming in
>>>> from the input format. OK and OV are output key and value emitted from the
>>>> mapper.
>>>>
>>>> Looking at the input sample's code ,it looks like it is creating the
>>>> partition based on the input key of the mapper.
>>>>
>>>> I think the partitions should be created considering the output key
>>>> (OK) and the output key sort comparator should be used for sorting the
>>>> samples.
>>>>
>>>> If partitioning is done based on input key and the mapper emits a
>>>> different key then the total sort wouldn't hold any good.
>>>>
>>>> Is there is any condition that input sample is to be only used for
>>>> mapper<K,V,K,V1>?
>>>>
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>
>>
>

Re: Hadoop sampler related query!

Posted by Mahesh Balija <ba...@gmail.com>.
Agreed with your explanation. One downside with your approach could be, if
we collect samples from the intermediate keys on demand it might limit the
partitioning to occur until all the mappers can complete and major downside
could be we need to keep all the intermediate records in memory (which may
NOT be possible) or need to write all the intermediate records locally
until all mappers complete and collecting the samples is done and then you
start partitioning the intermediate keys by iterating through all the
records one more time, which could also affect performance drastically of
the whole program.

Best,
Mahesh Balija,
Calsoft Labs.


On Wed, Apr 24, 2013 at 12:37 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks for the response Mahesh. I thought of this , but do not know why is
> this limitation.
>
> While sampling to pick up certain records and run our logic over the key
> part.Thats why there is a limitation that mappers key must be the same as
> mappers output key.
>
> However , I think , if we can run the mapper of the collected sample and
> do our analysis over the key of the mappers then it would be great and
> there wouldn't be any limitation like what we have today.
>
> Was wondering why its not like this.
>
> Thanks,
> Rahul
>
>
> On Wed, Apr 24, 2013 at 11:23 AM, Mahesh Balija <
> balijamahesh.mca@gmail.com> wrote:
>
>> Hi Rahul,
>>
>>              The limitation to use InputSampler is, the K and OK (I mean
>> Map INKEY and OUTKEY) both should be of same type.
>>              Technically because, while collecting the samples (ie.,
>> arraylist of keys) in writePartitionFile method it uses the INKEY as the
>> key. And for writing the partition file it uses Mapper OutputKEY as the
>> KEY.
>>
>>              Logically also this is the expected behavior of sampling
>> because, while collecting the samples the only source is the input splits
>> (INKEY) from which it collects the samples and for generating partition
>> file you need to generate based on the Mapper outkey type.
>>
>> Best,
>> Mahesh Balija,
>> CalsoftLabs.
>>
>>
>> On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> + mapred dev
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to Hadoop's input sampler ,which is used for
>>>> investigating the data set before hand using random selection , sampling
>>>> etc .Mainly used for total sort , used in pig's skewed join implementation
>>>> as well.
>>>>
>>>> The question here is -
>>>>
>>>> Mapper<K,V,OK,OV>
>>>>
>>>> K and V are input key and value of the mapper .Essentially coming in
>>>> from the input format. OK and OV are output key and value emitted from the
>>>> mapper.
>>>>
>>>> Looking at the input sample's code ,it looks like it is creating the
>>>> partition based on the input key of the mapper.
>>>>
>>>> I think the partitions should be created considering the output key
>>>> (OK) and the output key sort comparator should be used for sorting the
>>>> samples.
>>>>
>>>> If partitioning is done based on input key and the mapper emits a
>>>> different key then the total sort wouldn't hold any good.
>>>>
>>>> Is there is any condition that input sample is to be only used for
>>>> mapper<K,V,K,V1>?
>>>>
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>
>>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Thanks for the response Mahesh. I thought of this , but do not know why is
this limitation.

While sampling to pick up certain records and run our logic over the key
part.Thats why there is a limitation that mappers key must be the same as
mappers output key.

However , I think , if we can run the mapper of the collected sample and do
our analysis over the key of the mappers then it would be great and there
wouldn't be any limitation like what we have today.

Was wondering why its not like this.

Thanks,
Rahul


On Wed, Apr 24, 2013 at 11:23 AM, Mahesh Balija
<ba...@gmail.com>wrote:

> Hi Rahul,
>
>              The limitation to use InputSampler is, the K and OK (I mean
> Map INKEY and OUTKEY) both should be of same type.
>              Technically because, while collecting the samples (ie.,
> arraylist of keys) in writePartitionFile method it uses the INKEY as the
> key. And for writing the partition file it uses Mapper OutputKEY as the
> KEY.
>
>              Logically also this is the expected behavior of sampling
> because, while collecting the samples the only source is the input splits
> (INKEY) from which it collects the samples and for generating partition
> file you need to generate based on the Mapper outkey type.
>
> Best,
> Mahesh Balija,
> CalsoftLabs.
>
>
> On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> + mapred dev
>>
>>
>> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question related to Hadoop's input sampler ,which is used for
>>> investigating the data set before hand using random selection , sampling
>>> etc .Mainly used for total sort , used in pig's skewed join implementation
>>> as well.
>>>
>>> The question here is -
>>>
>>> Mapper<K,V,OK,OV>
>>>
>>> K and V are input key and value of the mapper .Essentially coming in
>>> from the input format. OK and OV are output key and value emitted from the
>>> mapper.
>>>
>>> Looking at the input sample's code ,it looks like it is creating the
>>> partition based on the input key of the mapper.
>>>
>>> I think the partitions should be created considering the output key (OK)
>>> and the output key sort comparator should be used for sorting the samples.
>>>
>>> If partitioning is done based on input key and the mapper emits a
>>> different key then the total sort wouldn't hold any good.
>>>
>>> Is there is any condition that input sample is to be only used for
>>> mapper<K,V,K,V1>?
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Thanks for the response Mahesh. I thought of this , but do not know why is
this limitation.

While sampling to pick up certain records and run our logic over the key
part.Thats why there is a limitation that mappers key must be the same as
mappers output key.

However , I think , if we can run the mapper of the collected sample and do
our analysis over the key of the mappers then it would be great and there
wouldn't be any limitation like what we have today.

Was wondering why its not like this.

Thanks,
Rahul


On Wed, Apr 24, 2013 at 11:23 AM, Mahesh Balija
<ba...@gmail.com>wrote:

> Hi Rahul,
>
>              The limitation to use InputSampler is, the K and OK (I mean
> Map INKEY and OUTKEY) both should be of same type.
>              Technically because, while collecting the samples (ie.,
> arraylist of keys) in writePartitionFile method it uses the INKEY as the
> key. And for writing the partition file it uses Mapper OutputKEY as the
> KEY.
>
>              Logically also this is the expected behavior of sampling
> because, while collecting the samples the only source is the input splits
> (INKEY) from which it collects the samples and for generating partition
> file you need to generate based on the Mapper outkey type.
>
> Best,
> Mahesh Balija,
> CalsoftLabs.
>
>
> On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> + mapred dev
>>
>>
>> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question related to Hadoop's input sampler ,which is used for
>>> investigating the data set before hand using random selection , sampling
>>> etc .Mainly used for total sort , used in pig's skewed join implementation
>>> as well.
>>>
>>> The question here is -
>>>
>>> Mapper<K,V,OK,OV>
>>>
>>> K and V are input key and value of the mapper .Essentially coming in
>>> from the input format. OK and OV are output key and value emitted from the
>>> mapper.
>>>
>>> Looking at the input sample's code ,it looks like it is creating the
>>> partition based on the input key of the mapper.
>>>
>>> I think the partitions should be created considering the output key (OK)
>>> and the output key sort comparator should be used for sorting the samples.
>>>
>>> If partitioning is done based on input key and the mapper emits a
>>> different key then the total sort wouldn't hold any good.
>>>
>>> Is there is any condition that input sample is to be only used for
>>> mapper<K,V,K,V1>?
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Thanks for the response Mahesh. I thought of this , but do not know why is
this limitation.

While sampling to pick up certain records and run our logic over the key
part.Thats why there is a limitation that mappers key must be the same as
mappers output key.

However , I think , if we can run the mapper of the collected sample and do
our analysis over the key of the mappers then it would be great and there
wouldn't be any limitation like what we have today.

Was wondering why its not like this.

Thanks,
Rahul


On Wed, Apr 24, 2013 at 11:23 AM, Mahesh Balija
<ba...@gmail.com>wrote:

> Hi Rahul,
>
>              The limitation to use InputSampler is, the K and OK (I mean
> Map INKEY and OUTKEY) both should be of same type.
>              Technically because, while collecting the samples (ie.,
> arraylist of keys) in writePartitionFile method it uses the INKEY as the
> key. And for writing the partition file it uses Mapper OutputKEY as the
> KEY.
>
>              Logically also this is the expected behavior of sampling
> because, while collecting the samples the only source is the input splits
> (INKEY) from which it collects the samples and for generating partition
> file you need to generate based on the Mapper outkey type.
>
> Best,
> Mahesh Balija,
> CalsoftLabs.
>
>
> On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> + mapred dev
>>
>>
>> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question related to Hadoop's input sampler ,which is used for
>>> investigating the data set before hand using random selection , sampling
>>> etc .Mainly used for total sort , used in pig's skewed join implementation
>>> as well.
>>>
>>> The question here is -
>>>
>>> Mapper<K,V,OK,OV>
>>>
>>> K and V are input key and value of the mapper .Essentially coming in
>>> from the input format. OK and OV are output key and value emitted from the
>>> mapper.
>>>
>>> Looking at the input sample's code ,it looks like it is creating the
>>> partition based on the input key of the mapper.
>>>
>>> I think the partitions should be created considering the output key (OK)
>>> and the output key sort comparator should be used for sorting the samples.
>>>
>>> If partitioning is done based on input key and the mapper emits a
>>> different key then the total sort wouldn't hold any good.
>>>
>>> Is there is any condition that input sample is to be only used for
>>> mapper<K,V,K,V1>?
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Thanks for the response Mahesh. I thought of this , but do not know why is
this limitation.

While sampling to pick up certain records and run our logic over the key
part.Thats why there is a limitation that mappers key must be the same as
mappers output key.

However , I think , if we can run the mapper of the collected sample and do
our analysis over the key of the mappers then it would be great and there
wouldn't be any limitation like what we have today.

Was wondering why its not like this.

Thanks,
Rahul


On Wed, Apr 24, 2013 at 11:23 AM, Mahesh Balija
<ba...@gmail.com>wrote:

> Hi Rahul,
>
>              The limitation to use InputSampler is, the K and OK (I mean
> Map INKEY and OUTKEY) both should be of same type.
>              Technically because, while collecting the samples (ie.,
> arraylist of keys) in writePartitionFile method it uses the INKEY as the
> key. And for writing the partition file it uses Mapper OutputKEY as the
> KEY.
>
>              Logically also this is the expected behavior of sampling
> because, while collecting the samples the only source is the input splits
> (INKEY) from which it collects the samples and for generating partition
> file you need to generate based on the Mapper outkey type.
>
> Best,
> Mahesh Balija,
> CalsoftLabs.
>
>
> On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> + mapred dev
>>
>>
>> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question related to Hadoop's input sampler ,which is used for
>>> investigating the data set before hand using random selection , sampling
>>> etc .Mainly used for total sort , used in pig's skewed join implementation
>>> as well.
>>>
>>> The question here is -
>>>
>>> Mapper<K,V,OK,OV>
>>>
>>> K and V are input key and value of the mapper .Essentially coming in
>>> from the input format. OK and OV are output key and value emitted from the
>>> mapper.
>>>
>>> Looking at the input sample's code ,it looks like it is creating the
>>> partition based on the input key of the mapper.
>>>
>>> I think the partitions should be created considering the output key (OK)
>>> and the output key sort comparator should be used for sorting the samples.
>>>
>>> If partitioning is done based on input key and the mapper emits a
>>> different key then the total sort wouldn't hold any good.
>>>
>>> Is there is any condition that input sample is to be only used for
>>> mapper<K,V,K,V1>?
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>
>

Re: Hadoop sampler related query!

Posted by Mahesh Balija <ba...@gmail.com>.
Hi Rahul,

             The limitation to use InputSampler is, the K and OK (I mean
Map INKEY and OUTKEY) both should be of same type.
             Technically because, while collecting the samples (ie.,
arraylist of keys) in writePartitionFile method it uses the INKEY as the
key. And for writing the partition file it uses Mapper OutputKEY as the
KEY.

             Logically also this is the expected behavior of sampling
because, while collecting the samples the only source is the input splits
(INKEY) from which it collects the samples and for generating partition
file you need to generate based on the Mapper outkey type.

Best,
Mahesh Balija,
CalsoftLabs.


On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> + mapred dev
>
>
> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question related to Hadoop's input sampler ,which is used for
>> investigating the data set before hand using random selection , sampling
>> etc .Mainly used for total sort , used in pig's skewed join implementation
>> as well.
>>
>> The question here is -
>>
>> Mapper<K,V,OK,OV>
>>
>> K and V are input key and value of the mapper .Essentially coming in from
>> the input format. OK and OV are output key and value emitted from the
>> mapper.
>>
>> Looking at the input sample's code ,it looks like it is creating the
>> partition based on the input key of the mapper.
>>
>> I think the partitions should be created considering the output key (OK)
>> and the output key sort comparator should be used for sorting the samples.
>>
>> If partitioning is done based on input key and the mapper emits a
>> different key then the total sort wouldn't hold any good.
>>
>> Is there is any condition that input sample is to be only used for
>> mapper<K,V,K,V1>?
>>
>>
>> Thanks,
>> Rahul
>>
>>
>

Re: Hadoop sampler related query!

Posted by Mahesh Balija <ba...@gmail.com>.
Hi Rahul,

             The limitation to use InputSampler is, the K and OK (I mean
Map INKEY and OUTKEY) both should be of same type.
             Technically because, while collecting the samples (ie.,
arraylist of keys) in writePartitionFile method it uses the INKEY as the
key. And for writing the partition file it uses Mapper OutputKEY as the
KEY.

             Logically also this is the expected behavior of sampling
because, while collecting the samples the only source is the input splits
(INKEY) from which it collects the samples and for generating partition
file you need to generate based on the Mapper outkey type.

Best,
Mahesh Balija,
CalsoftLabs.


On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> + mapred dev
>
>
> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question related to Hadoop's input sampler ,which is used for
>> investigating the data set before hand using random selection , sampling
>> etc .Mainly used for total sort , used in pig's skewed join implementation
>> as well.
>>
>> The question here is -
>>
>> Mapper<K,V,OK,OV>
>>
>> K and V are input key and value of the mapper .Essentially coming in from
>> the input format. OK and OV are output key and value emitted from the
>> mapper.
>>
>> Looking at the input sample's code ,it looks like it is creating the
>> partition based on the input key of the mapper.
>>
>> I think the partitions should be created considering the output key (OK)
>> and the output key sort comparator should be used for sorting the samples.
>>
>> If partitioning is done based on input key and the mapper emits a
>> different key then the total sort wouldn't hold any good.
>>
>> Is there is any condition that input sample is to be only used for
>> mapper<K,V,K,V1>?
>>
>>
>> Thanks,
>> Rahul
>>
>>
>

Re: Hadoop sampler related query!

Posted by Mahesh Balija <ba...@gmail.com>.
Hi Rahul,

             The limitation to use InputSampler is, the K and OK (I mean
Map INKEY and OUTKEY) both should be of same type.
             Technically because, while collecting the samples (ie.,
arraylist of keys) in writePartitionFile method it uses the INKEY as the
key. And for writing the partition file it uses Mapper OutputKEY as the
KEY.

             Logically also this is the expected behavior of sampling
because, while collecting the samples the only source is the input splits
(INKEY) from which it collects the samples and for generating partition
file you need to generate based on the Mapper outkey type.

Best,
Mahesh Balija,
CalsoftLabs.


On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> + mapred dev
>
>
> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question related to Hadoop's input sampler ,which is used for
>> investigating the data set before hand using random selection , sampling
>> etc .Mainly used for total sort , used in pig's skewed join implementation
>> as well.
>>
>> The question here is -
>>
>> Mapper<K,V,OK,OV>
>>
>> K and V are input key and value of the mapper .Essentially coming in from
>> the input format. OK and OV are output key and value emitted from the
>> mapper.
>>
>> Looking at the input sample's code ,it looks like it is creating the
>> partition based on the input key of the mapper.
>>
>> I think the partitions should be created considering the output key (OK)
>> and the output key sort comparator should be used for sorting the samples.
>>
>> If partitioning is done based on input key and the mapper emits a
>> different key then the total sort wouldn't hold any good.
>>
>> Is there is any condition that input sample is to be only used for
>> mapper<K,V,K,V1>?
>>
>>
>> Thanks,
>> Rahul
>>
>>
>

Re: Hadoop sampler related query!

Posted by Mahesh Balija <ba...@gmail.com>.
Hi Rahul,

             The limitation to use InputSampler is, the K and OK (I mean
Map INKEY and OUTKEY) both should be of same type.
             Technically because, while collecting the samples (ie.,
arraylist of keys) in writePartitionFile method it uses the INKEY as the
key. And for writing the partition file it uses Mapper OutputKEY as the
KEY.

             Logically also this is the expected behavior of sampling
because, while collecting the samples the only source is the input splits
(INKEY) from which it collects the samples and for generating partition
file you need to generate based on the Mapper outkey type.

Best,
Mahesh Balija,
CalsoftLabs.


On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> + mapred dev
>
>
> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question related to Hadoop's input sampler ,which is used for
>> investigating the data set before hand using random selection , sampling
>> etc .Mainly used for total sort , used in pig's skewed join implementation
>> as well.
>>
>> The question here is -
>>
>> Mapper<K,V,OK,OV>
>>
>> K and V are input key and value of the mapper .Essentially coming in from
>> the input format. OK and OV are output key and value emitted from the
>> mapper.
>>
>> Looking at the input sample's code ,it looks like it is creating the
>> partition based on the input key of the mapper.
>>
>> I think the partitions should be created considering the output key (OK)
>> and the output key sort comparator should be used for sorting the samples.
>>
>> If partitioning is done based on input key and the mapper emits a
>> different key then the total sort wouldn't hold any good.
>>
>> Is there is any condition that input sample is to be only used for
>> mapper<K,V,K,V1>?
>>
>>
>> Thanks,
>> Rahul
>>
>>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
+ mapred dev


On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to Hadoop's input sampler ,which is used for
> investigating the data set before hand using random selection , sampling
> etc .Mainly used for total sort , used in pig's skewed join implementation
> as well.
>
> The question here is -
>
> Mapper<K,V,OK,OV>
>
> K and V are input key and value of the mapper .Essentially coming in from
> the input format. OK and OV are output key and value emitted from the
> mapper.
>
> Looking at the input sample's code ,it looks like it is creating the
> partition based on the input key of the mapper.
>
> I think the partitions should be created considering the output key (OK)
> and the output key sort comparator should be used for sorting the samples.
>
> If partitioning is done based on input key and the mapper emits a
> different key then the total sort wouldn't hold any good.
>
> Is there is any condition that input sample is to be only used for
> mapper<K,V,K,V1>?
>
>
> Thanks,
> Rahul
>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Mighty users@hadoop

anyone on this.


On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to Hadoop's input sampler ,which is used for
> investigating the data set before hand using random selection , sampling
> etc .Mainly used for total sort , used in pig's skewed join implementation
> as well.
>
> The question here is -
>
> Mapper<K,V,OK,OV>
>
> K and V are input key and value of the mapper .Essentially coming in from
> the input format. OK and OV are output key and value emitted from the
> mapper.
>
> Looking at the input sample's code ,it looks like it is creating the
> partition based on the input key of the mapper.
>
> I think the partitions should be created considering the output key (OK)
> and the output key sort comparator should be used for sorting the samples.
>
> If partitioning is done based on input key and the mapper emits a
> different key then the total sort wouldn't hold any good.
>
>  Is there is any condition that input sample is to be only used for
> mapper<K,V,K,V1>?
>
>
> Thanks,
> Rahul
>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Mighty users@hadoop

anyone on this.


On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to Hadoop's input sampler ,which is used for
> investigating the data set before hand using random selection , sampling
> etc .Mainly used for total sort , used in pig's skewed join implementation
> as well.
>
> The question here is -
>
> Mapper<K,V,OK,OV>
>
> K and V are input key and value of the mapper .Essentially coming in from
> the input format. OK and OV are output key and value emitted from the
> mapper.
>
> Looking at the input sample's code ,it looks like it is creating the
> partition based on the input key of the mapper.
>
> I think the partitions should be created considering the output key (OK)
> and the output key sort comparator should be used for sorting the samples.
>
> If partitioning is done based on input key and the mapper emits a
> different key then the total sort wouldn't hold any good.
>
>  Is there is any condition that input sample is to be only used for
> mapper<K,V,K,V1>?
>
>
> Thanks,
> Rahul
>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
+ mapred dev


On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to Hadoop's input sampler ,which is used for
> investigating the data set before hand using random selection , sampling
> etc .Mainly used for total sort , used in pig's skewed join implementation
> as well.
>
> The question here is -
>
> Mapper<K,V,OK,OV>
>
> K and V are input key and value of the mapper .Essentially coming in from
> the input format. OK and OV are output key and value emitted from the
> mapper.
>
> Looking at the input sample's code ,it looks like it is creating the
> partition based on the input key of the mapper.
>
> I think the partitions should be created considering the output key (OK)
> and the output key sort comparator should be used for sorting the samples.
>
> If partitioning is done based on input key and the mapper emits a
> different key then the total sort wouldn't hold any good.
>
> Is there is any condition that input sample is to be only used for
> mapper<K,V,K,V1>?
>
>
> Thanks,
> Rahul
>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Mighty users@hadoop

anyone on this.


On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to Hadoop's input sampler ,which is used for
> investigating the data set before hand using random selection , sampling
> etc .Mainly used for total sort , used in pig's skewed join implementation
> as well.
>
> The question here is -
>
> Mapper<K,V,OK,OV>
>
> K and V are input key and value of the mapper .Essentially coming in from
> the input format. OK and OV are output key and value emitted from the
> mapper.
>
> Looking at the input sample's code ,it looks like it is creating the
> partition based on the input key of the mapper.
>
> I think the partitions should be created considering the output key (OK)
> and the output key sort comparator should be used for sorting the samples.
>
> If partitioning is done based on input key and the mapper emits a
> different key then the total sort wouldn't hold any good.
>
>  Is there is any condition that input sample is to be only used for
> mapper<K,V,K,V1>?
>
>
> Thanks,
> Rahul
>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Mighty users@hadoop

anyone on this.


On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to Hadoop's input sampler ,which is used for
> investigating the data set before hand using random selection , sampling
> etc .Mainly used for total sort , used in pig's skewed join implementation
> as well.
>
> The question here is -
>
> Mapper<K,V,OK,OV>
>
> K and V are input key and value of the mapper .Essentially coming in from
> the input format. OK and OV are output key and value emitted from the
> mapper.
>
> Looking at the input sample's code ,it looks like it is creating the
> partition based on the input key of the mapper.
>
> I think the partitions should be created considering the output key (OK)
> and the output key sort comparator should be used for sorting the samples.
>
> If partitioning is done based on input key and the mapper emits a
> different key then the total sort wouldn't hold any good.
>
>  Is there is any condition that input sample is to be only used for
> mapper<K,V,K,V1>?
>
>
> Thanks,
> Rahul
>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
+ mapred dev


On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to Hadoop's input sampler ,which is used for
> investigating the data set before hand using random selection , sampling
> etc .Mainly used for total sort , used in pig's skewed join implementation
> as well.
>
> The question here is -
>
> Mapper<K,V,OK,OV>
>
> K and V are input key and value of the mapper .Essentially coming in from
> the input format. OK and OV are output key and value emitted from the
> mapper.
>
> Looking at the input sample's code ,it looks like it is creating the
> partition based on the input key of the mapper.
>
> I think the partitions should be created considering the output key (OK)
> and the output key sort comparator should be used for sorting the samples.
>
> If partitioning is done based on input key and the mapper emits a
> different key then the total sort wouldn't hold any good.
>
> Is there is any condition that input sample is to be only used for
> mapper<K,V,K,V1>?
>
>
> Thanks,
> Rahul
>
>

Re: Hadoop sampler related query!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
+ mapred dev


On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to Hadoop's input sampler ,which is used for
> investigating the data set before hand using random selection , sampling
> etc .Mainly used for total sort , used in pig's skewed join implementation
> as well.
>
> The question here is -
>
> Mapper<K,V,OK,OV>
>
> K and V are input key and value of the mapper .Essentially coming in from
> the input format. OK and OV are output key and value emitted from the
> mapper.
>
> Looking at the input sample's code ,it looks like it is creating the
> partition based on the input key of the mapper.
>
> I think the partitions should be created considering the output key (OK)
> and the output key sort comparator should be used for sorting the samples.
>
> If partitioning is done based on input key and the mapper emits a
> different key then the total sort wouldn't hold any good.
>
> Is there is any condition that input sample is to be only used for
> mapper<K,V,K,V1>?
>
>
> Thanks,
> Rahul
>
>