You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by qiaoresearcher <qi...@gmail.com> on 2014/02/27 20:20:33 UTC

how to feed sampled data into each mapper

Assume there is one large data set with size 100G on hdfs, how can we
control that every data set sent to each mapper is around  10% or original
data (or 10G) and each 10% is random sampled from the 100G data set? Do we
have any example sample code doing this?


Regards,

Re: how to feed sampled data into each mapper

Posted by qiaoresearcher <qi...@gmail.com>.

thanks, i think what you suggest is to just divide the large file into
various splits and each split is about 10G, but how to make this 10G is
'random sampled' from the original large data set?

On Thu, Feb 27, 2014 at 7:40 PM, Hadoop User <ha...@gmail.com> wrote:

> Try changing split size in the driver code.
> Mapreduce split size properties
>
> Sent from my iPhone
>
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com>
> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we
> control that every data set sent to each mapper is around  10% or original
> data (or 10G) and each 10% is random sampled from the 100G data set? Do we
> have any example sample code doing this?
>
>
> Regards,
>
>

Re: how to feed sampled data into each mapper

Posted by qiaoresearcher <qi...@gmail.com>.

thanks, i think what you suggest is to just divide the large file into
various splits and each split is about 10G, but how to make this 10G is
'random sampled' from the original large data set?

On Thu, Feb 27, 2014 at 7:40 PM, Hadoop User <ha...@gmail.com> wrote:

> Try changing split size in the driver code.
> Mapreduce split size properties
>
> Sent from my iPhone
>
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com>
> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we
> control that every data set sent to each mapper is around  10% or original
> data (or 10G) and each 10% is random sampled from the 100G data set? Do we
> have any example sample code doing this?
>
>
> Regards,
>
>

Re: how to feed sampled data into each mapper

Posted by qiaoresearcher <qi...@gmail.com>.

thanks, i think what you suggest is to just divide the large file into
various splits and each split is about 10G, but how to make this 10G is
'random sampled' from the original large data set?

On Thu, Feb 27, 2014 at 7:40 PM, Hadoop User <ha...@gmail.com> wrote:

> Try changing split size in the driver code.
> Mapreduce split size properties
>
> Sent from my iPhone
>
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com>
> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we
> control that every data set sent to each mapper is around  10% or original
> data (or 10G) and each 10% is random sampled from the 100G data set? Do we
> have any example sample code doing this?
>
>
> Regards,
>
>

Re: how to feed sampled data into each mapper

Posted by qiaoresearcher <qi...@gmail.com>.

thanks, i think what you suggest is to just divide the large file into
various splits and each split is about 10G, but how to make this 10G is
'random sampled' from the original large data set?

On Thu, Feb 27, 2014 at 7:40 PM, Hadoop User <ha...@gmail.com> wrote:

> Try changing split size in the driver code.
> Mapreduce split size properties
>
> Sent from my iPhone
>
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com>
> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we
> control that every data set sent to each mapper is around  10% or original
> data (or 10G) and each 10% is random sampled from the 100G data set? Do we
> have any example sample code doing this?
>
>
> Regards,
>
>

Re: how to feed sampled data into each mapper

Posted by Hadoop User <ha...@gmail.com>.

Try changing split size in the driver code.
Mapreduce split size properties 

Sent from my iPhone

> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com> wrote:
> 
> Assume there is one large data set with size 100G on hdfs, how can we control that every data set sent to each mapper is around  10% or original data (or 10G) and each 10% is random sampled from the 100G data set? Do we have any example sample code doing this? 
> 
> 
> Regards,

Re: how to feed sampled data into each mapper

Posted by Hadoop User <ha...@gmail.com>.

Try changing split size in the driver code.
Mapreduce split size properties 

Sent from my iPhone

> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com> wrote:
> 
> Assume there is one large data set with size 100G on hdfs, how can we control that every data set sent to each mapper is around  10% or original data (or 10G) and each 10% is random sampled from the 100G data set? Do we have any example sample code doing this? 
> 
> 
> Regards,

Re: how to feed sampled data into each mapper

Posted by Hadoop User <ha...@gmail.com>.

Try changing split size in the driver code.
Mapreduce split size properties 

Sent from my iPhone

> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com> wrote:
> 
> Assume there is one large data set with size 100G on hdfs, how can we control that every data set sent to each mapper is around  10% or original data (or 10G) and each 10% is random sampled from the 100G data set? Do we have any example sample code doing this? 
> 
> 
> Regards,

Re: how to feed sampled data into each mapper

Posted by Hadoop User <ha...@gmail.com>.

Try changing split size in the driver code.
Mapreduce split size properties 

Sent from my iPhone

> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com> wrote:
> 
> Assume there is one large data set with size 100G on hdfs, how can we control that every data set sent to each mapper is around  10% or original data (or 10G) and each 10% is random sampled from the 100G data set? Do we have any example sample code doing this? 
> 
> 
> Regards,