You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by qiaoresearcher <qi...@gmail.com> on 2014/02/27 20:20:33 UTC
how to feed sampled data into each mapper
Assume there is one large data set with size 100G on hdfs, how can we
control that every data set sent to each mapper is around 10% or original
data (or 10G) and each 10% is random sampled from the 100G data set? Do we
have any example sample code doing this?
Regards,
Re: how to feed sampled data into each mapper
Posted by qiaoresearcher <qi...@gmail.com>.
thanks, i think what you suggest is to just divide the large file into
various splits and each split is about 10G, but how to make this 10G is
'random sampled' from the original large data set?
On Thu, Feb 27, 2014 at 7:40 PM, Hadoop User <ha...@gmail.com> wrote:
> Try changing split size in the driver code.
> Mapreduce split size properties
>
> Sent from my iPhone
>
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com>
> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we
> control that every data set sent to each mapper is around 10% or original
> data (or 10G) and each 10% is random sampled from the 100G data set? Do we
> have any example sample code doing this?
>
>
> Regards,
>
>
Re: how to feed sampled data into each mapper
Posted by qiaoresearcher <qi...@gmail.com>.
thanks, i think what you suggest is to just divide the large file into
various splits and each split is about 10G, but how to make this 10G is
'random sampled' from the original large data set?
On Thu, Feb 27, 2014 at 7:40 PM, Hadoop User <ha...@gmail.com> wrote:
> Try changing split size in the driver code.
> Mapreduce split size properties
>
> Sent from my iPhone
>
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com>
> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we
> control that every data set sent to each mapper is around 10% or original
> data (or 10G) and each 10% is random sampled from the 100G data set? Do we
> have any example sample code doing this?
>
>
> Regards,
>
>
Re: how to feed sampled data into each mapper
Posted by qiaoresearcher <qi...@gmail.com>.
thanks, i think what you suggest is to just divide the large file into
various splits and each split is about 10G, but how to make this 10G is
'random sampled' from the original large data set?
On Thu, Feb 27, 2014 at 7:40 PM, Hadoop User <ha...@gmail.com> wrote:
> Try changing split size in the driver code.
> Mapreduce split size properties
>
> Sent from my iPhone
>
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com>
> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we
> control that every data set sent to each mapper is around 10% or original
> data (or 10G) and each 10% is random sampled from the 100G data set? Do we
> have any example sample code doing this?
>
>
> Regards,
>
>
Re: how to feed sampled data into each mapper
Posted by qiaoresearcher <qi...@gmail.com>.
thanks, i think what you suggest is to just divide the large file into
various splits and each split is about 10G, but how to make this 10G is
'random sampled' from the original large data set?
On Thu, Feb 27, 2014 at 7:40 PM, Hadoop User <ha...@gmail.com> wrote:
> Try changing split size in the driver code.
> Mapreduce split size properties
>
> Sent from my iPhone
>
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com>
> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we
> control that every data set sent to each mapper is around 10% or original
> data (or 10G) and each 10% is random sampled from the 100G data set? Do we
> have any example sample code doing this?
>
>
> Regards,
>
>
Re: how to feed sampled data into each mapper
Posted by Hadoop User <ha...@gmail.com>.
Try changing split size in the driver code.
Mapreduce split size properties
Sent from my iPhone
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we control that every data set sent to each mapper is around 10% or original data (or 10G) and each 10% is random sampled from the 100G data set? Do we have any example sample code doing this?
>
>
> Regards,
Re: how to feed sampled data into each mapper
Posted by Hadoop User <ha...@gmail.com>.
Try changing split size in the driver code.
Mapreduce split size properties
Sent from my iPhone
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we control that every data set sent to each mapper is around 10% or original data (or 10G) and each 10% is random sampled from the 100G data set? Do we have any example sample code doing this?
>
>
> Regards,
Re: how to feed sampled data into each mapper
Posted by Hadoop User <ha...@gmail.com>.
Try changing split size in the driver code.
Mapreduce split size properties
Sent from my iPhone
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we control that every data set sent to each mapper is around 10% or original data (or 10G) and each 10% is random sampled from the 100G data set? Do we have any example sample code doing this?
>
>
> Regards,
Re: how to feed sampled data into each mapper
Posted by Hadoop User <ha...@gmail.com>.
Try changing split size in the driver code.
Mapreduce split size properties
Sent from my iPhone
> On Feb 27, 2014, at 11:20 AM, qiaoresearcher <qi...@gmail.com> wrote:
>
> Assume there is one large data set with size 100G on hdfs, how can we control that every data set sent to each mapper is around 10% or original data (or 10G) and each 10% is random sampled from the 100G data set? Do we have any example sample code doing this?
>
>
> Regards,