You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Austin Chungath <au...@gmail.com> on 2013/07/16 11:09:53 UTC

spawn maps without any input data - hadoop streaming

Hi,

I am trying to generate random data using hadoop streaming & python. It's a
map only job and I need to run a number of maps. There is no input to the
map as it's just going to generate random data.

How do I specify the number of maps to run? ( I am confused here because,
if I am not wrong, the number of maps spawned is related to the input data
size )
Any ideas as to how this can be done?

Warm regards,
Austin

RE: spawn maps without any input data - hadoop streaming

Posted by Devaraj k <de...@huawei.com>.
Hi Austin,

                Here number of maps  for a Job  depends on the splits return by InputFormat.getSplits() API. We can have an input format which decides the number of maps(by returning the splits) for a Job according to the need.

If we use FileInputFormat, these number of splits depend on the input data for the Job, that's why you see no of mappers is proportional to the Job input size.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/InputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext)

Thanks
Devaraj k

From: Austin Chungath [mailto:austincv@gmail.com]
Sent: 16 July 2013 14:40
To: user@hadoop.apache.org
Subject: spawn maps without any input data - hadoop streaming

Hi,

I am trying to generate random data using hadoop streaming & python. It's a map only job and I need to run a number of maps. There is no input to the map as it's just going to generate random data.

How do I specify the number of maps to run? ( I am confused here because, if I am not wrong, the number of maps spawned is related to the input data size )
Any ideas as to how this can be done?

Warm regards,
Austin

RE: spawn maps without any input data - hadoop streaming

Posted by Devaraj k <de...@huawei.com>.
Hi Austin,

                Here number of maps  for a Job  depends on the splits return by InputFormat.getSplits() API. We can have an input format which decides the number of maps(by returning the splits) for a Job according to the need.

If we use FileInputFormat, these number of splits depend on the input data for the Job, that's why you see no of mappers is proportional to the Job input size.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/InputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext)

Thanks
Devaraj k

From: Austin Chungath [mailto:austincv@gmail.com]
Sent: 16 July 2013 14:40
To: user@hadoop.apache.org
Subject: spawn maps without any input data - hadoop streaming

Hi,

I am trying to generate random data using hadoop streaming & python. It's a map only job and I need to run a number of maps. There is no input to the map as it's just going to generate random data.

How do I specify the number of maps to run? ( I am confused here because, if I am not wrong, the number of maps spawned is related to the input data size )
Any ideas as to how this can be done?

Warm regards,
Austin

RE: spawn maps without any input data - hadoop streaming

Posted by Devaraj k <de...@huawei.com>.
Hi Austin,

                Here number of maps  for a Job  depends on the splits return by InputFormat.getSplits() API. We can have an input format which decides the number of maps(by returning the splits) for a Job according to the need.

If we use FileInputFormat, these number of splits depend on the input data for the Job, that's why you see no of mappers is proportional to the Job input size.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/InputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext)

Thanks
Devaraj k

From: Austin Chungath [mailto:austincv@gmail.com]
Sent: 16 July 2013 14:40
To: user@hadoop.apache.org
Subject: spawn maps without any input data - hadoop streaming

Hi,

I am trying to generate random data using hadoop streaming & python. It's a map only job and I need to run a number of maps. There is no input to the map as it's just going to generate random data.

How do I specify the number of maps to run? ( I am confused here because, if I am not wrong, the number of maps spawned is related to the input data size )
Any ideas as to how this can be done?

Warm regards,
Austin

RE: spawn maps without any input data - hadoop streaming

Posted by Devaraj k <de...@huawei.com>.
Hi Austin,

                Here number of maps  for a Job  depends on the splits return by InputFormat.getSplits() API. We can have an input format which decides the number of maps(by returning the splits) for a Job according to the need.

If we use FileInputFormat, these number of splits depend on the input data for the Job, that's why you see no of mappers is proportional to the Job input size.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/InputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext)

Thanks
Devaraj k

From: Austin Chungath [mailto:austincv@gmail.com]
Sent: 16 July 2013 14:40
To: user@hadoop.apache.org
Subject: spawn maps without any input data - hadoop streaming

Hi,

I am trying to generate random data using hadoop streaming & python. It's a map only job and I need to run a number of maps. There is no input to the map as it's just going to generate random data.

How do I specify the number of maps to run? ( I am confused here because, if I am not wrong, the number of maps spawned is related to the input data size )
Any ideas as to how this can be done?

Warm regards,
Austin