You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Kasi Subrahmanyam <ka...@gmail.com> on 2012/07/04 14:02:51 UTC

issue with map running time

Hi ,

I have a job which has let us say 10 mappers running in parallel.
Some are running fast but few of them are taking too long to run.
For example few mappers are taking 5 to 10 mins but others are taking
around 12 hours or more.
Does the difference in the data handled by the mappers can cause such a
variation or is it the issue with connectivity.

Note:The cluster we are using have multiple users running their jobs on it.

Thanks in advance.
Subbu

Re: issue with map running time

Posted by Karthik Kambatla <ka...@cloudera.com>.

Manoj,

By running an MR job on many small files, one does incur latency costs of
reading individual files. These costs can be addressed to some extent by
re-using the JVM across map tasks (See mapred.job.reuse.jvm.num.tasks).

When your data is already distributed across small files and several
machines --

   1. It is probably best to use it as is if it is only one (very few) MR
   job(s).
   2. Otherwise, it would probably make sense to write an MR job to copy
   many small files to a few big files.

Your code for data copy seems about right, at least at first glance.

Thanks
Karthik

On Mon, Jul 9, 2012 at 11:57 PM, Manoj Babu <ma...@gmail.com> wrote:

> Thanks Karthik. But how we can overcome that? do we need to user different
> file format?
> Also am using the below code to merge all files into single file. Is it a
> proper way to do it?
>
>
> FileStatus[] inputFiles = local.listStatus(inputDir);
>       FSDataOutputStream out = hdfs.create(hdfsFile);
>       for(int i = 0; i < inputFiles.length; i++) {
>         System.out.println(inputFiles[i].getPath().getName());
>         FSDataInputStream in = local.open(inputFiles[i].getPath());
>         byte buffer[] = new byte[256];
>         int bytesRead = 0;
>         while((bytesRead = in.read(buffer)) > 0) {
>           out.write(buffer, 0, bytesRead);
>         }
>         in.close();
>       }
>       out.close();
>
> Cheers!
> Manoj.
>
>
>
> On Tue, Jul 10, 2012 at 12:32 AM, Karthik Kambatla <ka...@cloudera.com>wrote:
>
>> Hi Manoj,
>>
>> It seems like a different issue.
>>
>> Let me understand you case better. Is your input 656 files of 11 MB each?
>> In that case, MapReduce does create 656 map tasks. In general, an input
>> split is the data read from a single file, but limited to the block size
>> (64 MB in your case). As the files are smaller than 64 MB, each file forms
>> a different split.
>>
>> Hope that helps.
>> Karthik
>>
>>
>> On Mon, Jul 9, 2012 at 10:57 AM, Manoj Babu <ma...@gmail.com> wrote:
>>
>>> Hi Bobby,
>>>
>>> I have faced a similar issue, In the job the block size is 64MB and the
>>> no of the maps created is 656 and the no of files uploaded to HDFS is 656
>>> and its each file size is 11MB. I assume that if small files exist it will
>>> not able to group.
>>>
>>> Could kindly clarify it?
>>>
>>> Cheers!
>>> Manoj.
>>>
>>>
>>>
>>> On Fri, Jul 6, 2012 at 10:30 PM, Robert Evans <ev...@yahoo-inc.com>wrote:
>>>
>>>> How long a program takes to run depends on a lot of things.  It could
>>>> be a connectivity issue, or it could be that your program does a lot more
>>>> processing for some input records then for others, or it could be that some
>>>> of your records are a lot smaller so that more of them exist in a single
>>>> input split.  Without knowing what the code is doing it is hard to say
>>>> more then that.
>>>>
>>>> --Bobby Evans
>>>>
>>>> From: Kasi Subrahmanyam <ka...@gmail.com>
>>>> Reply-To: "mapreduce-user@hadoop.apache.org" <
>>>> mapreduce-user@hadoop.apache.org>
>>>> To: "mapreduce-user@hadoop.apache.org" <
>>>> mapreduce-user@hadoop.apache.org>
>>>> Subject: issue with map running time
>>>>
>>>> Hi ,
>>>>
>>>> I have a job which has let us say 10 mappers running in parallel.
>>>> Some are running fast but few of them are taking too long to run.
>>>> For example few mappers are taking 5 to 10 mins but others are taking
>>>> around 12 hours or more.
>>>> Does the difference in the data handled by the mappers can cause such a
>>>> variation or is it the issue with connectivity.
>>>>
>>>> Note:The cluster we are using have multiple users running their jobs on
>>>> it.
>>>>
>>>> Thanks in advance.
>>>> Subbu
>>>>
>>>
>>>
>>
>

Re: issue with map running time

Posted by Manoj Babu <ma...@gmail.com>.

Thanks Karthik. But how we can overcome that? do we need to user different
file format?
Also am using the below code to merge all files into single file. Is it a
proper way to do it?


FileStatus[] inputFiles = local.listStatus(inputDir);
      FSDataOutputStream out = hdfs.create(hdfsFile);
      for(int i = 0; i < inputFiles.length; i++) {
        System.out.println(inputFiles[i].getPath().getName());
        FSDataInputStream in = local.open(inputFiles[i].getPath());
        byte buffer[] = new byte[256];
        int bytesRead = 0;
        while((bytesRead = in.read(buffer)) > 0) {
          out.write(buffer, 0, bytesRead);
        }
        in.close();
      }
      out.close();

Cheers!
Manoj.



On Tue, Jul 10, 2012 at 12:32 AM, Karthik Kambatla <ka...@cloudera.com>wrote:

> Hi Manoj,
>
> It seems like a different issue.
>
> Let me understand you case better. Is your input 656 files of 11 MB each?
> In that case, MapReduce does create 656 map tasks. In general, an input
> split is the data read from a single file, but limited to the block size
> (64 MB in your case). As the files are smaller than 64 MB, each file forms
> a different split.
>
> Hope that helps.
> Karthik
>
>
> On Mon, Jul 9, 2012 at 10:57 AM, Manoj Babu <ma...@gmail.com> wrote:
>
>> Hi Bobby,
>>
>> I have faced a similar issue, In the job the block size is 64MB and the
>> no of the maps created is 656 and the no of files uploaded to HDFS is 656
>> and its each file size is 11MB. I assume that if small files exist it will
>> not able to group.
>>
>> Could kindly clarify it?
>>
>> Cheers!
>> Manoj.
>>
>>
>>
>> On Fri, Jul 6, 2012 at 10:30 PM, Robert Evans <ev...@yahoo-inc.com>wrote:
>>
>>> How long a program takes to run depends on a lot of things.  It could be
>>> a connectivity issue, or it could be that your program does a lot more
>>> processing for some input records then for others, or it could be that some
>>> of your records are a lot smaller so that more of them exist in a single
>>> input split.  Without knowing what the code is doing it is hard to say
>>> more then that.
>>>
>>> --Bobby Evans
>>>
>>> From: Kasi Subrahmanyam <ka...@gmail.com>
>>> Reply-To: "mapreduce-user@hadoop.apache.org" <
>>> mapreduce-user@hadoop.apache.org>
>>> To: "mapreduce-user@hadoop.apache.org" <mapreduce-user@hadoop.apache.org
>>> >
>>> Subject: issue with map running time
>>>
>>> Hi ,
>>>
>>> I have a job which has let us say 10 mappers running in parallel.
>>> Some are running fast but few of them are taking too long to run.
>>> For example few mappers are taking 5 to 10 mins but others are taking
>>> around 12 hours or more.
>>> Does the difference in the data handled by the mappers can cause such a
>>> variation or is it the issue with connectivity.
>>>
>>> Note:The cluster we are using have multiple users running their jobs on
>>> it.
>>>
>>> Thanks in advance.
>>> Subbu
>>>
>>
>>
>

Re: issue with map running time

Posted by Karthik Kambatla <ka...@cloudera.com>.

Hi Manoj,

It seems like a different issue.

Let me understand you case better. Is your input 656 files of 11 MB each?
In that case, MapReduce does create 656 map tasks. In general, an input
split is the data read from a single file, but limited to the block size
(64 MB in your case). As the files are smaller than 64 MB, each file forms
a different split.

Hope that helps.
Karthik

On Mon, Jul 9, 2012 at 10:57 AM, Manoj Babu <ma...@gmail.com> wrote:

> Hi Bobby,
>
> I have faced a similar issue, In the job the block size is 64MB and the no
> of the maps created is 656 and the no of files uploaded to HDFS is 656 and
> its each file size is 11MB. I assume that if small files exist it will not
> able to group.
>
> Could kindly clarify it?
>
> Cheers!
> Manoj.
>
>
>
> On Fri, Jul 6, 2012 at 10:30 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
>
>> How long a program takes to run depends on a lot of things.  It could be
>> a connectivity issue, or it could be that your program does a lot more
>> processing for some input records then for others, or it could be that some
>> of your records are a lot smaller so that more of them exist in a single
>> input split.  Without knowing what the code is doing it is hard to say
>> more then that.
>>
>> --Bobby Evans
>>
>> From: Kasi Subrahmanyam <ka...@gmail.com>
>> Reply-To: "mapreduce-user@hadoop.apache.org" <
>> mapreduce-user@hadoop.apache.org>
>> To: "mapreduce-user@hadoop.apache.org" <ma...@hadoop.apache.org>
>> Subject: issue with map running time
>>
>> Hi ,
>>
>> I have a job which has let us say 10 mappers running in parallel.
>> Some are running fast but few of them are taking too long to run.
>> For example few mappers are taking 5 to 10 mins but others are taking
>> around 12 hours or more.
>> Does the difference in the data handled by the mappers can cause such a
>> variation or is it the issue with connectivity.
>>
>> Note:The cluster we are using have multiple users running their jobs on
>> it.
>>
>> Thanks in advance.
>> Subbu
>>
>
>

Re: issue with map running time

Posted by Manoj Babu <ma...@gmail.com>.

Hi Bobby,

I have faced a similar issue, In the job the block size is 64MB and the no
of the maps created is 656 and the no of files uploaded to HDFS is 656 and
its each file size is 11MB. I assume that if small files exist it will not
able to group.

Could kindly clarify it?

Cheers!
Manoj.



On Fri, Jul 6, 2012 at 10:30 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> How long a program takes to run depends on a lot of things.  It could be a
> connectivity issue, or it could be that your program does a lot more
> processing for some input records then for others, or it could be that some
> of your records are a lot smaller so that more of them exist in a single
> input split.  Without knowing what the code is doing it is hard to say
> more then that.
>
> --Bobby Evans
>
> From: Kasi Subrahmanyam <ka...@gmail.com>
> Reply-To: "mapreduce-user@hadoop.apache.org" <
> mapreduce-user@hadoop.apache.org>
> To: "mapreduce-user@hadoop.apache.org" <ma...@hadoop.apache.org>
> Subject: issue with map running time
>
> Hi ,
>
> I have a job which has let us say 10 mappers running in parallel.
> Some are running fast but few of them are taking too long to run.
> For example few mappers are taking 5 to 10 mins but others are taking
> around 12 hours or more.
> Does the difference in the data handled by the mappers can cause such a
> variation or is it the issue with connectivity.
>
> Note:The cluster we are using have multiple users running their jobs on it.
>
> Thanks in advance.
> Subbu
>

Re: issue with map running time

Posted by Phani <ph...@ovi.com>.

Other users might have consumed all map slots which may have caused long wait times for some mapper in your job. In such cases I would watch the queues closely and reconsider job distribution to grid queues with sufficient map slots.

Thanks,
Phani

 
Best Regards, Phani phani83@ovi.com


>________________________________
> From: Robert Evans <ev...@yahoo-inc.com>
>To: "mapreduce-user@hadoop.apache.org" <ma...@hadoop.apache.org> 
>Sent: Friday, 6 July 2012 10:30 PM
>Subject: Re: issue with map running time
> 
>
>How long a program takes to run depends on a lot of things.  It could be a connectivity issue, or it could be that your program does a lot more processing for some input records then for others, or it could be that some of your records are a lot smaller so that more of them exist in a single input split.  Without knowing what the code is doing it is hard to say more then that.
>
>
>--Bobby Evans 
>
>From:  Kasi Subrahmanyam <ka...@gmail.com>
>Reply-To:  "mapreduce-user@hadoop.apache.org" <ma...@hadoop.apache.org>
>To:  "mapreduce-user@hadoop.apache.org" <ma...@hadoop.apache.org>
>Subject:  issue with map running time
>
>
>
>Hi ,
>
>I have a job which has let us say 10 mappers running in parallel.
>Some are running fast but few of them are taking too long to run.
>For example few mappers are taking 5 to 10 mins but others are taking around 12 hours or more.
>Does the difference in the data handled by the mappers can cause such a variation or is it the issue with connectivity.
>
>Note:The cluster we are using have multiple users running their jobs on it.
>
>Thanks in advance.
>Subbu
>
>
>


---
Sent via Epic Browser

Re: issue with map running time

Posted by Robert Evans <ev...@yahoo-inc.com>.

How long a program takes to run depends on a lot of things.  It could be a connectivity issue, or it could be that your program does a lot more processing for some input records then for others, or it could be that some of your records are a lot smaller so that more of them exist in a single input split.  Without knowing what the code is doing it is hard to say more then that.

--Bobby Evans

From: Kasi Subrahmanyam <ka...@gmail.com>>
Reply-To: "mapreduce-user@hadoop.apache.org<ma...@hadoop.apache.org>" <ma...@hadoop.apache.org>>
To: "mapreduce-user@hadoop.apache.org<ma...@hadoop.apache.org>" <ma...@hadoop.apache.org>>
Subject: issue with map running time

Hi ,

I have a job which has let us say 10 mappers running in parallel.
Some are running fast but few of them are taking too long to run.
For example few mappers are taking 5 to 10 mins but others are taking around 12 hours or more.
Does the difference in the data handled by the mappers can cause such a variation or is it the issue with connectivity.

Note:The cluster we are using have multiple users running their jobs on it.

Thanks in advance.
Subbu