You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by David Parks <da...@yahoo.com> on 2012/12/06 07:15:03 UTC

Map tasks processing some files multiple times

I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files
correctly:

 

2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167

 

When I look through the syslogs I can see that the file in question was
opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s
yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s
yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple
times.

 

Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

Re: Query about Speculative Execution

Posted by Harsh J <ha...@cloudera.com>.

Given that Speculative Execution *is* the answer to such scenarios,
I'd say the answer to your question without it, is *nothing*.

If a task does not report status for over 10 minutes (default), it is
killed and retried. If it does report status changes (such as
counters, task status, etc.) but is slow due to environmental or other
reasons, then the JobTracker, without speculative execution logic
turned on, will assume it is normal.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava
<Aj...@guavus.com> wrote:
> Hi,
>
> What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
> Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.
>
>
>
> Regards,
> Ajay Srivastava

-- 
Harsh J

Re: Query about Speculative Execution

Posted by Harsh J <ha...@cloudera.com>.

Given that Speculative Execution *is* the answer to such scenarios,
I'd say the answer to your question without it, is *nothing*.

If a task does not report status for over 10 minutes (default), it is
killed and retried. If it does report status changes (such as
counters, task status, etc.) but is slow due to environmental or other
reasons, then the JobTracker, without speculative execution logic
turned on, will assume it is normal.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava
<Aj...@guavus.com> wrote:
> Hi,
>
> What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
> Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.
>
>
>
> Regards,
> Ajay Srivastava

-- 
Harsh J

Re: Query about Speculative Execution

Posted by Srinivas Chamarthi <sr...@gmail.com>.

Hi,

may I know where should I find the sources related to speculative
scheduling happens ? and also how do we discard the output coming from
already completed mapper ?

I am actually trying to do something similar, like spawn map tasks
redundantly but not for speculative reasons but for each mapper and reducer
so that I can do a integrity check between the nodes where the tasks are
running.

any help is greatly appreciated.

thx
srinivas

On Thu, Dec 6, 2012 at 8:40 PM, Ajay Srivastava <Aj...@guavus.com>
wrote:

>  Thanks Mahesh & Harsh.
>
>
>
>  On 07-Dec-2012, at 7:42 AM, Mahesh Balija wrote:
>
> To simply, if you turn-off the speculative execution then the system will
> never bother about slow running jobs unless they won't report beyond
> specified time (10 minutes).
> If you have set speculative execution to true then the system may spawn
> another instance of mapper and consider the output of the fast running once
> or early completing task.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
> On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>> Hi,
>>
>> What is the behavior of jobTracker if speculative execution is off and a
>> task on data node is running extremely slow?
>> Will the jobTracker simply wait till the slow running task finishes or it
>> will try to heal the situation? Assuming that heartbeat from the node
>> running slow task are regular.
>>
>>
>>
>> Regards,
>> Ajay Srivastava
>
>
>
>

Re: Query about Speculative Execution

Posted by Srinivas Chamarthi <sr...@gmail.com>.

Hi,

may I know where should I find the sources related to speculative
scheduling happens ? and also how do we discard the output coming from
already completed mapper ?

I am actually trying to do something similar, like spawn map tasks
redundantly but not for speculative reasons but for each mapper and reducer
so that I can do a integrity check between the nodes where the tasks are
running.

any help is greatly appreciated.

thx
srinivas

On Thu, Dec 6, 2012 at 8:40 PM, Ajay Srivastava <Aj...@guavus.com>
wrote:

>  Thanks Mahesh & Harsh.
>
>
>
>  On 07-Dec-2012, at 7:42 AM, Mahesh Balija wrote:
>
> To simply, if you turn-off the speculative execution then the system will
> never bother about slow running jobs unless they won't report beyond
> specified time (10 minutes).
> If you have set speculative execution to true then the system may spawn
> another instance of mapper and consider the output of the fast running once
> or early completing task.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
> On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>> Hi,
>>
>> What is the behavior of jobTracker if speculative execution is off and a
>> task on data node is running extremely slow?
>> Will the jobTracker simply wait till the slow running task finishes or it
>> will try to heal the situation? Assuming that heartbeat from the node
>> running slow task are regular.
>>
>>
>>
>> Regards,
>> Ajay Srivastava
>
>
>
>

Re: Query about Speculative Execution

Posted by Srinivas Chamarthi <sr...@gmail.com>.

Hi,

may I know where should I find the sources related to speculative
scheduling happens ? and also how do we discard the output coming from
already completed mapper ?

I am actually trying to do something similar, like spawn map tasks
redundantly but not for speculative reasons but for each mapper and reducer
so that I can do a integrity check between the nodes where the tasks are
running.

any help is greatly appreciated.

thx
srinivas

On Thu, Dec 6, 2012 at 8:40 PM, Ajay Srivastava <Aj...@guavus.com>
wrote:

>  Thanks Mahesh & Harsh.
>
>
>
>  On 07-Dec-2012, at 7:42 AM, Mahesh Balija wrote:
>
> To simply, if you turn-off the speculative execution then the system will
> never bother about slow running jobs unless they won't report beyond
> specified time (10 minutes).
> If you have set speculative execution to true then the system may spawn
> another instance of mapper and consider the output of the fast running once
> or early completing task.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
> On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>> Hi,
>>
>> What is the behavior of jobTracker if speculative execution is off and a
>> task on data node is running extremely slow?
>> Will the jobTracker simply wait till the slow running task finishes or it
>> will try to heal the situation? Assuming that heartbeat from the node
>> running slow task are regular.
>>
>>
>>
>> Regards,
>> Ajay Srivastava
>
>
>
>

Re: Query about Speculative Execution

Posted by Srinivas Chamarthi <sr...@gmail.com>.

Hi,

may I know where should I find the sources related to speculative
scheduling happens ? and also how do we discard the output coming from
already completed mapper ?

I am actually trying to do something similar, like spawn map tasks
redundantly but not for speculative reasons but for each mapper and reducer
so that I can do a integrity check between the nodes where the tasks are
running.

any help is greatly appreciated.

thx
srinivas

On Thu, Dec 6, 2012 at 8:40 PM, Ajay Srivastava <Aj...@guavus.com>
wrote:

>  Thanks Mahesh & Harsh.
>
>
>
>  On 07-Dec-2012, at 7:42 AM, Mahesh Balija wrote:
>
> To simply, if you turn-off the speculative execution then the system will
> never bother about slow running jobs unless they won't report beyond
> specified time (10 minutes).
> If you have set speculative execution to true then the system may spawn
> another instance of mapper and consider the output of the fast running once
> or early completing task.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
> On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>> Hi,
>>
>> What is the behavior of jobTracker if speculative execution is off and a
>> task on data node is running extremely slow?
>> Will the jobTracker simply wait till the slow running task finishes or it
>> will try to heal the situation? Assuming that heartbeat from the node
>> running slow task are regular.
>>
>>
>>
>> Regards,
>> Ajay Srivastava
>
>
>
>

Re: Query about Speculative Execution

Posted by Ajay Srivastava <Aj...@guavus.com>.

Thanks Mahesh & Harsh.



On 07-Dec-2012, at 7:42 AM, Mahesh Balija wrote:

To simply, if you turn-off the speculative execution then the system will never bother about slow running jobs unless they won't report beyond specified time (10 minutes).
If you have set speculative execution to true then the system may spawn another instance of mapper and consider the output of the fast running once or early completing task.

Best,
Mahesh Balija,
Calsoft Labs.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi,

What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.



Regards,
Ajay Srivastava

Re: Query about Speculative Execution

Posted by Ajay Srivastava <Aj...@guavus.com>.

Thanks Mahesh & Harsh.



On 07-Dec-2012, at 7:42 AM, Mahesh Balija wrote:

To simply, if you turn-off the speculative execution then the system will never bother about slow running jobs unless they won't report beyond specified time (10 minutes).
If you have set speculative execution to true then the system may spawn another instance of mapper and consider the output of the fast running once or early completing task.

Best,
Mahesh Balija,
Calsoft Labs.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi,

What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.



Regards,
Ajay Srivastava

Re: Query about Speculative Execution

Posted by Ajay Srivastava <Aj...@guavus.com>.

Thanks Mahesh & Harsh.



On 07-Dec-2012, at 7:42 AM, Mahesh Balija wrote:

To simply, if you turn-off the speculative execution then the system will never bother about slow running jobs unless they won't report beyond specified time (10 minutes).
If you have set speculative execution to true then the system may spawn another instance of mapper and consider the output of the fast running once or early completing task.

Best,
Mahesh Balija,
Calsoft Labs.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi,

What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.



Regards,
Ajay Srivastava

Re: Query about Speculative Execution

Posted by Ajay Srivastava <Aj...@guavus.com>.

Thanks Mahesh & Harsh.



On 07-Dec-2012, at 7:42 AM, Mahesh Balija wrote:

To simply, if you turn-off the speculative execution then the system will never bother about slow running jobs unless they won't report beyond specified time (10 minutes).
If you have set speculative execution to true then the system may spawn another instance of mapper and consider the output of the fast running once or early completing task.

Best,
Mahesh Balija,
Calsoft Labs.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava <Aj...@guavus.com>> wrote:
Hi,

What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.



Regards,
Ajay Srivastava

Re: Query about Speculative Execution

Posted by Mahesh Balija <ba...@gmail.com>.

To simply, if you turn-off the speculative execution then the system will
never bother about slow running jobs unless they won't report beyond
specified time (10 minutes).
If you have set speculative execution to true then the system may spawn
another instance of mapper and consider the output of the fast running once
or early completing task.

Best,
Mahesh Balija,
Calsoft Labs.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava
<Aj...@guavus.com>wrote:

> Hi,
>
> What is the behavior of jobTracker if speculative execution is off and a
> task on data node is running extremely slow?
> Will the jobTracker simply wait till the slow running task finishes or it
> will try to heal the situation? Assuming that heartbeat from the node
> running slow task are regular.
>
>
>
> Regards,
> Ajay Srivastava

Re: Query about Speculative Execution

Posted by Mahesh Balija <ba...@gmail.com>.

To simply, if you turn-off the speculative execution then the system will
never bother about slow running jobs unless they won't report beyond
specified time (10 minutes).
If you have set speculative execution to true then the system may spawn
another instance of mapper and consider the output of the fast running once
or early completing task.

Best,
Mahesh Balija,
Calsoft Labs.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava
<Aj...@guavus.com>wrote:

> Hi,
>
> What is the behavior of jobTracker if speculative execution is off and a
> task on data node is running extremely slow?
> Will the jobTracker simply wait till the slow running task finishes or it
> will try to heal the situation? Assuming that heartbeat from the node
> running slow task are regular.
>
>
>
> Regards,
> Ajay Srivastava

Re: Query about Speculative Execution

Posted by Mahesh Balija <ba...@gmail.com>.

To simply, if you turn-off the speculative execution then the system will
never bother about slow running jobs unless they won't report beyond
specified time (10 minutes).
If you have set speculative execution to true then the system may spawn
another instance of mapper and consider the output of the fast running once
or early completing task.

Best,
Mahesh Balija,
Calsoft Labs.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava
<Aj...@guavus.com>wrote:

> Hi,
>
> What is the behavior of jobTracker if speculative execution is off and a
> task on data node is running extremely slow?
> Will the jobTracker simply wait till the slow running task finishes or it
> will try to heal the situation? Assuming that heartbeat from the node
> running slow task are regular.
>
>
>
> Regards,
> Ajay Srivastava

Re: Query about Speculative Execution

Posted by Harsh J <ha...@cloudera.com>.

Given that Speculative Execution *is* the answer to such scenarios,
I'd say the answer to your question without it, is *nothing*.

If a task does not report status for over 10 minutes (default), it is
killed and retried. If it does report status changes (such as
counters, task status, etc.) but is slow due to environmental or other
reasons, then the JobTracker, without speculative execution logic
turned on, will assume it is normal.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava
<Aj...@guavus.com> wrote:
> Hi,
>
> What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
> Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.
>
>
>
> Regards,
> Ajay Srivastava

-- 
Harsh J

Re: Query about Speculative Execution

Posted by Mahesh Balija <ba...@gmail.com>.

To simply, if you turn-off the speculative execution then the system will
never bother about slow running jobs unless they won't report beyond
specified time (10 minutes).
If you have set speculative execution to true then the system may spawn
another instance of mapper and consider the output of the fast running once
or early completing task.

Best,
Mahesh Balija,
Calsoft Labs.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava
<Aj...@guavus.com>wrote:

> Hi,
>
> What is the behavior of jobTracker if speculative execution is off and a
> task on data node is running extremely slow?
> Will the jobTracker simply wait till the slow running task finishes or it
> will try to heal the situation? Assuming that heartbeat from the node
> running slow task are regular.
>
>
>
> Regards,
> Ajay Srivastava

Re: Query about Speculative Execution

Posted by Harsh J <ha...@cloudera.com>.

Given that Speculative Execution *is* the answer to such scenarios,
I'd say the answer to your question without it, is *nothing*.

If a task does not report status for over 10 minutes (default), it is
killed and retried. If it does report status changes (such as
counters, task status, etc.) but is slow due to environmental or other
reasons, then the JobTracker, without speculative execution logic
turned on, will assume it is normal.

On Thu, Dec 6, 2012 at 8:27 PM, Ajay Srivastava
<Aj...@guavus.com> wrote:
> Hi,
>
> What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
> Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.
>
>
>
> Regards,
> Ajay Srivastava

-- 
Harsh J

Query about Speculative Execution

Posted by Ajay Srivastava <Aj...@guavus.com>.

Hi,

What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.



Regards,
Ajay Srivastava

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

I'm using multiple inputs because I actually have another type of input with
a different mapper, a single, unrelated file, that I omitted from this
discussion for simplicity. The basic formula is: read in a single database
of existing items, read in a bunch of catalogs of items, then merge and toss
like a salad (in a few map/reduce steps that follow).

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 9:44 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Glad it helps. Could you also explain the reason for using MultipleInputs ?

 

On Thu, Dec 6, 2012 at 2:59 PM, David Parks <da...@yahoo.com> wrote:

Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat
to replace the LongWritable key with a key representing the file name. It
was a bit tricky to do because of changing the generics from <LongWritable,
Text> to <Text, Text> and I goofed up and mis-directed a call to
isSplittable, which was causing the issue.

 

It now works fine. Thanks very much for the response, it gave me pause to
think enough to work out what I had done.

 

Dave

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 3:25 PM


To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

David,

 

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

 

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

 

Thanks

Hemanth

 

On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

I believe I just tracked down the problem, maybe you can help confirm if
you're familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension)
from s3n filesystem are being reported as splittable, and I see that it's
creating multiple input splits for these files. I'm mapping the files
directly off S3:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it's actually processing
the entire file (I set up a counter per file input). So the 2 files which
were processed twice had 2 splits (I now see that in some debug logs I
created), and the 1 file that was processed 3 times had 3 splits (the rest
were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as
not-splittable? This seems to be a bug in hadoop code if I'm right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files
correctly:

 

2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167

 

When I look through the syslogs I can see that the file in question was
opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s
yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s
yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple
times.

 

Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

I'm using multiple inputs because I actually have another type of input with
a different mapper, a single, unrelated file, that I omitted from this
discussion for simplicity. The basic formula is: read in a single database
of existing items, read in a bunch of catalogs of items, then merge and toss
like a salad (in a few map/reduce steps that follow).

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 9:44 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Glad it helps. Could you also explain the reason for using MultipleInputs ?

 

On Thu, Dec 6, 2012 at 2:59 PM, David Parks <da...@yahoo.com> wrote:

Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat
to replace the LongWritable key with a key representing the file name. It
was a bit tricky to do because of changing the generics from <LongWritable,
Text> to <Text, Text> and I goofed up and mis-directed a call to
isSplittable, which was causing the issue.

 

It now works fine. Thanks very much for the response, it gave me pause to
think enough to work out what I had done.

 

Dave

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 3:25 PM


To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

David,

 

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

 

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

 

Thanks

Hemanth

 

On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

I believe I just tracked down the problem, maybe you can help confirm if
you're familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension)
from s3n filesystem are being reported as splittable, and I see that it's
creating multiple input splits for these files. I'm mapping the files
directly off S3:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it's actually processing
the entire file (I set up a counter per file input). So the 2 files which
were processed twice had 2 splits (I now see that in some debug logs I
created), and the 1 file that was processed 3 times had 3 splits (the rest
were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as
not-splittable? This seems to be a bug in hadoop code if I'm right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files
correctly:

 

2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167

 

When I look through the syslogs I can see that the file in question was
opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s
yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s
yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple
times.

 

Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

Query about Speculative Execution

Posted by Ajay Srivastava <Aj...@guavus.com>.

Hi,

What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.



Regards,
Ajay Srivastava

Query about Speculative Execution

Posted by Ajay Srivastava <Aj...@guavus.com>.

Hi,

What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.



Regards,
Ajay Srivastava

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

I'm using multiple inputs because I actually have another type of input with
a different mapper, a single, unrelated file, that I omitted from this
discussion for simplicity. The basic formula is: read in a single database
of existing items, read in a bunch of catalogs of items, then merge and toss
like a salad (in a few map/reduce steps that follow).

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 9:44 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Glad it helps. Could you also explain the reason for using MultipleInputs ?

 

On Thu, Dec 6, 2012 at 2:59 PM, David Parks <da...@yahoo.com> wrote:

Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat
to replace the LongWritable key with a key representing the file name. It
was a bit tricky to do because of changing the generics from <LongWritable,
Text> to <Text, Text> and I goofed up and mis-directed a call to
isSplittable, which was causing the issue.

 

It now works fine. Thanks very much for the response, it gave me pause to
think enough to work out what I had done.

 

Dave

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 3:25 PM


To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

David,

 

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

 

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

 

Thanks

Hemanth

 

On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

I believe I just tracked down the problem, maybe you can help confirm if
you're familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension)
from s3n filesystem are being reported as splittable, and I see that it's
creating multiple input splits for these files. I'm mapping the files
directly off S3:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it's actually processing
the entire file (I set up a counter per file input). So the 2 files which
were processed twice had 2 splits (I now see that in some debug logs I
created), and the 1 file that was processed 3 times had 3 splits (the rest
were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as
not-splittable? This seems to be a bug in hadoop code if I'm right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files
correctly:

 

2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167

 

When I look through the syslogs I can see that the file in question was
opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s
yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s
yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple
times.

 

Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

Query about Speculative Execution

Posted by Ajay Srivastava <Aj...@guavus.com>.

Hi,

What is the behavior of jobTracker if speculative execution is off and a task on data node is running extremely slow?
Will the jobTracker simply wait till the slow running task finishes or it will try to heal the situation? Assuming that heartbeat from the node running slow task are regular.



Regards,
Ajay Srivastava

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

I'm using multiple inputs because I actually have another type of input with
a different mapper, a single, unrelated file, that I omitted from this
discussion for simplicity. The basic formula is: read in a single database
of existing items, read in a bunch of catalogs of items, then merge and toss
like a salad (in a few map/reduce steps that follow).

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 9:44 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Glad it helps. Could you also explain the reason for using MultipleInputs ?

 

On Thu, Dec 6, 2012 at 2:59 PM, David Parks <da...@yahoo.com> wrote:

Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat
to replace the LongWritable key with a key representing the file name. It
was a bit tricky to do because of changing the generics from <LongWritable,
Text> to <Text, Text> and I goofed up and mis-directed a call to
isSplittable, which was causing the issue.

 

It now works fine. Thanks very much for the response, it gave me pause to
think enough to work out what I had done.

 

Dave

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 3:25 PM


To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

David,

 

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

 

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

 

Thanks

Hemanth

 

On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

I believe I just tracked down the problem, maybe you can help confirm if
you're familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension)
from s3n filesystem are being reported as splittable, and I see that it's
creating multiple input splits for these files. I'm mapping the files
directly off S3:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it's actually processing
the entire file (I set up a counter per file input). So the 2 files which
were processed twice had 2 splits (I now see that in some debug logs I
created), and the 1 file that was processed 3 times had 3 splits (the rest
were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as
not-splittable? This seems to be a bug in hadoop code if I'm right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files
correctly:

 

2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167

 

When I look through the syslogs I can see that the file in question was
opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s
yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s
yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple
times.

 

Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

Re: Map tasks processing some files multiple times

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Glad it helps. Could you also explain the reason for using MultipleInputs ?


On Thu, Dec 6, 2012 at 2:59 PM, David Parks <da...@yahoo.com> wrote:

> Figured it out, it is, as usual, with my code. I had wrapped
> TextInputFormat to replace the LongWritable key with a key representing the
> file name. It was a bit tricky to do because of changing the generics from
> <LongWritable, Text> to <Text, Text> and I goofed up and mis-directed a
> call to isSplittable, which was causing the issue.****
>
> ** **
>
> It now works fine. Thanks very much for the response, it gave me pause to
> think enough to work out what I had done.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Hemanth Yamijala [mailto:yhemanth@thoughtworks.com]
> *Sent:* Thursday, December 06, 2012 3:25 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> David,****
>
> ** **
>
> You are using FileNameTextInputFormat. This is not in Hadoop source, as
> far as I can see. Can you please confirm where this is being used from ? It
> seems like the isSplittable method of this input format may need checking.
> ****
>
> ** **
>
> Another thing, given you are adding the same input format for all files,
> do you need MultipleInputs ?****
>
> ** **
>
> Thanks****
>
> Hemanth****
>
> ** **
>
> On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com>
> wrote:****
>
> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
>  ****
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
>  ****
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
>  ****
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
>  ****
>
> David****
>
>  ****
>
>  ****
>
> *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
>  ****
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
>  ****
>
> Raj****
>
>  ****
> ------------------------------
>
> *From:* David Parks <da...@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
>  ****
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can’t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 map
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
>  ****
>
> ** **
>

Re: Map tasks processing some files multiple times

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Glad it helps. Could you also explain the reason for using MultipleInputs ?


On Thu, Dec 6, 2012 at 2:59 PM, David Parks <da...@yahoo.com> wrote:

> Figured it out, it is, as usual, with my code. I had wrapped
> TextInputFormat to replace the LongWritable key with a key representing the
> file name. It was a bit tricky to do because of changing the generics from
> <LongWritable, Text> to <Text, Text> and I goofed up and mis-directed a
> call to isSplittable, which was causing the issue.****
>
> ** **
>
> It now works fine. Thanks very much for the response, it gave me pause to
> think enough to work out what I had done.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Hemanth Yamijala [mailto:yhemanth@thoughtworks.com]
> *Sent:* Thursday, December 06, 2012 3:25 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> David,****
>
> ** **
>
> You are using FileNameTextInputFormat. This is not in Hadoop source, as
> far as I can see. Can you please confirm where this is being used from ? It
> seems like the isSplittable method of this input format may need checking.
> ****
>
> ** **
>
> Another thing, given you are adding the same input format for all files,
> do you need MultipleInputs ?****
>
> ** **
>
> Thanks****
>
> Hemanth****
>
> ** **
>
> On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com>
> wrote:****
>
> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
>  ****
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
>  ****
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
>  ****
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
>  ****
>
> David****
>
>  ****
>
>  ****
>
> *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
>  ****
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
>  ****
>
> Raj****
>
>  ****
> ------------------------------
>
> *From:* David Parks <da...@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
>  ****
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can’t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 map
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
>  ****
>
> ** **
>

Re: Map tasks processing some files multiple times

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Glad it helps. Could you also explain the reason for using MultipleInputs ?


On Thu, Dec 6, 2012 at 2:59 PM, David Parks <da...@yahoo.com> wrote:

> Figured it out, it is, as usual, with my code. I had wrapped
> TextInputFormat to replace the LongWritable key with a key representing the
> file name. It was a bit tricky to do because of changing the generics from
> <LongWritable, Text> to <Text, Text> and I goofed up and mis-directed a
> call to isSplittable, which was causing the issue.****
>
> ** **
>
> It now works fine. Thanks very much for the response, it gave me pause to
> think enough to work out what I had done.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Hemanth Yamijala [mailto:yhemanth@thoughtworks.com]
> *Sent:* Thursday, December 06, 2012 3:25 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> David,****
>
> ** **
>
> You are using FileNameTextInputFormat. This is not in Hadoop source, as
> far as I can see. Can you please confirm where this is being used from ? It
> seems like the isSplittable method of this input format may need checking.
> ****
>
> ** **
>
> Another thing, given you are adding the same input format for all files,
> do you need MultipleInputs ?****
>
> ** **
>
> Thanks****
>
> Hemanth****
>
> ** **
>
> On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com>
> wrote:****
>
> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
>  ****
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
>  ****
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
>  ****
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
>  ****
>
> David****
>
>  ****
>
>  ****
>
> *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
>  ****
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
>  ****
>
> Raj****
>
>  ****
> ------------------------------
>
> *From:* David Parks <da...@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
>  ****
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can’t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 map
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
>  ****
>
> ** **
>

Re: Map tasks processing some files multiple times

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Glad it helps. Could you also explain the reason for using MultipleInputs ?


On Thu, Dec 6, 2012 at 2:59 PM, David Parks <da...@yahoo.com> wrote:

> Figured it out, it is, as usual, with my code. I had wrapped
> TextInputFormat to replace the LongWritable key with a key representing the
> file name. It was a bit tricky to do because of changing the generics from
> <LongWritable, Text> to <Text, Text> and I goofed up and mis-directed a
> call to isSplittable, which was causing the issue.****
>
> ** **
>
> It now works fine. Thanks very much for the response, it gave me pause to
> think enough to work out what I had done.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Hemanth Yamijala [mailto:yhemanth@thoughtworks.com]
> *Sent:* Thursday, December 06, 2012 3:25 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> David,****
>
> ** **
>
> You are using FileNameTextInputFormat. This is not in Hadoop source, as
> far as I can see. Can you please confirm where this is being used from ? It
> seems like the isSplittable method of this input format may need checking.
> ****
>
> ** **
>
> Another thing, given you are adding the same input format for all files,
> do you need MultipleInputs ?****
>
> ** **
>
> Thanks****
>
> Hemanth****
>
> ** **
>
> On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com>
> wrote:****
>
> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
>  ****
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
>  ****
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
>  ****
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
>  ****
>
> David****
>
>  ****
>
>  ****
>
> *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
>  ****
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
>  ****
>
> Raj****
>
>  ****
> ------------------------------
>
> *From:* David Parks <da...@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
>  ****
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can’t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 map
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
>  ****
>
> ** **
>

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat
to replace the LongWritable key with a key representing the file name. It
was a bit tricky to do because of changing the generics from <LongWritable,
Text> to <Text, Text> and I goofed up and mis-directed a call to
isSplittable, which was causing the issue.

 

It now works fine. Thanks very much for the response, it gave me pause to
think enough to work out what I had done.

 

Dave

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 3:25 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

David,

 

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

 

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

 

Thanks

Hemanth

 

On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

I believe I just tracked down the problem, maybe you can help confirm if
you're familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension)
from s3n filesystem are being reported as splittable, and I see that it's
creating multiple input splits for these files. I'm mapping the files
directly off S3:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it's actually processing
the entire file (I set up a counter per file input). So the 2 files which
were processed twice had 2 splits (I now see that in some debug logs I
created), and the 1 file that was processed 3 times had 3 splits (the rest
were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as
not-splittable? This seems to be a bug in hadoop code if I'm right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files
correctly:

 

2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167

 

When I look through the syslogs I can see that the file in question was
opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s
yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s
yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple
times.

 

Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat
to replace the LongWritable key with a key representing the file name. It
was a bit tricky to do because of changing the generics from <LongWritable,
Text> to <Text, Text> and I goofed up and mis-directed a call to
isSplittable, which was causing the issue.

 

It now works fine. Thanks very much for the response, it gave me pause to
think enough to work out what I had done.

 

Dave

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 3:25 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

David,

 

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

 

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

 

Thanks

Hemanth

 

On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

I believe I just tracked down the problem, maybe you can help confirm if
you're familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension)
from s3n filesystem are being reported as splittable, and I see that it's
creating multiple input splits for these files. I'm mapping the files
directly off S3:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it's actually processing
the entire file (I set up a counter per file input). So the 2 files which
were processed twice had 2 splits (I now see that in some debug logs I
created), and the 1 file that was processed 3 times had 3 splits (the rest
were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as
not-splittable? This seems to be a bug in hadoop code if I'm right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files
correctly:

 

2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167

 

When I look through the syslogs I can see that the file in question was
opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s
yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s
yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple
times.

 

Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat
to replace the LongWritable key with a key representing the file name. It
was a bit tricky to do because of changing the generics from <LongWritable,
Text> to <Text, Text> and I goofed up and mis-directed a call to
isSplittable, which was causing the issue.

 

It now works fine. Thanks very much for the response, it gave me pause to
think enough to work out what I had done.

 

Dave

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 3:25 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

David,

 

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

 

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

 

Thanks

Hemanth

 

On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

I believe I just tracked down the problem, maybe you can help confirm if
you're familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension)
from s3n filesystem are being reported as splittable, and I see that it's
creating multiple input splits for these files. I'm mapping the files
directly off S3:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it's actually processing
the entire file (I set up a counter per file input). So the 2 files which
were processed twice had 2 splits (I now see that in some debug logs I
created), and the 1 file that was processed 3 times had 3 splits (the rest
were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as
not-splittable? This seems to be a bug in hadoop code if I'm right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files
correctly:

 

2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167

 

When I look through the syslogs I can see that the file in question was
opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s
yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s
yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple
times.

 

Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat
to replace the LongWritable key with a key representing the file name. It
was a bit tricky to do because of changing the generics from <LongWritable,
Text> to <Text, Text> and I goofed up and mis-directed a call to
isSplittable, which was causing the issue.

 

It now works fine. Thanks very much for the response, it gave me pause to
think enough to work out what I had done.

 

Dave

 

 

From: Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] 
Sent: Thursday, December 06, 2012 3:25 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

David,

 

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

 

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

 

Thanks

Hemanth

 

On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

I believe I just tracked down the problem, maybe you can help confirm if
you're familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension)
from s3n filesystem are being reported as splittable, and I see that it's
creating multiple input splits for these files. I'm mapping the files
directly off S3:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it's actually processing
the entire file (I set up a counter per file input). So the 2 files which
were processed twice had 2 splits (I now see that in some debug logs I
created), and the 1 file that was processed 3 times had 3 splits (the rest
were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as
not-splittable? This seems to be a bug in hadoop code if I'm right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files
correctly:

 

2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167

 

When I look through the syslogs I can see that the file in question was
opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s
yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s
yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive
rse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple
times.

 

Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

Re: Map tasks processing some files multiple times

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

David,

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

Thanks
Hemanth


On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
> ** **
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
> ** **
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
> ** **
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
> ** **
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
> ** **
>
> David****
>
> ** **
>
> ** **
>
> *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
> ** **
>
> Raj****
>
> ** **
> ------------------------------
>
> *From:* David Parks <da...@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
> ** **
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can’t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 map
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
> ** **
>
>

Re: Map tasks processing some files multiple times

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

David,

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

Thanks
Hemanth


On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
> ** **
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
> ** **
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
> ** **
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
> ** **
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
> ** **
>
> David****
>
> ** **
>
> ** **
>
> *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
> ** **
>
> Raj****
>
> ** **
> ------------------------------
>
> *From:* David Parks <da...@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
> ** **
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can’t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 map
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
> ** **
>
>

Re: Map tasks processing some files multiple times

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

David,

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

Thanks
Hemanth


On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
> ** **
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
> ** **
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
> ** **
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
> ** **
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
> ** **
>
> David****
>
> ** **
>
> ** **
>
> *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
> ** **
>
> Raj****
>
> ** **
> ------------------------------
>
> *From:* David Parks <da...@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
> ** **
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can’t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 map
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
> ** **
>
>

Re: Map tasks processing some files multiple times

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

David,

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

Thanks
Hemanth


On Thu, Dec 6, 2012 at 1:06 PM, David Parks <da...@yahoo.com> wrote:

> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
> ** **
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
> ** **
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
> ** **
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
> ** **
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
> ** **
>
> David****
>
> ** **
>
> ** **
>
> *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
> ** **
>
> Raj****
>
> ** **
> ------------------------------
>
> *From:* David Parks <da...@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
> ** **
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can’t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 map
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
> ** **
>
>

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

I believe I just tracked down the problem, maybe you can help confirm if you’re familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension) from s3n filesystem are being reported as splittable, and I see that it’s creating multiple input splits for these files. I’m mapping the files directly off S3:

 

       Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it’s actually processing the entire file (I set up a counter per file input). So the 2 files which were processed twice had 2 splits (I now see that in some debug logs I created), and the 1 file that was processed 3 times had 3 splits (the rest were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as not-splittable? This seems to be a bug in hadoop code if I’m right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files correctly:

 

2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167

 

When I look through the syslogs I can see that the file in question was opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.

 

Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

I believe I just tracked down the problem, maybe you can help confirm if you’re familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension) from s3n filesystem are being reported as splittable, and I see that it’s creating multiple input splits for these files. I’m mapping the files directly off S3:

 

       Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it’s actually processing the entire file (I set up a counter per file input). So the 2 files which were processed twice had 2 splits (I now see that in some debug logs I created), and the 1 file that was processed 3 times had 3 splits (the rest were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as not-splittable? This seems to be a bug in hadoop code if I’m right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files correctly:

 

2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167

 

When I look through the syslogs I can see that the file in question was opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.

 

Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

I believe I just tracked down the problem, maybe you can help confirm if you’re familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension) from s3n filesystem are being reported as splittable, and I see that it’s creating multiple input splits for these files. I’m mapping the files directly off S3:

 

       Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it’s actually processing the entire file (I set up a counter per file input). So the 2 files which were processed twice had 2 splits (I now see that in some debug logs I created), and the 1 file that was processed 3 times had 3 splits (the rest were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as not-splittable? This seems to be a bug in hadoop code if I’m right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files correctly:

 

2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167

 

When I look through the syslogs I can see that the file in question was opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.

 

Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

RE: Map tasks processing some files multiple times

Posted by David Parks <da...@yahoo.com>.

I believe I just tracked down the problem, maybe you can help confirm if you’re familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz extension) from s3n filesystem are being reported as splittable, and I see that it’s creating multiple input splits for these files. I’m mapping the files directly off S3:

 

       Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it’s actually processing the entire file (I set up a counter per file input). So the 2 files which were processed twice had 2 splits (I now see that in some debug logs I created), and the 1 file that was processed 3 times had 3 splits (the rest were smaller and were only assigned one split by default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come through as not-splittable? This seems to be a bug in hadoop code if I’m right.

 

David

 

 

From: Raj Vishwanathan [mailto:rajvish@yahoo.com] 
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

 

Could it be due to spec-ex? Does it make a diffrerence in the end?

 

Raj

 


  _____  


From: David Parks <da...@yahoo.com>
To: user@hadoop.apache.org 
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

 

I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.

 

This is the code I use to set up the mapper:

 

       Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

       }

 

I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files correctly:

 

2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167

 

When I look through the syslogs I can see that the file in question was opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.

 

Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.

 

Any thoughts/ideas/guesses?

Re: Map tasks processing some files multiple times

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Could it be due to spec-ex? Does it make a diffrerence in the end?

Raj



>________________________________
> From: David Parks <da...@yahoo.com>
>To: user@hadoop.apache.org 
>Sent: Wednesday, December 5, 2012 10:15 PM
>Subject: Map tasks processing some files multiple times
> 
>
>I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.
> 
>This is the code I use to set up the mapper:
> 
>       Path lsDir = newPath("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");
>       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: "+ f.getPath().toString());
>       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length> 0 ){
>              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);
>       }
> 
>I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.
> 
>I also have the following confirmation that it found the 167 files correctly:
> 
>2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167
> 
>When I look through the syslogs I can see that the file in question was opened by two different map attempts:
> 
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
> 
>This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.
> 
>Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.
> 
>Any thoughts/ideas/guesses?
> 
>
>

Re: Map tasks processing some files multiple times

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Could it be due to spec-ex? Does it make a diffrerence in the end?

Raj



>________________________________
> From: David Parks <da...@yahoo.com>
>To: user@hadoop.apache.org 
>Sent: Wednesday, December 5, 2012 10:15 PM
>Subject: Map tasks processing some files multiple times
> 
>
>I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.
> 
>This is the code I use to set up the mapper:
> 
>       Path lsDir = newPath("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");
>       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: "+ f.getPath().toString());
>       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length> 0 ){
>              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);
>       }
> 
>I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.
> 
>I also have the following confirmation that it found the 167 files correctly:
> 
>2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167
> 
>When I look through the syslogs I can see that the file in question was opened by two different map attempts:
> 
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
> 
>This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.
> 
>Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.
> 
>Any thoughts/ideas/guesses?
> 
>
>

Re: Map tasks processing some files multiple times

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Could it be due to spec-ex? Does it make a diffrerence in the end?

Raj



>________________________________
> From: David Parks <da...@yahoo.com>
>To: user@hadoop.apache.org 
>Sent: Wednesday, December 5, 2012 10:15 PM
>Subject: Map tasks processing some files multiple times
> 
>
>I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.
> 
>This is the code I use to set up the mapper:
> 
>       Path lsDir = newPath("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");
>       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: "+ f.getPath().toString());
>       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length> 0 ){
>              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);
>       }
> 
>I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.
> 
>I also have the following confirmation that it found the 167 files correctly:
> 
>2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167
> 
>When I look through the syslogs I can see that the file in question was opened by two different map attempts:
> 
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
> 
>This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.
> 
>Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.
> 
>Any thoughts/ideas/guesses?
> 
>
>

Re: Map tasks processing some files multiple times

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Could it be due to spec-ex? Does it make a diffrerence in the end?

Raj



>________________________________
> From: David Parks <da...@yahoo.com>
>To: user@hadoop.apache.org 
>Sent: Wednesday, December 5, 2012 10:15 PM
>Subject: Map tasks processing some files multiple times
> 
>
>I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.
> 
>This is the code I use to set up the mapper:
> 
>       Path lsDir = newPath("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");
>       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: "+ f.getPath().toString());
>       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length> 0 ){
>              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);
>       }
> 
>I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.
> 
>I also have the following confirmation that it found the 167 files correctly:
> 
>2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167
> 
>When I look through the syslogs I can see that the file in question was opened by two different map attempts:
> 
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
> 
>This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.
> 
>Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.
> 
>Any thoughts/ideas/guesses?
> 
>
>