You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Liu, Raymond" <ra...@intel.com> on 2012/08/10 05:22:42 UTC

How can I get the intermediate output file from mapper class?

Hi

	I am trying to access the intermediate file save to the local filesystem from mapreduce's mapper output.

	I have googled this one : http://stackoverflow.com/questions/7867608/hadoop-mapreduce-intermediate-output

	I am using hadoop 1.0.3 , and I did set following property in mapred-site.xml

<property>
  <name>keep.task.files.pattern</name>
  <value>.*_m_00000*</value>
</property>

Then after restart hadoop and run some jobss, I did see tasks in my local dir like:

/mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/

But I still cannot find any output dir there.

I have four disks mount for local dir, and only jars,work dir are find as following:

<property>
<name>mapred.local.dir</name>
<value>/mnt/DP_disk1/raymond/hdfs/mapred,/mnt/DP_disk2/raymond/hdfs/mapred,/mnt/DP_disk3/raymond/hdfs/mapred,/mnt/DP_disk4/raymond/hdfs/mapred</value>
</property>

Then I search though them:

raymond@sr173:~$ ls /mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
jars  job.xml
raymond@sr173:~$ ls /mnt/DP_disk2/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
raymond@sr173:~$ ls /mnt/DP_disk3/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
jobToken  work
raymond@sr173:~$ ls /mnt/DP_disk4/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/

And I also search the ttprivate dir, no luck there :

raymond@sr173:~$ ls /mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcache/job_201208101040_0003/attempt_201208101040_0003_m_000021_0/taskjvm.sh
/mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcache/job_201208101040_0003/attempt_201208101040_0003_m_000021_0/taskjvm.sh

So, Is there anything I am still missing?


Best Regards,
Raymond Liu


RE: How can I get the intermediate output file from mapper class?

Posted by "Liu, Raymond" <ra...@intel.com>.
Alright, finally managed to get the intermediate file.

The pattern should be ".*_m_0000.*" instead of ".*_m_0000*"... stupid me.

If you try to get everything, use ".*" for pattern. ;)


Best Regards,
Raymond Liu


> -----Original Message-----
> From: Liu, Raymond [mailto:raymond.liu@intel.com]
> Sent: Friday, August 10, 2012 2:42 PM
> To: Harsh J; common-user@hadoop.apache.org
> Subject: RE: How can I get the intermediate output file from mapper class?
> 
> Hi Harsh
> 
> 	Thanks for your reply. While I don't quite catch what do you mean...
> Accroding to the description
> 
> <property>
>   <name>keep.task.files.pattern</name>
>   <value>.*_m_0000*</value>
>   <description>Keep all files from tasks whose task names match the given
>                regular expression. Defaults to none.</description>
> </property>
> 
> 
> 	Isn't that pattern for the task name? and the task name is something like :
> task_201208101126_0004_m_000000 ? So, shouldn't this patten make all the
> data from the tasks from been cleaned?
> 
> 	If this don't work, can you kindly show me what's the exact pattern I
> should put here for the map->intermediate->reduce intermediate file (the
> merged partition file waiting to be shuffled to reduce tasks)? I tried ".out*" , it
> doesn't works too.
> 
> Or I should modify some other property instead?
> 
> 
> Best Regards,
> Raymond Liu
> 
> > -----Original Message-----
> > From: Harsh J [mailto:harsh@cloudera.com]
> > Sent: Friday, August 10, 2012 12:29 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: How can I get the intermediate output file from mapper class?
> >
> > Hi,
> >
> > You need the "file.out" and "file.out.index" files when wanting the
> > map->intermediate->reduce files. So try a pattern that matches these
> > and you should have it.
> >
> > The "XXXXX" kind of files are what MR produces on HDFS as regular
> > outputs - these aren't intermediate.
> >
> > On Fri, Aug 10, 2012 at 8:52 AM, Liu, Raymond <ra...@intel.com>
> > wrote:
> > > Hi
> > >
> > >         I am trying to access the intermediate file save to the
> > > local
> > filesystem from mapreduce's mapper output.
> > >
> > >         I have googled this one :
> > > http://stackoverflow.com/questions/7867608/hadoop-mapreduce-intermed
> > > ia
> > > te-output
> > >
> > >         I am using hadoop 1.0.3 , and I did set following property
> > > in mapred-site.xml
> > >
> > > <property>
> > >   <name>keep.task.files.pattern</name>
> > >   <value>.*_m_00000*</value>
> > > </property>
> > >
> > > Then after restart hadoop and run some jobss, I did see tasks in my
> > > local dir
> > like:
> > >
> > >
> >
> /mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > > 208101040_0003/
> > >
> > > But I still cannot find any output dir there.
> > >
> > > I have four disks mount for local dir, and only jars,work dir are
> > > find as
> > following:
> > >
> > > <property>
> > > <name>mapred.local.dir</name>
> > >
> >
> <value>/mnt/DP_disk1/raymond/hdfs/mapred,/mnt/DP_disk2/raymond/hdfs/
> > ma
> > >
> >
> pred,/mnt/DP_disk3/raymond/hdfs/mapred,/mnt/DP_disk4/raymond/hdfs/ma
> > pr
> > > ed</value>
> > > </property>
> > >
> > > Then I search though them:
> > >
> > > raymond@sr173:~$ ls
> > >
> >
> /mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > > 208101040_0003/
> > > jars  job.xml
> > > raymond@sr173:~$ ls
> > >
> >
> /mnt/DP_disk2/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > > 208101040_0003/ raymond@sr173:~$ ls
> > >
> >
> /mnt/DP_disk3/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > > 208101040_0003/
> > > jobToken  work
> > > raymond@sr173:~$ ls
> > >
> >
> /mnt/DP_disk4/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > > 208101040_0003/
> > >
> > > And I also search the ttprivate dir, no luck there :
> > >
> > > raymond@sr173:~$ ls
> > >
> >
> /mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcac
> > >
> >
> he/job_201208101040_0003/attempt_201208101040_0003_m_000021_0/tas
> > kjvm.
> > > sh
> > >
> >
> /mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcac
> > >
> >
> he/job_201208101040_0003/attempt_201208101040_0003_m_000021_0/tas
> > kjvm.
> > > sh
> > >
> > > So, Is there anything I am still missing?
> > >
> > >
> > > Best Regards,
> > > Raymond Liu
> > >
> >
> >
> >
> > --
> > Harsh J

RE: How can I get the intermediate output file from mapper class?

Posted by "Liu, Raymond" <ra...@intel.com>.
Hi Harsh

	Thanks for your reply. While I don't quite catch what do you mean... Accroding to the description

<property>
  <name>keep.task.files.pattern</name>
  <value>.*_m_0000*</value>
  <description>Keep all files from tasks whose task names match the given
               regular expression. Defaults to none.</description>
</property>


	Isn't that pattern for the task name? and the task name is something like : task_201208101126_0004_m_000000 ? So, shouldn't this patten make all the data from the tasks from been cleaned?

	If this don't work, can you kindly show me what's the exact pattern I should put here for the map->intermediate->reduce intermediate file (the merged partition file waiting to be shuffled to reduce tasks)? I tried ".out*" , it doesn't works too.

Or I should modify some other property instead?


Best Regards,
Raymond Liu

> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Friday, August 10, 2012 12:29 PM
> To: common-user@hadoop.apache.org
> Subject: Re: How can I get the intermediate output file from mapper class?
> 
> Hi,
> 
> You need the "file.out" and "file.out.index" files when wanting the
> map->intermediate->reduce files. So try a pattern that matches these
> and you should have it.
> 
> The "XXXXX" kind of files are what MR produces on HDFS as regular outputs -
> these aren't intermediate.
> 
> On Fri, Aug 10, 2012 at 8:52 AM, Liu, Raymond <ra...@intel.com>
> wrote:
> > Hi
> >
> >         I am trying to access the intermediate file save to the local
> filesystem from mapreduce's mapper output.
> >
> >         I have googled this one :
> > http://stackoverflow.com/questions/7867608/hadoop-mapreduce-intermedia
> > te-output
> >
> >         I am using hadoop 1.0.3 , and I did set following property in
> > mapred-site.xml
> >
> > <property>
> >   <name>keep.task.files.pattern</name>
> >   <value>.*_m_00000*</value>
> > </property>
> >
> > Then after restart hadoop and run some jobss, I did see tasks in my local dir
> like:
> >
> >
> /mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > 208101040_0003/
> >
> > But I still cannot find any output dir there.
> >
> > I have four disks mount for local dir, and only jars,work dir are find as
> following:
> >
> > <property>
> > <name>mapred.local.dir</name>
> >
> <value>/mnt/DP_disk1/raymond/hdfs/mapred,/mnt/DP_disk2/raymond/hdfs/
> ma
> >
> pred,/mnt/DP_disk3/raymond/hdfs/mapred,/mnt/DP_disk4/raymond/hdfs/ma
> pr
> > ed</value>
> > </property>
> >
> > Then I search though them:
> >
> > raymond@sr173:~$ ls
> >
> /mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > 208101040_0003/
> > jars  job.xml
> > raymond@sr173:~$ ls
> >
> /mnt/DP_disk2/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > 208101040_0003/ raymond@sr173:~$ ls
> >
> /mnt/DP_disk3/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > 208101040_0003/
> > jobToken  work
> > raymond@sr173:~$ ls
> >
> /mnt/DP_disk4/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201
> > 208101040_0003/
> >
> > And I also search the ttprivate dir, no luck there :
> >
> > raymond@sr173:~$ ls
> >
> /mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcac
> >
> he/job_201208101040_0003/attempt_201208101040_0003_m_000021_0/tas
> kjvm.
> > sh
> >
> /mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcac
> >
> he/job_201208101040_0003/attempt_201208101040_0003_m_000021_0/tas
> kjvm.
> > sh
> >
> > So, Is there anything I am still missing?
> >
> >
> > Best Regards,
> > Raymond Liu
> >
> 
> 
> 
> --
> Harsh J

Re: How can I get the intermediate output file from mapper class?

Posted by Harsh J <ha...@cloudera.com>.
Hi,

You need the "file.out" and "file.out.index" files when wanting the
map->intermediate->reduce files. So try a pattern that matches these
and you should have it.

The "XXXXX" kind of files are what MR produces on HDFS as regular
outputs - these aren't intermediate.

On Fri, Aug 10, 2012 at 8:52 AM, Liu, Raymond <ra...@intel.com> wrote:
> Hi
>
>         I am trying to access the intermediate file save to the local filesystem from mapreduce's mapper output.
>
>         I have googled this one : http://stackoverflow.com/questions/7867608/hadoop-mapreduce-intermediate-output
>
>         I am using hadoop 1.0.3 , and I did set following property in mapred-site.xml
>
> <property>
>   <name>keep.task.files.pattern</name>
>   <value>.*_m_00000*</value>
> </property>
>
> Then after restart hadoop and run some jobss, I did see tasks in my local dir like:
>
> /mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
>
> But I still cannot find any output dir there.
>
> I have four disks mount for local dir, and only jars,work dir are find as following:
>
> <property>
> <name>mapred.local.dir</name>
> <value>/mnt/DP_disk1/raymond/hdfs/mapred,/mnt/DP_disk2/raymond/hdfs/mapred,/mnt/DP_disk3/raymond/hdfs/mapred,/mnt/DP_disk4/raymond/hdfs/mapred</value>
> </property>
>
> Then I search though them:
>
> raymond@sr173:~$ ls /mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
> jars  job.xml
> raymond@sr173:~$ ls /mnt/DP_disk2/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
> raymond@sr173:~$ ls /mnt/DP_disk3/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
> jobToken  work
> raymond@sr173:~$ ls /mnt/DP_disk4/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
>
> And I also search the ttprivate dir, no luck there :
>
> raymond@sr173:~$ ls /mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcache/job_201208101040_0003/attempt_201208101040_0003_m_000021_0/taskjvm.sh
> /mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcache/job_201208101040_0003/attempt_201208101040_0003_m_000021_0/taskjvm.sh
>
> So, Is there anything I am still missing?
>
>
> Best Regards,
> Raymond Liu
>



-- 
Harsh J