You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Edward Capriolo <ed...@gmail.com> on 2010/01/06 17:47:22 UTC

NYC Event: Hadoop a Whirlwind Tour

Sorry for the short notice. Tonight, January 06, 2010 at 6:45

NYC BUG (NYC BSD Users Group) have asked me to do a presentation on Hadoop.

Presentation Information:
http://www.nycbug.org/index.php?NAV=Home;SUBM=10260
Slides:
http://www.nycbug.org/files/meeting_2010-01.pdf
Description:
This presentation gives a brief high level overview of Hadoop. Next,
we hit the ground running with a quick practical example of how Hadoop
solves a "big data" problem. We also discuss how the demonstrated
Hadoop processing model scales out to terabytes of data and hundreds
or even thousands of computers.

I am excited here, because it is a chance to bring some BSD users into
the hadoop fold.
I also built a preliminary FreeBSD port of hadoop
http://www.jointhegrid.com/jtg_ports/
in case after the presentation someone wants to dive into hadoop.

Again, sorry for the short notice.

Edward

Other sources for hadoop api help

Posted by Raymond Jennings III <ra...@yahoo.com>.
I am trying to develop some hadoop programs and I see that most of the examples included in the distribution are using deprecated classes and methods.  Are there any other sources to learn about the api other than the javadocs, which for beginners trying to write hadoop programs, is not the best source.  Thanks.



      

Re: Is it possible to share a key across maps?

Posted by Raymond Jennings III <ra...@yahoo.com>.
It looks like what you are referring to is the deprecated class - which has made for some confusing conversations in the past.  It seems like many users still use the older API and most of the examples still use it.  I would like to stay with the more recent api which looks the call is actually "setup()" instead of configure().  Not sure if it's a one to one mapping though.

--- On Fri, 1/8/10, Jeff Zhang <zj...@gmail.com> wrote:

> From: Jeff Zhang <zj...@gmail.com>
> Subject: Re: Is it possible to share a key across maps?
> To: common-user@hadoop.apache.org
> Date: Friday, January 8, 2010, 11:15 PM
> Actually you can treat the mapper
> task as a template design pattern, here's
> the persuade code:
> 
> Mapper.configure(JobConf)
> for each record in InputSplit:
>       do
> Mapper.map(key,value,outputkey,outputvalue)
> Mapper.close()
> 
> Any sub class of mapper can override the three method:
> configure(),
> map(),close() to do customization.
> 
> 
> 
> 2010/1/8 Gang Luo <lg...@yahoo.com.cn>
> 
> > I don't do that in map method, but in configure(
> JobConf ) method which
> > runs ahead of any map method call in that map task.
> > JobConf.get("map.input.file") can tell you which file
> this map task is
> > processing. Use this path to read first line of
> corresponding file. All
> > these are done in configure method, that means, before
> any map method is
> > called.
> >
> >
> > -Gang
> >
> >
> >
> > ----- 原始邮件 ----
> > 发件人: Raymond Jennings III <ra...@yahoo.com>
> > 收件人: common-user@hadoop.apache.org
> > 发送日期: 2010/1/8 (周五) 7:54:30 下午
> > 主   题: Re: Is it possible to
> share a key across maps?
> >
> > Hi, you do this in the map method (open the file and
> read the first line?)
> >  Could you explain a little more how you do it
> with configure(), thank you.
> >
> > --- On Fri, 1/8/10, Gang Luo <lg...@yahoo.com.cn>
> wrote:
> >
> > > From: Gang Luo <lg...@yahoo.com.cn>
> > > Subject: Re: Is it possible to share a key across
> maps?
> > > To: common-user@hadoop.apache.org
> > > Date: Friday, January 8, 2010, 4:46 PM
> > > I will do that like this: at each map
> > > task, I get the input file to
> > > this mapper in the configure(), and manually read
> the first
> > > line of
> > > that file to get the user ID. Then start running
> the map
> > > function.
> > >
> > >
> > > -Gang
> > >
> > >
> > > ----- 原始邮件 ----
> > > 发件人: Raymond Jennings III <ra...@yahoo.com>
> > > 收件人: common-user@hadoop.apache.org
> > > 发送日期: 2010/1/8 (周五) 4:23:15 下午
> > > 主   题: Is it possible to
> share a key
> > > across maps?
> > >
> > > I have large files where the userid is the first
> line of
> > > each file.  I want to use that value as the
> output of
> > > the map phase for each subsequent line of the
> file.  If
> > > each map task gets a chunk of this file only one
> map task
> > > will read the key value from the first
> line.  Is there
> > > anyway I can force the other map tasks to wait
> until this
> > > key is read and then somehow pass this value to
> other map
> > > tasks?  Or is my reasoning incorrect? 
> Thanks.
> > >
> > >
> > >
> > >
> ___________________________________________________________
> > >
> >
> >   好玩贺卡等你发,邮箱贺卡全新上线!
> > >
> > > http://card.mail.cn.yahoo.com/
> > >
> >
> >
> >   
>    ___________________________________________________________
> >   好玩贺卡等你发,邮箱贺卡全新上线!
> > http://card.mail.cn.yahoo.com/
> >
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang
> 


      

Re: Is it possible to share a key across maps?

Posted by Jeff Zhang <zj...@gmail.com>.
Actually you can treat the mapper task as a template design pattern, here's
the persuade code:

Mapper.configure(JobConf)
for each record in InputSplit:
      do Mapper.map(key,value,outputkey,outputvalue)
Mapper.close()

Any sub class of mapper can override the three method: configure(),
map(),close() to do customization.



2010/1/8 Gang Luo <lg...@yahoo.com.cn>

> I don't do that in map method, but in configure( JobConf ) method which
> runs ahead of any map method call in that map task.
> JobConf.get("map.input.file") can tell you which file this map task is
> processing. Use this path to read first line of corresponding file. All
> these are done in configure method, that means, before any map method is
> called.
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人: Raymond Jennings III <ra...@yahoo.com>
> 收件人: common-user@hadoop.apache.org
> 发送日期: 2010/1/8 (周五) 7:54:30 下午
> 主   题: Re: Is it possible to share a key across maps?
>
> Hi, you do this in the map method (open the file and read the first line?)
>  Could you explain a little more how you do it with configure(), thank you.
>
> --- On Fri, 1/8/10, Gang Luo <lg...@yahoo.com.cn> wrote:
>
> > From: Gang Luo <lg...@yahoo.com.cn>
> > Subject: Re: Is it possible to share a key across maps?
> > To: common-user@hadoop.apache.org
> > Date: Friday, January 8, 2010, 4:46 PM
> > I will do that like this: at each map
> > task, I get the input file to
> > this mapper in the configure(), and manually read the first
> > line of
> > that file to get the user ID. Then start running the map
> > function.
> >
> >
> > -Gang
> >
> >
> > ----- 原始邮件 ----
> > 发件人: Raymond Jennings III <ra...@yahoo.com>
> > 收件人: common-user@hadoop.apache.org
> > 发送日期: 2010/1/8 (周五) 4:23:15 下午
> > 主   题: Is it possible to share a key
> > across maps?
> >
> > I have large files where the userid is the first line of
> > each file.  I want to use that value as the output of
> > the map phase for each subsequent line of the file.  If
> > each map task gets a chunk of this file only one map task
> > will read the key value from the first line.  Is there
> > anyway I can force the other map tasks to wait until this
> > key is read and then somehow pass this value to other map
> > tasks?  Or is my reasoning incorrect?  Thanks.
> >
> >
> >
> > ___________________________________________________________
> >
> >   好玩贺卡等你发,邮箱贺卡全新上线!
> >
> > http://card.mail.cn.yahoo.com/
> >
>
>
>       ___________________________________________________________
>   好玩贺卡等你发,邮箱贺卡全新上线!
> http://card.mail.cn.yahoo.com/
>



-- 
Best Regards

Jeff Zhang

Re: Is it possible to share a key across maps?

Posted by Amogh Vasekar <am...@yahoo-inc.com>.
(Sorry for the spam if any, mails are bouncing back for me)

Hi,
In setup() use this,
FileSplit split = (FileSplit)context.getInputSplit();
 split.getPath() will return you the Path.
Hope this helps.

Amogh


On 1/13/10 1:25 AM, "Raymond Jennings III" <ra...@yahoo.com> wrote:

Hi Gang,
I was able to use this on an older version that uses the JobClient class to run the job but not on the newer api with the Job class.  The Job class appears to use a setup() method instead of a configure() method but the "map.input.file" attribute does not appear to be available via the conf class the setup() method.  Have you tried to do what you described using the newer api?  Thank you.

--- On Fri, 1/8/10, Gang Luo <lg...@yahoo.com.cn> wrote:


Re: Is it possible to share a key across maps?

Posted by Raymond Jennings III <ra...@yahoo.com>.
Hi Gang, 
I was able to use this on an older version that uses the JobClient class to run the job but not on the newer api with the Job class.  The Job class appears to use a setup() method instead of a configure() method but the "map.input.file" attribute does not appear to be available via the conf class the setup() method.  Have you tried to do what you described using the newer api?  Thank you.

--- On Fri, 1/8/10, Gang Luo <lg...@yahoo.com.cn> wrote:

> From: Gang Luo <lg...@yahoo.com.cn>
> Subject: Re: Is it possible to share a key across maps?
> To: common-user@hadoop.apache.org
> Date: Friday, January 8, 2010, 10:03 PM
> I don't do that in map method, but in
> configure( JobConf ) method which runs ahead of any map
> method call in that map task. JobConf.get("map.input.file")
> can tell you which file this map task is processing. Use
> this path to read first line of corresponding file. All
> these are done in configure method, that means, before any
> map method is called.
> 
> 
> -Gang
> 
> 
> 
> ----- 原始邮件 ----
> 发件人: Raymond Jennings III <ra...@yahoo.com>
> 收件人: common-user@hadoop.apache.org
> 发送日期: 2010/1/8 (周五) 7:54:30 下午
> 主   题: Re: Is it possible to share a
> key across maps?
> 
> Hi, you do this in the map method (open the file and read
> the first line?)  Could you explain a little more how
> you do it with configure(), thank you.
> 
> --- On Fri, 1/8/10, Gang Luo <lg...@yahoo.com.cn>
> wrote:
> 
> > From: Gang Luo <lg...@yahoo.com.cn>
> > Subject: Re: Is it possible to share a key across
> maps?
> > To: common-user@hadoop.apache.org
> > Date: Friday, January 8, 2010, 4:46 PM
> > I will do that like this: at each map
> > task, I get the input file to
> > this mapper in the configure(), and manually read the
> first
> > line of
> > that file to get the user ID. Then start running the
> map
> > function.
> > 
> > 
> > -Gang
> > 
> > 
> > ----- 原始邮件 ----
> > 发件人: Raymond Jennings III <ra...@yahoo.com>
> > 收件人: common-user@hadoop.apache.org
> > 发送日期: 2010/1/8 (周五) 4:23:15 下午
> > 主   题: Is it possible to share a
> key
> > across maps?
> > 
> > I have large files where the userid is the first line
> of
> > each file.  I want to use that value as the
> output of
> > the map phase for each subsequent line of the
> file.  If
> > each map task gets a chunk of this file only one map
> task
> > will read the key value from the first line.  Is
> there
> > anyway I can force the other map tasks to wait until
> this
> > key is read and then somehow pass this value to other
> map
> > tasks?  Or is my reasoning incorrect? 
> Thanks.
> > 
> > 
> >      
> >
> ___________________________________________________________
> > 
> >   好玩贺卡等你发,邮箱贺卡全新上线!
> > 
> > http://card.mail.cn.yahoo.com/
> > 
> 
> 
>      
> ___________________________________________________________
> 
>   好玩贺卡等你发,邮箱贺卡全新上线!
> 
> http://card.mail.cn.yahoo.com/
> 


      

Re: Is it possible to share a key across maps?

Posted by Gang Luo <lg...@yahoo.com.cn>.
I don't do that in map method, but in configure( JobConf ) method which runs ahead of any map method call in that map task. JobConf.get("map.input.file") can tell you which file this map task is processing. Use this path to read first line of corresponding file. All these are done in configure method, that means, before any map method is called.


-Gang



----- 原始邮件 ----
发件人: Raymond Jennings III <ra...@yahoo.com>
收件人: common-user@hadoop.apache.org
发送日期: 2010/1/8 (周五) 7:54:30 下午
主   题: Re: Is it possible to share a key across maps?

Hi, you do this in the map method (open the file and read the first line?)  Could you explain a little more how you do it with configure(), thank you.

--- On Fri, 1/8/10, Gang Luo <lg...@yahoo.com.cn> wrote:

> From: Gang Luo <lg...@yahoo.com.cn>
> Subject: Re: Is it possible to share a key across maps?
> To: common-user@hadoop.apache.org
> Date: Friday, January 8, 2010, 4:46 PM
> I will do that like this: at each map
> task, I get the input file to
> this mapper in the configure(), and manually read the first
> line of
> that file to get the user ID. Then start running the map
> function.
> 
> 
> -Gang
> 
> 
> ----- 原始邮件 ----
> 发件人: Raymond Jennings III <ra...@yahoo.com>
> 收件人: common-user@hadoop.apache.org
> 发送日期: 2010/1/8 (周五) 4:23:15 下午
> 主   题: Is it possible to share a key
> across maps?
> 
> I have large files where the userid is the first line of
> each file.  I want to use that value as the output of
> the map phase for each subsequent line of the file.  If
> each map task gets a chunk of this file only one map task
> will read the key value from the first line.  Is there
> anyway I can force the other map tasks to wait until this
> key is read and then somehow pass this value to other map
> tasks?  Or is my reasoning incorrect?  Thanks.
> 
> 
>      
> ___________________________________________________________
> 
>   好玩贺卡等你发,邮箱贺卡全新上线!
> 
> http://card.mail.cn.yahoo.com/
> 


      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

Re: Is it possible to share a key across maps?

Posted by Raymond Jennings III <ra...@yahoo.com>.
Hi, you do this in the map method (open the file and read the first line?)  Could you explain a little more how you do it with configure(), thank you.

--- On Fri, 1/8/10, Gang Luo <lg...@yahoo.com.cn> wrote:

> From: Gang Luo <lg...@yahoo.com.cn>
> Subject: Re: Is it possible to share a key across maps?
> To: common-user@hadoop.apache.org
> Date: Friday, January 8, 2010, 4:46 PM
> I will do that like this: at each map
> task, I get the input file to
> this mapper in the configure(), and manually read the first
> line of
> that file to get the user ID. Then start running the map
> function.
> 
> 
> -Gang
> 
> 
> ----- 原始邮件 ----
> 发件人: Raymond Jennings III <ra...@yahoo.com>
> 收件人: common-user@hadoop.apache.org
> 发送日期: 2010/1/8 (周五) 4:23:15 下午
> 主   题: Is it possible to share a key
> across maps?
> 
> I have large files where the userid is the first line of
> each file.  I want to use that value as the output of
> the map phase for each subsequent line of the file.  If
> each map task gets a chunk of this file only one map task
> will read the key value from the first line.  Is there
> anyway I can force the other map tasks to wait until this
> key is read and then somehow pass this value to other map
> tasks?  Or is my reasoning incorrect?  Thanks.
> 
> 
>      
> ___________________________________________________________
> 
>   好玩贺卡等你发,邮箱贺卡全新上线!
> 
> http://card.mail.cn.yahoo.com/
> 


      

Re: Is it possible to share a key across maps?

Posted by Gang Luo <lg...@yahoo.com.cn>.
I will do that like this: at each map task, I get the input file to
this mapper in the configure(), and manually read the first line of
that file to get the user ID. Then start running the map function.


-Gang


----- 原始邮件 ----
发件人: Raymond Jennings III <ra...@yahoo.com>
收件人: common-user@hadoop.apache.org
发送日期: 2010/1/8 (周五) 4:23:15 下午
主   题: Is it possible to share a key across maps?

I have large files where the userid is the first line of each file.  I want to use that value as the output of the map phase for each subsequent line of the file.  If each map task gets a chunk of this file only one map task will read the key value from the first line.  Is there anyway I can force the other map tasks to wait until this key is read and then somehow pass this value to other map tasks?  Or is my reasoning incorrect?  Thanks.


      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

Is it possible to share a key across maps?

Posted by Raymond Jennings III <ra...@yahoo.com>.
I have large files where the userid is the first line of each file.  I want to use that value as the output of the map phase for each subsequent line of the file.  If each map task gets a chunk of this file only one map task will read the key value from the first line.  Is there anyway I can force the other map tasks to wait until this key is read and then somehow pass this value to other map tasks?  Or is my reasoning incorrect?  Thanks.