You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Scott Whitecross <sc...@dataxu.com> on 2010/03/12 14:22:16 UTC

Efficiently Stream into Sequence Files?

Hi -

I'd like to create a job that pulls small files from a remote server (using FTP, SCP, etc.) and stores them directly to sequence files on HDFS.  Looking at the sequence file APi, I don't see an obvious way to do this.  It looks like what I have to do is pull the remote file to disk, then read the file into memory to place in the sequence file.  Is there a better way?

Looking at the API, am I forced to use the append method?

            FileSystem hdfs = FileSystem.get(context.getConfiguration());
            FSDataOutputStream outputStream = hdfs.create(new Path(outputPath));
            writer = SequenceFile.createWriter(context.getConfiguration(), outputStream, Text.class, BytesWritable.class, null, null);
		
	   // read in file to remotefilebytes            

            writer.append(filekey, remotefilebytes);


The alternative would be to have one job pull the remote files, and a secondary job write them into sequence files.    

I'm using the latest Cloudera release, which I believe is Hadoop 20.1

Thanks.

Re: Efficiently Stream into Sequence Files?

Posted by Zak Stone <zs...@gmail.com>.

Well, do consider buffering a set of files in however much memory you
have for each map task and then waiting for Hadoop to stream some into
a SequenceFile before you download more. Your tasks can work with
batches of files that are small enough to fit in memory but large
enough to avoid download latency and to allow Hadoop to be writing
constantly.

Depending on your local filesystem and how many small files you have,
it could be very inefficient to write small files to the local disk
and then open them all again later.

Zak


On Mon, Mar 15, 2010 at 9:44 AM, Scott Whitecross <sc...@dataxu.com> wrote:
> I could, however, the "small" files could grow beyond what I want to allocate memory for.  I could drop the files to disk, and load them as well in the job, but that seems less efficient then just saving the files and processing with a secondary job to create sequence files.
>
> Thanks.
>
> On Mar 12, 2010, at 2:20 PM, Zak Stone wrote:
>
>> Why not write a Hadoop map task that fetches the remote files into
>> memory and then emits them as key-value pairs into a SequenceFile?
>>
>> Zak
>>
>>
>> On Fri, Mar 12, 2010 at 8:22 AM, Scott Whitecross <sc...@dataxu.com> wrote:
>>> Hi -
>>>
>>> I'd like to create a job that pulls small files from a remote server (using FTP, SCP, etc.) and stores them directly to sequence files on HDFS.  Looking at the sequence file APi, I don't see an obvious way to do this.  It looks like what I have to do is pull the remote file to disk, then read the file into memory to place in the sequence file.  Is there a better way?
>>>
>>> Looking at the API, am I forced to use the append method?
>>>
>>>            FileSystem hdfs = FileSystem.get(context.getConfiguration());
>>>            FSDataOutputStream outputStream = hdfs.create(new Path(outputPath));
>>>            writer = SequenceFile.createWriter(context.getConfiguration(), outputStream, Text.class, BytesWritable.class, null, null);
>>>
>>>           // read in file to remotefilebytes
>>>
>>>            writer.append(filekey, remotefilebytes);
>>>
>>>
>>> The alternative would be to have one job pull the remote files, and a secondary job write them into sequence files.
>>>
>>> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>
>

Re: I want to group "similar" keys in the reducer.

Posted by Patrick Angeles <pa...@cloudera.com>.

You can use a custom Partitioner to send keys to a specific reducer. Note
that your reducer will still process one key at a time.

On Mon, Mar 15, 2010 at 1:26 PM, Raymond Jennings III <raymondjiii@yahoo.com
> wrote:

> Is it possible to override a method in the reducer so that similar keys
> will be grouped together?  For example I want all keys of value "KEY1" and
> "KEY2" to merged together.  (My reducer has a KEY of type TEXT.)  Thanks.
>
>
>
>

Re: I want to group "similar" keys in the reducer.

Posted by Reik Schatz <re...@bwin.org>.

I think what you do in that case is to write your own Partitioner class. 
The default partitioning is based on the hash value. See 
http://wiki.apache.org/hadoop/HadoopMapReduce

Raymond Jennings III wrote:
> Is it possible to override a method in the reducer so that similar keys will be grouped together?  For example I want all keys of value "KEY1" and "KEY2" to merged together.  (My reducer has a KEY of type TEXT.)  Thanks.
>
>
>       
>   

-- 

*Reik Schatz*
Technical Lead, Platform
P: +46 8 562 470 00
M: +46 76 25 29 872
F: +46 8 562 470 01
E: reik.schatz@bwin.org <ma...@bwin.org>
*/bwin/* Games AB
Klarabergsviadukten 82,
111 64 Stockholm, Sweden

[This e-mail may contain confidential and/or privileged information. If 
you are not the intended recipient (or have received this e-mail in 
error) please notify the sender immediately and destroy this e-mail. Any 
unauthorised copying, disclosure or distribution of the material in this 
e-mail is strictly forbidden.]

RE: WritableName can't load class in hive

Posted by Oded Rotem <od...@gmail.com>.

It's there as well, and still, no luck.

-----Original Message-----
From: Alex Kozlov [mailto:alexvk@cloudera.com] 
Sent: Tuesday, March 16, 2010 8:02 PM
To: common-user@hadoop.apache.org
Subject: Re: WritableName can't load class in hive

Hive executable will put all jars in HIVE_LIB=${HIVE_HOME}/lib into
classpath.  Try putting your custom jar into the $HIVE_HOME/lib directory
and restarting the CLI.

On Tue, Mar 16, 2010 at 6:28 AM, Oded Rotem <od...@gmail.com> wrote:

> Yes, I run the CLI from a folder containing the jar in question.
>
> -----Original Message-----
> From: Sonal Goyal [mailto:sonalgoyal4@gmail.com]
> Sent: Tuesday, March 16, 2010 1:14 PM
> To: common-user@hadoop.apache.org
> Subject: Re: WritableName can't load class in hive
>
> For some custom functions, I put the jar on the local path accessible to
> the
> CLI. Have you tried that?
>
> Thanks and Regards,
> Sonal
>
>
> On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem <od...@gmail.com>
> wrote:
>
> > We have a bunch of sequence files containing keys & values of custom
> > Writable classes that we wrote, in a HDFS directory.
> >
> > We manage to view them using Hadoop fs -text. For further ad-hoc
> analysis,
> > we tried using Hive. Managed to load them as external tables in Hive,
> > however running a simple select count() against the table fails with
> > "WritableName can't load class" in the job output log.
> >
> > Executing
> >        add jar <path>
> > does not solve it.
> >
> > Where do we need to place the jar containing the definition of the
> writable
> > classes?
> >
> >
>
>

RE: WritableName can't load class in hive

Posted by Oded Rotem <od...@gmail.com>.

No, I didn't specify any SerDe. I'll read up on that and see if it works.

Thanks.

-----Original Message-----
From: Arvind Prabhakar [mailto:arvind@cloudera.com] 
Sent: Wednesday, March 17, 2010 10:40 PM
To: common-user@hadoop.apache.org; hive-user@hadoop.apache.org
Subject: Re: WritableName can't load class in hive

[cross posting to hive-user]

Oded - how did you create the table in Hive? Did you specify any row format
SerDe for the table? If not, then that may be the cause of this problem
since the default LazySimpleSerDe is unable to deserialize the custom
Writable key value pairs that you have used in your file.

-Arvind

On Tue, Mar 16, 2010 at 2:50 PM, Oded Rotem <od...@gmail.com> wrote:

> Actually, now I moved to this error:
>
> java.lang.RuntimeException: org.apache.hadoop.hive.serde2.SerDeException:
> class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either
> BytesWritable or Text object!
>
> -----Original Message-----
> From: Alex Kozlov [mailto:alexvk@cloudera.com]
> Sent: Tuesday, March 16, 2010 8:02 PM
> To: common-user@hadoop.apache.org
> Subject: Re: WritableName can't load class in hive
>
> Hive executable will put all jars in HIVE_LIB=${HIVE_HOME}/lib into
> classpath.  Try putting your custom jar into the $HIVE_HOME/lib directory
> and restarting the CLI.
>
> On Tue, Mar 16, 2010 at 6:28 AM, Oded Rotem <od...@gmail.com>
> wrote:
>
> > Yes, I run the CLI from a folder containing the jar in question.
> >
> > -----Original Message-----
> > From: Sonal Goyal [mailto:sonalgoyal4@gmail.com]
> > Sent: Tuesday, March 16, 2010 1:14 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: WritableName can't load class in hive
> >
> > For some custom functions, I put the jar on the local path accessible to
> > the
> > CLI. Have you tried that?
> >
> > Thanks and Regards,
> > Sonal
> >
> >
> > On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem <od...@gmail.com>
> > wrote:
> >
> > > We have a bunch of sequence files containing keys & values of custom
> > > Writable classes that we wrote, in a HDFS directory.
> > >
> > > We manage to view them using Hadoop fs -text. For further ad-hoc
> > analysis,
> > > we tried using Hive. Managed to load them as external tables in Hive,
> > > however running a simple select count() against the table fails with
> > > "WritableName can't load class" in the job output log.
> > >
> > > Executing
> > >        add jar <path>
> > > does not solve it.
> > >
> > > Where do we need to place the jar containing the definition of the
> > writable
> > > classes?
> > >
> > >
> >
> >
>
>

RE: WritableName can't load class in hive

Posted by Oded Rotem <od...@gmail.com>.

No, I didn't specify any SerDe. I'll read up on that and see if it works.

Thanks.

-----Original Message-----
From: Arvind Prabhakar [mailto:arvind@cloudera.com] 
Sent: Wednesday, March 17, 2010 10:40 PM
To: common-user@hadoop.apache.org; hive-user@hadoop.apache.org
Subject: Re: WritableName can't load class in hive

[cross posting to hive-user]

Oded - how did you create the table in Hive? Did you specify any row format
SerDe for the table? If not, then that may be the cause of this problem
since the default LazySimpleSerDe is unable to deserialize the custom
Writable key value pairs that you have used in your file.

-Arvind

On Tue, Mar 16, 2010 at 2:50 PM, Oded Rotem <od...@gmail.com> wrote:

> Actually, now I moved to this error:
>
> java.lang.RuntimeException: org.apache.hadoop.hive.serde2.SerDeException:
> class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either
> BytesWritable or Text object!
>
> -----Original Message-----
> From: Alex Kozlov [mailto:alexvk@cloudera.com]
> Sent: Tuesday, March 16, 2010 8:02 PM
> To: common-user@hadoop.apache.org
> Subject: Re: WritableName can't load class in hive
>
> Hive executable will put all jars in HIVE_LIB=${HIVE_HOME}/lib into
> classpath.  Try putting your custom jar into the $HIVE_HOME/lib directory
> and restarting the CLI.
>
> On Tue, Mar 16, 2010 at 6:28 AM, Oded Rotem <od...@gmail.com>
> wrote:
>
> > Yes, I run the CLI from a folder containing the jar in question.
> >
> > -----Original Message-----
> > From: Sonal Goyal [mailto:sonalgoyal4@gmail.com]
> > Sent: Tuesday, March 16, 2010 1:14 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: WritableName can't load class in hive
> >
> > For some custom functions, I put the jar on the local path accessible to
> > the
> > CLI. Have you tried that?
> >
> > Thanks and Regards,
> > Sonal
> >
> >
> > On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem <od...@gmail.com>
> > wrote:
> >
> > > We have a bunch of sequence files containing keys & values of custom
> > > Writable classes that we wrote, in a HDFS directory.
> > >
> > > We manage to view them using Hadoop fs -text. For further ad-hoc
> > analysis,
> > > we tried using Hive. Managed to load them as external tables in Hive,
> > > however running a simple select count() against the table fails with
> > > "WritableName can't load class" in the job output log.
> > >
> > > Executing
> > >        add jar <path>
> > > does not solve it.
> > >
> > > Where do we need to place the jar containing the definition of the
> > writable
> > > classes?
> > >
> > >
> >
> >
>
>

Re: WritableName can't load class in hive

Posted by Arvind Prabhakar <ar...@cloudera.com>.

[cross posting to hive-user]

Oded - how did you create the table in Hive? Did you specify any row format
SerDe for the table? If not, then that may be the cause of this problem
since the default LazySimpleSerDe is unable to deserialize the custom
Writable key value pairs that you have used in your file.

-Arvind

On Tue, Mar 16, 2010 at 2:50 PM, Oded Rotem <od...@gmail.com> wrote:

> Actually, now I moved to this error:
>
> java.lang.RuntimeException: org.apache.hadoop.hive.serde2.SerDeException:
> class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either
> BytesWritable or Text object!
>
> -----Original Message-----
> From: Alex Kozlov [mailto:alexvk@cloudera.com]
> Sent: Tuesday, March 16, 2010 8:02 PM
> To: common-user@hadoop.apache.org
> Subject: Re: WritableName can't load class in hive
>
> Hive executable will put all jars in HIVE_LIB=${HIVE_HOME}/lib into
> classpath.  Try putting your custom jar into the $HIVE_HOME/lib directory
> and restarting the CLI.
>
> On Tue, Mar 16, 2010 at 6:28 AM, Oded Rotem <od...@gmail.com>
> wrote:
>
> > Yes, I run the CLI from a folder containing the jar in question.
> >
> > -----Original Message-----
> > From: Sonal Goyal [mailto:sonalgoyal4@gmail.com]
> > Sent: Tuesday, March 16, 2010 1:14 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: WritableName can't load class in hive
> >
> > For some custom functions, I put the jar on the local path accessible to
> > the
> > CLI. Have you tried that?
> >
> > Thanks and Regards,
> > Sonal
> >
> >
> > On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem <od...@gmail.com>
> > wrote:
> >
> > > We have a bunch of sequence files containing keys & values of custom
> > > Writable classes that we wrote, in a HDFS directory.
> > >
> > > We manage to view them using Hadoop fs -text. For further ad-hoc
> > analysis,
> > > we tried using Hive. Managed to load them as external tables in Hive,
> > > however running a simple select count() against the table fails with
> > > "WritableName can't load class" in the job output log.
> > >
> > > Executing
> > >        add jar <path>
> > > does not solve it.
> > >
> > > Where do we need to place the jar containing the definition of the
> > writable
> > > classes?
> > >
> > >
> >
> >
>
>

Re: WritableName can't load class in hive

Posted by Arvind Prabhakar <ar...@cloudera.com>.

[cross posting to hive-user]

Oded - how did you create the table in Hive? Did you specify any row format
SerDe for the table? If not, then that may be the cause of this problem
since the default LazySimpleSerDe is unable to deserialize the custom
Writable key value pairs that you have used in your file.

-Arvind

On Tue, Mar 16, 2010 at 2:50 PM, Oded Rotem <od...@gmail.com> wrote:

> Actually, now I moved to this error:
>
> java.lang.RuntimeException: org.apache.hadoop.hive.serde2.SerDeException:
> class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either
> BytesWritable or Text object!
>
> -----Original Message-----
> From: Alex Kozlov [mailto:alexvk@cloudera.com]
> Sent: Tuesday, March 16, 2010 8:02 PM
> To: common-user@hadoop.apache.org
> Subject: Re: WritableName can't load class in hive
>
> Hive executable will put all jars in HIVE_LIB=${HIVE_HOME}/lib into
> classpath.  Try putting your custom jar into the $HIVE_HOME/lib directory
> and restarting the CLI.
>
> On Tue, Mar 16, 2010 at 6:28 AM, Oded Rotem <od...@gmail.com>
> wrote:
>
> > Yes, I run the CLI from a folder containing the jar in question.
> >
> > -----Original Message-----
> > From: Sonal Goyal [mailto:sonalgoyal4@gmail.com]
> > Sent: Tuesday, March 16, 2010 1:14 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: WritableName can't load class in hive
> >
> > For some custom functions, I put the jar on the local path accessible to
> > the
> > CLI. Have you tried that?
> >
> > Thanks and Regards,
> > Sonal
> >
> >
> > On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem <od...@gmail.com>
> > wrote:
> >
> > > We have a bunch of sequence files containing keys & values of custom
> > > Writable classes that we wrote, in a HDFS directory.
> > >
> > > We manage to view them using Hadoop fs -text. For further ad-hoc
> > analysis,
> > > we tried using Hive. Managed to load them as external tables in Hive,
> > > however running a simple select count() against the table fails with
> > > "WritableName can't load class" in the job output log.
> > >
> > > Executing
> > >        add jar <path>
> > > does not solve it.
> > >
> > > Where do we need to place the jar containing the definition of the
> > writable
> > > classes?
> > >
> > >
> >
> >
>
>

RE: WritableName can't load class in hive

Posted by Oded Rotem <od...@gmail.com>.

Actually, now I moved to this error:

java.lang.RuntimeException: org.apache.hadoop.hive.serde2.SerDeException:
class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either
BytesWritable or Text object!

-----Original Message-----
From: Alex Kozlov [mailto:alexvk@cloudera.com] 
Sent: Tuesday, March 16, 2010 8:02 PM
To: common-user@hadoop.apache.org
Subject: Re: WritableName can't load class in hive

Hive executable will put all jars in HIVE_LIB=${HIVE_HOME}/lib into
classpath.  Try putting your custom jar into the $HIVE_HOME/lib directory
and restarting the CLI.

On Tue, Mar 16, 2010 at 6:28 AM, Oded Rotem <od...@gmail.com> wrote:

> Yes, I run the CLI from a folder containing the jar in question.
>
> -----Original Message-----
> From: Sonal Goyal [mailto:sonalgoyal4@gmail.com]
> Sent: Tuesday, March 16, 2010 1:14 PM
> To: common-user@hadoop.apache.org
> Subject: Re: WritableName can't load class in hive
>
> For some custom functions, I put the jar on the local path accessible to
> the
> CLI. Have you tried that?
>
> Thanks and Regards,
> Sonal
>
>
> On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem <od...@gmail.com>
> wrote:
>
> > We have a bunch of sequence files containing keys & values of custom
> > Writable classes that we wrote, in a HDFS directory.
> >
> > We manage to view them using Hadoop fs -text. For further ad-hoc
> analysis,
> > we tried using Hive. Managed to load them as external tables in Hive,
> > however running a simple select count() against the table fails with
> > "WritableName can't load class" in the job output log.
> >
> > Executing
> >        add jar <path>
> > does not solve it.
> >
> > Where do we need to place the jar containing the definition of the
> writable
> > classes?
> >
> >
>
>

Re: WritableName can't load class in hive

Posted by Alex Kozlov <al...@cloudera.com>.

Hive executable will put all jars in HIVE_LIB=${HIVE_HOME}/lib into
classpath.  Try putting your custom jar into the $HIVE_HOME/lib directory
and restarting the CLI.

On Tue, Mar 16, 2010 at 6:28 AM, Oded Rotem <od...@gmail.com> wrote:

> Yes, I run the CLI from a folder containing the jar in question.
>
> -----Original Message-----
> From: Sonal Goyal [mailto:sonalgoyal4@gmail.com]
> Sent: Tuesday, March 16, 2010 1:14 PM
> To: common-user@hadoop.apache.org
> Subject: Re: WritableName can't load class in hive
>
> For some custom functions, I put the jar on the local path accessible to
> the
> CLI. Have you tried that?
>
> Thanks and Regards,
> Sonal
>
>
> On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem <od...@gmail.com>
> wrote:
>
> > We have a bunch of sequence files containing keys & values of custom
> > Writable classes that we wrote, in a HDFS directory.
> >
> > We manage to view them using Hadoop fs -text. For further ad-hoc
> analysis,
> > we tried using Hive. Managed to load them as external tables in Hive,
> > however running a simple select count() against the table fails with
> > "WritableName can't load class" in the job output log.
> >
> > Executing
> >        add jar <path>
> > does not solve it.
> >
> > Where do we need to place the jar containing the definition of the
> writable
> > classes?
> >
> >
>
>

RE: WritableName can't load class in hive

Posted by Oded Rotem <od...@gmail.com>.

Yes, I run the CLI from a folder containing the jar in question.

-----Original Message-----
From: Sonal Goyal [mailto:sonalgoyal4@gmail.com] 
Sent: Tuesday, March 16, 2010 1:14 PM
To: common-user@hadoop.apache.org
Subject: Re: WritableName can't load class in hive

For some custom functions, I put the jar on the local path accessible to the
CLI. Have you tried that?

Thanks and Regards,
Sonal


On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem <od...@gmail.com> wrote:

> We have a bunch of sequence files containing keys & values of custom
> Writable classes that we wrote, in a HDFS directory.
>
> We manage to view them using Hadoop fs -text. For further ad-hoc analysis,
> we tried using Hive. Managed to load them as external tables in Hive,
> however running a simple select count() against the table fails with
> "WritableName can't load class" in the job output log.
>
> Executing
>        add jar <path>
> does not solve it.
>
> Where do we need to place the jar containing the definition of the
writable
> classes?
>
>

Re: WritableName can't load class in hive

Posted by Sonal Goyal <so...@gmail.com>.

For some custom functions, I put the jar on the local path accessible to the
CLI. Have you tried that?

Thanks and Regards,
Sonal


On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem <od...@gmail.com> wrote:

> We have a bunch of sequence files containing keys & values of custom
> Writable classes that we wrote, in a HDFS directory.
>
> We manage to view them using Hadoop fs -text. For further ad-hoc analysis,
> we tried using Hive. Managed to load them as external tables in Hive,
> however running a simple select count() against the table fails with
> "WritableName can't load class" in the job output log.
>
> Executing
>        add jar <path>
> does not solve it.
>
> Where do we need to place the jar containing the definition of the writable
> classes?
>
>

WritableName can't load class in hive

Posted by Oded Rotem <od...@gmail.com>.

We have a bunch of sequence files containing keys & values of custom Writable classes that we wrote, in a HDFS directory.

We manage to view them using Hadoop fs -text. For further ad-hoc analysis, we tried using Hive. Managed to load them as external tables in Hive, however running a simple select count() against the table fails with "WritableName can't load class" in the job output log. 

Executing 
	add jar <path> 
does not solve it.

Where do we need to place the jar containing the definition of the writable classes?

Re: TTL of distributed cache

Posted by Gang Luo <lg...@yahoo.com.cn>.

It is smart. Thanks Amareshwari.

-Gang


----- 原始邮件 ----
发件人： Amareshwari Sri Ramadasu <am...@yahoo-inc.com>
收件人： "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
发送日期： 2010/3/16 (周二) 11:08:31 下午
主   题： Re: TTL of distributed cache

Hi Gang,
Answers inline.

On 3/16/10 9:58 AM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Hi all,
what is the life length of the distributed cache files?
Localized cache file will be removed, if the file is not used by any job and localized disk space on the machine goes higher than configured local.cache.size(by default, 10 GB).

Will hadoop redistributed the same file to the same node twice if it is being used by two jobs?
No, It will be localized only once. Both the jobs will use the same localized file. If the file gets modified on DFS, then it will be localized once again.

Thanks
Amareshwari

Re: TTL of distributed cache

Posted by Amareshwari Sri Ramadasu <am...@yahoo-inc.com>.

Hi Gang,
Answers inline.

On 3/16/10 9:58 AM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Hi all,
what is the life length of the distributed cache files?
Localized cache file will be removed, if the file is not used by any job and localized disk space on the machine goes higher than configured local.cache.size(by default, 10 GB).

 Will hadoop redistributed the same file to the same node twice if it is being used by two jobs?
No, It will be localized only once. Both the jobs will use the same localized file. If the file gets modified on DFS, then it will be localized once again.

Thanks
Amareshwari

TTL of distributed cache

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi all,
what is the life length of the distributed cache files? Will hadoop redistributed the same file to the same node twice if it is being used by two jobs? 

Thanks,
-Gang

Re: Is there an easy way to clear old jobs from the jobtracker webpage?

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
The property mapred.jobtracker.completeuserjobs.maximum property specifies the number of jobs to be kept on JT page at any time. After this they are available under history page. Probably setting this to 0 will do the trick?

Amogh


On 3/17/10 10:09 PM, "Raymond Jennings III" <ra...@yahoo.com> wrote:

I'd like to be able to clear the contents of the jobs that have completed running on the jobtracker webpage.  Is there an easy way to do this without restarting the cluster?

Is there an easy way to clear old jobs from the jobtracker webpage?

Posted by Raymond Jennings III <ra...@yahoo.com>.

I'd like to be able to clear the contents of the jobs that have completed running on the jobtracker webpage.  Is there an easy way to do this without restarting the cluster?

Re: I want to group "similar" keys in the reducer.

Posted by Sonal Goyal <so...@gmail.com>.

Hi Raymond,

A custom partitioner is probably what you need.
An alternate approach is to emit keys based on your pattern. Say you are
currently emitting <KEY1, Val1> , <KEY2, Val2>, <K1, Val3>, <K4, Val4>

You can instead emit

<KEY, <Key1, Val1>> <KEY, <Key2, Val2>> <K, <K1, Val3>> <K, <K4, Val4>>

Thanks and Regards,
Sonal


2010/3/16 Jim Twensky <ji...@gmail.com>

> Hi Raymond,
>
> Take a look at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setGroupingComparatorClass(java.lang.Class)<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setGroupingComparatorClass%28java.lang.Class%29>
> .
> I think this is what you want. Also make sure to implement a custom
> partitioner that only takes into account the first part of the key,
> namely the KEY part. You can search for "Secondary Sort" and "Hadoop"
> to see some tutorials on this topic.
>
> Cheers,
> Jim
>
> 2010/3/15 Gang Luo <lg...@yahoo.com.cn>:
> > you need to define a pattern and implement you own partitioner so that
> all the similar keys you want to group will go the the same reducer. At
> reduce side, you possibly need to  implement secondary  sorting so that the
> keys you want to group are grouped in the sorted input to reducer. For
> reduce method process on key at one time, you also need to maintain a window
> to buffer all the keys being grouped.
> >
> > -Gang
> >
> >
> >
> > ----- 原始邮件 ----
> > 发件人： Raymond Jennings III <ra...@yahoo.com>
> > 收件人： common-user@hadoop.apache.org
> > 发送日期： 2010/3/15 (周一) 1:26:09 下午
> > 主   题： I want to group "similar" keys in the reducer.
> >
> > Is it possible to override a method in the reducer so that similar keys
> will be grouped together?  For example I want all keys of value "KEY1" and
> "KEY2" to merged together.  (My reducer has a KEY of type TEXT.)  Thanks.
> >
> >
> >
> >
>

Re: I want to group "similar" keys in the reducer.

Posted by Jim Twensky <ji...@gmail.com>.

Hi Raymond,

Take a look at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setGroupingComparatorClass(java.lang.Class).
I think this is what you want. Also make sure to implement a custom
partitioner that only takes into account the first part of the key,
namely the KEY part. You can search for "Secondary Sort" and "Hadoop"
to see some tutorials on this topic.

Cheers,
Jim

2010/3/15 Gang Luo <lg...@yahoo.com.cn>:
> you need to define a pattern and implement you own partitioner so that all the similar keys you want to group will go the the same reducer. At reduce side, you possibly need to  implement secondary  sorting so that the keys you want to group are grouped in the sorted input to reducer. For reduce method process on key at one time, you also need to maintain a window to buffer all the keys being grouped.
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人： Raymond Jennings III <ra...@yahoo.com>
> 收件人： common-user@hadoop.apache.org
> 发送日期： 2010/3/15 (周一) 1:26:09 下午
> 主   题： I want to group "similar" keys in the reducer.
>
> Is it possible to override a method in the reducer so that similar keys will be grouped together?  For example I want all keys of value "KEY1" and "KEY2" to merged together.  (My reducer has a KEY of type TEXT.)  Thanks.
>
>
>
>

Re: I want to group "similar" keys in the reducer.

Posted by Gang Luo <lg...@yahoo.com.cn>.

you need to define a pattern and implement you own partitioner so that all the similar keys you want to group will go the the same reducer. At reduce side, you possibly need to  implement secondary  sorting so that the keys you want to group are grouped in the sorted input to reducer. For reduce method process on key at one time, you also need to maintain a window to buffer all the keys being grouped.

-Gang



----- 原始邮件 ----
发件人： Raymond Jennings III <ra...@yahoo.com>
收件人： common-user@hadoop.apache.org
发送日期： 2010/3/15 (周一) 1:26:09 下午
主   题： I want to group "similar" keys in the reducer.

Is it possible to override a method in the reducer so that similar keys will be grouped together?  For example I want all keys of value "KEY1" and "KEY2" to merged together.  (My reducer has a KEY of type TEXT.)  Thanks.

I want to group "similar" keys in the reducer.

Posted by Raymond Jennings III <ra...@yahoo.com>.

Is it possible to override a method in the reducer so that similar keys will be grouped together?  For example I want all keys of value "KEY1" and "KEY2" to merged together.  (My reducer has a KEY of type TEXT.)  Thanks.

Re: Can I pass a user value to my reducer?

Posted by Ted Yu <yu...@gmail.com>.

Raymond:
You can use the following code to pass value from main to your reducer
class:
        @Override
        protected void setup(Context context) throws IOException,
                InterruptedException {
            try {
                Configuration conf = context.getConfiguration();
                watermark = Long.parseLong(conf.get("watermark"));
            } catch (NumberFormatException e) {
                e.printStackTrace();
            }

        }


On Mon, Mar 15, 2010 at 7:54 AM, Nick Jones <ni...@amd.com> wrote:

> On 3/15/2010 9:51 AM, Raymond Jennings III wrote:
>
>> I need to pass a counter value to my reducer from the main program.  Can
>> this be done through the context parameter somehow?
>>
>>
>>
>>
>>
>>
> Have you tried serializing both the mapper output value and this counter as
> "value" in the key/value pair?
>
> Nick Jones
>
>
>

Re: Can I pass a user value to my reducer?

Posted by Nick Jones <ni...@amd.com>.

On 3/15/2010 9:51 AM, Raymond Jennings III wrote:
> I need to pass a counter value to my reducer from the main program.  Can this be done through the context parameter somehow?
>
>
>
>
>    
Have you tried serializing both the mapper output value and this counter 
as "value" in the key/value pair?

Nick Jones

Can I pass a user value to my reducer?

Posted by Raymond Jennings III <ra...@yahoo.com>.

I need to pass a counter value to my reducer from the main program.  Can this be done through the context parameter somehow?

Re: Efficiently Stream into Sequence Files?

Posted by Scott Whitecross <sc...@dataxu.com>.

I could, however, the "small" files could grow beyond what I want to allocate memory for.  I could drop the files to disk, and load them as well in the job, but that seems less efficient then just saving the files and processing with a secondary job to create sequence files.

Thanks.

On Mar 12, 2010, at 2:20 PM, Zak Stone wrote:

> Why not write a Hadoop map task that fetches the remote files into
> memory and then emits them as key-value pairs into a SequenceFile?
> 
> Zak
> 
> 
> On Fri, Mar 12, 2010 at 8:22 AM, Scott Whitecross <sc...@dataxu.com> wrote:
>> Hi -
>> 
>> I'd like to create a job that pulls small files from a remote server (using FTP, SCP, etc.) and stores them directly to sequence files on HDFS.  Looking at the sequence file APi, I don't see an obvious way to do this.  It looks like what I have to do is pull the remote file to disk, then read the file into memory to place in the sequence file.  Is there a better way?
>> 
>> Looking at the API, am I forced to use the append method?
>> 
>>            FileSystem hdfs = FileSystem.get(context.getConfiguration());
>>            FSDataOutputStream outputStream = hdfs.create(new Path(outputPath));
>>            writer = SequenceFile.createWriter(context.getConfiguration(), outputStream, Text.class, BytesWritable.class, null, null);
>> 
>>           // read in file to remotefilebytes
>> 
>>            writer.append(filekey, remotefilebytes);
>> 
>> 
>> The alternative would be to have one job pull the remote files, and a secondary job write them into sequence files.
>> 
>> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>> 
>> Thanks.
>> 
>> 
>> 
>> 
>>

Re: Efficiently Stream into Sequence Files?

Posted by Zak Stone <zs...@gmail.com>.

Why not write a Hadoop map task that fetches the remote files into
memory and then emits them as key-value pairs into a SequenceFile?

Zak


On Fri, Mar 12, 2010 at 8:22 AM, Scott Whitecross <sc...@dataxu.com> wrote:
> Hi -
>
> I'd like to create a job that pulls small files from a remote server (using FTP, SCP, etc.) and stores them directly to sequence files on HDFS.  Looking at the sequence file APi, I don't see an obvious way to do this.  It looks like what I have to do is pull the remote file to disk, then read the file into memory to place in the sequence file.  Is there a better way?
>
> Looking at the API, am I forced to use the append method?
>
>            FileSystem hdfs = FileSystem.get(context.getConfiguration());
>            FSDataOutputStream outputStream = hdfs.create(new Path(outputPath));
>            writer = SequenceFile.createWriter(context.getConfiguration(), outputStream, Text.class, BytesWritable.class, null, null);
>
>           // read in file to remotefilebytes
>
>            writer.append(filekey, remotefilebytes);
>
>
> The alternative would be to have one job pull the remote files, and a secondary job write them into sequence files.
>
> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>
> Thanks.
>
>
>
>
>

Re: Efficiently Stream into Sequence Files?

Posted by Patrick Angeles <pa...@cloudera.com>.

Scott,

The code you have below should work, provided that the 'outputPath' points
to an HDFS file. The trick is to get FTP/SCP access to the remote files
using a Java client and receive the contents into a byte buffer.You can then
set that byte buffer into your BytesWritable and call writer.append().

On Fri, Mar 12, 2010 at 9:22 AM, Scott Whitecross <sc...@dataxu.com> wrote:

> Hi -
>
> I'd like to create a job that pulls small files from a remote server (using
> FTP, SCP, etc.) and stores them directly to sequence files on HDFS.  Looking
> at the sequence file APi, I don't see an obvious way to do this.  It looks
> like what I have to do is pull the remote file to disk, then read the file
> into memory to place in the sequence file.  Is there a better way?
>
> Looking at the API, am I forced to use the append method?
>
>            FileSystem hdfs = FileSystem.get(context.getConfiguration());
>            FSDataOutputStream outputStream = hdfs.create(new
> Path(outputPath));
>            writer = SequenceFile.createWriter(context.getConfiguration(),
> outputStream, Text.class, BytesWritable.class, null, null);
>
>           // read in file to remotefilebytes
>
>            writer.append(filekey, remotefilebytes);
>
>
> The alternative would be to have one job pull the remote files, and a
> secondary job write them into sequence files.
>
> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>
> Thanks.
>
>
>
>
>

Re: Efficiently Stream into Sequence Files?

Posted by Scott Whitecross <sc...@dataxu.com>.

I just looked at the javadocs, but it is unclear to me what the difference between a TFile and Sequence File?  It also looks like you need to load append the data in a similar way as with normal sequence files.

On Mar 12, 2010, at 2:15 PM, Hong Tang wrote:

> Have you looked at TFile?
> 
> On Mar 12, 2010, at 5:22 AM, Scott Whitecross wrote:
> 
>> Hi -
>> 
>> I'd like to create a job that pulls small files from a remote server  
>> (using FTP, SCP, etc.) and stores them directly to sequence files on  
>> HDFS.  Looking at the sequence file APi, I don't see an obvious way  
>> to do this.  It looks like what I have to do is pull the remote file  
>> to disk, then read the file into memory to place in the sequence  
>> file.  Is there a better way?
>> 
>> Looking at the API, am I forced to use the append method?
>> 
>>           FileSystem hdfs =  
>> FileSystem.get(context.getConfiguration());
>>           FSDataOutputStream outputStream = hdfs.create(new  
>> Path(outputPath));
>>           writer =  
>> SequenceFile.createWriter(context.getConfiguration(), outputStream,  
>> Text.class, BytesWritable.class, null, null);
>> 		
>> 	   // read in file to remotefilebytes
>> 
>>           writer.append(filekey, remotefilebytes);
>> 
>> 
>> The alternative would be to have one job pull the remote files, and  
>> a secondary job write them into sequence files.
>> 
>> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>> 
>> Thanks.
>> 
>> 
>> 
>> 
>

Re: Efficiently Stream into Sequence Files?

Posted by Hong Tang <ht...@yahoo-inc.com>.

Have you looked at TFile?

On Mar 12, 2010, at 5:22 AM, Scott Whitecross wrote:

> Hi -
>
> I'd like to create a job that pulls small files from a remote server  
> (using FTP, SCP, etc.) and stores them directly to sequence files on  
> HDFS.  Looking at the sequence file APi, I don't see an obvious way  
> to do this.  It looks like what I have to do is pull the remote file  
> to disk, then read the file into memory to place in the sequence  
> file.  Is there a better way?
>
> Looking at the API, am I forced to use the append method?
>
>            FileSystem hdfs =  
> FileSystem.get(context.getConfiguration());
>            FSDataOutputStream outputStream = hdfs.create(new  
> Path(outputPath));
>            writer =  
> SequenceFile.createWriter(context.getConfiguration(), outputStream,  
> Text.class, BytesWritable.class, null, null);
> 		
> 	   // read in file to remotefilebytes
>
>            writer.append(filekey, remotefilebytes);
>
>
> The alternative would be to have one job pull the remote files, and  
> a secondary job write them into sequence files.
>
> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>
> Thanks.
>
>
>
>