You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by 丛林 <co...@gmail.com> on 2011/05/12 02:17:32 UTC

How to merge several SequenceFile into one?

Hi all,

There is lots of SequenceFile in HDFS, how can I merge them into one
SequenceFile?

Thanks for you suggestion.

-Lin

Re: How to merge several SequenceFile into one?

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

> There is lots of SequenceFile in HDFS, how can I merge them into one
> SequenceFile?

The simplest way to do that is to create a job that
- input format = sequence file
- map = identity mapper
- reduce = identity reduce
- output = sequence file
and
 job.setNumReduceTasks(1)

However: I think it is a useless thing to do.
Sequence files are only really useful inside a Hadoop cluster serving
as input for later jobs.
And having multiple files only helps Hadoop in scaling out.

So my question to you: Why do you want that?



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

RE: AW: How to merge several SequenceFile into one?

Posted by Panayotis Antonopoulos <an...@hotmail.com>.

I would like to merge some SequenceFiles as well, so any help would be great!

Although the solution with the single reducer works great, the files are small so I don't need distribution.
I think I will create a simple java program that will read these files and merge them.

> From: Christoph.Schmitz@1und1.de
> To: mapreduce-user@hadoop.apache.org
> Date: Thu, 12 May 2011 15:44:57 +0200
> Subject: AW: How to merge several SequenceFile into one?
> 
> Oops, sorry, I answered in the wrong thread. I intended to reply to the "How to create a SequenceFile faster" issue.
> 
> Regards,
> Christoph
> 
> -----Ursprüngliche Nachricht-----
> Von: 丛林 [mailto:conglin02@gmail.com] 
> Gesendet: Donnerstag, 12. Mai 2011 14:30
> An: mapreduce-user@hadoop.apache.org
> Betreff: Re: How to merge several SequenceFile into one?
> 
> Hi Christoph,
> 
> If there is no reducer, how can these sequence files be merged?
> 
> Thanks for you advice.
> 
> Best Wishes,
> 
> -Lin
> 
> 在 2011年5月12日 下午7:44，Christoph Schmitz <Ch...@1und1.de> 写道：
> > Hi Lin,
> >
> > you could run a map-only job, i.e. read your data and output it from the mapper without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use job.setNumReduceTasks(0)).
> >
> > That way, you parallelize over your inputs through a number of mappers and do not have any sort/shuffle/reduce overhead.
> >
> > Regards,
> > Christoph
> >
> > -----Ursprüngliche Nachricht-----
> > Von: 丛林 [mailto:conglin02@gmail.com]
> > Gesendet: Donnerstag, 12. Mai 2011 13:16
> > An: mapreduce-user@hadoop.apache.org
> > Betreff: Re: How to merge several SequenceFile into one?
> >
> > Dear Jason,
> >
> > If the order of the keys in sequence file is not important to me, in
> > other words, the sort process is not necessary, how can I stop the
> > distributed sort to save the consumption of resource?
> >
> > Thanks for your suggestion.
> >
> > Best Wishes,
> >
> > -Lin
> >
> > 2011/5/12 jason <ur...@gmail.com>:
> >> M/R job with a single reducer would do the job. This way you can
> >> utilize distributed sort and merge/combine/dedupe key/values as you
> >> wish.
> >>
> >> On 5/11/11, 丛林 <co...@gmail.com> wrote:
> >>> Hi all,
> >>>
> >>> There is lots of SequenceFile in HDFS, how can I merge them into one
> >>> SequenceFile?
> >>>
> >>> Thanks for you suggestion.
> >>>
> >>> -Lin
> >>>
> >>
> >

AW: How to merge several SequenceFile into one?

Posted by Christoph Schmitz <Ch...@1und1.de>.

Oops, sorry, I answered in the wrong thread. I intended to reply to the "How to create a SequenceFile faster" issue.

Regards,
Christoph

-----Ursprüngliche Nachricht-----
Von: 丛林 [mailto:conglin02@gmail.com] 
Gesendet: Donnerstag, 12. Mai 2011 14:30
An: mapreduce-user@hadoop.apache.org
Betreff: Re: How to merge several SequenceFile into one?

Hi Christoph,

If there is no reducer, how can these sequence files be merged?

Thanks for you advice.

Best Wishes,

-Lin

在 2011年5月12日 下午7:44，Christoph Schmitz <Ch...@1und1.de> 写道：
> Hi Lin,
>
> you could run a map-only job, i.e. read your data and output it from the mapper without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use job.setNumReduceTasks(0)).
>
> That way, you parallelize over your inputs through a number of mappers and do not have any sort/shuffle/reduce overhead.
>
> Regards,
> Christoph
>
> -----Ursprüngliche Nachricht-----
> Von: 丛林 [mailto:conglin02@gmail.com]
> Gesendet: Donnerstag, 12. Mai 2011 13:16
> An: mapreduce-user@hadoop.apache.org
> Betreff: Re: How to merge several SequenceFile into one?
>
> Dear Jason,
>
> If the order of the keys in sequence file is not important to me, in
> other words, the sort process is not necessary, how can I stop the
> distributed sort to save the consumption of resource?
>
> Thanks for your suggestion.
>
> Best Wishes,
>
> -Lin
>
> 2011/5/12 jason <ur...@gmail.com>:
>> M/R job with a single reducer would do the job. This way you can
>> utilize distributed sort and merge/combine/dedupe key/values as you
>> wish.
>>
>> On 5/11/11, 丛林 <co...@gmail.com> wrote:
>>> Hi all,
>>>
>>> There is lots of SequenceFile in HDFS, how can I merge them into one
>>> SequenceFile?
>>>
>>> Thanks for you suggestion.
>>>
>>> -Lin
>>>
>>
>

Re: How to merge several SequenceFile into one?

Posted by 丛林 <co...@gmail.com>.

Hi Christoph,

If there is no reducer, how can these sequence files be merged?

Thanks for you advice.

Best Wishes,

-Lin

在 2011年5月12日 下午7:44，Christoph Schmitz <Ch...@1und1.de> 写道：
> Hi Lin,
>
> you could run a map-only job, i.e. read your data and output it from the mapper without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use job.setNumReduceTasks(0)).
>
> That way, you parallelize over your inputs through a number of mappers and do not have any sort/shuffle/reduce overhead.
>
> Regards,
> Christoph
>
> -----Ursprüngliche Nachricht-----
> Von: 丛林 [mailto:conglin02@gmail.com]
> Gesendet: Donnerstag, 12. Mai 2011 13:16
> An: mapreduce-user@hadoop.apache.org
> Betreff: Re: How to merge several SequenceFile into one?
>
> Dear Jason,
>
> If the order of the keys in sequence file is not important to me, in
> other words, the sort process is not necessary, how can I stop the
> distributed sort to save the consumption of resource?
>
> Thanks for your suggestion.
>
> Best Wishes,
>
> -Lin
>
> 2011/5/12 jason <ur...@gmail.com>:
>> M/R job with a single reducer would do the job. This way you can
>> utilize distributed sort and merge/combine/dedupe key/values as you
>> wish.
>>
>> On 5/11/11, 丛林 <co...@gmail.com> wrote:
>>> Hi all,
>>>
>>> There is lots of SequenceFile in HDFS, how can I merge them into one
>>> SequenceFile?
>>>
>>> Thanks for you suggestion.
>>>
>>> -Lin
>>>
>>
>

AW: How to merge several SequenceFile into one?

Posted by Christoph Schmitz <Ch...@1und1.de>.

Hi Lin,

you could run a map-only job, i.e. read your data and output it from the mapper without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use job.setNumReduceTasks(0)).

That way, you parallelize over your inputs through a number of mappers and do not have any sort/shuffle/reduce overhead.

Regards,
Christoph

-----Ursprüngliche Nachricht-----
Von: 丛林 [mailto:conglin02@gmail.com] 
Gesendet: Donnerstag, 12. Mai 2011 13:16
An: mapreduce-user@hadoop.apache.org
Betreff: Re: How to merge several SequenceFile into one?

Dear Jason,

If the order of the keys in sequence file is not important to me, in
other words, the sort process is not necessary, how can I stop the
distributed sort to save the consumption of resource?

Thanks for your suggestion.

Best Wishes,

-Lin

2011/5/12 jason <ur...@gmail.com>:
> M/R job with a single reducer would do the job. This way you can
> utilize distributed sort and merge/combine/dedupe key/values as you
> wish.
>
> On 5/11/11, 丛林 <co...@gmail.com> wrote:
>> Hi all,
>>
>> There is lots of SequenceFile in HDFS, how can I merge them into one
>> SequenceFile?
>>
>> Thanks for you suggestion.
>>
>> -Lin
>>
>

Re: How to merge several SequenceFile into one?

Posted by 丛林 <co...@gmail.com>.

Dear Jason,

If the order of the keys in sequence file is not important to me, in
other words, the sort process is not necessary, how can I stop the
distributed sort to save the consumption of resource?

Thanks for your suggestion.

Best Wishes,

-Lin

2011/5/12 jason <ur...@gmail.com>:
> M/R job with a single reducer would do the job. This way you can
> utilize distributed sort and merge/combine/dedupe key/values as you
> wish.
>
> On 5/11/11, 丛林 <co...@gmail.com> wrote:
>> Hi all,
>>
>> There is lots of SequenceFile in HDFS, how can I merge them into one
>> SequenceFile?
>>
>> Thanks for you suggestion.
>>
>> -Lin
>>
>

Re: How to merge several SequenceFile into one?

Posted by jason <ur...@gmail.com>.

M/R job with a single reducer would do the job. This way you can
utilize distributed sort and merge/combine/dedupe key/values as you
wish.

On 5/11/11, 丛林 <co...@gmail.com> wrote:
> Hi all,
>
> There is lots of SequenceFile in HDFS, how can I merge them into one
> SequenceFile?
>
> Thanks for you suggestion.
>
> -Lin
>