You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Jerry Lam <ch...@gmail.com> on 2013/09/10 17:07:10 UTC

Concatenate multiple sequence files into 1 big sequence file

Hi Hadoop users,

I have been trying to concatenate multiple sequence files into one.
Since the total size of the sequence files is quite big (1TB), I won't use
mapreduce because it requires 1TB in the reducer host to hold the temporary
data.

I ended up doing what have been suggested in this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E

It works very well. I wonder if there is a faster way to append to a
sequence file.

Currently, the code looks like this (omit opening and closing sequence
files, exception handling etc):

// each seq is a sequence file
// writer is a sequence file writer
        for (val seq : seqs) {

          reader =new SequenceFile.Reader(conf, Reader.file(seq.getPath()));

            while (reader.next(readerKey, readerValue)) {

              writer.append(readerKey, readerValue);

            }

        }

Is there a better way to do this? Note that I think it is wasteful to
deserialize and serialize the key and value in the while loop because the
program simply append to the sequence file. Also, I don't seem to be able
to read and write fast enough (about 6MB/sec).

Any advice is appreciated,


Jerry

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by John Meagher <jo...@gmail.com>.

Here's a great tool for exactly what you're looking for
https://github.com/edwardcapriolo/filecrush

On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:
> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf, Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able to
> read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Jerry Lam <ch...@gmail.com>.

Hi guys,

Thank you for all the advices here. I really appreciate it.

I read through the code in filecrush and I found out that it is doing
exactly what I'm currently doing.
The logic resides in CrushReducer.java with the following lines that do the
concatenation:

while (reader.next(key, value)) {

   sink.write(key, value);

   reporter.incrCounter(ReducerCounter.RECORDS_CRUSHED, 1);

  }

I wonder if there are other faster ways to do this? Preferably a solution
that involves only streaming a set of sequence files to the final sequence
file.

Best Regards,


Jerry


On Tue, Sep 10, 2013 at 11:20 AM, Adam Muise <am...@hortonworks.com> wrote:

> Jerry,
>
> It might not help with this particular file, but you might considered the
> approach used at Blackberry when dealing with your data. They block
> compressed into small avro files and then concatenated into large avro
> files without decompressing. Check out the boom file format here:
>
> https://github.com/blackberry/hadoop-logdriver
>
> for now, use filecrush:
> https://github.com/edwardcapriolo/filecrush
>
> Cheers,
>
>
>
>
> On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi Hadoop users,
>>
>> I have been trying to concatenate multiple sequence files into one.
>> Since the total size of the sequence files is quite big (1TB), I won't
>> use mapreduce because it requires 1TB in the reducer host to hold the
>> temporary data.
>>
>> I ended up doing what have been suggested in this thread:
>> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>>
>> It works very well. I wonder if there is a faster way to append to a
>> sequence file.
>>
>> Currently, the code looks like this (omit opening and closing sequence
>> files, exception handling etc):
>>
>> // each seq is a sequence file
>> // writer is a sequence file writer
>>         for (val seq : seqs) {
>>
>>           reader =new SequenceFile.Reader(conf,
>> Reader.file(seq.getPath()));
>>
>>             while (reader.next(readerKey, readerValue)) {
>>
>>               writer.append(readerKey, readerValue);
>>
>>             }
>>
>>         }
>>
>> Is there a better way to do this? Note that I think it is wasteful to
>> deserialize and serialize the key and value in the while loop because the
>> program simply append to the sequence file. Also, I don't seem to be able
>> to read and write fast enough (about 6MB/sec).
>>
>> Any advice is appreciated,
>>
>>
>> Jerry
>>
>
>
>
> --
> *
> *
> *
> *
> *Adam Muise*
> Solution Engineer
> *Hortonworks*
> amuise@hortonworks.com
> 416-417-4037
>
> Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.<http://hortonworks.com/>
>
> Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>
>
> Hadoop: Disruptive Possibilities by Jeff Needham<http://hortonworks.com/resources/?did=72&cat=1>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Jerry Lam <ch...@gmail.com>.

Hi guys,

Thank you for all the advices here. I really appreciate it.

I read through the code in filecrush and I found out that it is doing
exactly what I'm currently doing.
The logic resides in CrushReducer.java with the following lines that do the
concatenation:

while (reader.next(key, value)) {

   sink.write(key, value);

   reporter.incrCounter(ReducerCounter.RECORDS_CRUSHED, 1);

  }

I wonder if there are other faster ways to do this? Preferably a solution
that involves only streaming a set of sequence files to the final sequence
file.

Best Regards,


Jerry


On Tue, Sep 10, 2013 at 11:20 AM, Adam Muise <am...@hortonworks.com> wrote:

> Jerry,
>
> It might not help with this particular file, but you might considered the
> approach used at Blackberry when dealing with your data. They block
> compressed into small avro files and then concatenated into large avro
> files without decompressing. Check out the boom file format here:
>
> https://github.com/blackberry/hadoop-logdriver
>
> for now, use filecrush:
> https://github.com/edwardcapriolo/filecrush
>
> Cheers,
>
>
>
>
> On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi Hadoop users,
>>
>> I have been trying to concatenate multiple sequence files into one.
>> Since the total size of the sequence files is quite big (1TB), I won't
>> use mapreduce because it requires 1TB in the reducer host to hold the
>> temporary data.
>>
>> I ended up doing what have been suggested in this thread:
>> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>>
>> It works very well. I wonder if there is a faster way to append to a
>> sequence file.
>>
>> Currently, the code looks like this (omit opening and closing sequence
>> files, exception handling etc):
>>
>> // each seq is a sequence file
>> // writer is a sequence file writer
>>         for (val seq : seqs) {
>>
>>           reader =new SequenceFile.Reader(conf,
>> Reader.file(seq.getPath()));
>>
>>             while (reader.next(readerKey, readerValue)) {
>>
>>               writer.append(readerKey, readerValue);
>>
>>             }
>>
>>         }
>>
>> Is there a better way to do this? Note that I think it is wasteful to
>> deserialize and serialize the key and value in the while loop because the
>> program simply append to the sequence file. Also, I don't seem to be able
>> to read and write fast enough (about 6MB/sec).
>>
>> Any advice is appreciated,
>>
>>
>> Jerry
>>
>
>
>
> --
> *
> *
> *
> *
> *Adam Muise*
> Solution Engineer
> *Hortonworks*
> amuise@hortonworks.com
> 416-417-4037
>
> Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.<http://hortonworks.com/>
>
> Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>
>
> Hadoop: Disruptive Possibilities by Jeff Needham<http://hortonworks.com/resources/?did=72&cat=1>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Jerry Lam <ch...@gmail.com>.

Hi guys,

Thank you for all the advices here. I really appreciate it.

I read through the code in filecrush and I found out that it is doing
exactly what I'm currently doing.
The logic resides in CrushReducer.java with the following lines that do the
concatenation:

while (reader.next(key, value)) {

   sink.write(key, value);

   reporter.incrCounter(ReducerCounter.RECORDS_CRUSHED, 1);

  }

I wonder if there are other faster ways to do this? Preferably a solution
that involves only streaming a set of sequence files to the final sequence
file.

Best Regards,


Jerry


On Tue, Sep 10, 2013 at 11:20 AM, Adam Muise <am...@hortonworks.com> wrote:

> Jerry,
>
> It might not help with this particular file, but you might considered the
> approach used at Blackberry when dealing with your data. They block
> compressed into small avro files and then concatenated into large avro
> files without decompressing. Check out the boom file format here:
>
> https://github.com/blackberry/hadoop-logdriver
>
> for now, use filecrush:
> https://github.com/edwardcapriolo/filecrush
>
> Cheers,
>
>
>
>
> On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi Hadoop users,
>>
>> I have been trying to concatenate multiple sequence files into one.
>> Since the total size of the sequence files is quite big (1TB), I won't
>> use mapreduce because it requires 1TB in the reducer host to hold the
>> temporary data.
>>
>> I ended up doing what have been suggested in this thread:
>> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>>
>> It works very well. I wonder if there is a faster way to append to a
>> sequence file.
>>
>> Currently, the code looks like this (omit opening and closing sequence
>> files, exception handling etc):
>>
>> // each seq is a sequence file
>> // writer is a sequence file writer
>>         for (val seq : seqs) {
>>
>>           reader =new SequenceFile.Reader(conf,
>> Reader.file(seq.getPath()));
>>
>>             while (reader.next(readerKey, readerValue)) {
>>
>>               writer.append(readerKey, readerValue);
>>
>>             }
>>
>>         }
>>
>> Is there a better way to do this? Note that I think it is wasteful to
>> deserialize and serialize the key and value in the while loop because the
>> program simply append to the sequence file. Also, I don't seem to be able
>> to read and write fast enough (about 6MB/sec).
>>
>> Any advice is appreciated,
>>
>>
>> Jerry
>>
>
>
>
> --
> *
> *
> *
> *
> *Adam Muise*
> Solution Engineer
> *Hortonworks*
> amuise@hortonworks.com
> 416-417-4037
>
> Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.<http://hortonworks.com/>
>
> Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>
>
> Hadoop: Disruptive Possibilities by Jeff Needham<http://hortonworks.com/resources/?did=72&cat=1>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Jerry Lam <ch...@gmail.com>.

Hi guys,

Thank you for all the advices here. I really appreciate it.

I read through the code in filecrush and I found out that it is doing
exactly what I'm currently doing.
The logic resides in CrushReducer.java with the following lines that do the
concatenation:

while (reader.next(key, value)) {

   sink.write(key, value);

   reporter.incrCounter(ReducerCounter.RECORDS_CRUSHED, 1);

  }

I wonder if there are other faster ways to do this? Preferably a solution
that involves only streaming a set of sequence files to the final sequence
file.

Best Regards,


Jerry


On Tue, Sep 10, 2013 at 11:20 AM, Adam Muise <am...@hortonworks.com> wrote:

> Jerry,
>
> It might not help with this particular file, but you might considered the
> approach used at Blackberry when dealing with your data. They block
> compressed into small avro files and then concatenated into large avro
> files without decompressing. Check out the boom file format here:
>
> https://github.com/blackberry/hadoop-logdriver
>
> for now, use filecrush:
> https://github.com/edwardcapriolo/filecrush
>
> Cheers,
>
>
>
>
> On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi Hadoop users,
>>
>> I have been trying to concatenate multiple sequence files into one.
>> Since the total size of the sequence files is quite big (1TB), I won't
>> use mapreduce because it requires 1TB in the reducer host to hold the
>> temporary data.
>>
>> I ended up doing what have been suggested in this thread:
>> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>>
>> It works very well. I wonder if there is a faster way to append to a
>> sequence file.
>>
>> Currently, the code looks like this (omit opening and closing sequence
>> files, exception handling etc):
>>
>> // each seq is a sequence file
>> // writer is a sequence file writer
>>         for (val seq : seqs) {
>>
>>           reader =new SequenceFile.Reader(conf,
>> Reader.file(seq.getPath()));
>>
>>             while (reader.next(readerKey, readerValue)) {
>>
>>               writer.append(readerKey, readerValue);
>>
>>             }
>>
>>         }
>>
>> Is there a better way to do this? Note that I think it is wasteful to
>> deserialize and serialize the key and value in the while loop because the
>> program simply append to the sequence file. Also, I don't seem to be able
>> to read and write fast enough (about 6MB/sec).
>>
>> Any advice is appreciated,
>>
>>
>> Jerry
>>
>
>
>
> --
> *
> *
> *
> *
> *Adam Muise*
> Solution Engineer
> *Hortonworks*
> amuise@hortonworks.com
> 416-417-4037
>
> Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.<http://hortonworks.com/>
>
> Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>
>
> Hadoop: Disruptive Possibilities by Jeff Needham<http://hortonworks.com/resources/?did=72&cat=1>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Adam Muise <am...@hortonworks.com>.

Jerry,

It might not help with this particular file, but you might considered the
approach used at Blackberry when dealing with your data. They block
compressed into small avro files and then concatenated into large avro
files without decompressing. Check out the boom file format here:

https://github.com/blackberry/hadoop-logdriver

for now, use filecrush:
https://github.com/edwardcapriolo/filecrush

Cheers,




On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf,
> Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able
> to read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry
>



-- 
*
*
*
*
*Adam Muise*
Solution Engineer
*Hortonworks*
amuise@hortonworks.com
416-417-4037

Hortonworks - Develops, Distributes and Supports Enterprise Apache
Hadoop.<http://hortonworks.com/>

Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>

Hadoop: Disruptive Possibilities by Jeff
Needham<http://hortonworks.com/resources/?did=72&cat=1>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Jay Vyas <ja...@gmail.com>.

iirc sequence files can be concatenated as is and read as one large file
but maybe im forgetting something.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Adam Muise <am...@hortonworks.com>.

Jerry,

It might not help with this particular file, but you might considered the
approach used at Blackberry when dealing with your data. They block
compressed into small avro files and then concatenated into large avro
files without decompressing. Check out the boom file format here:

https://github.com/blackberry/hadoop-logdriver

for now, use filecrush:
https://github.com/edwardcapriolo/filecrush

Cheers,




On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf,
> Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able
> to read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry
>



-- 
*
*
*
*
*Adam Muise*
Solution Engineer
*Hortonworks*
amuise@hortonworks.com
416-417-4037

Hortonworks - Develops, Distributes and Supports Enterprise Apache
Hadoop.<http://hortonworks.com/>

Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>

Hadoop: Disruptive Possibilities by Jeff
Needham<http://hortonworks.com/resources/?did=72&cat=1>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Adam Muise <am...@hortonworks.com>.

Jerry,

It might not help with this particular file, but you might considered the
approach used at Blackberry when dealing with your data. They block
compressed into small avro files and then concatenated into large avro
files without decompressing. Check out the boom file format here:

https://github.com/blackberry/hadoop-logdriver

for now, use filecrush:
https://github.com/edwardcapriolo/filecrush

Cheers,




On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf,
> Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able
> to read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry
>



-- 
*
*
*
*
*Adam Muise*
Solution Engineer
*Hortonworks*
amuise@hortonworks.com
416-417-4037

Hortonworks - Develops, Distributes and Supports Enterprise Apache
Hadoop.<http://hortonworks.com/>

Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>

Hadoop: Disruptive Possibilities by Jeff
Needham<http://hortonworks.com/resources/?did=72&cat=1>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by John Meagher <jo...@gmail.com>.

Here's a great tool for exactly what you're looking for
https://github.com/edwardcapriolo/filecrush

On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:
> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf, Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able to
> read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Adam Muise <am...@hortonworks.com>.

Jerry,

It might not help with this particular file, but you might considered the
approach used at Blackberry when dealing with your data. They block
compressed into small avro files and then concatenated into large avro
files without decompressing. Check out the boom file format here:

https://github.com/blackberry/hadoop-logdriver

for now, use filecrush:
https://github.com/edwardcapriolo/filecrush

Cheers,




On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf,
> Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able
> to read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry
>



-- 
*
*
*
*
*Adam Muise*
Solution Engineer
*Hortonworks*
amuise@hortonworks.com
416-417-4037

Hortonworks - Develops, Distributes and Supports Enterprise Apache
Hadoop.<http://hortonworks.com/>

Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>

Hadoop: Disruptive Possibilities by Jeff
Needham<http://hortonworks.com/resources/?did=72&cat=1>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by John Meagher <jo...@gmail.com>.

Here's a great tool for exactly what you're looking for
https://github.com/edwardcapriolo/filecrush

On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:
> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf, Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able to
> read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Jay Vyas <ja...@gmail.com>.

iirc sequence files can be concatenated as is and read as one large file
but maybe im forgetting something.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by John Meagher <jo...@gmail.com>.

Here's a great tool for exactly what you're looking for
https://github.com/edwardcapriolo/filecrush

On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <ch...@gmail.com> wrote:
> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf, Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able to
> read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Jay Vyas <ja...@gmail.com>.

iirc sequence files can be concatenated as is and read as one large file
but maybe im forgetting something.

Re: Concatenate multiple sequence files into 1 big sequence file

Posted by Jay Vyas <ja...@gmail.com>.

iirc sequence files can be concatenated as is and read as one large file
but maybe im forgetting something.