You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by John Howland <jo...@gmail.com> on 2008/09/15 04:15:58 UTC

SequenceFiles and binary data

If I want to read values out of input files as binary data, is this
what BytesWritable is for?

I've successfully run my first task that uses a SequenceFile for
output. Are there any examples of SequenceFile usage out there? I'd
like to see the full range of what SequenceFile can do. What are the
trade-offs between record compression and block compression? What are
the limits on the key and value sizes? How do you use the per-file
metadata?

My intended use is to read files on a local filesystem into a
SequenceFile, with the value of each record being the contents of each
file. I hacked MultiFileWordCount to get the basic concept working...
but I'd appreciate any advice from the experts. In particular, what's
the most efficient way to read data from an
InputStreamReader/BufferedReader into a BytesWritable object?

Thanks,

John

Re: SequenceFiles and binary data

Posted by Owen O'Malley <om...@apache.org>.
On Sep 14, 2008, at 7:15 PM, John Howland wrote:

> If I want to read values out of input files as binary data, is this
> what BytesWritable is for?

yes

> I've successfully run my first task that uses a SequenceFile for
> output. Are there any examples of SequenceFile usage out there? I'd
> like to see the full range of what SequenceFile can do.

If you want serious usage, I'd suggest pulling up Nutch. Distcp also  
uses sequence files as its input.

You should also probably look at the TFile package that Hong is writing.

https://issues.apache.org/jira/browse/HADOOP-3315

Once it is ready, it will likely be exactly what you are looking for.

> What are the
> trade-offs between record compression and block compression?

You pretty much always want block compression. The only place where  
record compression is ok, is if your value is web pages or some other  
huge chunk of text.

> What are
> the limits on the key and value sizes?

Large.  I think I've see keys and/or values of around 50-100mb. It  
certainly can't be bigger than 1g. I believe the TFile limit on keys  
may be 64k.

> How do you use the per-file
> metadata?

It is just an application specific string to string map in the header  
of the file.

> My intended use is to read files on a local filesystem into a
> SequenceFile, with the value of each record being the contents of each
> file. I hacked MultiFileWordCount to get the basic concept working...

You should also look at the Hadoop archives.
http://hadoop.apache.org/core/docs/r0.18.0/hadoop_archives.html

> but I'd appreciate any advice from the experts. In particular, what's
> the most efficient way to read data from an
> InputStreamReader/BufferedReader into a BytesWritable object?

The easiest way is the way you've done it. You probably want to use  
lzo compression too.

-- Owen