You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by alex bohr <al...@gmail.com> on 2013/12/05 00:40:08 UTC

Check compression codec of an HDFS file

What's the best way to check the compression codec that an HDFS file was
written with?

We use both Gzip and Snappy compression so I want a way to determine how a
specific file is compressed.

The closest I found is the *getCodec
<http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec(org.apache.hadoop.fs.Path)>
*but
that relies on the file name suffix ... which don't exist since Reducers
typically don't add a suffix to the filenames they create.

Thanks

Re: Check compression codec of an HDFS file

Posted by alex bohr <al...@gmail.com>.

The SequenceFile.Reader will work PErfect!  (I should have seen that).

As always - thanks Harsh


On Thu, Dec 5, 2013 at 2:22 AM, Harsh J <ha...@cloudera.com> wrote:

> If you're looking for file header/contents based inspection, you could
> download the file and run the Linux utility 'file' on the file, and it
> should tell you the format.
>
> I don't know about Snappy (AFAIK, we don't have a snappy
> frame/container format support in Hadoop yet, although upstream Snappy
> issue 34 seems resolved now), but Gzip files can be identified simply
> by their header bytes for the magic sequence.
>
> If its sequence files you are looking to analyse, a simple way is to
> read its first few hundred bytes, which should have the codec string
> in it. Programmatically you can use
>
> https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
> for sequence files.
>
> On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <al...@gmail.com> wrote:
> > What's the best way to check the compression codec that an HDFS file was
> > written with?
> >
> > We use both Gzip and Snappy compression so I want a way to determine how
> a
> > specific file is compressed.
> >
> > The closest I found is the getCodec but that relies on the file name
> suffix
> > ... which don't exist since Reducers typically don't add a suffix to the
> > filenames they create.
> >
> > Thanks
>
>
>
> --
> Harsh J
>

Re: Check compression codec of an HDFS file

Posted by alex bohr <al...@gmail.com>.

The SequenceFile.Reader will work PErfect!  (I should have seen that).

As always - thanks Harsh


On Thu, Dec 5, 2013 at 2:22 AM, Harsh J <ha...@cloudera.com> wrote:

> If you're looking for file header/contents based inspection, you could
> download the file and run the Linux utility 'file' on the file, and it
> should tell you the format.
>
> I don't know about Snappy (AFAIK, we don't have a snappy
> frame/container format support in Hadoop yet, although upstream Snappy
> issue 34 seems resolved now), but Gzip files can be identified simply
> by their header bytes for the magic sequence.
>
> If its sequence files you are looking to analyse, a simple way is to
> read its first few hundred bytes, which should have the codec string
> in it. Programmatically you can use
>
> https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
> for sequence files.
>
> On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <al...@gmail.com> wrote:
> > What's the best way to check the compression codec that an HDFS file was
> > written with?
> >
> > We use both Gzip and Snappy compression so I want a way to determine how
> a
> > specific file is compressed.
> >
> > The closest I found is the getCodec but that relies on the file name
> suffix
> > ... which don't exist since Reducers typically don't add a suffix to the
> > filenames they create.
> >
> > Thanks
>
>
>
> --
> Harsh J
>

Re: Check compression codec of an HDFS file

Posted by alex bohr <al...@gmail.com>.

The SequenceFile.Reader will work PErfect!  (I should have seen that).

As always - thanks Harsh


On Thu, Dec 5, 2013 at 2:22 AM, Harsh J <ha...@cloudera.com> wrote:

> If you're looking for file header/contents based inspection, you could
> download the file and run the Linux utility 'file' on the file, and it
> should tell you the format.
>
> I don't know about Snappy (AFAIK, we don't have a snappy
> frame/container format support in Hadoop yet, although upstream Snappy
> issue 34 seems resolved now), but Gzip files can be identified simply
> by their header bytes for the magic sequence.
>
> If its sequence files you are looking to analyse, a simple way is to
> read its first few hundred bytes, which should have the codec string
> in it. Programmatically you can use
>
> https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
> for sequence files.
>
> On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <al...@gmail.com> wrote:
> > What's the best way to check the compression codec that an HDFS file was
> > written with?
> >
> > We use both Gzip and Snappy compression so I want a way to determine how
> a
> > specific file is compressed.
> >
> > The closest I found is the getCodec but that relies on the file name
> suffix
> > ... which don't exist since Reducers typically don't add a suffix to the
> > filenames they create.
> >
> > Thanks
>
>
>
> --
> Harsh J
>

Re: Check compression codec of an HDFS file

Posted by alex bohr <al...@gmail.com>.

The SequenceFile.Reader will work PErfect!  (I should have seen that).

As always - thanks Harsh


On Thu, Dec 5, 2013 at 2:22 AM, Harsh J <ha...@cloudera.com> wrote:

> If you're looking for file header/contents based inspection, you could
> download the file and run the Linux utility 'file' on the file, and it
> should tell you the format.
>
> I don't know about Snappy (AFAIK, we don't have a snappy
> frame/container format support in Hadoop yet, although upstream Snappy
> issue 34 seems resolved now), but Gzip files can be identified simply
> by their header bytes for the magic sequence.
>
> If its sequence files you are looking to analyse, a simple way is to
> read its first few hundred bytes, which should have the codec string
> in it. Programmatically you can use
>
> https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
> for sequence files.
>
> On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <al...@gmail.com> wrote:
> > What's the best way to check the compression codec that an HDFS file was
> > written with?
> >
> > We use both Gzip and Snappy compression so I want a way to determine how
> a
> > specific file is compressed.
> >
> > The closest I found is the getCodec but that relies on the file name
> suffix
> > ... which don't exist since Reducers typically don't add a suffix to the
> > filenames they create.
> >
> > Thanks
>
>
>
> --
> Harsh J
>

Re: Check compression codec of an HDFS file

Posted by Harsh J <ha...@cloudera.com>.

If you're looking for file header/contents based inspection, you could
download the file and run the Linux utility 'file' on the file, and it
should tell you the format.

I don't know about Snappy (AFAIK, we don't have a snappy
frame/container format support in Hadoop yet, although upstream Snappy
issue 34 seems resolved now), but Gzip files can be identified simply
by their header bytes for the magic sequence.

If its sequence files you are looking to analyse, a simple way is to
read its first few hundred bytes, which should have the codec string
in it. Programmatically you can use
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
for sequence files.

On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <al...@gmail.com> wrote:
> What's the best way to check the compression codec that an HDFS file was
> written with?
>
> We use both Gzip and Snappy compression so I want a way to determine how a
> specific file is compressed.
>
> The closest I found is the getCodec but that relies on the file name suffix
> ... which don't exist since Reducers typically don't add a suffix to the
> filenames they create.
>
> Thanks

-- 
Harsh J

Re: Check compression codec of an HDFS file

Posted by Harsh J <ha...@cloudera.com>.

If you're looking for file header/contents based inspection, you could
download the file and run the Linux utility 'file' on the file, and it
should tell you the format.

I don't know about Snappy (AFAIK, we don't have a snappy
frame/container format support in Hadoop yet, although upstream Snappy
issue 34 seems resolved now), but Gzip files can be identified simply
by their header bytes for the magic sequence.

If its sequence files you are looking to analyse, a simple way is to
read its first few hundred bytes, which should have the codec string
in it. Programmatically you can use
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
for sequence files.

On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <al...@gmail.com> wrote:
> What's the best way to check the compression codec that an HDFS file was
> written with?
>
> We use both Gzip and Snappy compression so I want a way to determine how a
> specific file is compressed.
>
> The closest I found is the getCodec but that relies on the file name suffix
> ... which don't exist since Reducers typically don't add a suffix to the
> filenames they create.
>
> Thanks

-- 
Harsh J

Re: Check compression codec of an HDFS file

Posted by Harsh J <ha...@cloudera.com>.

If you're looking for file header/contents based inspection, you could
download the file and run the Linux utility 'file' on the file, and it
should tell you the format.

I don't know about Snappy (AFAIK, we don't have a snappy
frame/container format support in Hadoop yet, although upstream Snappy
issue 34 seems resolved now), but Gzip files can be identified simply
by their header bytes for the magic sequence.

If its sequence files you are looking to analyse, a simple way is to
read its first few hundred bytes, which should have the codec string
in it. Programmatically you can use
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
for sequence files.

On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <al...@gmail.com> wrote:
> What's the best way to check the compression codec that an HDFS file was
> written with?
>
> We use both Gzip and Snappy compression so I want a way to determine how a
> specific file is compressed.
>
> The closest I found is the getCodec but that relies on the file name suffix
> ... which don't exist since Reducers typically don't add a suffix to the
> filenames they create.
>
> Thanks

-- 
Harsh J

Re: Check compression codec of an HDFS file

Posted by Harsh J <ha...@cloudera.com>.

If you're looking for file header/contents based inspection, you could
download the file and run the Linux utility 'file' on the file, and it
should tell you the format.

I don't know about Snappy (AFAIK, we don't have a snappy
frame/container format support in Hadoop yet, although upstream Snappy
issue 34 seems resolved now), but Gzip files can be identified simply
by their header bytes for the magic sequence.

If its sequence files you are looking to analyse, a simple way is to
read its first few hundred bytes, which should have the codec string
in it. Programmatically you can use
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
for sequence files.

On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <al...@gmail.com> wrote:
> What's the best way to check the compression codec that an HDFS file was
> written with?
>
> We use both Gzip and Snappy compression so I want a way to determine how a
> specific file is compressed.
>
> The closest I found is the getCodec but that relies on the file name suffix
> ... which don't exist since Reducers typically don't add a suffix to the
> filenames they create.
>
> Thanks

-- 
Harsh J