You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Robert Dyer <ps...@gmail.com> on 2013/11/23 22:14:03 UTC

Uncompressed size of Sequence files

Is there an easy way to get the uncompressed size of a sequence file that
is block compressed?  I am using the Snappy compressor.

I realize I can obviously just decompress them to temporary files to get
the size, but I would assume there is an easier way.  Perhaps an existing
tool that my search did not turn up?

If not, I will have to run a MR job load each compressed block and read the
Snappy header to get the size.  I need to do this for a large number of
files so I'd prefer a simple CLI tool (sort of like 'hadoop fs -du').

- Robert

Re: Uncompressed size of Sequence files

Posted by Robert Dyer <ps...@gmail.com>.
I should probably mention my attempt to use the 'hadoop' command for this
task fails (this file is fairly large, about 80GB compressed):

$ HADOOP_HEAPSIZE=3000 hadoop fs -text /path/to/file | wc -c
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
 at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
 at java.lang.String.getBytes(String.java:956)
at org.apache.hadoop.fs.FsShell$TextRecordInputStream.read(FsShell.java:391)
 at java.io.InputStream.read(InputStream.java:179)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
 at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
 at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:122)
at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:50)
 at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:427)
at
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1934)
 at org.apache.hadoop.fs.FsShell.text(FsShell.java:421)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1597)
 at org.apache.hadoop.fs.FsShell.run(FsShell.java:1798)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916)

On Sat, Nov 23, 2013 at 3:14 PM, Robert Dyer <ps...@gmail.com> wrote:

> Is there an easy way to get the uncompressed size of a sequence file that
> is block compressed?  I am using the Snappy compressor.
>
> I realize I can obviously just decompress them to temporary files to get
> the size, but I would assume there is an easier way.  Perhaps an existing
> tool that my search did not turn up?
>
> If not, I will have to run a MR job load each compressed block and read
> the Snappy header to get the size.  I need to do this for a large number of
> files so I'd prefer a simple CLI tool (sort of like 'hadoop fs -du').
>
> - Robert
>
>

Re: Uncompressed size of Sequence files

Posted by Robert Dyer <ps...@gmail.com>.
I should probably mention my attempt to use the 'hadoop' command for this
task fails (this file is fairly large, about 80GB compressed):

$ HADOOP_HEAPSIZE=3000 hadoop fs -text /path/to/file | wc -c
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
 at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
 at java.lang.String.getBytes(String.java:956)
at org.apache.hadoop.fs.FsShell$TextRecordInputStream.read(FsShell.java:391)
 at java.io.InputStream.read(InputStream.java:179)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
 at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
 at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:122)
at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:50)
 at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:427)
at
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1934)
 at org.apache.hadoop.fs.FsShell.text(FsShell.java:421)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1597)
 at org.apache.hadoop.fs.FsShell.run(FsShell.java:1798)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916)

On Sat, Nov 23, 2013 at 3:14 PM, Robert Dyer <ps...@gmail.com> wrote:

> Is there an easy way to get the uncompressed size of a sequence file that
> is block compressed?  I am using the Snappy compressor.
>
> I realize I can obviously just decompress them to temporary files to get
> the size, but I would assume there is an easier way.  Perhaps an existing
> tool that my search did not turn up?
>
> If not, I will have to run a MR job load each compressed block and read
> the Snappy header to get the size.  I need to do this for a large number of
> files so I'd prefer a simple CLI tool (sort of like 'hadoop fs -du').
>
> - Robert
>
>

Re: Uncompressed size of Sequence files

Posted by Robert Dyer <ps...@gmail.com>.
I should probably mention my attempt to use the 'hadoop' command for this
task fails (this file is fairly large, about 80GB compressed):

$ HADOOP_HEAPSIZE=3000 hadoop fs -text /path/to/file | wc -c
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
 at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
 at java.lang.String.getBytes(String.java:956)
at org.apache.hadoop.fs.FsShell$TextRecordInputStream.read(FsShell.java:391)
 at java.io.InputStream.read(InputStream.java:179)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
 at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
 at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:122)
at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:50)
 at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:427)
at
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1934)
 at org.apache.hadoop.fs.FsShell.text(FsShell.java:421)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1597)
 at org.apache.hadoop.fs.FsShell.run(FsShell.java:1798)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916)

On Sat, Nov 23, 2013 at 3:14 PM, Robert Dyer <ps...@gmail.com> wrote:

> Is there an easy way to get the uncompressed size of a sequence file that
> is block compressed?  I am using the Snappy compressor.
>
> I realize I can obviously just decompress them to temporary files to get
> the size, but I would assume there is an easier way.  Perhaps an existing
> tool that my search did not turn up?
>
> If not, I will have to run a MR job load each compressed block and read
> the Snappy header to get the size.  I need to do this for a large number of
> files so I'd prefer a simple CLI tool (sort of like 'hadoop fs -du').
>
> - Robert
>
>

Re: Uncompressed size of Sequence files

Posted by Robert Dyer <ps...@gmail.com>.
I should probably mention my attempt to use the 'hadoop' command for this
task fails (this file is fairly large, about 80GB compressed):

$ HADOOP_HEAPSIZE=3000 hadoop fs -text /path/to/file | wc -c
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
 at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
 at java.lang.String.getBytes(String.java:956)
at org.apache.hadoop.fs.FsShell$TextRecordInputStream.read(FsShell.java:391)
 at java.io.InputStream.read(InputStream.java:179)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
 at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
 at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:122)
at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:50)
 at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:427)
at
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1934)
 at org.apache.hadoop.fs.FsShell.text(FsShell.java:421)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1597)
 at org.apache.hadoop.fs.FsShell.run(FsShell.java:1798)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916)

On Sat, Nov 23, 2013 at 3:14 PM, Robert Dyer <ps...@gmail.com> wrote:

> Is there an easy way to get the uncompressed size of a sequence file that
> is block compressed?  I am using the Snappy compressor.
>
> I realize I can obviously just decompress them to temporary files to get
> the size, but I would assume there is an easier way.  Perhaps an existing
> tool that my search did not turn up?
>
> If not, I will have to run a MR job load each compressed block and read
> the Snappy header to get the size.  I need to do this for a large number of
> files so I'd prefer a simple CLI tool (sort of like 'hadoop fs -du').
>
> - Robert
>
>