You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ant.apache.org by Frank Astier <fa...@yahoo-inc.com> on 2011/12/19 19:25:40 UTC

Question about using Tar with Hadoop files

Hi -

I’m trying to use the Apache Tar package (1.8.2) for a Java program that tars large files in Hadoop. I am currently failing on a file that’s 17 GB long. Note that this code works without any problem for smaller files. I’m tarring smaller HDFS files all day long without any problem. It fails only when I have to tar that 17 GB file. I have a hard time making sense of the error message, after looking at source code for 3 days now... The exact file size at the time of the error is: 17456999265 bytes. The exception I’m seeing is:

12/19/11 5:54 PM [BDM.main] EXCEPTION request to write '65535' bytes exceeds size in header of '277130081' bytes
12/19/11 5:54 PM [BDM.main] EXCEPTION org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:238)
12/19/11 5:54 PM [BDM.main] EXCEPTION com.yahoo.ads.ngdstone.tpbdm.HDFSTar.archive(HDFSTar.java:149)

My code is:

           TarEntry entry = new TarEntry(p.getName());
           Path absolutePath = p.isAbsolute() ? p : new Path(baseDir, p); // HDFS Path
           FileStatus fileStatus = fs.getFileStatus(absolutePath); // HDFS fileStatus
           entry.setNames(fileStatus.getOwner(), fileStatus.getGroup());
           entry.setUserName(user);
           entry.setGroupName(group);
            entry.setName(name);
            entry.setSize(fileStatus.getLen());
            entry.setMode(Integer.parseInt("0100" + permissions, 8));
            out.putNextEntry(entry); // out = TarOutputStream

            if (fileStatus.getLen() > 0) {

                InputStream in = fs.open(absolutePath); // large file in HDFS

                try {

                    ++nEntries;

                    int bytesRead = in.read(buf);

                    while (bytesRead >= 0) {
                        out.write(buf, 0, bytesRead);
                        bytesRead = in.read(buf);
                    }

                } finally {
                    in.close();
                }
            }

            out.closeEntry();

Any idea? Am I missing anything in the way I’m setting up the TarOutputStream or TarEntry? Or does tar have implicit limits that are never going to work for multi-gigabytes size files?

Thanks!

Frank

Re: Question about using Tar with Hadoop files

Posted by Andy Stevens <in...@googlemail.com>.

You didn't mention what version of Ant was involved...

Andy.
On 20 Dec 2011 05:31, "Frank Astier" <fa...@yahoo-inc.com> wrote:

> Hi -
>
> I’m trying to use the Apache Tar package (1.8.2) for a Java program that
> tars large files in Hadoop. I am currently failing on a file that’s 17 GB
> long. Note that this code works without any problem for smaller files. I’m
> tarring smaller HDFS files all day long without any problem. It fails only
> when I have to tar that 17 GB file. I have a hard time making sense of the
> error message, after looking at source code for 3 days now... The exact
> file size at the time of the error is: 17456999265 bytes. The exception I’m
> seeing is:
>
> 12/19/11 5:54 PM [BDM.main] EXCEPTION request to write '65535' bytes
> exceeds size in header of '277130081' bytes
> 12/19/11 5:54 PM [BDM.main] EXCEPTION
> org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:238)
> 12/19/11 5:54 PM [BDM.main] EXCEPTION
> com.yahoo.ads.ngdstone.tpbdm.HDFSTar.archive(HDFSTar.java:149)
>
> My code is:
>
>           TarEntry entry = new TarEntry(p.getName());
>           Path absolutePath = p.isAbsolute() ? p : new Path(baseDir, p);
> // HDFS Path
>           FileStatus fileStatus = fs.getFileStatus(absolutePath); // HDFS
> fileStatus
>           entry.setNames(fileStatus.getOwner(), fileStatus.getGroup());
>           entry.setUserName(user);
>           entry.setGroupName(group);
>            entry.setName(name);
>            entry.setSize(fileStatus.getLen());
>            entry.setMode(Integer.parseInt("0100" + permissions, 8));
>            out.putNextEntry(entry); // out = TarOutputStream
>
>            if (fileStatus.getLen() > 0) {
>
>                InputStream in = fs.open(absolutePath); // large file in
> HDFS
>
>                try {
>
>                    ++nEntries;
>
>                    int bytesRead = in.read(buf);
>
>                    while (bytesRead >= 0) {
>                        out.write(buf, 0, bytesRead);
>                        bytesRead = in.read(buf);
>                    }
>
>                } finally {
>                    in.close();
>                }
>            }
>
>            out.closeEntry();
>
> Any idea? Am I missing anything in the way I’m setting up the
> TarOutputStream or TarEntry? Or does tar have implicit limits that are
> never going to work for multi-gigabytes size files?
>
> Thanks!
>
> Frank
>

Re: Question about using Tar with Hadoop files

Posted by Stefan Bodewig <bo...@apache.org>.

On 2011-12-22, Stefan Bodewig wrote:

> On 2011-12-19, Frank Astier wrote:

> Later versions of tar support workarounds (namely ustar later used by
> GNU tar and BSD tar aswell) and even later the newer POSIX standard
> added PAX extension headers to address this (and other things like file
> names longer than 100 characters).

> The trunk of Commons Compress supports both ustar and PAX by now but the
> latest release of it doesn't.  I expect the next release of Commons
> Compress to happen pretty soon, though.

s/ustar/star/

It is Jörg Schilling's tar implementation that pioneered the binary
representation of entry sizes that allows for a much bigger maximum
size.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@ant.apache.org
For additional commands, e-mail: user-help@ant.apache.org

Re: Question about using Tar with Hadoop files

Posted by Stefan Bodewig <bo...@apache.org>.

On 2011-12-19, Frank Astier wrote:

> I’m trying to use the Apache Tar package (1.8.2) for a Java program
> that tars large files in Hadoop. I am currently failing on a file
> that’s 17 GB long.

First of all, do yourself a favor and use Commons Compress rather than
Ant's tar package.

Traditional tar doesn't support anything bigger than 2GB (an octal
number of eleven sevens).  Ant's tar package doesn't go beyond that.

Later versions of tar support workarounds (namely ustar later used by
GNU tar and BSD tar aswell) and even later the newer POSIX standard
added PAX extension headers to address this (and other things like file
names longer than 100 characters).

The trunk of Commons Compress supports both ustar and PAX by now but the
latest release of it doesn't.  I expect the next release of Commons
Compress to happen pretty soon, though.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@ant.apache.org
For additional commands, e-mail: user-help@ant.apache.org