You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Dave Christianson <da...@thetradedesk.com> on 2019/01/30 09:03:49 UTC

EOFException when using S3AFileSystem with random input policy

I'm seeing a problem using the S3AFileSystem with the ParquetInputFormat that causes a non-transient EOF for certain files. I have traced what looks like the source of the problem to the use of the "random" input policy in order to support seek behavior required by Parquet.

I've written a sample program that illustrates the problem given a path in S3 - not using Parquet, works on any file > 1024K:

final Configuration conf = new Configuration();
conf.set("fs.s3a.readahead.range", "1K");
conf.set("fs.s3a.experimental.input.fadvise", "random");

final FileSystem fs = FileSystem.get(path.toUri(), conf);

// forward seek reading across readahead boundary
try (FSDataInputStream in = fs.open(path)) {
    final byte[] temp = new byte[5];
    in.readByte();
    in.readFully(1023, temp); // <-- works
}

// forward seek reading from end of readahead boundary
try (FSDataInputStream in = fs.open(path)) {
final byte[] temp = new byte[5];
in.readByte();
in.readFully(1024, temp); // <-- throws EOFException
}

I'm wondering two things:
- is this a known problem that I simply haven't found a ticket or question for - if not, what are the steps to discuss/contribute a fix (I have a potential solution in S3AInputStream.seekInStream) - is the random inputpolicy not expected to work fully - as it stands seek, especially backwards seek against s3 seems - different? - although for certain use cases it could prevent having to download the entire file to local storage
Regards, Dave