You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Dave Christianson <da...@thetradedesk.com> on 2019/02/10 22:40:19 UTC

S3AFileSystem premature EOF with random InputPolicy

I'm seeing a problem using the S3AFileSystem where I get a premature EOF.  In particular when reading Parquet files using projection. It appears this causes a backward seek on the S3A input stream which then triggers a bug where the input stream will attempt to use the "random" input policy, and eventually (depending on the file and the amount of data read) read past the end of its readahead buffer without reopening the stream ... resulting in an EOF.

I haven't seen this issue reported anywhere, I'm wondering about whether this worth a fix (it looks like this stream reopening behavior just needs to be more aggressive) or if it's just better to retrieve the whole file sequentially before attempting to parse (I was suprised it works at all).


I've written a sample program that illustrates the problem given a path in S3 (minus parquet):

final Configuration conf = new Configuration();
conf.set("fs.s3a.readahead.range", "1K");
conf.set("fs.s3a.experimental.input.fadvise", "random");

final FileSystem fs = FileSystem.get(path.toUri(), conf);

// forward seek reading across readahead boundary
try (FSDataInputStream in = fs.open(path)) {
    final byte[] temp = new byte[5];
    in.readByte();
    in.readFully(1023, temp); // <-- works
}

// forward seek reading from end of readahead boundary
try (FSDataInputStream in = fs.open(path)) {
final byte[] temp = new byte[5];
in.readByte();
in.readFully(1024, temp); // <-- throws EOFException
}


Regards, Dave Christianson