You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Uwe L. Korn (JIRA)" <ji...@apache.org> on 2016/10/28 08:05:58 UTC

[jira] [Created] (DRILL-4977) Reading parquet metadata cache from S3 with fadvise=random and Hadoop 3 generates a large number of requests

Uwe L. Korn created DRILL-4977:
----------------------------------

             Summary: Reading parquet metadata cache from S3 with fadvise=random and Hadoop 3 generates a large number of requests
                 Key: DRILL-4977
                 URL: https://issues.apache.org/jira/browse/DRILL-4977
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - Parquet
    Affects Versions: 1.8.0
         Environment: Hadoop 3.0
            Reporter: Uwe L. Korn


When using the new {{fs.s3a.experimental.input.fadvise=random}} mode for accessing Parquet files stored in S3, we see a significant improvement for the query performance but a slowdown on query planning. This is due to the way the metadata file is read (each chunk of 8000 bytes generates a new GET request to S3). Indicating with {{FSDataInputStream.setReadahead(metadata-filesize)}} that we will read the whole file, this behaviour is circumvented. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)