You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Sahil Takiar (Jira)" <ji...@apache.org> on 2020/01/22 18:00:00 UTC
[jira] [Created] (IMPALA-9316) Consider coalescing S3 scans

Sahil Takiar created IMPALA-9316:
------------------------------------

             Summary: Consider coalescing S3 scans
                 Key: IMPALA-9316
                 URL: https://issues.apache.org/jira/browse/IMPALA-9316
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Sahil Takiar


We should consider coalescing S3 reads. IIUC the current {{DiskIoMgr}} code for S3A does not do anything special for scheduling S3 scan ranges. It simply round-robin assigns scans to IO threads.

I think there might be a smarter algorithm we could employ when scheduling S3 reads. A few things to consider:
* With the migration to {{hdfsPreadFully}}, each S3 scan range should correspond to a single HTTP GET request (assuming the 8 MB limit is not hit, see below)
* {{read_size}} limits the size of a read to 8 MB (I believe if a scan range exceeds this limit, the reads are just done on the same IO thread, but sequentially - they are broken up into multiple HTTP GET requests)
* S3A has a readahead option that defaults to 64 KB, however, it only applies in certain situations
** If {{fs.s3a.experimental.input.fadvise=random}} (which is the recommended value when reading Parquet / ORC data), the readahead applies if (1) it won't cause the read to go past the end of the file, and (2) the request read length is under 64 KB (it reads up to Math.max(requested-read-length, 64 KB)) (so the readahead most likely applies for small reads)

Coalescing reads would allow Impala to combine multiple, small HTTP GET requests into fewer, larger HTTP GET requests. There may be some data that needs to be skipped over, but the cost of reading that extra data might outweigh the cost of issuing multiple HTTP requests. Since each HTTP request requires a round-trip to S3, issuing a lot of GET requests can be costly, especially if each only reads a small amount of data.

Some implementation factors to consider:
* There should probably be a limit on the maximum size of a read request (is 8 MB the right value for S3?)
* Since S3A uses a default of 64 KB for their readahead, we can probably use a similar value
* Should the number of disk IO threads be considered when coalescing reads? e.g. by default there are 16 IO threads, if there are 16 small scan ranges, does it make more sense to coalesce them into a single large scan range, or would we get better throughput by issuing all 16 in parallel



--
This message was sent by Atlassian Jira
(v8.3.4#803005)