You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2017/11/10 11:41:00 UTC
[jira] [Commented] (HADOOP-15027) Improvements for Hadoop read from AliyunOSS

    [ https://issues.apache.org/jira/browse/HADOOP-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247373#comment-16247373 ] 

Steve Loughran commented on HADOOP-15027:
-----------------------------------------

* There's an uber-JIRA to track all Alyun OSS issues; moved this under it: HADOOP-13377
* and added you to the list of developers; assigned the work to you 
* Make sure that [~unclegen] reviews, tests & is happy with this: he's the keeper of the module right now
* All patches for the object stores require the submitter to say which endpoint you ran all the test against. This ensures that you are confident you haven't broken anything before anyone else has a go.

Looking at the patch, I see what you are trying to do: speedup reads through pre-emptive fetching of data ahead of the client code, which ensures that when one thread is working on slow stuff.

I see the benefits of this on a sequential read from the start to end of a file, but for the common high-performance column formats: ORC & Parquet, that IO pattern isn't followed. Instead you seek 
open
seek (EOF - some offset)
read(footer)
seek(first column + length)

read(some summary data)either seek(first column), read(column,length) process
or seek(next column of that type)

or something similar: aggressive random IO, where the existing data needs to be discarded. If the https connection needs to be aborted, it's very expensive, so S3A and wasb now have random IO modes where in a readFully(position, length) read they do a GET position-max(min-read-size, length); and for forward seeks discard data wherever possible.

I would focus on performance of those data formats, rather than sequential IO, which primarly gets used for; .gz, .csv. avro ingest before parquet/orc is generated & used for all the other queries. (and distcp too, of course)

Take a look at HADOOP-13203 for the S3A work there, where we added a switch between sequential and random IO; added tests for random IO perf.

HADOOP-14535, did something better for Wasb, where it starts off in sequential, but as soon as you do a backwards seek (operation 4 in the list above), it says "this is columnar data" and switches to random IO. There's a patch pending for S3 to do that too, as it makes it easy to mix sequential data sources with random ones.

I would start with that, then worry about how best to prefetch data, which probably only matters in sequential reads.

Having a quick look at your code 

* The thread pool should be for the FileSystem itself, not per input stream. You can have many open input streams in a single process (especially: Spark, Hive); creating a thread pool for each one is slow and expensive.

* The retry logic needs tobe reworked because it just does retry-without-delay and retries every exception. There are some failures (UnknownHostException, NoRouteToHostException, auth failures, any RuntimeException) which aren't going be recoverable. Those we can recover from need to include some sleep & backoff policy. The good news: {{org.apache.hadoop.io.retry.RetryPolicy}} handles all this, with {{RetryPolicy.retryByException}} letting you declare the map of which exception to fail fast, which to retry on. Have a look at where other code is using it.


I like the look of the overall idea, and know that read performance matters. But focus on seek() first. Talk to [~unclegen] and see what he suggests.

> Improvements for Hadoop read from AliyunOSS
> -------------------------------------------
>
>                 Key: HADOOP-15027
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15027
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/oss
>    Affects Versions: 3.0.0
>            Reporter: wujinhu
>            Assignee: wujinhu
>         Attachments: HADOOP-15027.001.patch
>
>
> Currently, read performance is poor when Hadoop reads from AliyunOSS. It needs about 1min to read 1GB from OSS.
> Class AliyunOSSInputStream uses single thread to read data from AliyunOSS,  so we can refactor this by using multi-thread pre read to improve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org