You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Uwe Korn <uw...@xhochy.com> on 2016/10/06 19:35:27 UTC

First impressions with Drill+Parquet+S3

Hello,

We had some test runs with Drill 1.8 in the last days and wanted to share the experience with you as we've made some interesting findings that astonished us. We did run on our internal company cluster and thus used the S3 API to access our internal storage cluster, not AWS (the behavior should still be the same).

Setup experience: Awesome, it took me less than 30min to have a multimode Drill setup running on Mesos+Aurora with S3 configured. Really nice.

Performance with the 1.8 release: Awful. Compared to the queries I ran locally with Drill on a small dataset, runtimes were magnitudes higher than on my laptop. After some debugging, I saw that hadoop-s3a is always requesting via S3 the byte range from the position we want to start to read until the end of the file. This gave the following HTTP pattern:
* GET bytes=8k-100M
* GET bytes=2M-100M
* GET bytes=4M-100M
Although the HTTP request were normally aborted before all the data was send by the server, it was still about 10-15x the size of the input files that went over the network. Using Parquet, I actually hoped to achieve the opposite, i.e. that less the whole file was transferred (my test queries were only using 2 of 15 columns).

In Hadoop 3.0.0-alpha1 [2], there are a lot of improvements w.r.t. S3 access. You can now select via fs.s3a.experimental.input.fadvise=random a new reading mode that will only request via S3 the asked range plus a small readahead buffer. While this keeps the number of requests constant, we now only request the actual data we need. With that, performance is not amazing but in an acceptable range.

Still query planning always took at least 35s. This was an effect of fs.s3a.experimental.input.fadvise=random. While the Parquet access is specifying really good which ranges it wants to read, the parser for the metadata cache actually only request 8000 bytes at once and thus lead to several thousand HTTP requests for a single sequential read. As a workaround, we have added a call to FSDataInputStream.setReadahead(metadata-filesize) to limit the access to a single request. This brought reading metadata down to 3s.

Another problem with the metadata cache was, that it actually was rebuild on every query. Drill relies here on the change timestamp of the directory which is not support by S3 [1] and thus always the current time was returned as the modification date of the directory.

These were just our initial, basic findings with Drill. At the moment it looks promising enough so that we will probably do some more usability and performance testing. If we already did something wrong with the initial S3 tests, it would be nice to get to know some pointers what it could have been. The bad S3 I/O performance was really surprising for us.

Kind regards,
Uwe

[1] https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#Warning_2:_Because_Object_stores_dont_track_modification_times_of_directories <https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#Warning_2:_Because_Object_stores_dont_track_modification_times_of_directories>
[2] From here on, the tests were made with Drill-master+hadoop-3.0.0-alpha1+aws-sdk-1.11.35, i.e. custom Drill and Hadoop builds to have dependencies in newer versions.

Re: First impressions with Drill+Parquet+S3

Posted by Uwe Korn <uw...@xhochy.com>.

Hello Parth,

I filed JIRAs for S3 performance:
  * https://issues.apache.org/jira/browse/DRILL-4977
  * https://issues.apache.org/jira/browse/DRILL-4976
  * https://issues.apache.org/jira/browse/DRILL-4978

and one for execution of drillbits inside Apache Mesos+Aurora:
  * https://issues.apache.org/jira/browse/DRILL-4979

As a start I would look first into the latter as this is a requirement 
to actually safely use Drill on such a cluster. I have commented with a 
basic implementation idea, I\u2019d love to get some feedback on that as it 
would be my first Drill contribution.

Uwe


On 28.10.16 00:26, Parth Chandra wrote:
> Hi Uwe,
>
>    Can you log JIRA's for the performance issues that you encounter while
> working on S3? Not many folks are working on optimizing that path, so any
> patches that you might be able to contribute would be appreciated.
>
> Parth
>
> On Thu, Oct 6, 2016 at 1:56 PM, Uwe Korn <uw...@xhochy.com> wrote:
>
>> Yes. Performance was much better with a real file system (i.e. I ran
>> locally on my laptop using the SSD installed there). I don\u2019t expect to have
>> the exact same performance with S3 as I don\u2019t have things like data
>> locality there. My use case is mainly querying \u201ecold\u201c datasets, i.e. ones
>> that are not touched often and when, only a few queries are done on them.
>>
>>
>>> Am 06.10.2016 um 22:47 schrieb Ted Dunning <te...@gmail.com>:
>>>
>>> Have you tried running against a real file system interface? Or even just
>>> against HDFS?
>>>
>>>
>>>
>>> On Thu, Oct 6, 2016 at 12:35 PM, Uwe Korn <uwelk@xhochy.com <mailto:
>> uwelk@xhochy.com>> wrote:
>>>> Hello,
>>>>
>>>> We had some test runs with Drill 1.8 in the last days and wanted to
>> share
>>>> the experience with you as we've made some interesting findings that
>>>> astonished us. We did run on our internal company cluster and thus used
>> the
>>>> S3 API to access our internal storage cluster, not AWS (the behavior
>> should
>>>> still be the same).
>>>>
>>>> Setup experience: Awesome, it took me less than 30min to have a
>> multimode
>>>> Drill setup running on Mesos+Aurora with S3 configured. Really nice.
>>>>
>>>> Performance with the 1.8 release: Awful. Compared to the queries I ran
>>>> locally with Drill on a small dataset, runtimes were magnitudes higher
>> than
>>>> on my laptop. After some debugging, I saw that hadoop-s3a is always
>>>> requesting via S3 the byte range from the position we want to start to
>> read
>>>> until the end of the file. This gave the following HTTP pattern:
>>>> * GET bytes=8k-100M
>>>> * GET bytes=2M-100M
>>>> * GET bytes=4M-100M
>>>> Although the HTTP request were normally aborted before all the data was
>>>> send by the server, it was still about 10-15x the size of the input
>> files
>>>> that went over the network. Using Parquet, I actually hoped to achieve
>> the
>>>> opposite, i.e. that less the whole file was transferred (my test queries
>>>> were only using 2 of 15 columns).
>>>>
>>>> In Hadoop 3.0.0-alpha1 [2], there are a lot of improvements w.r.t. S3
>>>> access. You can now select via fs.s3a.experimental.input.fadvise=random
>> a
>>>> new reading mode that will only request via S3 the asked range plus a
>> small
>>>> readahead buffer. While this keeps the number of requests constant, we
>> now
>>>> only request the actual data we need. With that, performance is not
>> amazing
>>>> but in an acceptable range.
>>>>
>>>> Still query planning always took at least 35s. This was an effect of
>>>> fs.s3a.experimental.input.fadvise=random. While the Parquet access is
>>>> specifying really good which ranges it wants to read, the parser for the
>>>> metadata cache actually only request 8000 bytes at once and thus lead to
>>>> several thousand HTTP requests for a single sequential read. As a
>>>> workaround, we have added a call to FSDataInputStream.
>>>> setReadahead(metadata-filesize) to limit the access to a single
>> request.
>>>> This brought reading metadata down to 3s.
>>>>
>>>> Another problem with the metadata cache was, that it actually was
>> rebuild
>>>> on every query. Drill relies here on the change timestamp of the
>> directory
>>>> which is not support by S3 [1] and thus always the current time was
>>>> returned as the modification date of the directory.
>>>>
>>>> These were just our initial, basic findings with Drill. At the moment it
>>>> looks promising enough so that we will probably do some more usability
>> and
>>>> performance testing. If we already did something wrong with the initial
>> S3
>>>> tests, it would be nice to get to know some pointers what it could have
>>>> been. The bad S3 I/O performance was really surprising for us.
>>>>
>>>> Kind regards,
>>>> Uwe
>>>>
>>>> [1] https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/
>>>> tools/hadoop-aws/index.html#Warning_2:_Because_Object_
>> stores_dont_track_
>>>> modification_times_of_directories <https://hadoop.apache.org/ <
>> https://hadoop.apache.org/>
>>>> docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#
>>>> Warning_2:_Because_Object_stores_dont_track_modification_times_of_
>>>> directories>
>>>> [2] From here on, the tests were made with Drill-master+hadoop-3.0.0-
>> alpha1+aws-sdk-1.11.35,
>>>> i.e. custom Drill and Hadoop builds to have dependencies in newer
>> versions.
>>
>>

Re: First impressions with Drill+Parquet+S3

Posted by Parth Chandra <pc...@maprtech.com>.

Hi Uwe,

  Can you log JIRA's for the performance issues that you encounter while
working on S3? Not many folks are working on optimizing that path, so any
patches that you might be able to contribute would be appreciated.

Parth

On Thu, Oct 6, 2016 at 1:56 PM, Uwe Korn <uw...@xhochy.com> wrote:

> Yes. Performance was much better with a real file system (i.e. I ran
> locally on my laptop using the SSD installed there). I don’t expect to have
> the exact same performance with S3 as I don’t have things like data
> locality there. My use case is mainly querying „cold“ datasets, i.e. ones
> that are not touched often and when, only a few queries are done on them.
>
>
> > Am 06.10.2016 um 22:47 schrieb Ted Dunning <te...@gmail.com>:
> >
> > Have you tried running against a real file system interface? Or even just
> > against HDFS?
> >
> >
> >
> > On Thu, Oct 6, 2016 at 12:35 PM, Uwe Korn <uwelk@xhochy.com <mailto:
> uwelk@xhochy.com>> wrote:
> >
> >> Hello,
> >>
> >> We had some test runs with Drill 1.8 in the last days and wanted to
> share
> >> the experience with you as we've made some interesting findings that
> >> astonished us. We did run on our internal company cluster and thus used
> the
> >> S3 API to access our internal storage cluster, not AWS (the behavior
> should
> >> still be the same).
> >>
> >> Setup experience: Awesome, it took me less than 30min to have a
> multimode
> >> Drill setup running on Mesos+Aurora with S3 configured. Really nice.
> >>
> >> Performance with the 1.8 release: Awful. Compared to the queries I ran
> >> locally with Drill on a small dataset, runtimes were magnitudes higher
> than
> >> on my laptop. After some debugging, I saw that hadoop-s3a is always
> >> requesting via S3 the byte range from the position we want to start to
> read
> >> until the end of the file. This gave the following HTTP pattern:
> >> * GET bytes=8k-100M
> >> * GET bytes=2M-100M
> >> * GET bytes=4M-100M
> >> Although the HTTP request were normally aborted before all the data was
> >> send by the server, it was still about 10-15x the size of the input
> files
> >> that went over the network. Using Parquet, I actually hoped to achieve
> the
> >> opposite, i.e. that less the whole file was transferred (my test queries
> >> were only using 2 of 15 columns).
> >>
> >> In Hadoop 3.0.0-alpha1 [2], there are a lot of improvements w.r.t. S3
> >> access. You can now select via fs.s3a.experimental.input.fadvise=random
> a
> >> new reading mode that will only request via S3 the asked range plus a
> small
> >> readahead buffer. While this keeps the number of requests constant, we
> now
> >> only request the actual data we need. With that, performance is not
> amazing
> >> but in an acceptable range.
> >>
> >> Still query planning always took at least 35s. This was an effect of
> >> fs.s3a.experimental.input.fadvise=random. While the Parquet access is
> >> specifying really good which ranges it wants to read, the parser for the
> >> metadata cache actually only request 8000 bytes at once and thus lead to
> >> several thousand HTTP requests for a single sequential read. As a
> >> workaround, we have added a call to FSDataInputStream.
> >> setReadahead(metadata-filesize) to limit the access to a single
> request.
> >> This brought reading metadata down to 3s.
> >>
> >> Another problem with the metadata cache was, that it actually was
> rebuild
> >> on every query. Drill relies here on the change timestamp of the
> directory
> >> which is not support by S3 [1] and thus always the current time was
> >> returned as the modification date of the directory.
> >>
> >> These were just our initial, basic findings with Drill. At the moment it
> >> looks promising enough so that we will probably do some more usability
> and
> >> performance testing. If we already did something wrong with the initial
> S3
> >> tests, it would be nice to get to know some pointers what it could have
> >> been. The bad S3 I/O performance was really surprising for us.
> >>
> >> Kind regards,
> >> Uwe
> >>
> >> [1] https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/
> >> tools/hadoop-aws/index.html#Warning_2:_Because_Object_
> stores_dont_track_
> >> modification_times_of_directories <https://hadoop.apache.org/ <
> https://hadoop.apache.org/>
> >> docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#
> >> Warning_2:_Because_Object_stores_dont_track_modification_times_of_
> >> directories>
> >> [2] From here on, the tests were made with Drill-master+hadoop-3.0.0-
> alpha1+aws-sdk-1.11.35,
> >> i.e. custom Drill and Hadoop builds to have dependencies in newer
> versions.
>
>

Re: First impressions with Drill+Parquet+S3

Posted by Uwe Korn <uw...@xhochy.com>.

Yes. Performance was much better with a real file system (i.e. I ran locally on my laptop using the SSD installed there). I don’t expect to have the exact same performance with S3 as I don’t have things like data locality there. My use case is mainly querying „cold“ datasets, i.e. ones that are not touched often and when, only a few queries are done on them. 


> Am 06.10.2016 um 22:47 schrieb Ted Dunning <te...@gmail.com>:
> 
> Have you tried running against a real file system interface? Or even just
> against HDFS?
> 
> 
> 
> On Thu, Oct 6, 2016 at 12:35 PM, Uwe Korn <uwelk@xhochy.com <ma...@xhochy.com>> wrote:
> 
>> Hello,
>> 
>> We had some test runs with Drill 1.8 in the last days and wanted to share
>> the experience with you as we've made some interesting findings that
>> astonished us. We did run on our internal company cluster and thus used the
>> S3 API to access our internal storage cluster, not AWS (the behavior should
>> still be the same).
>> 
>> Setup experience: Awesome, it took me less than 30min to have a multimode
>> Drill setup running on Mesos+Aurora with S3 configured. Really nice.
>> 
>> Performance with the 1.8 release: Awful. Compared to the queries I ran
>> locally with Drill on a small dataset, runtimes were magnitudes higher than
>> on my laptop. After some debugging, I saw that hadoop-s3a is always
>> requesting via S3 the byte range from the position we want to start to read
>> until the end of the file. This gave the following HTTP pattern:
>> * GET bytes=8k-100M
>> * GET bytes=2M-100M
>> * GET bytes=4M-100M
>> Although the HTTP request were normally aborted before all the data was
>> send by the server, it was still about 10-15x the size of the input files
>> that went over the network. Using Parquet, I actually hoped to achieve the
>> opposite, i.e. that less the whole file was transferred (my test queries
>> were only using 2 of 15 columns).
>> 
>> In Hadoop 3.0.0-alpha1 [2], there are a lot of improvements w.r.t. S3
>> access. You can now select via fs.s3a.experimental.input.fadvise=random a
>> new reading mode that will only request via S3 the asked range plus a small
>> readahead buffer. While this keeps the number of requests constant, we now
>> only request the actual data we need. With that, performance is not amazing
>> but in an acceptable range.
>> 
>> Still query planning always took at least 35s. This was an effect of
>> fs.s3a.experimental.input.fadvise=random. While the Parquet access is
>> specifying really good which ranges it wants to read, the parser for the
>> metadata cache actually only request 8000 bytes at once and thus lead to
>> several thousand HTTP requests for a single sequential read. As a
>> workaround, we have added a call to FSDataInputStream.
>> setReadahead(metadata-filesize) to limit the access to a single request.
>> This brought reading metadata down to 3s.
>> 
>> Another problem with the metadata cache was, that it actually was rebuild
>> on every query. Drill relies here on the change timestamp of the directory
>> which is not support by S3 [1] and thus always the current time was
>> returned as the modification date of the directory.
>> 
>> These were just our initial, basic findings with Drill. At the moment it
>> looks promising enough so that we will probably do some more usability and
>> performance testing. If we already did something wrong with the initial S3
>> tests, it would be nice to get to know some pointers what it could have
>> been. The bad S3 I/O performance was really surprising for us.
>> 
>> Kind regards,
>> Uwe
>> 
>> [1] https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/
>> tools/hadoop-aws/index.html#Warning_2:_Because_Object_stores_dont_track_
>> modification_times_of_directories <https://hadoop.apache.org/ <https://hadoop.apache.org/>
>> docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#
>> Warning_2:_Because_Object_stores_dont_track_modification_times_of_
>> directories>
>> [2] From here on, the tests were made with Drill-master+hadoop-3.0.0-alpha1+aws-sdk-1.11.35,
>> i.e. custom Drill and Hadoop builds to have dependencies in newer versions.

Re: First impressions with Drill+Parquet+S3

Posted by Ted Dunning <te...@gmail.com>.

Have you tried running against a real file system interface? Or even just
against HDFS?



On Thu, Oct 6, 2016 at 12:35 PM, Uwe Korn <uw...@xhochy.com> wrote:

> Hello,
>
> We had some test runs with Drill 1.8 in the last days and wanted to share
> the experience with you as we've made some interesting findings that
> astonished us. We did run on our internal company cluster and thus used the
> S3 API to access our internal storage cluster, not AWS (the behavior should
> still be the same).
>
> Setup experience: Awesome, it took me less than 30min to have a multimode
> Drill setup running on Mesos+Aurora with S3 configured. Really nice.
>
> Performance with the 1.8 release: Awful. Compared to the queries I ran
> locally with Drill on a small dataset, runtimes were magnitudes higher than
> on my laptop. After some debugging, I saw that hadoop-s3a is always
> requesting via S3 the byte range from the position we want to start to read
> until the end of the file. This gave the following HTTP pattern:
>  * GET bytes=8k-100M
>  * GET bytes=2M-100M
>  * GET bytes=4M-100M
> Although the HTTP request were normally aborted before all the data was
> send by the server, it was still about 10-15x the size of the input files
> that went over the network. Using Parquet, I actually hoped to achieve the
> opposite, i.e. that less the whole file was transferred (my test queries
> were only using 2 of 15 columns).
>
> In Hadoop 3.0.0-alpha1 [2], there are a lot of improvements w.r.t. S3
> access. You can now select via fs.s3a.experimental.input.fadvise=random a
> new reading mode that will only request via S3 the asked range plus a small
> readahead buffer. While this keeps the number of requests constant, we now
> only request the actual data we need. With that, performance is not amazing
> but in an acceptable range.
>
> Still query planning always took at least 35s. This was an effect of
> fs.s3a.experimental.input.fadvise=random. While the Parquet access is
> specifying really good which ranges it wants to read, the parser for the
> metadata cache actually only request 8000 bytes at once and thus lead to
> several thousand HTTP requests for a single sequential read. As a
> workaround, we have added a call to FSDataInputStream.
> setReadahead(metadata-filesize) to limit the access to a single request.
> This brought reading metadata down to 3s.
>
> Another problem with the metadata cache was, that it actually was rebuild
> on every query. Drill relies here on the change timestamp of the directory
> which is not support by S3 [1] and thus always the current time was
> returned as the modification date of the directory.
>
> These were just our initial, basic findings with Drill. At the moment it
> looks promising enough so that we will probably do some more usability and
> performance testing. If we already did something wrong with the initial S3
> tests, it would be nice to get to know some pointers what it could have
> been. The bad S3 I/O performance was really surprising for us.
>
> Kind regards,
> Uwe
>
> [1] https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/
> tools/hadoop-aws/index.html#Warning_2:_Because_Object_stores_dont_track_
> modification_times_of_directories <https://hadoop.apache.org/
> docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#
> Warning_2:_Because_Object_stores_dont_track_modification_times_of_
> directories>
> [2] From here on, the tests were made with Drill-master+hadoop-3.0.0-alpha1+aws-sdk-1.11.35,
> i.e. custom Drill and Hadoop builds to have dependencies in newer versions.