You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Hafiz Mujadid <ha...@gmail.com> on 2015/07/25 11:08:10 UTC

Best Performance of drill

Hi!

I have terabytes of data on S3 and I want to query this data using drill. I
want to know at which format of data drill gives best performance. whether
CSV format will be best or parquet format? Also what should be file size?
whether small files will be more appropriate for drill or large files?


Thanks

Re: Best Performance of drill

Posted by Hafiz Mujadid <ha...@gmail.com>.

thanks david and stefan :)

On Sun, Jul 26, 2015 at 12:40 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi Hafiz,
>
> We are trying to discover if the Tachyon project (
> http://tachyon-project.org/) can be used as a bridge between the two
> worlds
> (local S3 cache with some built in intelligence).
>
> Others may have know of  alternative "S3 sweeteners".
>
> Regards,
>  -Stefan
>
>
>
> On Sat, Jul 25, 2015 at 7:32 PM, David Tucker <dt...@maprtech.com>
> wrote:
>
> > One more thing to remember ... S3 is an object store, not a file system
> in
> > the traditional sense.   That means that when a drill-bit accesses a file
> > from S3, the whole thing is transferred ... whether it's 100 bytes or 100
> > megabytes.   The advantages of the parquet format are far more obvious
> in a
> > file-system environment, where basic operations like lseek and partial
> file
> > reads are supported.
> >
> > Stefan is absolutely correct that you're still better off with parquet
> > files ... if only because the absolute volume of data you'll be pulling
> in
> > from S3 will be reduced.   In terms of file size, you'll likely see
> better
> > performance with larger files rather than smaller ones (thousands of
> > GB-sized files for your TB of data rather than millions of  MB-sized
> > files).   This will definitely be a balancing act; you'll want to test
> the
> > scaling __slowly__ and identify the sweet spot.   IMHO, you really may be
> > better off pulling the data down from S3 onto local storage if you'll be
> > accessing it with multiple queries.   Ephemeral storage on Amazon EC2
> > instances is fairly cheap if you're only talking about a few TB.
> >
> > -- David
> >
> > On Jul 25, 2015, at 7:16 AM, Hafiz Mujadid <ha...@gmail.com>
> > wrote:
> >
> > > Thanks alot Stefan :)
> > >
> > > On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter <
> > stefan@activitystream.com>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm pretty new around here but let me attempt to answer you.
> > >>
> > >>   - Parquet will always be (a lot) faster than CSV, especially if your
> > >>   querying for only a part of the columns in the CSV
> > >>   - Parquet is has various compression techniques and is more "scan
> > >>   friendly" (optimized for scanning compressed data)
> > >>
> > >>   - The optimal filesize is linked to the fs segment sizes (I'm not
> sure
> > >>   how that effects S3) and block sizes
> > >>   - hava a look at this:
> > >>   http://ingest.tips/2015/01/31/parquet-row-group-size/
> > >>
> > >>   - Read up on partitioning of Parquet file that is supported by Drill
> > and
> > >>   can improve your performance quite a bit
> > >>   - partitioning can help you with efficiently filter data and
> prevents
> > >>   scanning of data not relevant to your query
> > >>
> > >>   - Spend a little bit of time to plan how your will map your CSV to
> > >>   Parquet to make sure columns are imported as the appropriate data
> type
> > >>   - this matters in compression and efficiency (storing numbers as
> > string,
> > >>   for example, will prevent Parquet for doing some optimization magick
> > >>   - See this:
> > >>   http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2
> > (or
> > >>   some of the other presentations on Parquet)
> > >>
> > >>   - Optimize your drillbits (Drill machines) so they are sharing the
> > >>   workload
> > >>
> > >>   - Get to know #3 best practices
> > >>   - https://www.youtube.com/watch?v=_FHRzq7eHQc
> > >>   - https://aws.amazon.com/articles/1904
> > >>
> > >> Hope this helps,
> > >> -Stefan
> > >>
> > >> On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <
> > hafizmujadid00@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi!
> > >>>
> > >>> I have terabytes of data on S3 and I want to query this data using
> > >> drill. I
> > >>> want to know at which format of data drill gives best performance.
> > >> whether
> > >>> CSV format will be best or parquet format? Also what should be file
> > size?
> > >>> whether small files will be more appropriate for drill or large
> files?
> > >>>
> > >>>
> > >>> Thanks
> > >>>
> > >>
> > >
> > >
> > >
> > > --
> > > Regards: HAFIZ MUJADID
> >
> >
>



-- 
Regards: HAFIZ MUJADID

Re: Best Performance of drill

Posted by Stefán Baxter <st...@activitystream.com>.

Hi Hafiz,

We are trying to discover if the Tachyon project (
http://tachyon-project.org/) can be used as a bridge between the two worlds
(local S3 cache with some built in intelligence).

Others may have know of  alternative "S3 sweeteners".

Regards,
 -Stefan



On Sat, Jul 25, 2015 at 7:32 PM, David Tucker <dt...@maprtech.com> wrote:

> One more thing to remember ... S3 is an object store, not a file system in
> the traditional sense.   That means that when a drill-bit accesses a file
> from S3, the whole thing is transferred ... whether it's 100 bytes or 100
> megabytes.   The advantages of the parquet format are far more obvious in a
> file-system environment, where basic operations like lseek and partial file
> reads are supported.
>
> Stefan is absolutely correct that you're still better off with parquet
> files ... if only because the absolute volume of data you'll be pulling in
> from S3 will be reduced.   In terms of file size, you'll likely see better
> performance with larger files rather than smaller ones (thousands of
> GB-sized files for your TB of data rather than millions of  MB-sized
> files).   This will definitely be a balancing act; you'll want to test the
> scaling __slowly__ and identify the sweet spot.   IMHO, you really may be
> better off pulling the data down from S3 onto local storage if you'll be
> accessing it with multiple queries.   Ephemeral storage on Amazon EC2
> instances is fairly cheap if you're only talking about a few TB.
>
> -- David
>
> On Jul 25, 2015, at 7:16 AM, Hafiz Mujadid <ha...@gmail.com>
> wrote:
>
> > Thanks alot Stefan :)
> >
> > On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter <
> stefan@activitystream.com>
> > wrote:
> >
> >> Hi,
> >>
> >> I'm pretty new around here but let me attempt to answer you.
> >>
> >>   - Parquet will always be (a lot) faster than CSV, especially if your
> >>   querying for only a part of the columns in the CSV
> >>   - Parquet is has various compression techniques and is more "scan
> >>   friendly" (optimized for scanning compressed data)
> >>
> >>   - The optimal filesize is linked to the fs segment sizes (I'm not sure
> >>   how that effects S3) and block sizes
> >>   - hava a look at this:
> >>   http://ingest.tips/2015/01/31/parquet-row-group-size/
> >>
> >>   - Read up on partitioning of Parquet file that is supported by Drill
> and
> >>   can improve your performance quite a bit
> >>   - partitioning can help you with efficiently filter data and prevents
> >>   scanning of data not relevant to your query
> >>
> >>   - Spend a little bit of time to plan how your will map your CSV to
> >>   Parquet to make sure columns are imported as the appropriate data type
> >>   - this matters in compression and efficiency (storing numbers as
> string,
> >>   for example, will prevent Parquet for doing some optimization magick
> >>   - See this:
> >>   http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2
> (or
> >>   some of the other presentations on Parquet)
> >>
> >>   - Optimize your drillbits (Drill machines) so they are sharing the
> >>   workload
> >>
> >>   - Get to know #3 best practices
> >>   - https://www.youtube.com/watch?v=_FHRzq7eHQc
> >>   - https://aws.amazon.com/articles/1904
> >>
> >> Hope this helps,
> >> -Stefan
> >>
> >> On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <
> hafizmujadid00@gmail.com>
> >> wrote:
> >>
> >>> Hi!
> >>>
> >>> I have terabytes of data on S3 and I want to query this data using
> >> drill. I
> >>> want to know at which format of data drill gives best performance.
> >> whether
> >>> CSV format will be best or parquet format? Also what should be file
> size?
> >>> whether small files will be more appropriate for drill or large files?
> >>>
> >>>
> >>> Thanks
> >>>
> >>
> >
> >
> >
> > --
> > Regards: HAFIZ MUJADID
>
>

Re: Best Performance of drill

Posted by David Tucker <dt...@maprtech.com>.

One more thing to remember ... S3 is an object store, not a file system in the traditional sense.   That means that when a drill-bit accesses a file from S3, the whole thing is transferred ... whether it's 100 bytes or 100 megabytes.   The advantages of the parquet format are far more obvious in a file-system environment, where basic operations like lseek and partial file reads are supported.

Stefan is absolutely correct that you're still better off with parquet files ... if only because the absolute volume of data you'll be pulling in from S3 will be reduced.   In terms of file size, you'll likely see better performance with larger files rather than smaller ones (thousands of  GB-sized files for your TB of data rather than millions of  MB-sized files).   This will definitely be a balancing act; you'll want to test the scaling __slowly__ and identify the sweet spot.   IMHO, you really may be better off pulling the data down from S3 onto local storage if you'll be accessing it with multiple queries.   Ephemeral storage on Amazon EC2 instances is fairly cheap if you're only talking about a few TB.

-- David

On Jul 25, 2015, at 7:16 AM, Hafiz Mujadid <ha...@gmail.com> wrote:

> Thanks alot Stefan :)
> 
> On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
> 
>> Hi,
>> 
>> I'm pretty new around here but let me attempt to answer you.
>> 
>>   - Parquet will always be (a lot) faster than CSV, especially if your
>>   querying for only a part of the columns in the CSV
>>   - Parquet is has various compression techniques and is more "scan
>>   friendly" (optimized for scanning compressed data)
>> 
>>   - The optimal filesize is linked to the fs segment sizes (I'm not sure
>>   how that effects S3) and block sizes
>>   - hava a look at this:
>>   http://ingest.tips/2015/01/31/parquet-row-group-size/
>> 
>>   - Read up on partitioning of Parquet file that is supported by Drill and
>>   can improve your performance quite a bit
>>   - partitioning can help you with efficiently filter data and prevents
>>   scanning of data not relevant to your query
>> 
>>   - Spend a little bit of time to plan how your will map your CSV to
>>   Parquet to make sure columns are imported as the appropriate data type
>>   - this matters in compression and efficiency (storing numbers as string,
>>   for example, will prevent Parquet for doing some optimization magick
>>   - See this:
>>   http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2 (or
>>   some of the other presentations on Parquet)
>> 
>>   - Optimize your drillbits (Drill machines) so they are sharing the
>>   workload
>> 
>>   - Get to know #3 best practices
>>   - https://www.youtube.com/watch?v=_FHRzq7eHQc
>>   - https://aws.amazon.com/articles/1904
>> 
>> Hope this helps,
>> -Stefan
>> 
>> On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <ha...@gmail.com>
>> wrote:
>> 
>>> Hi!
>>> 
>>> I have terabytes of data on S3 and I want to query this data using
>> drill. I
>>> want to know at which format of data drill gives best performance.
>> whether
>>> CSV format will be best or parquet format? Also what should be file size?
>>> whether small files will be more appropriate for drill or large files?
>>> 
>>> 
>>> Thanks
>>> 
>> 
> 
> 
> 
> -- 
> Regards: HAFIZ MUJADID

Re: Best Performance of drill

Posted by Hafiz Mujadid <ha...@gmail.com>.

Thanks alot Stefan :)

On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> I'm pretty new around here but let me attempt to answer you.
>
>    - Parquet will always be (a lot) faster than CSV, especially if your
>    querying for only a part of the columns in the CSV
>    - Parquet is has various compression techniques and is more "scan
>    friendly" (optimized for scanning compressed data)
>
>    - The optimal filesize is linked to the fs segment sizes (I'm not sure
>    how that effects S3) and block sizes
>    - hava a look at this:
>    http://ingest.tips/2015/01/31/parquet-row-group-size/
>
>    - Read up on partitioning of Parquet file that is supported by Drill and
>    can improve your performance quite a bit
>    - partitioning can help you with efficiently filter data and prevents
>    scanning of data not relevant to your query
>
>    - Spend a little bit of time to plan how your will map your CSV to
>    Parquet to make sure columns are imported as the appropriate data type
>    - this matters in compression and efficiency (storing numbers as string,
>    for example, will prevent Parquet for doing some optimization magick
>    - See this:
>    http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2 (or
>    some of the other presentations on Parquet)
>
>    - Optimize your drillbits (Drill machines) so they are sharing the
>    workload
>
>    - Get to know #3 best practices
>    - https://www.youtube.com/watch?v=_FHRzq7eHQc
>    - https://aws.amazon.com/articles/1904
>
> Hope this helps,
>  -Stefan
>
> On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <ha...@gmail.com>
> wrote:
>
> > Hi!
> >
> > I have terabytes of data on S3 and I want to query this data using
> drill. I
> > want to know at which format of data drill gives best performance.
> whether
> > CSV format will be best or parquet format? Also what should be file size?
> > whether small files will be more appropriate for drill or large files?
> >
> >
> > Thanks
> >
>



-- 
Regards: HAFIZ MUJADID

Re: Best Performance of drill

Posted by Stefán Baxter <st...@activitystream.com>.

Hi,

I'm pretty new around here but let me attempt to answer you.

   - Parquet will always be (a lot) faster than CSV, especially if your
   querying for only a part of the columns in the CSV
   - Parquet is has various compression techniques and is more "scan
   friendly" (optimized for scanning compressed data)

   - The optimal filesize is linked to the fs segment sizes (I'm not sure
   how that effects S3) and block sizes
   - hava a look at this:
   http://ingest.tips/2015/01/31/parquet-row-group-size/

   - Read up on partitioning of Parquet file that is supported by Drill and
   can improve your performance quite a bit
   - partitioning can help you with efficiently filter data and prevents
   scanning of data not relevant to your query

   - Spend a little bit of time to plan how your will map your CSV to
   Parquet to make sure columns are imported as the appropriate data type
   - this matters in compression and efficiency (storing numbers as string,
   for example, will prevent Parquet for doing some optimization magick
   - See this:
   http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2 (or
   some of the other presentations on Parquet)

   - Optimize your drillbits (Drill machines) so they are sharing the
   workload

   - Get to know #3 best practices
   - https://www.youtube.com/watch?v=_FHRzq7eHQc
   - https://aws.amazon.com/articles/1904

Hope this helps,
 -Stefan

On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <ha...@gmail.com>
wrote:

> Hi!
>
> I have terabytes of data on S3 and I want to query this data using drill. I
> want to know at which format of data drill gives best performance. whether
> CSV format will be best or parquet format? Also what should be file size?
> whether small files will be more appropriate for drill or large files?
>
>
> Thanks
>