You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Andy Pernsteiner <ap...@maprtech.com> on 2015/10/07 17:27:47 UTC

Drill + gzipped-CSV performance

I'm running some experimental queries, both against CSV, and against
Gzipped-CSV (same data, same file-count, etc).

I'm doing a simple :

> select count(columns[0]) from dfs.workspace.`/csv`

and

> select count(columns[0]) from dfs.workspace.`/gz`

Here are my results:

70-files, plain-CSV, 5GB on disk: *4.8s*

 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed):  *30.4s*


When looking at profiles, it would appear that most of the time is spent on
the TEXT_SUB_SCAN operation.  Both queries spawn the same # of
minor-fragments for this phase (68), but the process_time for those minor
fragments is an average of 24s for the GZ data (most of the fragments are
pretty close to each other in terms of deviation), and 700ms average for
the plain CSV data.

Is this expected?

-- 
 Andy Pernsteiner
 Manager, Field Enablement
ph: 206.228.0737

www.mapr.com

Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Drill + gzipped-CSV performance

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Oct 7, 2015 at 2:03 PM, Jason Altekruse <al...@gmail.com>
wrote:

> Here is  a presentation with some helpful information (I haven't read all
> of it, but the table on slide 7 gies a nice overview of features in each
> codec).
>
> http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2
>
> I am skeptical of their assertion that only bzip2 is splittable. This page
> from the cloudera docs claims that only gzip is not splittable. You might
> have to try out a few and see what you get for results.
>
>
> http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/admin_data_compression_performance.html
>

For one, the MapR built-in compression is supremely splittable. The
application doesn't even know the file is compressed and can split it
anywhere at all.  Splitting on 64kB boundaries is more efficient, but
overall it doesn't much matter.

Re: Drill + gzipped-CSV performance

Posted by Jason Altekruse <al...@gmail.com>.

The issue you are likely hitting is not being CPU bound, but
under-parallelizing. Files that are gzip compressed are not splittable in
HDFS, so we will be reading the whole file on a single thread.

Plain text files, as well as those that are compressed with splittable
compression codecs will be read in parallel.

Here is  a presentation with some helpful information (I haven't read all
of it, but the table on slide 7 gies a nice overview of features in each
codec).

http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2

I am skeptical of their assertion that only bzip2 is splittable. This page
from the cloudera docs claims that only gzip is not splittable. You might
have to try out a few and see what you get for results.

http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/admin_data_compression_performance.html

On Wed, Oct 7, 2015 at 1:51 PM, Andy Pernsteiner <ap...@maprtech.com>
wrote:

> Ya that makes sense.  I’ll check the system next time I run this to see
> how much CPU the drill bits wind up taking.  For now I’ll just accept the
> penalty :)
>
>
>
>  Andy Pernsteiner
>  Manager, Field Enablement
> ph: 206.228.0737
>
> www.mapr.com
> Now Available - Free Hadoop On-Demand Training
>
>
>
> From: Alexander Reshetov <al...@gmail.com>
> Reply: user@drill.apache.org <us...@drill.apache.org>>
> Date: October 7, 2015 at 4:37:29 PM
> To: user@drill.apache.org <us...@drill.apache.org>>
> Subject:  Re: Drill + gzipped-CSV performance
>
> Hi Andy,
>
> I think that in your specific setup CPU becomes the bottleneck, which
> leads to slower query time. You can try query on other system with
> faster CPU. And/or try lower compression ratio.
>
> On Wed, Oct 7, 2015 at 9:15 PM, Andy Pernsteiner
> <ap...@maprtech.com> wrote:
> > In thinking this through, it probably is somewhat expected to see a
> slowdown when having to decompress data (esp gzip) as part of running a
> Drill query.
> >
> >
> >
> > Andy Pernsteiner
> > Manager, Field Enablement
> > ph: 206.228.0737
> >
> > www.mapr.com
> > Now Available - Free Hadoop On-Demand Training
> >
> >
> >
> > From: Andy Pernsteiner <ap...@maprtech.com>
> > Reply: Andy Pernsteiner <ap...@maprtech.com>>
> > Date: October 7, 2015 at 11:27:47 AM
> > To: user@drill.apache.org <us...@drill.apache.org>>
> > Subject: Drill + gzipped-CSV performance
> >
> > I'm running some experimental queries, both against CSV, and against
> Gzipped-CSV (same data, same file-count, etc).
> >
> > I'm doing a simple :
> >
> >> select count(columns[0]) from dfs.workspace.`/csv`
> >
> > and
> >
> >> select count(columns[0]) from dfs.workspace.`/gz`
> >
> > Here are my results:
> >
> > 70-files, plain-CSV, 5GB on disk: 4.8s
> >
> > 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed): 30.4s
> >
> >
> > When looking at profiles, it would appear that most of the time is spent
> on the TEXT_SUB_SCAN operation. Both queries spawn the same # of
> minor-fragments for this phase (68), but the process_time for those minor
> fragments is an average of 24s for the GZ data (most of the fragments are
> pretty close to each other in terms of deviation), and 700ms average for
> the plain CSV data.
> >
> > Is this expected?
> >
> > --
> > Andy Pernsteiner
> > Manager, Field Enablement
> > ph: 206.228.0737
> >
> > www.mapr.com
> > Now Available - Free Hadoop On-Demand Training
> >
> >
>

Re: Drill + gzipped-CSV performance

Posted by Andy Pernsteiner <ap...@maprtech.com>.

Ya that makes sense.  I’ll check the system next time I run this to see how much CPU the drill bits wind up taking.  For now I’ll just accept the penalty :)



 Andy Pernsteiner
 Manager, Field Enablement
ph: 206.228.0737

www.mapr.com
Now Available - Free Hadoop On-Demand Training



From: Alexander Reshetov <al...@gmail.com>
Reply: user@drill.apache.org <us...@drill.apache.org>>
Date: October 7, 2015 at 4:37:29 PM
To: user@drill.apache.org <us...@drill.apache.org>>
Subject:  Re: Drill + gzipped-CSV performance  

Hi Andy,  

I think that in your specific setup CPU becomes the bottleneck, which  
leads to slower query time. You can try query on other system with  
faster CPU. And/or try lower compression ratio.  

On Wed, Oct 7, 2015 at 9:15 PM, Andy Pernsteiner  
<ap...@maprtech.com> wrote:  
> In thinking this through, it probably is somewhat expected to see a slowdown when having to decompress data (esp gzip) as part of running a Drill query.  
>  
>  
>  
> Andy Pernsteiner  
> Manager, Field Enablement  
> ph: 206.228.0737  
>  
> www.mapr.com  
> Now Available - Free Hadoop On-Demand Training  
>  
>  
>  
> From: Andy Pernsteiner <ap...@maprtech.com>  
> Reply: Andy Pernsteiner <ap...@maprtech.com>>  
> Date: October 7, 2015 at 11:27:47 AM  
> To: user@drill.apache.org <us...@drill.apache.org>>  
> Subject: Drill + gzipped-CSV performance  
>  
> I'm running some experimental queries, both against CSV, and against Gzipped-CSV (same data, same file-count, etc).  
>  
> I'm doing a simple :  
>  
>> select count(columns[0]) from dfs.workspace.`/csv`  
>  
> and  
>  
>> select count(columns[0]) from dfs.workspace.`/gz`  
>  
> Here are my results:  
>  
> 70-files, plain-CSV, 5GB on disk: 4.8s  
>  
> 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed): 30.4s  
>  
>  
> When looking at profiles, it would appear that most of the time is spent on the TEXT_SUB_SCAN operation. Both queries spawn the same # of minor-fragments for this phase (68), but the process_time for those minor fragments is an average of 24s for the GZ data (most of the fragments are pretty close to each other in terms of deviation), and 700ms average for the plain CSV data.  
>  
> Is this expected?  
>  
> --  
> Andy Pernsteiner  
> Manager, Field Enablement  
> ph: 206.228.0737  
>  
> www.mapr.com  
> Now Available - Free Hadoop On-Demand Training  
>  
>

Re: Drill + gzipped-CSV performance

Posted by Alexander Reshetov <al...@gmail.com>.

Hi Andy,

I think that in your specific setup CPU becomes the bottleneck, which
leads to slower query time. You can try query on other system with
faster CPU. And/or try lower compression ratio.

On Wed, Oct 7, 2015 at 9:15 PM, Andy Pernsteiner
<ap...@maprtech.com> wrote:
> In thinking this through, it probably is somewhat expected to see a slowdown when having to decompress data (esp gzip) as part of running a Drill query.
>
>
>
>  Andy Pernsteiner
>  Manager, Field Enablement
> ph: 206.228.0737
>
> www.mapr.com
> Now Available - Free Hadoop On-Demand Training
>
>
>
> From: Andy Pernsteiner <ap...@maprtech.com>
> Reply: Andy Pernsteiner <ap...@maprtech.com>>
> Date: October 7, 2015 at 11:27:47 AM
> To: user@drill.apache.org <us...@drill.apache.org>>
> Subject:  Drill + gzipped-CSV performance
>
> I'm running some experimental queries, both against CSV, and against Gzipped-CSV (same data, same file-count, etc).
>
> I'm doing a simple :
>
>> select count(columns[0]) from dfs.workspace.`/csv`
>
> and
>
>> select count(columns[0]) from dfs.workspace.`/gz`
>
> Here are my results:
>
> 70-files, plain-CSV, 5GB on disk: 4.8s
>
>  70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed):  30.4s
>
>
> When looking at profiles, it would appear that most of the time is spent on the TEXT_SUB_SCAN operation.  Both queries spawn the same # of minor-fragments for this phase (68), but the process_time for those minor fragments is an average of 24s for the GZ data (most of the fragments are pretty close to each other in terms of deviation), and 700ms average for the plain CSV data.
>
> Is this expected?
>
> --
>  Andy Pernsteiner
>  Manager, Field Enablement
> ph: 206.228.0737
>
> www.mapr.com
> Now Available - Free Hadoop On-Demand Training
>
>

Re: Drill + gzipped-CSV performance

Posted by Andy Pernsteiner <ap...@maprtech.com>.

In thinking this through, it probably is somewhat expected to see a slowdown when having to decompress data (esp gzip) as part of running a Drill query.  



 Andy Pernsteiner
 Manager, Field Enablement
ph: 206.228.0737

www.mapr.com
Now Available - Free Hadoop On-Demand Training



From: Andy Pernsteiner <ap...@maprtech.com>
Reply: Andy Pernsteiner <ap...@maprtech.com>>
Date: October 7, 2015 at 11:27:47 AM
To: user@drill.apache.org <us...@drill.apache.org>>
Subject:  Drill + gzipped-CSV performance  

I'm running some experimental queries, both against CSV, and against Gzipped-CSV (same data, same file-count, etc).

I'm doing a simple :

> select count(columns[0]) from dfs.workspace.`/csv`

and

> select count(columns[0]) from dfs.workspace.`/gz`

Here are my results:

70-files, plain-CSV, 5GB on disk: 4.8s 

 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed):  30.4s 


When looking at profiles, it would appear that most of the time is spent on the TEXT_SUB_SCAN operation.  Both queries spawn the same # of minor-fragments for this phase (68), but the process_time for those minor fragments is an average of 24s for the GZ data (most of the fragments are pretty close to each other in terms of deviation), and 700ms average for the plain CSV data.

Is this expected?  

--
 Andy Pernsteiner
 Manager, Field Enablement
ph: 206.228.0737

www.mapr.com
Now Available - Free Hadoop On-Demand Training

Re: Drill + gzipped-CSV performance

Posted by Jacques Nadeau <ja...@dremio.com>.

The other issue you might be running across is I have seen situations where
gzip is not using native library for decompression. You should take a look
at whether this is being used.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Oct 7, 2015 at 8:27 AM, Andy Pernsteiner <ap...@maprtech.com>
wrote:

> I'm running some experimental queries, both against CSV, and against
> Gzipped-CSV (same data, same file-count, etc).
>
> I'm doing a simple :
>
> > select count(columns[0]) from dfs.workspace.`/csv`
>
> and
>
> > select count(columns[0]) from dfs.workspace.`/gz`
>
> Here are my results:
>
> 70-files, plain-CSV, 5GB on disk: *4.8s*
>
>  70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed):  *30.4s*
>
>
> When looking at profiles, it would appear that most of the time is spent on
> the TEXT_SUB_SCAN operation.  Both queries spawn the same # of
> minor-fragments for this phase (68), but the process_time for those minor
> fragments is an average of 24s for the GZ data (most of the fragments are
> pretty close to each other in terms of deviation), and 700ms average for
> the plain CSV data.
>
> Is this expected?
>
> --
>  Andy Pernsteiner
>  Manager, Field Enablement
> ph: 206.228.0737
>
> www.mapr.com
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>