You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Khurram Faraaz <kf...@maprtech.com> on 2016/08/01 09:57:47 UTC

Re: [Drill-Questions] Speed difference between GZ and BZ2

What is the data format within those .gz and .bz2 files ? It is parquet or
JSON or plain text (CSV) ?
Also, what was this config parameter `store.parquet.compression` set to,
when ypu ran your test ?

- Khurram

On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <sh...@games24x7.com>
wrote:

> Awaiting for response..
>
> On 30-Jul-2016 3:20 PM, "Shankar Mane" <sh...@games24x7.com> wrote:
>
> >
>
> > I am Comparing Querying speed between GZ and BZ2.
> >
> > Below are the 2 files and their sizes (This 2 files have same data):
> > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> >
> >
> >
> > Results:
> >
> > 0: jdbc:drill:> select channelid, count(serverTime) from
> dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> > +------------+----------+
> > | channelid  |  EXPR$1  |
> > +------------+----------+
> > | 3          | 977134   |
> > | 0          | 836850   |
> > | 2          | 3202854  |
> > +------------+----------+
> > 3 rows selected (86.034 seconds)
> >
> >
> >
> > 0: jdbc:drill:> select channelid, count(serverTime) from
> dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ;
> > +------------+----------+
> > | channelid  |  EXPR$1  |
> > +------------+----------+
> > | 3          | 977134   |
> > | 0          | 836850   |
> > | 2          | 3202854  |
> > +------------+----------+
> > 3 rows selected (459.079 seconds)
> >
> >
> >
> > Questions:
> > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > 3. As bz2 is splittable format, How drill using it ?
> >
> >
> > regards,
> > shankar
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Shankar Mane <sh...@games24x7.com>.

Yes, i went through the benchmarks and started testing this one.

I have tested this one using Hadoop Map-Reduce. And it seems BZ worked
faster than GZ.  As i know GZ is non-splittable and BZ is splittable.
Hadoop MR takes the advantage of this splittable property and launched
multiple mappers and reducers (multiple CPU's) whereas in case of GZ only
single mapper runs (single CPU) .

Can't drill use this splittable property ?



On Fri, Aug 5, 2016 at 8:50 PM, Khurram Faraaz <kf...@maprtech.com> wrote:

> Shankar,
>
> This is expected behavior, bzip2 decompression is four to twelve times
> slower than decompressing gzip compressed files.
> You can look at the comparison benchmark here for numbers -
> http://tukaani.org/lzma/benchmarks.html
>
> On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane <sh...@games24x7.com>
> wrote:
>
> > Please find the query plan for both queries. FYI: I am not seeing
> > any planning difference between these 2 queries except Cost.
> >
> >
> > /******************************** Query on GZ
> > ****************************************/
> >
> > 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(channelid=[$0], EXPR$1=[$1])
> > 00-02        UnionExchange
> > 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> > 01-02            Project(channelid=[$0], EXPR$1=[$1])
> > 01-03              HashToRandomExchange(dist0=[[$0]])
> > 02-01                UnorderedMuxExchange
> > 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> > E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> > 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> > 03-03                      Scan(groupscan=[EasyGroupScan
> > [selectionRoot=hdfs://namenode:9000/tmp/stest-gz/
> > kafka_3_25-Jul-2016-12a.json.gz,
> > numFiles=1, columns=[`channelid`, `serverTime`],
> > files=[hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-
> > 2016-12a.json.gz]]])
> >  | {
> >   "head" : {
> >     "version" : 1,
> >     "generator" : {
> >       "type" : "ExplainHandler",
> >       "info" : ""
> >     },
> >     "type" : "APACHE_DRILL_PHYSICAL",
> >     "options" : [ ],
> >     "queue" : 0,
> >     "resultMode" : "EXEC"
> >   },
> >   "graph" : [ {
> >     "pop" : "fs-scan",
> >     "@id" : 196611,
> >     "userName" : "hadoop",
> >     "files" : [
> > "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz" ],
> >     "storage" : {
> >       "type" : "file",
> >       "enabled" : true,
> >       "connection" : "hdfs://namenode:9000",
> >       "config" : null,
> >       "workspaces" : {
> >         "root" : {
> >           "location" : "/tmp/",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         },
> >         "tmp" : {
> >           "location" : "/tmp",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         }
> >       },
> >       "formats" : {
> >         "psv" : {
> >           "type" : "text",
> >           "extensions" : [ "tbl" ],
> >           "delimiter" : "|"
> >         },
> >         "csv" : {
> >           "type" : "text",
> >           "extensions" : [ "csv" ],
> >           "delimiter" : ","
> >         },
> >         "tsv" : {
> >           "type" : "text",
> >           "extensions" : [ "tsv" ],
> >           "delimiter" : "\t"
> >         },
> >         "parquet" : {
> >           "type" : "parquet"
> >         },
> >         "json" : {
> >           "type" : "json",
> >           "extensions" : [ "json" ]
> >         },
> >         "avro" : {
> >           "type" : "avro"
> >         }
> >       }
> >     },
> >     "format" : {
> >       "type" : "json",
> >       "extensions" : [ "json" ]
> >     },
> >     "columns" : [ "`channelid`", "`serverTime`" ],
> >     "selectionRoot" :
> > "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz",
> >     "cost" : 1800981.0
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 196610,
> >     "child" : 196611,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "count(`serverTime`) "
> >     } ],
> >     "cost" : 900490.5
> >   }, {
> >     "pop" : "project",
> >     "@id" : 196609,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     }, {
> >       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >       "expr" : "hash32asdouble(`channelid`) "
> >     } ],
> >     "child" : 196610,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "unordered-mux-exchange",
> >     "@id" : 131073,
> >     "child" : 196609,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "hash-to-random-exchange",
> >     "@id" : 65539,
> >     "child" : 131073,
> >     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "project",
> >     "@id" : 65538,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 65539,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 65537,
> >     "child" : 65538,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "$sum0(`EXPR$1`) "
> >     } ],
> >     "cost" : 90049.05
> >   }, {
> >     "pop" : "union-exchange",
> >     "@id" : 2,
> >     "child" : 65537,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 18009.81
> >   }, {
> >     "pop" : "project",
> >     "@id" : 1,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 2,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 18009.81
> >   }, {
> >     "pop" : "screen",
> >     "@id" : 0,
> >     "child" : 1,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 18009.81
> >   } ]
> > } |
> > +------+------+
> > 1 row selected (0.729 seconds)
> > 0: jdbc:drill:>
> >
> >
> >
> > /******************************** Query on BZ
> > ****************************************/
> >
> >
> > 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> channelid ;
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(channelid=[$0], EXPR$1=[$1])
> > 00-02        UnionExchange
> > 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> > 01-02            Project(channelid=[$0], EXPR$1=[$1])
> > 01-03              HashToRandomExchange(dist0=[[$0]])
> > 02-01                UnorderedMuxExchange
> > 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> > E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> > 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> > 03-03                      Scan(groupscan=[EasyGroupScan
> > [selectionRoot=hdfs://namenode:9000/tmp/stest-bz2/
> > kafka_3_25-Jul-2016-12a.json.bz2,
> > numFiles=1, columns=[`channelid`, `serverTime`],
> > files=[hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-
> > 2016-12a.json.bz2]]])
> >  | {
> >   "head" : {
> >     "version" : 1,
> >     "generator" : {
> >       "type" : "ExplainHandler",
> >       "info" : ""
> >     },
> >     "type" : "APACHE_DRILL_PHYSICAL",
> >     "options" : [ ],
> >     "queue" : 0,
> >     "resultMode" : "EXEC"
> >   },
> >   "graph" : [ {
> >     "pop" : "fs-scan",
> >     "@id" : 196611,
> >     "userName" : "hadoop",
> >     "files" : [
> > "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2" ],
> >     "storage" : {
> >       "type" : "file",
> >       "enabled" : true,
> >       "connection" : "hdfs://namenode:9000",
> >       "config" : null,
> >       "workspaces" : {
> >         "root" : {
> >           "location" : "/tmp/",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         },
> >         "tmp" : {
> >           "location" : "/tmp",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         }
> >       },
> >       "formats" : {
> >         "psv" : {
> >           "type" : "text",
> >           "extensions" : [ "tbl" ],
> >           "delimiter" : "|"
> >         },
> >         "csv" : {
> >           "type" : "text",
> >           "extensions" : [ "csv" ],
> >           "delimiter" : ","
> >         },
> >         "tsv" : {
> >           "type" : "text",
> >           "extensions" : [ "tsv" ],
> >           "delimiter" : "\t"
> >         },
> >         "parquet" : {
> >           "type" : "parquet"
> >         },
> >         "json" : {
> >           "type" : "json",
> >           "extensions" : [ "json" ]
> >         },
> >         "avro" : {
> >           "type" : "avro"
> >         }
> >       }
> >     },
> >     "format" : {
> >       "type" : "json",
> >       "extensions" : [ "json" ]
> >     },
> >     "columns" : [ "`channelid`", "`serverTime`" ],
> >     "selectionRoot" :
> > "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2",
> >     "cost" : 1148224.0
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 196610,
> >     "child" : 196611,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "count(`serverTime`) "
> >     } ],
> >     "cost" : 574112.0
> >   }, {
> >     "pop" : "project",
> >     "@id" : 196609,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     }, {
> >       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >       "expr" : "hash32asdouble(`channelid`) "
> >     } ],
> >     "child" : 196610,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "unordered-mux-exchange",
> >     "@id" : 131073,
> >     "child" : 196609,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "hash-to-random-exchange",
> >     "@id" : 65539,
> >     "child" : 131073,
> >     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "project",
> >     "@id" : 65538,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 65539,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 65537,
> >     "child" : 65538,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "$sum0(`EXPR$1`) "
> >     } ],
> >     "cost" : 57411.2
> >   }, {
> >     "pop" : "union-exchange",
> >     "@id" : 2,
> >     "child" : 65537,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 11482.24
> >   }, {
> >     "pop" : "project",
> >     "@id" : 1,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 2,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 11482.24
> >   }, {
> >     "pop" : "screen",
> >     "@id" : 0,
> >     "child" : 1,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 11482.24
> >   } ]
> > } |
> > +------+------+
> > 1 row selected (0.381 seconds)
> > 0: jdbc:drill:>
> >
> >
> > On Thu, Aug 4, 2016 at 3:07 PM, Khurram Faraaz <kf...@maprtech.com>
> > wrote:
> >
> > > Can you please do an explain plan over the two aggregate queries. That
> > way
> > > we can know where most of the time is being spent, is it in the query
> > > planning phase or is it query execution that is taking longer. Please
> > share
> > > the query plans and the time taken for those explain plan statements.
> > >
> > > On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <
> shankar.mane@games24x7.com
> > >
> > > wrote:
> > >
> > > > It is plain json (1 json per line).
> > > > Each json message size = ~4kb
> > > > no. of json messages = ~5 Millions.
> > > >
> > > > store.parquet.compression = snappy ( i don't think, this parameter
> get
> > > > used. As I am querying select only.)
> > > >
> > > >
> > > > On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kfaraaz@maprtech.com
> >
> > > > wrote:
> > > >
> > > > > What is the data format within those .gz and .bz2 files ? It is
> > parquet
> > > > or
> > > > > JSON or plain text (CSV) ?
> > > > > Also, what was this config parameter `store.parquet.compression`
> set
> > > to,
> > > > > when ypu ran your test ?
> > > > >
> > > > > - Khurram
> > > > >
> > > > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> > > > shankar.mane@games24x7.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Awaiting for response..
> > > > > >
> > > > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <
> shankar.mane@games24x7.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > I am Comparing Querying speed between GZ and BZ2.
> > > > > > >
> > > > > > > Below are the 2 files and their sizes (This 2 files have same
> > > data):
> > > > > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Results:
> > > > > > >
> > > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by
> > > channelid
> > > > ;
> > > > > > > +------------+----------+
> > > > > > > | channelid  |  EXPR$1  |
> > > > > > > +------------+----------+
> > > > > > > | 3          | 977134   |
> > > > > > > | 0          | 836850   |
> > > > > > > | 2          | 3202854  |
> > > > > > > +------------+----------+
> > > > > > > 3 rows selected (86.034 seconds)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> > > > channelid
> > > > > ;
> > > > > > > +------------+----------+
> > > > > > > | channelid  |  EXPR$1  |
> > > > > > > +------------+----------+
> > > > > > > | 3          | 977134   |
> > > > > > > | 0          | 836850   |
> > > > > > > | 2          | 3202854  |
> > > > > > > +------------+----------+
> > > > > > > 3 rows selected (459.079 seconds)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Questions:
> > > > > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > > > > 2. How can we speed to up Bz2.  Are there any configuration to
> > do ?
> > > > > > > 3. As bz2 is splittable format, How drill using it ?
> > > > > > >
> > > > > > >
> > > > > > > regards,
> > > > > > > shankar
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Shankar Mane <sh...@games24x7.com>.

Yes, i went through the benchmarks and started testing this one.

I have tested this one using Hadoop Map-Reduce. And it seems BZ worked
faster than GZ.  As i know GZ is non-splittable and BZ is splittable.
Hadoop MR takes the advantage of this splittable property and launched
multiple mappers and reducers (multiple CPU's) whereas in case of GZ only
single mapper runs (single CPU) .

Can't drill use this splittable property ?



On Fri, Aug 5, 2016 at 8:50 PM, Khurram Faraaz <kf...@maprtech.com> wrote:

> Shankar,
>
> This is expected behavior, bzip2 decompression is four to twelve times
> slower than decompressing gzip compressed files.
> You can look at the comparison benchmark here for numbers -
> http://tukaani.org/lzma/benchmarks.html
>
> On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane <sh...@games24x7.com>
> wrote:
>
> > Please find the query plan for both queries. FYI: I am not seeing
> > any planning difference between these 2 queries except Cost.
> >
> >
> > /******************************** Query on GZ
> > ****************************************/
> >
> > 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(channelid=[$0], EXPR$1=[$1])
> > 00-02        UnionExchange
> > 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> > 01-02            Project(channelid=[$0], EXPR$1=[$1])
> > 01-03              HashToRandomExchange(dist0=[[$0]])
> > 02-01                UnorderedMuxExchange
> > 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> > E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> > 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> > 03-03                      Scan(groupscan=[EasyGroupScan
> > [selectionRoot=hdfs://namenode:9000/tmp/stest-gz/
> > kafka_3_25-Jul-2016-12a.json.gz,
> > numFiles=1, columns=[`channelid`, `serverTime`],
> > files=[hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-
> > 2016-12a.json.gz]]])
> >  | {
> >   "head" : {
> >     "version" : 1,
> >     "generator" : {
> >       "type" : "ExplainHandler",
> >       "info" : ""
> >     },
> >     "type" : "APACHE_DRILL_PHYSICAL",
> >     "options" : [ ],
> >     "queue" : 0,
> >     "resultMode" : "EXEC"
> >   },
> >   "graph" : [ {
> >     "pop" : "fs-scan",
> >     "@id" : 196611,
> >     "userName" : "hadoop",
> >     "files" : [
> > "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz" ],
> >     "storage" : {
> >       "type" : "file",
> >       "enabled" : true,
> >       "connection" : "hdfs://namenode:9000",
> >       "config" : null,
> >       "workspaces" : {
> >         "root" : {
> >           "location" : "/tmp/",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         },
> >         "tmp" : {
> >           "location" : "/tmp",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         }
> >       },
> >       "formats" : {
> >         "psv" : {
> >           "type" : "text",
> >           "extensions" : [ "tbl" ],
> >           "delimiter" : "|"
> >         },
> >         "csv" : {
> >           "type" : "text",
> >           "extensions" : [ "csv" ],
> >           "delimiter" : ","
> >         },
> >         "tsv" : {
> >           "type" : "text",
> >           "extensions" : [ "tsv" ],
> >           "delimiter" : "\t"
> >         },
> >         "parquet" : {
> >           "type" : "parquet"
> >         },
> >         "json" : {
> >           "type" : "json",
> >           "extensions" : [ "json" ]
> >         },
> >         "avro" : {
> >           "type" : "avro"
> >         }
> >       }
> >     },
> >     "format" : {
> >       "type" : "json",
> >       "extensions" : [ "json" ]
> >     },
> >     "columns" : [ "`channelid`", "`serverTime`" ],
> >     "selectionRoot" :
> > "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz",
> >     "cost" : 1800981.0
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 196610,
> >     "child" : 196611,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "count(`serverTime`) "
> >     } ],
> >     "cost" : 900490.5
> >   }, {
> >     "pop" : "project",
> >     "@id" : 196609,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     }, {
> >       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >       "expr" : "hash32asdouble(`channelid`) "
> >     } ],
> >     "child" : 196610,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "unordered-mux-exchange",
> >     "@id" : 131073,
> >     "child" : 196609,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "hash-to-random-exchange",
> >     "@id" : 65539,
> >     "child" : 131073,
> >     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "project",
> >     "@id" : 65538,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 65539,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 180098.1
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 65537,
> >     "child" : 65538,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "$sum0(`EXPR$1`) "
> >     } ],
> >     "cost" : 90049.05
> >   }, {
> >     "pop" : "union-exchange",
> >     "@id" : 2,
> >     "child" : 65537,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 18009.81
> >   }, {
> >     "pop" : "project",
> >     "@id" : 1,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 2,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 18009.81
> >   }, {
> >     "pop" : "screen",
> >     "@id" : 0,
> >     "child" : 1,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 18009.81
> >   } ]
> > } |
> > +------+------+
> > 1 row selected (0.729 seconds)
> > 0: jdbc:drill:>
> >
> >
> >
> > /******************************** Query on BZ
> > ****************************************/
> >
> >
> > 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> channelid ;
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(channelid=[$0], EXPR$1=[$1])
> > 00-02        UnionExchange
> > 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> > 01-02            Project(channelid=[$0], EXPR$1=[$1])
> > 01-03              HashToRandomExchange(dist0=[[$0]])
> > 02-01                UnorderedMuxExchange
> > 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> > E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> > 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> > 03-03                      Scan(groupscan=[EasyGroupScan
> > [selectionRoot=hdfs://namenode:9000/tmp/stest-bz2/
> > kafka_3_25-Jul-2016-12a.json.bz2,
> > numFiles=1, columns=[`channelid`, `serverTime`],
> > files=[hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-
> > 2016-12a.json.bz2]]])
> >  | {
> >   "head" : {
> >     "version" : 1,
> >     "generator" : {
> >       "type" : "ExplainHandler",
> >       "info" : ""
> >     },
> >     "type" : "APACHE_DRILL_PHYSICAL",
> >     "options" : [ ],
> >     "queue" : 0,
> >     "resultMode" : "EXEC"
> >   },
> >   "graph" : [ {
> >     "pop" : "fs-scan",
> >     "@id" : 196611,
> >     "userName" : "hadoop",
> >     "files" : [
> > "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2" ],
> >     "storage" : {
> >       "type" : "file",
> >       "enabled" : true,
> >       "connection" : "hdfs://namenode:9000",
> >       "config" : null,
> >       "workspaces" : {
> >         "root" : {
> >           "location" : "/tmp/",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         },
> >         "tmp" : {
> >           "location" : "/tmp",
> >           "writable" : true,
> >           "defaultInputFormat" : null
> >         }
> >       },
> >       "formats" : {
> >         "psv" : {
> >           "type" : "text",
> >           "extensions" : [ "tbl" ],
> >           "delimiter" : "|"
> >         },
> >         "csv" : {
> >           "type" : "text",
> >           "extensions" : [ "csv" ],
> >           "delimiter" : ","
> >         },
> >         "tsv" : {
> >           "type" : "text",
> >           "extensions" : [ "tsv" ],
> >           "delimiter" : "\t"
> >         },
> >         "parquet" : {
> >           "type" : "parquet"
> >         },
> >         "json" : {
> >           "type" : "json",
> >           "extensions" : [ "json" ]
> >         },
> >         "avro" : {
> >           "type" : "avro"
> >         }
> >       }
> >     },
> >     "format" : {
> >       "type" : "json",
> >       "extensions" : [ "json" ]
> >     },
> >     "columns" : [ "`channelid`", "`serverTime`" ],
> >     "selectionRoot" :
> > "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2",
> >     "cost" : 1148224.0
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 196610,
> >     "child" : 196611,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "count(`serverTime`) "
> >     } ],
> >     "cost" : 574112.0
> >   }, {
> >     "pop" : "project",
> >     "@id" : 196609,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     }, {
> >       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >       "expr" : "hash32asdouble(`channelid`) "
> >     } ],
> >     "child" : 196610,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "unordered-mux-exchange",
> >     "@id" : 131073,
> >     "child" : 196609,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "hash-to-random-exchange",
> >     "@id" : 65539,
> >     "child" : 131073,
> >     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "project",
> >     "@id" : 65538,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 65539,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 114822.4
> >   }, {
> >     "pop" : "hash-aggregate",
> >     "@id" : 65537,
> >     "child" : 65538,
> >     "cardinality" : 1.0,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "groupByExprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     } ],
> >     "aggrExprs" : [ {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "$sum0(`EXPR$1`) "
> >     } ],
> >     "cost" : 57411.2
> >   }, {
> >     "pop" : "union-exchange",
> >     "@id" : 2,
> >     "child" : 65537,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 11482.24
> >   }, {
> >     "pop" : "project",
> >     "@id" : 1,
> >     "exprs" : [ {
> >       "ref" : "`channelid`",
> >       "expr" : "`channelid`"
> >     }, {
> >       "ref" : "`EXPR$1`",
> >       "expr" : "`EXPR$1`"
> >     } ],
> >     "child" : 2,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 11482.24
> >   }, {
> >     "pop" : "screen",
> >     "@id" : 0,
> >     "child" : 1,
> >     "initialAllocation" : 1000000,
> >     "maxAllocation" : 10000000000,
> >     "cost" : 11482.24
> >   } ]
> > } |
> > +------+------+
> > 1 row selected (0.381 seconds)
> > 0: jdbc:drill:>
> >
> >
> > On Thu, Aug 4, 2016 at 3:07 PM, Khurram Faraaz <kf...@maprtech.com>
> > wrote:
> >
> > > Can you please do an explain plan over the two aggregate queries. That
> > way
> > > we can know where most of the time is being spent, is it in the query
> > > planning phase or is it query execution that is taking longer. Please
> > share
> > > the query plans and the time taken for those explain plan statements.
> > >
> > > On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <
> shankar.mane@games24x7.com
> > >
> > > wrote:
> > >
> > > > It is plain json (1 json per line).
> > > > Each json message size = ~4kb
> > > > no. of json messages = ~5 Millions.
> > > >
> > > > store.parquet.compression = snappy ( i don't think, this parameter
> get
> > > > used. As I am querying select only.)
> > > >
> > > >
> > > > On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kfaraaz@maprtech.com
> >
> > > > wrote:
> > > >
> > > > > What is the data format within those .gz and .bz2 files ? It is
> > parquet
> > > > or
> > > > > JSON or plain text (CSV) ?
> > > > > Also, what was this config parameter `store.parquet.compression`
> set
> > > to,
> > > > > when ypu ran your test ?
> > > > >
> > > > > - Khurram
> > > > >
> > > > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> > > > shankar.mane@games24x7.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Awaiting for response..
> > > > > >
> > > > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <
> shankar.mane@games24x7.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > I am Comparing Querying speed between GZ and BZ2.
> > > > > > >
> > > > > > > Below are the 2 files and their sizes (This 2 files have same
> > > data):
> > > > > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Results:
> > > > > > >
> > > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by
> > > channelid
> > > > ;
> > > > > > > +------------+----------+
> > > > > > > | channelid  |  EXPR$1  |
> > > > > > > +------------+----------+
> > > > > > > | 3          | 977134   |
> > > > > > > | 0          | 836850   |
> > > > > > > | 2          | 3202854  |
> > > > > > > +------------+----------+
> > > > > > > 3 rows selected (86.034 seconds)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> > > > channelid
> > > > > ;
> > > > > > > +------------+----------+
> > > > > > > | channelid  |  EXPR$1  |
> > > > > > > +------------+----------+
> > > > > > > | 3          | 977134   |
> > > > > > > | 0          | 836850   |
> > > > > > > | 2          | 3202854  |
> > > > > > > +------------+----------+
> > > > > > > 3 rows selected (459.079 seconds)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Questions:
> > > > > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > > > > 2. How can we speed to up Bz2.  Are there any configuration to
> > do ?
> > > > > > > 3. As bz2 is splittable format, How drill using it ?
> > > > > > >
> > > > > > >
> > > > > > > regards,
> > > > > > > shankar
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Khurram Faraaz <kf...@maprtech.com>.

Shankar,

This is expected behavior, bzip2 decompression is four to twelve times
slower than decompressing gzip compressed files.
You can look at the comparison benchmark here for numbers -
http://tukaani.org/lzma/benchmarks.html

On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane <sh...@games24x7.com>
wrote:

> Please find the query plan for both queries. FYI: I am not seeing
> any planning difference between these 2 queries except Cost.
>
>
> /******************************** Query on GZ
> ****************************************/
>
> 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(channelid=[$0], EXPR$1=[$1])
> 00-02        UnionExchange
> 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> 01-02            Project(channelid=[$0], EXPR$1=[$1])
> 01-03              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 03-03                      Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://namenode:9000/tmp/stest-gz/
> kafka_3_25-Jul-2016-12a.json.gz,
> numFiles=1, columns=[`channelid`, `serverTime`],
> files=[hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-
> 2016-12a.json.gz]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 196611,
>     "userName" : "hadoop",
>     "files" : [
> "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "hdfs://namenode:9000",
>       "config" : null,
>       "workspaces" : {
>         "root" : {
>           "location" : "/tmp/",
>           "writable" : true,
>           "defaultInputFormat" : null
>         },
>         "tmp" : {
>           "location" : "/tmp",
>           "writable" : true,
>           "defaultInputFormat" : null
>         }
>       },
>       "formats" : {
>         "psv" : {
>           "type" : "text",
>           "extensions" : [ "tbl" ],
>           "delimiter" : "|"
>         },
>         "csv" : {
>           "type" : "text",
>           "extensions" : [ "csv" ],
>           "delimiter" : ","
>         },
>         "tsv" : {
>           "type" : "text",
>           "extensions" : [ "tsv" ],
>           "delimiter" : "\t"
>         },
>         "parquet" : {
>           "type" : "parquet"
>         },
>         "json" : {
>           "type" : "json",
>           "extensions" : [ "json" ]
>         },
>         "avro" : {
>           "type" : "avro"
>         }
>       }
>     },
>     "format" : {
>       "type" : "json",
>       "extensions" : [ "json" ]
>     },
>     "columns" : [ "`channelid`", "`serverTime`" ],
>     "selectionRoot" :
> "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz",
>     "cost" : 1800981.0
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 196610,
>     "child" : 196611,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "count(`serverTime`) "
>     } ],
>     "cost" : 900490.5
>   }, {
>     "pop" : "project",
>     "@id" : 196609,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     }, {
>       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>       "expr" : "hash32asdouble(`channelid`) "
>     } ],
>     "child" : 196610,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "unordered-mux-exchange",
>     "@id" : 131073,
>     "child" : 196609,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "hash-to-random-exchange",
>     "@id" : 65539,
>     "child" : 131073,
>     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "project",
>     "@id" : 65538,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 65539,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 65537,
>     "child" : 65538,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "$sum0(`EXPR$1`) "
>     } ],
>     "cost" : 90049.05
>   }, {
>     "pop" : "union-exchange",
>     "@id" : 2,
>     "child" : 65537,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   }, {
>     "pop" : "project",
>     "@id" : 1,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 2,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   } ]
> } |
> +------+------+
> 1 row selected (0.729 seconds)
> 0: jdbc:drill:>
>
>
>
> /******************************** Query on BZ
> ****************************************/
>
>
> 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(channelid=[$0], EXPR$1=[$1])
> 00-02        UnionExchange
> 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> 01-02            Project(channelid=[$0], EXPR$1=[$1])
> 01-03              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 03-03                      Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://namenode:9000/tmp/stest-bz2/
> kafka_3_25-Jul-2016-12a.json.bz2,
> numFiles=1, columns=[`channelid`, `serverTime`],
> files=[hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-
> 2016-12a.json.bz2]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 196611,
>     "userName" : "hadoop",
>     "files" : [
> "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "hdfs://namenode:9000",
>       "config" : null,
>       "workspaces" : {
>         "root" : {
>           "location" : "/tmp/",
>           "writable" : true,
>           "defaultInputFormat" : null
>         },
>         "tmp" : {
>           "location" : "/tmp",
>           "writable" : true,
>           "defaultInputFormat" : null
>         }
>       },
>       "formats" : {
>         "psv" : {
>           "type" : "text",
>           "extensions" : [ "tbl" ],
>           "delimiter" : "|"
>         },
>         "csv" : {
>           "type" : "text",
>           "extensions" : [ "csv" ],
>           "delimiter" : ","
>         },
>         "tsv" : {
>           "type" : "text",
>           "extensions" : [ "tsv" ],
>           "delimiter" : "\t"
>         },
>         "parquet" : {
>           "type" : "parquet"
>         },
>         "json" : {
>           "type" : "json",
>           "extensions" : [ "json" ]
>         },
>         "avro" : {
>           "type" : "avro"
>         }
>       }
>     },
>     "format" : {
>       "type" : "json",
>       "extensions" : [ "json" ]
>     },
>     "columns" : [ "`channelid`", "`serverTime`" ],
>     "selectionRoot" :
> "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2",
>     "cost" : 1148224.0
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 196610,
>     "child" : 196611,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "count(`serverTime`) "
>     } ],
>     "cost" : 574112.0
>   }, {
>     "pop" : "project",
>     "@id" : 196609,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     }, {
>       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>       "expr" : "hash32asdouble(`channelid`) "
>     } ],
>     "child" : 196610,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "unordered-mux-exchange",
>     "@id" : 131073,
>     "child" : 196609,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "hash-to-random-exchange",
>     "@id" : 65539,
>     "child" : 131073,
>     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "project",
>     "@id" : 65538,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 65539,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 65537,
>     "child" : 65538,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "$sum0(`EXPR$1`) "
>     } ],
>     "cost" : 57411.2
>   }, {
>     "pop" : "union-exchange",
>     "@id" : 2,
>     "child" : 65537,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   }, {
>     "pop" : "project",
>     "@id" : 1,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 2,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   } ]
> } |
> +------+------+
> 1 row selected (0.381 seconds)
> 0: jdbc:drill:>
>
>
> On Thu, Aug 4, 2016 at 3:07 PM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > Can you please do an explain plan over the two aggregate queries. That
> way
> > we can know where most of the time is being spent, is it in the query
> > planning phase or is it query execution that is taking longer. Please
> share
> > the query plans and the time taken for those explain plan statements.
> >
> > On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <shankar.mane@games24x7.com
> >
> > wrote:
> >
> > > It is plain json (1 json per line).
> > > Each json message size = ~4kb
> > > no. of json messages = ~5 Millions.
> > >
> > > store.parquet.compression = snappy ( i don't think, this parameter get
> > > used. As I am querying select only.)
> > >
> > >
> > > On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com>
> > > wrote:
> > >
> > > > What is the data format within those .gz and .bz2 files ? It is
> parquet
> > > or
> > > > JSON or plain text (CSV) ?
> > > > Also, what was this config parameter `store.parquet.compression` set
> > to,
> > > > when ypu ran your test ?
> > > >
> > > > - Khurram
> > > >
> > > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> > > shankar.mane@games24x7.com
> > > > >
> > > > wrote:
> > > >
> > > > > Awaiting for response..
> > > > >
> > > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <shankar.mane@games24x7.com
> >
> > > > wrote:
> > > > >
> > > > > >
> > > > >
> > > > > > I am Comparing Querying speed between GZ and BZ2.
> > > > > >
> > > > > > Below are the 2 files and their sizes (This 2 files have same
> > data):
> > > > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > > > >
> > > > > >
> > > > > >
> > > > > > Results:
> > > > > >
> > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by
> > channelid
> > > ;
> > > > > > +------------+----------+
> > > > > > | channelid  |  EXPR$1  |
> > > > > > +------------+----------+
> > > > > > | 3          | 977134   |
> > > > > > | 0          | 836850   |
> > > > > > | 2          | 3202854  |
> > > > > > +------------+----------+
> > > > > > 3 rows selected (86.034 seconds)
> > > > > >
> > > > > >
> > > > > >
> > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> > > channelid
> > > > ;
> > > > > > +------------+----------+
> > > > > > | channelid  |  EXPR$1  |
> > > > > > +------------+----------+
> > > > > > | 3          | 977134   |
> > > > > > | 0          | 836850   |
> > > > > > | 2          | 3202854  |
> > > > > > +------------+----------+
> > > > > > 3 rows selected (459.079 seconds)
> > > > > >
> > > > > >
> > > > > >
> > > > > > Questions:
> > > > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > > > 2. How can we speed to up Bz2.  Are there any configuration to
> do ?
> > > > > > 3. As bz2 is splittable format, How drill using it ?
> > > > > >
> > > > > >
> > > > > > regards,
> > > > > > shankar
> > > > >
> > > >
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Khurram Faraaz <kf...@maprtech.com>.

Shankar,

This is expected behavior, bzip2 decompression is four to twelve times
slower than decompressing gzip compressed files.
You can look at the comparison benchmark here for numbers -
http://tukaani.org/lzma/benchmarks.html

On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane <sh...@games24x7.com>
wrote:

> Please find the query plan for both queries. FYI: I am not seeing
> any planning difference between these 2 queries except Cost.
>
>
> /******************************** Query on GZ
> ****************************************/
>
> 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(channelid=[$0], EXPR$1=[$1])
> 00-02        UnionExchange
> 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> 01-02            Project(channelid=[$0], EXPR$1=[$1])
> 01-03              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 03-03                      Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://namenode:9000/tmp/stest-gz/
> kafka_3_25-Jul-2016-12a.json.gz,
> numFiles=1, columns=[`channelid`, `serverTime`],
> files=[hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-
> 2016-12a.json.gz]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 196611,
>     "userName" : "hadoop",
>     "files" : [
> "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "hdfs://namenode:9000",
>       "config" : null,
>       "workspaces" : {
>         "root" : {
>           "location" : "/tmp/",
>           "writable" : true,
>           "defaultInputFormat" : null
>         },
>         "tmp" : {
>           "location" : "/tmp",
>           "writable" : true,
>           "defaultInputFormat" : null
>         }
>       },
>       "formats" : {
>         "psv" : {
>           "type" : "text",
>           "extensions" : [ "tbl" ],
>           "delimiter" : "|"
>         },
>         "csv" : {
>           "type" : "text",
>           "extensions" : [ "csv" ],
>           "delimiter" : ","
>         },
>         "tsv" : {
>           "type" : "text",
>           "extensions" : [ "tsv" ],
>           "delimiter" : "\t"
>         },
>         "parquet" : {
>           "type" : "parquet"
>         },
>         "json" : {
>           "type" : "json",
>           "extensions" : [ "json" ]
>         },
>         "avro" : {
>           "type" : "avro"
>         }
>       }
>     },
>     "format" : {
>       "type" : "json",
>       "extensions" : [ "json" ]
>     },
>     "columns" : [ "`channelid`", "`serverTime`" ],
>     "selectionRoot" :
> "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz",
>     "cost" : 1800981.0
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 196610,
>     "child" : 196611,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "count(`serverTime`) "
>     } ],
>     "cost" : 900490.5
>   }, {
>     "pop" : "project",
>     "@id" : 196609,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     }, {
>       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>       "expr" : "hash32asdouble(`channelid`) "
>     } ],
>     "child" : 196610,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "unordered-mux-exchange",
>     "@id" : 131073,
>     "child" : 196609,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "hash-to-random-exchange",
>     "@id" : 65539,
>     "child" : 131073,
>     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "project",
>     "@id" : 65538,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 65539,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 65537,
>     "child" : 65538,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "$sum0(`EXPR$1`) "
>     } ],
>     "cost" : 90049.05
>   }, {
>     "pop" : "union-exchange",
>     "@id" : 2,
>     "child" : 65537,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   }, {
>     "pop" : "project",
>     "@id" : 1,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 2,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   } ]
> } |
> +------+------+
> 1 row selected (0.729 seconds)
> 0: jdbc:drill:>
>
>
>
> /******************************** Query on BZ
> ****************************************/
>
>
> 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(channelid=[$0], EXPR$1=[$1])
> 00-02        UnionExchange
> 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> 01-02            Project(channelid=[$0], EXPR$1=[$1])
> 01-03              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 03-03                      Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://namenode:9000/tmp/stest-bz2/
> kafka_3_25-Jul-2016-12a.json.bz2,
> numFiles=1, columns=[`channelid`, `serverTime`],
> files=[hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-
> 2016-12a.json.bz2]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 196611,
>     "userName" : "hadoop",
>     "files" : [
> "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "hdfs://namenode:9000",
>       "config" : null,
>       "workspaces" : {
>         "root" : {
>           "location" : "/tmp/",
>           "writable" : true,
>           "defaultInputFormat" : null
>         },
>         "tmp" : {
>           "location" : "/tmp",
>           "writable" : true,
>           "defaultInputFormat" : null
>         }
>       },
>       "formats" : {
>         "psv" : {
>           "type" : "text",
>           "extensions" : [ "tbl" ],
>           "delimiter" : "|"
>         },
>         "csv" : {
>           "type" : "text",
>           "extensions" : [ "csv" ],
>           "delimiter" : ","
>         },
>         "tsv" : {
>           "type" : "text",
>           "extensions" : [ "tsv" ],
>           "delimiter" : "\t"
>         },
>         "parquet" : {
>           "type" : "parquet"
>         },
>         "json" : {
>           "type" : "json",
>           "extensions" : [ "json" ]
>         },
>         "avro" : {
>           "type" : "avro"
>         }
>       }
>     },
>     "format" : {
>       "type" : "json",
>       "extensions" : [ "json" ]
>     },
>     "columns" : [ "`channelid`", "`serverTime`" ],
>     "selectionRoot" :
> "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2",
>     "cost" : 1148224.0
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 196610,
>     "child" : 196611,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "count(`serverTime`) "
>     } ],
>     "cost" : 574112.0
>   }, {
>     "pop" : "project",
>     "@id" : 196609,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     }, {
>       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>       "expr" : "hash32asdouble(`channelid`) "
>     } ],
>     "child" : 196610,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "unordered-mux-exchange",
>     "@id" : 131073,
>     "child" : 196609,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "hash-to-random-exchange",
>     "@id" : 65539,
>     "child" : 131073,
>     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "project",
>     "@id" : 65538,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 65539,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 65537,
>     "child" : 65538,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "$sum0(`EXPR$1`) "
>     } ],
>     "cost" : 57411.2
>   }, {
>     "pop" : "union-exchange",
>     "@id" : 2,
>     "child" : 65537,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   }, {
>     "pop" : "project",
>     "@id" : 1,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 2,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   } ]
> } |
> +------+------+
> 1 row selected (0.381 seconds)
> 0: jdbc:drill:>
>
>
> On Thu, Aug 4, 2016 at 3:07 PM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > Can you please do an explain plan over the two aggregate queries. That
> way
> > we can know where most of the time is being spent, is it in the query
> > planning phase or is it query execution that is taking longer. Please
> share
> > the query plans and the time taken for those explain plan statements.
> >
> > On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <shankar.mane@games24x7.com
> >
> > wrote:
> >
> > > It is plain json (1 json per line).
> > > Each json message size = ~4kb
> > > no. of json messages = ~5 Millions.
> > >
> > > store.parquet.compression = snappy ( i don't think, this parameter get
> > > used. As I am querying select only.)
> > >
> > >
> > > On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com>
> > > wrote:
> > >
> > > > What is the data format within those .gz and .bz2 files ? It is
> parquet
> > > or
> > > > JSON or plain text (CSV) ?
> > > > Also, what was this config parameter `store.parquet.compression` set
> > to,
> > > > when ypu ran your test ?
> > > >
> > > > - Khurram
> > > >
> > > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> > > shankar.mane@games24x7.com
> > > > >
> > > > wrote:
> > > >
> > > > > Awaiting for response..
> > > > >
> > > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <shankar.mane@games24x7.com
> >
> > > > wrote:
> > > > >
> > > > > >
> > > > >
> > > > > > I am Comparing Querying speed between GZ and BZ2.
> > > > > >
> > > > > > Below are the 2 files and their sizes (This 2 files have same
> > data):
> > > > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > > > >
> > > > > >
> > > > > >
> > > > > > Results:
> > > > > >
> > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by
> > channelid
> > > ;
> > > > > > +------------+----------+
> > > > > > | channelid  |  EXPR$1  |
> > > > > > +------------+----------+
> > > > > > | 3          | 977134   |
> > > > > > | 0          | 836850   |
> > > > > > | 2          | 3202854  |
> > > > > > +------------+----------+
> > > > > > 3 rows selected (86.034 seconds)
> > > > > >
> > > > > >
> > > > > >
> > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> > > channelid
> > > > ;
> > > > > > +------------+----------+
> > > > > > | channelid  |  EXPR$1  |
> > > > > > +------------+----------+
> > > > > > | 3          | 977134   |
> > > > > > | 0          | 836850   |
> > > > > > | 2          | 3202854  |
> > > > > > +------------+----------+
> > > > > > 3 rows selected (459.079 seconds)
> > > > > >
> > > > > >
> > > > > >
> > > > > > Questions:
> > > > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > > > 2. How can we speed to up Bz2.  Are there any configuration to
> do ?
> > > > > > 3. As bz2 is splittable format, How drill using it ?
> > > > > >
> > > > > >
> > > > > > regards,
> > > > > > shankar
> > > > >
> > > >
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Khurram Faraaz <kf...@maprtech.com>.

Ok so query planning took less than one second in both the aggregate
queries.
Looks like most of the time is getting spent in query execution.

On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane <sh...@games24x7.com>
wrote:

> Please find the query plan for both queries. FYI: I am not seeing
> any planning difference between these 2 queries except Cost.
>
>
> /******************************** Query on GZ
> ****************************************/
>
> 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(channelid=[$0], EXPR$1=[$1])
> 00-02        UnionExchange
> 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> 01-02            Project(channelid=[$0], EXPR$1=[$1])
> 01-03              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 03-03                      Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://namenode:9000/tmp/stest-gz/
> kafka_3_25-Jul-2016-12a.json.gz,
> numFiles=1, columns=[`channelid`, `serverTime`],
> files=[hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-
> 2016-12a.json.gz]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 196611,
>     "userName" : "hadoop",
>     "files" : [
> "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "hdfs://namenode:9000",
>       "config" : null,
>       "workspaces" : {
>         "root" : {
>           "location" : "/tmp/",
>           "writable" : true,
>           "defaultInputFormat" : null
>         },
>         "tmp" : {
>           "location" : "/tmp",
>           "writable" : true,
>           "defaultInputFormat" : null
>         }
>       },
>       "formats" : {
>         "psv" : {
>           "type" : "text",
>           "extensions" : [ "tbl" ],
>           "delimiter" : "|"
>         },
>         "csv" : {
>           "type" : "text",
>           "extensions" : [ "csv" ],
>           "delimiter" : ","
>         },
>         "tsv" : {
>           "type" : "text",
>           "extensions" : [ "tsv" ],
>           "delimiter" : "\t"
>         },
>         "parquet" : {
>           "type" : "parquet"
>         },
>         "json" : {
>           "type" : "json",
>           "extensions" : [ "json" ]
>         },
>         "avro" : {
>           "type" : "avro"
>         }
>       }
>     },
>     "format" : {
>       "type" : "json",
>       "extensions" : [ "json" ]
>     },
>     "columns" : [ "`channelid`", "`serverTime`" ],
>     "selectionRoot" :
> "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz",
>     "cost" : 1800981.0
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 196610,
>     "child" : 196611,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "count(`serverTime`) "
>     } ],
>     "cost" : 900490.5
>   }, {
>     "pop" : "project",
>     "@id" : 196609,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     }, {
>       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>       "expr" : "hash32asdouble(`channelid`) "
>     } ],
>     "child" : 196610,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "unordered-mux-exchange",
>     "@id" : 131073,
>     "child" : 196609,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "hash-to-random-exchange",
>     "@id" : 65539,
>     "child" : 131073,
>     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "project",
>     "@id" : 65538,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 65539,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 65537,
>     "child" : 65538,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "$sum0(`EXPR$1`) "
>     } ],
>     "cost" : 90049.05
>   }, {
>     "pop" : "union-exchange",
>     "@id" : 2,
>     "child" : 65537,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   }, {
>     "pop" : "project",
>     "@id" : 1,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 2,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   } ]
> } |
> +------+------+
> 1 row selected (0.729 seconds)
> 0: jdbc:drill:>
>
>
>
> /******************************** Query on BZ
> ****************************************/
>
>
> 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(channelid=[$0], EXPR$1=[$1])
> 00-02        UnionExchange
> 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> 01-02            Project(channelid=[$0], EXPR$1=[$1])
> 01-03              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 03-03                      Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://namenode:9000/tmp/stest-bz2/
> kafka_3_25-Jul-2016-12a.json.bz2,
> numFiles=1, columns=[`channelid`, `serverTime`],
> files=[hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-
> 2016-12a.json.bz2]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 196611,
>     "userName" : "hadoop",
>     "files" : [
> "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "hdfs://namenode:9000",
>       "config" : null,
>       "workspaces" : {
>         "root" : {
>           "location" : "/tmp/",
>           "writable" : true,
>           "defaultInputFormat" : null
>         },
>         "tmp" : {
>           "location" : "/tmp",
>           "writable" : true,
>           "defaultInputFormat" : null
>         }
>       },
>       "formats" : {
>         "psv" : {
>           "type" : "text",
>           "extensions" : [ "tbl" ],
>           "delimiter" : "|"
>         },
>         "csv" : {
>           "type" : "text",
>           "extensions" : [ "csv" ],
>           "delimiter" : ","
>         },
>         "tsv" : {
>           "type" : "text",
>           "extensions" : [ "tsv" ],
>           "delimiter" : "\t"
>         },
>         "parquet" : {
>           "type" : "parquet"
>         },
>         "json" : {
>           "type" : "json",
>           "extensions" : [ "json" ]
>         },
>         "avro" : {
>           "type" : "avro"
>         }
>       }
>     },
>     "format" : {
>       "type" : "json",
>       "extensions" : [ "json" ]
>     },
>     "columns" : [ "`channelid`", "`serverTime`" ],
>     "selectionRoot" :
> "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2",
>     "cost" : 1148224.0
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 196610,
>     "child" : 196611,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "count(`serverTime`) "
>     } ],
>     "cost" : 574112.0
>   }, {
>     "pop" : "project",
>     "@id" : 196609,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     }, {
>       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>       "expr" : "hash32asdouble(`channelid`) "
>     } ],
>     "child" : 196610,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "unordered-mux-exchange",
>     "@id" : 131073,
>     "child" : 196609,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "hash-to-random-exchange",
>     "@id" : 65539,
>     "child" : 131073,
>     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "project",
>     "@id" : 65538,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 65539,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 65537,
>     "child" : 65538,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "$sum0(`EXPR$1`) "
>     } ],
>     "cost" : 57411.2
>   }, {
>     "pop" : "union-exchange",
>     "@id" : 2,
>     "child" : 65537,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   }, {
>     "pop" : "project",
>     "@id" : 1,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 2,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   } ]
> } |
> +------+------+
> 1 row selected (0.381 seconds)
> 0: jdbc:drill:>
>
>
> On Thu, Aug 4, 2016 at 3:07 PM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > Can you please do an explain plan over the two aggregate queries. That
> way
> > we can know where most of the time is being spent, is it in the query
> > planning phase or is it query execution that is taking longer. Please
> share
> > the query plans and the time taken for those explain plan statements.
> >
> > On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <shankar.mane@games24x7.com
> >
> > wrote:
> >
> > > It is plain json (1 json per line).
> > > Each json message size = ~4kb
> > > no. of json messages = ~5 Millions.
> > >
> > > store.parquet.compression = snappy ( i don't think, this parameter get
> > > used. As I am querying select only.)
> > >
> > >
> > > On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com>
> > > wrote:
> > >
> > > > What is the data format within those .gz and .bz2 files ? It is
> parquet
> > > or
> > > > JSON or plain text (CSV) ?
> > > > Also, what was this config parameter `store.parquet.compression` set
> > to,
> > > > when ypu ran your test ?
> > > >
> > > > - Khurram
> > > >
> > > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> > > shankar.mane@games24x7.com
> > > > >
> > > > wrote:
> > > >
> > > > > Awaiting for response..
> > > > >
> > > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <shankar.mane@games24x7.com
> >
> > > > wrote:
> > > > >
> > > > > >
> > > > >
> > > > > > I am Comparing Querying speed between GZ and BZ2.
> > > > > >
> > > > > > Below are the 2 files and their sizes (This 2 files have same
> > data):
> > > > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > > > >
> > > > > >
> > > > > >
> > > > > > Results:
> > > > > >
> > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by
> > channelid
> > > ;
> > > > > > +------------+----------+
> > > > > > | channelid  |  EXPR$1  |
> > > > > > +------------+----------+
> > > > > > | 3          | 977134   |
> > > > > > | 0          | 836850   |
> > > > > > | 2          | 3202854  |
> > > > > > +------------+----------+
> > > > > > 3 rows selected (86.034 seconds)
> > > > > >
> > > > > >
> > > > > >
> > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> > > channelid
> > > > ;
> > > > > > +------------+----------+
> > > > > > | channelid  |  EXPR$1  |
> > > > > > +------------+----------+
> > > > > > | 3          | 977134   |
> > > > > > | 0          | 836850   |
> > > > > > | 2          | 3202854  |
> > > > > > +------------+----------+
> > > > > > 3 rows selected (459.079 seconds)
> > > > > >
> > > > > >
> > > > > >
> > > > > > Questions:
> > > > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > > > 2. How can we speed to up Bz2.  Are there any configuration to
> do ?
> > > > > > 3. As bz2 is splittable format, How drill using it ?
> > > > > >
> > > > > >
> > > > > > regards,
> > > > > > shankar
> > > > >
> > > >
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Khurram Faraaz <kf...@maprtech.com>.

Ok so query planning took less than one second in both the aggregate
queries.
Looks like most of the time is getting spent in query execution.

On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane <sh...@games24x7.com>
wrote:

> Please find the query plan for both queries. FYI: I am not seeing
> any planning difference between these 2 queries except Cost.
>
>
> /******************************** Query on GZ
> ****************************************/
>
> 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(channelid=[$0], EXPR$1=[$1])
> 00-02        UnionExchange
> 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> 01-02            Project(channelid=[$0], EXPR$1=[$1])
> 01-03              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 03-03                      Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://namenode:9000/tmp/stest-gz/
> kafka_3_25-Jul-2016-12a.json.gz,
> numFiles=1, columns=[`channelid`, `serverTime`],
> files=[hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-
> 2016-12a.json.gz]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 196611,
>     "userName" : "hadoop",
>     "files" : [
> "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "hdfs://namenode:9000",
>       "config" : null,
>       "workspaces" : {
>         "root" : {
>           "location" : "/tmp/",
>           "writable" : true,
>           "defaultInputFormat" : null
>         },
>         "tmp" : {
>           "location" : "/tmp",
>           "writable" : true,
>           "defaultInputFormat" : null
>         }
>       },
>       "formats" : {
>         "psv" : {
>           "type" : "text",
>           "extensions" : [ "tbl" ],
>           "delimiter" : "|"
>         },
>         "csv" : {
>           "type" : "text",
>           "extensions" : [ "csv" ],
>           "delimiter" : ","
>         },
>         "tsv" : {
>           "type" : "text",
>           "extensions" : [ "tsv" ],
>           "delimiter" : "\t"
>         },
>         "parquet" : {
>           "type" : "parquet"
>         },
>         "json" : {
>           "type" : "json",
>           "extensions" : [ "json" ]
>         },
>         "avro" : {
>           "type" : "avro"
>         }
>       }
>     },
>     "format" : {
>       "type" : "json",
>       "extensions" : [ "json" ]
>     },
>     "columns" : [ "`channelid`", "`serverTime`" ],
>     "selectionRoot" :
> "hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz",
>     "cost" : 1800981.0
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 196610,
>     "child" : 196611,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "count(`serverTime`) "
>     } ],
>     "cost" : 900490.5
>   }, {
>     "pop" : "project",
>     "@id" : 196609,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     }, {
>       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>       "expr" : "hash32asdouble(`channelid`) "
>     } ],
>     "child" : 196610,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "unordered-mux-exchange",
>     "@id" : 131073,
>     "child" : 196609,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "hash-to-random-exchange",
>     "@id" : 65539,
>     "child" : 131073,
>     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "project",
>     "@id" : 65538,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 65539,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 180098.1
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 65537,
>     "child" : 65538,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "$sum0(`EXPR$1`) "
>     } ],
>     "cost" : 90049.05
>   }, {
>     "pop" : "union-exchange",
>     "@id" : 2,
>     "child" : 65537,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   }, {
>     "pop" : "project",
>     "@id" : 1,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 2,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 18009.81
>   } ]
> } |
> +------+------+
> 1 row selected (0.729 seconds)
> 0: jdbc:drill:>
>
>
>
> /******************************** Query on BZ
> ****************************************/
>
>
> 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
> dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(channelid=[$0], EXPR$1=[$1])
> 00-02        UnionExchange
> 01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
> 01-02            Project(channelid=[$0], EXPR$1=[$1])
> 01-03              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(channelid=[$0], EXPR$1=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 03-03                      Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://namenode:9000/tmp/stest-bz2/
> kafka_3_25-Jul-2016-12a.json.bz2,
> numFiles=1, columns=[`channelid`, `serverTime`],
> files=[hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-
> 2016-12a.json.bz2]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 196611,
>     "userName" : "hadoop",
>     "files" : [
> "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "hdfs://namenode:9000",
>       "config" : null,
>       "workspaces" : {
>         "root" : {
>           "location" : "/tmp/",
>           "writable" : true,
>           "defaultInputFormat" : null
>         },
>         "tmp" : {
>           "location" : "/tmp",
>           "writable" : true,
>           "defaultInputFormat" : null
>         }
>       },
>       "formats" : {
>         "psv" : {
>           "type" : "text",
>           "extensions" : [ "tbl" ],
>           "delimiter" : "|"
>         },
>         "csv" : {
>           "type" : "text",
>           "extensions" : [ "csv" ],
>           "delimiter" : ","
>         },
>         "tsv" : {
>           "type" : "text",
>           "extensions" : [ "tsv" ],
>           "delimiter" : "\t"
>         },
>         "parquet" : {
>           "type" : "parquet"
>         },
>         "json" : {
>           "type" : "json",
>           "extensions" : [ "json" ]
>         },
>         "avro" : {
>           "type" : "avro"
>         }
>       }
>     },
>     "format" : {
>       "type" : "json",
>       "extensions" : [ "json" ]
>     },
>     "columns" : [ "`channelid`", "`serverTime`" ],
>     "selectionRoot" :
> "hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2",
>     "cost" : 1148224.0
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 196610,
>     "child" : 196611,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "count(`serverTime`) "
>     } ],
>     "cost" : 574112.0
>   }, {
>     "pop" : "project",
>     "@id" : 196609,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     }, {
>       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>       "expr" : "hash32asdouble(`channelid`) "
>     } ],
>     "child" : 196610,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "unordered-mux-exchange",
>     "@id" : 131073,
>     "child" : 196609,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "hash-to-random-exchange",
>     "@id" : 65539,
>     "child" : 131073,
>     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "project",
>     "@id" : 65538,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 65539,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 114822.4
>   }, {
>     "pop" : "hash-aggregate",
>     "@id" : 65537,
>     "child" : 65538,
>     "cardinality" : 1.0,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "groupByExprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     } ],
>     "aggrExprs" : [ {
>       "ref" : "`EXPR$1`",
>       "expr" : "$sum0(`EXPR$1`) "
>     } ],
>     "cost" : 57411.2
>   }, {
>     "pop" : "union-exchange",
>     "@id" : 2,
>     "child" : 65537,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   }, {
>     "pop" : "project",
>     "@id" : 1,
>     "exprs" : [ {
>       "ref" : "`channelid`",
>       "expr" : "`channelid`"
>     }, {
>       "ref" : "`EXPR$1`",
>       "expr" : "`EXPR$1`"
>     } ],
>     "child" : 2,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 11482.24
>   } ]
> } |
> +------+------+
> 1 row selected (0.381 seconds)
> 0: jdbc:drill:>
>
>
> On Thu, Aug 4, 2016 at 3:07 PM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > Can you please do an explain plan over the two aggregate queries. That
> way
> > we can know where most of the time is being spent, is it in the query
> > planning phase or is it query execution that is taking longer. Please
> share
> > the query plans and the time taken for those explain plan statements.
> >
> > On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <shankar.mane@games24x7.com
> >
> > wrote:
> >
> > > It is plain json (1 json per line).
> > > Each json message size = ~4kb
> > > no. of json messages = ~5 Millions.
> > >
> > > store.parquet.compression = snappy ( i don't think, this parameter get
> > > used. As I am querying select only.)
> > >
> > >
> > > On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com>
> > > wrote:
> > >
> > > > What is the data format within those .gz and .bz2 files ? It is
> parquet
> > > or
> > > > JSON or plain text (CSV) ?
> > > > Also, what was this config parameter `store.parquet.compression` set
> > to,
> > > > when ypu ran your test ?
> > > >
> > > > - Khurram
> > > >
> > > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> > > shankar.mane@games24x7.com
> > > > >
> > > > wrote:
> > > >
> > > > > Awaiting for response..
> > > > >
> > > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <shankar.mane@games24x7.com
> >
> > > > wrote:
> > > > >
> > > > > >
> > > > >
> > > > > > I am Comparing Querying speed between GZ and BZ2.
> > > > > >
> > > > > > Below are the 2 files and their sizes (This 2 files have same
> > data):
> > > > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > > > >
> > > > > >
> > > > > >
> > > > > > Results:
> > > > > >
> > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by
> > channelid
> > > ;
> > > > > > +------------+----------+
> > > > > > | channelid  |  EXPR$1  |
> > > > > > +------------+----------+
> > > > > > | 3          | 977134   |
> > > > > > | 0          | 836850   |
> > > > > > | 2          | 3202854  |
> > > > > > +------------+----------+
> > > > > > 3 rows selected (86.034 seconds)
> > > > > >
> > > > > >
> > > > > >
> > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> > > channelid
> > > > ;
> > > > > > +------------+----------+
> > > > > > | channelid  |  EXPR$1  |
> > > > > > +------------+----------+
> > > > > > | 3          | 977134   |
> > > > > > | 0          | 836850   |
> > > > > > | 2          | 3202854  |
> > > > > > +------------+----------+
> > > > > > 3 rows selected (459.079 seconds)
> > > > > >
> > > > > >
> > > > > >
> > > > > > Questions:
> > > > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > > > 2. How can we speed to up Bz2.  Are there any configuration to
> do ?
> > > > > > 3. As bz2 is splittable format, How drill using it ?
> > > > > >
> > > > > >
> > > > > > regards,
> > > > > > shankar
> > > > >
> > > >
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Shankar Mane <sh...@games24x7.com>.

Please find the query plan for both queries. FYI: I am not seeing
any planning difference between these 2 queries except Cost.


/******************************** Query on GZ
****************************************/

0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(channelid=[$0], EXPR$1=[$1])
00-02        UnionExchange
01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
01-02            Project(channelid=[$0], EXPR$1=[$1])
01-03              HashToRandomExchange(dist0=[[$0]])
02-01                UnorderedMuxExchange
03-01                  Project(channelid=[$0], EXPR$1=[$1],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
03-03                      Scan(groupscan=[EasyGroupScan
[selectionRoot=hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz,
numFiles=1, columns=[`channelid`, `serverTime`],
files=[hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz]]])
 | {
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    },
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ ],
    "queue" : 0,
    "resultMode" : "EXEC"
  },
  "graph" : [ {
    "pop" : "fs-scan",
    "@id" : 196611,
    "userName" : "hadoop",
    "files" : [
"hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz" ],
    "storage" : {
      "type" : "file",
      "enabled" : true,
      "connection" : "hdfs://namenode:9000",
      "config" : null,
      "workspaces" : {
        "root" : {
          "location" : "/tmp/",
          "writable" : true,
          "defaultInputFormat" : null
        },
        "tmp" : {
          "location" : "/tmp",
          "writable" : true,
          "defaultInputFormat" : null
        }
      },
      "formats" : {
        "psv" : {
          "type" : "text",
          "extensions" : [ "tbl" ],
          "delimiter" : "|"
        },
        "csv" : {
          "type" : "text",
          "extensions" : [ "csv" ],
          "delimiter" : ","
        },
        "tsv" : {
          "type" : "text",
          "extensions" : [ "tsv" ],
          "delimiter" : "\t"
        },
        "parquet" : {
          "type" : "parquet"
        },
        "json" : {
          "type" : "json",
          "extensions" : [ "json" ]
        },
        "avro" : {
          "type" : "avro"
        }
      }
    },
    "format" : {
      "type" : "json",
      "extensions" : [ "json" ]
    },
    "columns" : [ "`channelid`", "`serverTime`" ],
    "selectionRoot" :
"hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz",
    "cost" : 1800981.0
  }, {
    "pop" : "hash-aggregate",
    "@id" : 196610,
    "child" : 196611,
    "cardinality" : 1.0,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "groupByExprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    } ],
    "aggrExprs" : [ {
      "ref" : "`EXPR$1`",
      "expr" : "count(`serverTime`) "
    } ],
    "cost" : 900490.5
  }, {
    "pop" : "project",
    "@id" : 196609,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    }, {
      "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
      "expr" : "hash32asdouble(`channelid`) "
    } ],
    "child" : 196610,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 180098.1
  }, {
    "pop" : "unordered-mux-exchange",
    "@id" : 131073,
    "child" : 196609,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 180098.1
  }, {
    "pop" : "hash-to-random-exchange",
    "@id" : 65539,
    "child" : 131073,
    "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 180098.1
  }, {
    "pop" : "project",
    "@id" : 65538,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    } ],
    "child" : 65539,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 180098.1
  }, {
    "pop" : "hash-aggregate",
    "@id" : 65537,
    "child" : 65538,
    "cardinality" : 1.0,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "groupByExprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    } ],
    "aggrExprs" : [ {
      "ref" : "`EXPR$1`",
      "expr" : "$sum0(`EXPR$1`) "
    } ],
    "cost" : 90049.05
  }, {
    "pop" : "union-exchange",
    "@id" : 2,
    "child" : 65537,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 18009.81
  }, {
    "pop" : "project",
    "@id" : 1,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    } ],
    "child" : 2,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 18009.81
  }, {
    "pop" : "screen",
    "@id" : 0,
    "child" : 1,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 18009.81
  } ]
} |
+------+------+
1 row selected (0.729 seconds)
0: jdbc:drill:>



/******************************** Query on BZ
****************************************/


0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(channelid=[$0], EXPR$1=[$1])
00-02        UnionExchange
01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
01-02            Project(channelid=[$0], EXPR$1=[$1])
01-03              HashToRandomExchange(dist0=[[$0]])
02-01                UnorderedMuxExchange
03-01                  Project(channelid=[$0], EXPR$1=[$1],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
03-03                      Scan(groupscan=[EasyGroupScan
[selectionRoot=hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2,
numFiles=1, columns=[`channelid`, `serverTime`],
files=[hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2]]])
 | {
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    },
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ ],
    "queue" : 0,
    "resultMode" : "EXEC"
  },
  "graph" : [ {
    "pop" : "fs-scan",
    "@id" : 196611,
    "userName" : "hadoop",
    "files" : [
"hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2" ],
    "storage" : {
      "type" : "file",
      "enabled" : true,
      "connection" : "hdfs://namenode:9000",
      "config" : null,
      "workspaces" : {
        "root" : {
          "location" : "/tmp/",
          "writable" : true,
          "defaultInputFormat" : null
        },
        "tmp" : {
          "location" : "/tmp",
          "writable" : true,
          "defaultInputFormat" : null
        }
      },
      "formats" : {
        "psv" : {
          "type" : "text",
          "extensions" : [ "tbl" ],
          "delimiter" : "|"
        },
        "csv" : {
          "type" : "text",
          "extensions" : [ "csv" ],
          "delimiter" : ","
        },
        "tsv" : {
          "type" : "text",
          "extensions" : [ "tsv" ],
          "delimiter" : "\t"
        },
        "parquet" : {
          "type" : "parquet"
        },
        "json" : {
          "type" : "json",
          "extensions" : [ "json" ]
        },
        "avro" : {
          "type" : "avro"
        }
      }
    },
    "format" : {
      "type" : "json",
      "extensions" : [ "json" ]
    },
    "columns" : [ "`channelid`", "`serverTime`" ],
    "selectionRoot" :
"hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2",
    "cost" : 1148224.0
  }, {
    "pop" : "hash-aggregate",
    "@id" : 196610,
    "child" : 196611,
    "cardinality" : 1.0,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "groupByExprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    } ],
    "aggrExprs" : [ {
      "ref" : "`EXPR$1`",
      "expr" : "count(`serverTime`) "
    } ],
    "cost" : 574112.0
  }, {
    "pop" : "project",
    "@id" : 196609,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    }, {
      "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
      "expr" : "hash32asdouble(`channelid`) "
    } ],
    "child" : 196610,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 114822.4
  }, {
    "pop" : "unordered-mux-exchange",
    "@id" : 131073,
    "child" : 196609,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 114822.4
  }, {
    "pop" : "hash-to-random-exchange",
    "@id" : 65539,
    "child" : 131073,
    "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 114822.4
  }, {
    "pop" : "project",
    "@id" : 65538,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    } ],
    "child" : 65539,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 114822.4
  }, {
    "pop" : "hash-aggregate",
    "@id" : 65537,
    "child" : 65538,
    "cardinality" : 1.0,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "groupByExprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    } ],
    "aggrExprs" : [ {
      "ref" : "`EXPR$1`",
      "expr" : "$sum0(`EXPR$1`) "
    } ],
    "cost" : 57411.2
  }, {
    "pop" : "union-exchange",
    "@id" : 2,
    "child" : 65537,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 11482.24
  }, {
    "pop" : "project",
    "@id" : 1,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    } ],
    "child" : 2,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 11482.24
  }, {
    "pop" : "screen",
    "@id" : 0,
    "child" : 1,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 11482.24
  } ]
} |
+------+------+
1 row selected (0.381 seconds)
0: jdbc:drill:>


On Thu, Aug 4, 2016 at 3:07 PM, Khurram Faraaz <kf...@maprtech.com> wrote:

> Can you please do an explain plan over the two aggregate queries. That way
> we can know where most of the time is being spent, is it in the query
> planning phase or is it query execution that is taking longer. Please share
> the query plans and the time taken for those explain plan statements.
>
> On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <sh...@games24x7.com>
> wrote:
>
> > It is plain json (1 json per line).
> > Each json message size = ~4kb
> > no. of json messages = ~5 Millions.
> >
> > store.parquet.compression = snappy ( i don't think, this parameter get
> > used. As I am querying select only.)
> >
> >
> > On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com>
> > wrote:
> >
> > > What is the data format within those .gz and .bz2 files ? It is parquet
> > or
> > > JSON or plain text (CSV) ?
> > > Also, what was this config parameter `store.parquet.compression` set
> to,
> > > when ypu ran your test ?
> > >
> > > - Khurram
> > >
> > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> > shankar.mane@games24x7.com
> > > >
> > > wrote:
> > >
> > > > Awaiting for response..
> > > >
> > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <sh...@games24x7.com>
> > > wrote:
> > > >
> > > > >
> > > >
> > > > > I am Comparing Querying speed between GZ and BZ2.
> > > > >
> > > > > Below are the 2 files and their sizes (This 2 files have same
> data):
> > > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > > >
> > > > >
> > > > >
> > > > > Results:
> > > > >
> > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by
> channelid
> > ;
> > > > > +------------+----------+
> > > > > | channelid  |  EXPR$1  |
> > > > > +------------+----------+
> > > > > | 3          | 977134   |
> > > > > | 0          | 836850   |
> > > > > | 2          | 3202854  |
> > > > > +------------+----------+
> > > > > 3 rows selected (86.034 seconds)
> > > > >
> > > > >
> > > > >
> > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> > channelid
> > > ;
> > > > > +------------+----------+
> > > > > | channelid  |  EXPR$1  |
> > > > > +------------+----------+
> > > > > | 3          | 977134   |
> > > > > | 0          | 836850   |
> > > > > | 2          | 3202854  |
> > > > > +------------+----------+
> > > > > 3 rows selected (459.079 seconds)
> > > > >
> > > > >
> > > > >
> > > > > Questions:
> > > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > > > > 3. As bz2 is splittable format, How drill using it ?
> > > > >
> > > > >
> > > > > regards,
> > > > > shankar
> > > >
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Shankar Mane <sh...@games24x7.com>.

Please find the query plan for both queries. FYI: I am not seeing
any planning difference between these 2 queries except Cost.


/******************************** Query on GZ
****************************************/

0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(channelid=[$0], EXPR$1=[$1])
00-02        UnionExchange
01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
01-02            Project(channelid=[$0], EXPR$1=[$1])
01-03              HashToRandomExchange(dist0=[[$0]])
02-01                UnorderedMuxExchange
03-01                  Project(channelid=[$0], EXPR$1=[$1],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
03-03                      Scan(groupscan=[EasyGroupScan
[selectionRoot=hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz,
numFiles=1, columns=[`channelid`, `serverTime`],
files=[hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz]]])
 | {
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    },
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ ],
    "queue" : 0,
    "resultMode" : "EXEC"
  },
  "graph" : [ {
    "pop" : "fs-scan",
    "@id" : 196611,
    "userName" : "hadoop",
    "files" : [
"hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz" ],
    "storage" : {
      "type" : "file",
      "enabled" : true,
      "connection" : "hdfs://namenode:9000",
      "config" : null,
      "workspaces" : {
        "root" : {
          "location" : "/tmp/",
          "writable" : true,
          "defaultInputFormat" : null
        },
        "tmp" : {
          "location" : "/tmp",
          "writable" : true,
          "defaultInputFormat" : null
        }
      },
      "formats" : {
        "psv" : {
          "type" : "text",
          "extensions" : [ "tbl" ],
          "delimiter" : "|"
        },
        "csv" : {
          "type" : "text",
          "extensions" : [ "csv" ],
          "delimiter" : ","
        },
        "tsv" : {
          "type" : "text",
          "extensions" : [ "tsv" ],
          "delimiter" : "\t"
        },
        "parquet" : {
          "type" : "parquet"
        },
        "json" : {
          "type" : "json",
          "extensions" : [ "json" ]
        },
        "avro" : {
          "type" : "avro"
        }
      }
    },
    "format" : {
      "type" : "json",
      "extensions" : [ "json" ]
    },
    "columns" : [ "`channelid`", "`serverTime`" ],
    "selectionRoot" :
"hdfs://namenode:9000/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz",
    "cost" : 1800981.0
  }, {
    "pop" : "hash-aggregate",
    "@id" : 196610,
    "child" : 196611,
    "cardinality" : 1.0,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "groupByExprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    } ],
    "aggrExprs" : [ {
      "ref" : "`EXPR$1`",
      "expr" : "count(`serverTime`) "
    } ],
    "cost" : 900490.5
  }, {
    "pop" : "project",
    "@id" : 196609,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    }, {
      "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
      "expr" : "hash32asdouble(`channelid`) "
    } ],
    "child" : 196610,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 180098.1
  }, {
    "pop" : "unordered-mux-exchange",
    "@id" : 131073,
    "child" : 196609,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 180098.1
  }, {
    "pop" : "hash-to-random-exchange",
    "@id" : 65539,
    "child" : 131073,
    "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 180098.1
  }, {
    "pop" : "project",
    "@id" : 65538,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    } ],
    "child" : 65539,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 180098.1
  }, {
    "pop" : "hash-aggregate",
    "@id" : 65537,
    "child" : 65538,
    "cardinality" : 1.0,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "groupByExprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    } ],
    "aggrExprs" : [ {
      "ref" : "`EXPR$1`",
      "expr" : "$sum0(`EXPR$1`) "
    } ],
    "cost" : 90049.05
  }, {
    "pop" : "union-exchange",
    "@id" : 2,
    "child" : 65537,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 18009.81
  }, {
    "pop" : "project",
    "@id" : 1,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    } ],
    "child" : 2,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 18009.81
  }, {
    "pop" : "screen",
    "@id" : 0,
    "child" : 1,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 18009.81
  } ]
} |
+------+------+
1 row selected (0.729 seconds)
0: jdbc:drill:>



/******************************** Query on BZ
****************************************/


0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(channelid=[$0], EXPR$1=[$1])
00-02        UnionExchange
01-01          HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
01-02            Project(channelid=[$0], EXPR$1=[$1])
01-03              HashToRandomExchange(dist0=[[$0]])
02-01                UnorderedMuxExchange
03-01                  Project(channelid=[$0], EXPR$1=[$1],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
03-02                    HashAgg(group=[{0}], EXPR$1=[COUNT($1)])
03-03                      Scan(groupscan=[EasyGroupScan
[selectionRoot=hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2,
numFiles=1, columns=[`channelid`, `serverTime`],
files=[hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2]]])
 | {
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    },
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ ],
    "queue" : 0,
    "resultMode" : "EXEC"
  },
  "graph" : [ {
    "pop" : "fs-scan",
    "@id" : 196611,
    "userName" : "hadoop",
    "files" : [
"hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2" ],
    "storage" : {
      "type" : "file",
      "enabled" : true,
      "connection" : "hdfs://namenode:9000",
      "config" : null,
      "workspaces" : {
        "root" : {
          "location" : "/tmp/",
          "writable" : true,
          "defaultInputFormat" : null
        },
        "tmp" : {
          "location" : "/tmp",
          "writable" : true,
          "defaultInputFormat" : null
        }
      },
      "formats" : {
        "psv" : {
          "type" : "text",
          "extensions" : [ "tbl" ],
          "delimiter" : "|"
        },
        "csv" : {
          "type" : "text",
          "extensions" : [ "csv" ],
          "delimiter" : ","
        },
        "tsv" : {
          "type" : "text",
          "extensions" : [ "tsv" ],
          "delimiter" : "\t"
        },
        "parquet" : {
          "type" : "parquet"
        },
        "json" : {
          "type" : "json",
          "extensions" : [ "json" ]
        },
        "avro" : {
          "type" : "avro"
        }
      }
    },
    "format" : {
      "type" : "json",
      "extensions" : [ "json" ]
    },
    "columns" : [ "`channelid`", "`serverTime`" ],
    "selectionRoot" :
"hdfs://namenode:9000/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2",
    "cost" : 1148224.0
  }, {
    "pop" : "hash-aggregate",
    "@id" : 196610,
    "child" : 196611,
    "cardinality" : 1.0,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "groupByExprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    } ],
    "aggrExprs" : [ {
      "ref" : "`EXPR$1`",
      "expr" : "count(`serverTime`) "
    } ],
    "cost" : 574112.0
  }, {
    "pop" : "project",
    "@id" : 196609,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    }, {
      "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
      "expr" : "hash32asdouble(`channelid`) "
    } ],
    "child" : 196610,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 114822.4
  }, {
    "pop" : "unordered-mux-exchange",
    "@id" : 131073,
    "child" : 196609,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 114822.4
  }, {
    "pop" : "hash-to-random-exchange",
    "@id" : 65539,
    "child" : 131073,
    "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 114822.4
  }, {
    "pop" : "project",
    "@id" : 65538,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    } ],
    "child" : 65539,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 114822.4
  }, {
    "pop" : "hash-aggregate",
    "@id" : 65537,
    "child" : 65538,
    "cardinality" : 1.0,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "groupByExprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    } ],
    "aggrExprs" : [ {
      "ref" : "`EXPR$1`",
      "expr" : "$sum0(`EXPR$1`) "
    } ],
    "cost" : 57411.2
  }, {
    "pop" : "union-exchange",
    "@id" : 2,
    "child" : 65537,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 11482.24
  }, {
    "pop" : "project",
    "@id" : 1,
    "exprs" : [ {
      "ref" : "`channelid`",
      "expr" : "`channelid`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`EXPR$1`"
    } ],
    "child" : 2,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 11482.24
  }, {
    "pop" : "screen",
    "@id" : 0,
    "child" : 1,
    "initialAllocation" : 1000000,
    "maxAllocation" : 10000000000,
    "cost" : 11482.24
  } ]
} |
+------+------+
1 row selected (0.381 seconds)
0: jdbc:drill:>


On Thu, Aug 4, 2016 at 3:07 PM, Khurram Faraaz <kf...@maprtech.com> wrote:

> Can you please do an explain plan over the two aggregate queries. That way
> we can know where most of the time is being spent, is it in the query
> planning phase or is it query execution that is taking longer. Please share
> the query plans and the time taken for those explain plan statements.
>
> On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <sh...@games24x7.com>
> wrote:
>
> > It is plain json (1 json per line).
> > Each json message size = ~4kb
> > no. of json messages = ~5 Millions.
> >
> > store.parquet.compression = snappy ( i don't think, this parameter get
> > used. As I am querying select only.)
> >
> >
> > On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com>
> > wrote:
> >
> > > What is the data format within those .gz and .bz2 files ? It is parquet
> > or
> > > JSON or plain text (CSV) ?
> > > Also, what was this config parameter `store.parquet.compression` set
> to,
> > > when ypu ran your test ?
> > >
> > > - Khurram
> > >
> > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> > shankar.mane@games24x7.com
> > > >
> > > wrote:
> > >
> > > > Awaiting for response..
> > > >
> > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <sh...@games24x7.com>
> > > wrote:
> > > >
> > > > >
> > > >
> > > > > I am Comparing Querying speed between GZ and BZ2.
> > > > >
> > > > > Below are the 2 files and their sizes (This 2 files have same
> data):
> > > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > > >
> > > > >
> > > > >
> > > > > Results:
> > > > >
> > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by
> channelid
> > ;
> > > > > +------------+----------+
> > > > > | channelid  |  EXPR$1  |
> > > > > +------------+----------+
> > > > > | 3          | 977134   |
> > > > > | 0          | 836850   |
> > > > > | 2          | 3202854  |
> > > > > +------------+----------+
> > > > > 3 rows selected (86.034 seconds)
> > > > >
> > > > >
> > > > >
> > > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> > channelid
> > > ;
> > > > > +------------+----------+
> > > > > | channelid  |  EXPR$1  |
> > > > > +------------+----------+
> > > > > | 3          | 977134   |
> > > > > | 0          | 836850   |
> > > > > | 2          | 3202854  |
> > > > > +------------+----------+
> > > > > 3 rows selected (459.079 seconds)
> > > > >
> > > > >
> > > > >
> > > > > Questions:
> > > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > > > > 3. As bz2 is splittable format, How drill using it ?
> > > > >
> > > > >
> > > > > regards,
> > > > > shankar
> > > >
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Khurram Faraaz <kf...@maprtech.com>.

Can you please do an explain plan over the two aggregate queries. That way
we can know where most of the time is being spent, is it in the query
planning phase or is it query execution that is taking longer. Please share
the query plans and the time taken for those explain plan statements.

On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <sh...@games24x7.com>
wrote:

> It is plain json (1 json per line).
> Each json message size = ~4kb
> no. of json messages = ~5 Millions.
>
> store.parquet.compression = snappy ( i don't think, this parameter get
> used. As I am querying select only.)
>
>
> On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > What is the data format within those .gz and .bz2 files ? It is parquet
> or
> > JSON or plain text (CSV) ?
> > Also, what was this config parameter `store.parquet.compression` set to,
> > when ypu ran your test ?
> >
> > - Khurram
> >
> > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> shankar.mane@games24x7.com
> > >
> > wrote:
> >
> > > Awaiting for response..
> > >
> > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <sh...@games24x7.com>
> > wrote:
> > >
> > > >
> > >
> > > > I am Comparing Querying speed between GZ and BZ2.
> > > >
> > > > Below are the 2 files and their sizes (This 2 files have same data):
> > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > >
> > > >
> > > >
> > > > Results:
> > > >
> > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid
> ;
> > > > +------------+----------+
> > > > | channelid  |  EXPR$1  |
> > > > +------------+----------+
> > > > | 3          | 977134   |
> > > > | 0          | 836850   |
> > > > | 2          | 3202854  |
> > > > +------------+----------+
> > > > 3 rows selected (86.034 seconds)
> > > >
> > > >
> > > >
> > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> channelid
> > ;
> > > > +------------+----------+
> > > > | channelid  |  EXPR$1  |
> > > > +------------+----------+
> > > > | 3          | 977134   |
> > > > | 0          | 836850   |
> > > > | 2          | 3202854  |
> > > > +------------+----------+
> > > > 3 rows selected (459.079 seconds)
> > > >
> > > >
> > > >
> > > > Questions:
> > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > > > 3. As bz2 is splittable format, How drill using it ?
> > > >
> > > >
> > > > regards,
> > > > shankar
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Khurram Faraaz <kf...@maprtech.com>.

Can you please do an explain plan over the two aggregate queries. That way
we can know where most of the time is being spent, is it in the query
planning phase or is it query execution that is taking longer. Please share
the query plans and the time taken for those explain plan statements.

On Mon, Aug 1, 2016 at 3:46 PM, Shankar Mane <sh...@games24x7.com>
wrote:

> It is plain json (1 json per line).
> Each json message size = ~4kb
> no. of json messages = ~5 Millions.
>
> store.parquet.compression = snappy ( i don't think, this parameter get
> used. As I am querying select only.)
>
>
> On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > What is the data format within those .gz and .bz2 files ? It is parquet
> or
> > JSON or plain text (CSV) ?
> > Also, what was this config parameter `store.parquet.compression` set to,
> > when ypu ran your test ?
> >
> > - Khurram
> >
> > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <
> shankar.mane@games24x7.com
> > >
> > wrote:
> >
> > > Awaiting for response..
> > >
> > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <sh...@games24x7.com>
> > wrote:
> > >
> > > >
> > >
> > > > I am Comparing Querying speed between GZ and BZ2.
> > > >
> > > > Below are the 2 files and their sizes (This 2 files have same data):
> > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > > >
> > > >
> > > >
> > > > Results:
> > > >
> > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid
> ;
> > > > +------------+----------+
> > > > | channelid  |  EXPR$1  |
> > > > +------------+----------+
> > > > | 3          | 977134   |
> > > > | 0          | 836850   |
> > > > | 2          | 3202854  |
> > > > +------------+----------+
> > > > 3 rows selected (86.034 seconds)
> > > >
> > > >
> > > >
> > > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by
> channelid
> > ;
> > > > +------------+----------+
> > > > | channelid  |  EXPR$1  |
> > > > +------------+----------+
> > > > | 3          | 977134   |
> > > > | 0          | 836850   |
> > > > | 2          | 3202854  |
> > > > +------------+----------+
> > > > 3 rows selected (459.079 seconds)
> > > >
> > > >
> > > >
> > > > Questions:
> > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > > > 3. As bz2 is splittable format, How drill using it ?
> > > >
> > > >
> > > > regards,
> > > > shankar
> > >
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Shankar Mane <sh...@games24x7.com>.

It is plain json (1 json per line).
Each json message size = ~4kb
no. of json messages = ~5 Millions.

store.parquet.compression = snappy ( i don't think, this parameter get
used. As I am querying select only.)


On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com> wrote:

> What is the data format within those .gz and .bz2 files ? It is parquet or
> JSON or plain text (CSV) ?
> Also, what was this config parameter `store.parquet.compression` set to,
> when ypu ran your test ?
>
> - Khurram
>
> On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <shankar.mane@games24x7.com
> >
> wrote:
>
> > Awaiting for response..
> >
> > On 30-Jul-2016 3:20 PM, "Shankar Mane" <sh...@games24x7.com>
> wrote:
> >
> > >
> >
> > > I am Comparing Querying speed between GZ and BZ2.
> > >
> > > Below are the 2 files and their sizes (This 2 files have same data):
> > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > >
> > >
> > >
> > > Results:
> > >
> > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> > > +------------+----------+
> > > | channelid  |  EXPR$1  |
> > > +------------+----------+
> > > | 3          | 977134   |
> > > | 0          | 836850   |
> > > | 2          | 3202854  |
> > > +------------+----------+
> > > 3 rows selected (86.034 seconds)
> > >
> > >
> > >
> > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid
> ;
> > > +------------+----------+
> > > | channelid  |  EXPR$1  |
> > > +------------+----------+
> > > | 3          | 977134   |
> > > | 0          | 836850   |
> > > | 2          | 3202854  |
> > > +------------+----------+
> > > 3 rows selected (459.079 seconds)
> > >
> > >
> > >
> > > Questions:
> > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > > 3. As bz2 is splittable format, How drill using it ?
> > >
> > >
> > > regards,
> > > shankar
> >
>

Re: [Drill-Questions] Speed difference between GZ and BZ2

Posted by Shankar Mane <sh...@games24x7.com>.

It is plain json (1 json per line).
Each json message size = ~4kb
no. of json messages = ~5 Millions.

store.parquet.compression = snappy ( i don't think, this parameter get
used. As I am querying select only.)


On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kf...@maprtech.com> wrote:

> What is the data format within those .gz and .bz2 files ? It is parquet or
> JSON or plain text (CSV) ?
> Also, what was this config parameter `store.parquet.compression` set to,
> when ypu ran your test ?
>
> - Khurram
>
> On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <shankar.mane@games24x7.com
> >
> wrote:
>
> > Awaiting for response..
> >
> > On 30-Jul-2016 3:20 PM, "Shankar Mane" <sh...@games24x7.com>
> wrote:
> >
> > >
> >
> > > I am Comparing Querying speed between GZ and BZ2.
> > >
> > > Below are the 2 files and their sizes (This 2 files have same data):
> > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > >
> > >
> > >
> > > Results:
> > >
> > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> > > +------------+----------+
> > > | channelid  |  EXPR$1  |
> > > +------------+----------+
> > > | 3          | 977134   |
> > > | 0          | 836850   |
> > > | 2          | 3202854  |
> > > +------------+----------+
> > > 3 rows selected (86.034 seconds)
> > >
> > >
> > >
> > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid
> ;
> > > +------------+----------+
> > > | channelid  |  EXPR$1  |
> > > +------------+----------+
> > > | 3          | 977134   |
> > > | 0          | 836850   |
> > > | 2          | 3202854  |
> > > +------------+----------+
> > > 3 rows selected (459.079 seconds)
> > >
> > >
> > >
> > > Questions:
> > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > > 3. As bz2 is splittable format, How drill using it ?
> > >
> > >
> > > regards,
> > > shankar
> >
>