You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Khurram Faraaz <kf...@maprtech.com> on 2015/06/10 03:55:42 UTC

Window function query takes too long to complete and return results

Query that uses window functions takes too long to complete and return
results. It returns close to a million records, for which it took 533.8
seconds ~8 minutes
Input CSV file has two columns, one integer and another varchar type
column. Please let me know if this needs to be investigated and I can
report a JIRA to track this if required ?

Size of the input CSV file

root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv

-rwxr-xr-x   3 root root   27889455 2015-06-10 01:26 /tmp/manyDuplicates.csv

{code}

select count(*) over(partition by cast(columns[1] as varchar(25)) order by
cast(columns[0] as bigint)) from `manyDuplicates.csv`;

...

1,000,007 rows selected (533.857 seconds)
{code}

There are five distinct values in columns[1] in the CSV file. = [FIVE
PARTITIONS]

{code}

0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
`manyDuplicates.csv`;

*+-----------------------+*

*| **       EXPR$0        ** |*

*+-----------------------+*

*| *FFFFGGGGHHHHIIIIJJJJ * |*

*| *PPPPQQQQRRRRSSSSTTTT * |*

*| *AAAABBBBCCCCDDDDEEEE * |*

*| *UUUUVVVVWWWWXXXXZZZZ * |*

*| *KKKKLLLLMMMMNNNNOOOO * |*

*+-----------------------+*

5 rows selected (1.906 seconds)
{code}

Here is the count for each of those values in columns[1]

{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *200484 * |*

*+---------+*

1 row selected (0.961 seconds)

{code}


{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *199353 * |*

*+---------+*

1 row selected (0.86 seconds)

{code}


{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *200702 * |*

*+---------+*

1 row selected (0.826 seconds)

{code}


{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *199916 * |*

*+---------+*

1 row selected (0.851 seconds)

{code}


{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *199552 * |*

*+---------+*

1 row selected (0.827 seconds)
{code}

Thanks,
Khurram

Re: Window function query takes too long to complete and return results

Posted by Khurram Faraaz <kf...@maprtech.com>.

Great! I will re-run the query on latest level and also verify results
against Postgress results.

On Wed, Jun 10, 2015 at 9:27 AM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> I tried the query using the new implementation (DRILL-3200) and it's much
> more faster: 14 seconds compared to 523 seconds using the current
> implementation. I didn't check the results though.
>
> On Tue, Jun 9, 2015 at 11:30 PM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > JIRA 3269 is opened to track this behavior.
> > I tried to iterate over the ResultSet from a JDBC program, I only
> iterated
> > over the results until there were records, no results were
> > processed/printed. It still took close to nine minutes to complete
> > execution.
> >
> > Here is a snippet of what I did from JDBC.
> >
> > String query = "select count(*) over(partition by cast(columns[1] as
> > varchar(25)) order by cast(columns[0] as bigint)) from
> > `manyDuplicates.csv`"
> > ;
> >
> >
> >
> >                 ResultSet rs = stmt.executeQuery(query);
> >
> >
> >                 while (rs.next()) {
> >
> >                     System.out.println("1");
> >
> >                 }
> >
> > On Tue, Jun 9, 2015 at 9:56 PM, Steven Phillips <sp...@maprtech.com>
> > wrote:
> >
> > > In cases like this where you are printing millions of record in
> SQLLINE,
> > > you should pipe the output to /dev/null or to a file, and measure the
> > > performance that way. I'm guessing that most of the time in this case
> is
> > > spent printing the output to the console, and thus really unrelated to
> > > Drill performance. If piping the data to a file or /dev/null causes the
> > > query to run much faster, than it probably isn't a real issue.
> > >
> > > also, anytime you are investigating a performance related issue, you
> > should
> > > always check the profile. In this case, I suspect you might see that
> most
> > > of the time is spent in the WAIT time of the SCREEN operator. That
> would
> > > indicate that client side processing is slowing the query down.
> > >
> > > On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche <
> > adeneche@maprtech.com
> > > >
> > > wrote:
> > >
> > > > please open a JIRA issue. please provide the test file (compressed)
> or
> > a
> > > > script to generate similar data.
> > > >
> > > > Thanks!
> > > >
> > > > On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <kfaraaz@maprtech.com
> >
> > > > wrote:
> > > >
> > > > > Query that uses window functions takes too long to complete and
> > return
> > > > > results. It returns close to a million records, for which it took
> > 533.8
> > > > > seconds ~8 minutes
> > > > > Input CSV file has two columns, one integer and another varchar
> type
> > > > > column. Please let me know if this needs to be investigated and I
> can
> > > > > report a JIRA to track this if required ?
> > > > >
> > > > > Size of the input CSV file
> > > > >
> > > > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
> > > > >
> > > > > -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> > > > > /tmp/manyDuplicates.csv
> > > > >
> > > > > {code}
> > > > >
> > > > > select count(*) over(partition by cast(columns[1] as varchar(25))
> > order
> > > > by
> > > > > cast(columns[0] as bigint)) from `manyDuplicates.csv`;
> > > > >
> > > > > ...
> > > > >
> > > > > 1,000,007 rows selected (533.857 seconds)
> > > > > {code}
> > > > >
> > > > > There are five distinct values in columns[1] in the CSV file. =
> [FIVE
> > > > > PARTITIONS]
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> > > > > `manyDuplicates.csv`;
> > > > >
> > > > > *+-----------------------+*
> > > > >
> > > > > *| **       EXPR$0        ** |*
> > > > >
> > > > > *+-----------------------+*
> > > > >
> > > > > *| *FFFFGGGGHHHHIIIIJJJJ * |*
> > > > >
> > > > > *| *PPPPQQQQRRRRSSSSTTTT * |*
> > > > >
> > > > > *| *AAAABBBBCCCCDDDDEEEE * |*
> > > > >
> > > > > *| *UUUUVVVVWWWWXXXXZZZZ * |*
> > > > >
> > > > > *| *KKKKLLLLMMMMNNNNOOOO * |*
> > > > >
> > > > > *+-----------------------+*
> > > > >
> > > > > 5 rows selected (1.906 seconds)
> > > > > {code}
> > > > >
> > > > > Here is the count for each of those values in columns[1]
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *200484 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.961 seconds)
> > > > >
> > > > > {code}
> > > > >
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *199353 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.86 seconds)
> > > > >
> > > > > {code}
> > > > >
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *200702 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.826 seconds)
> > > > >
> > > > > {code}
> > > > >
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *199916 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.851 seconds)
> > > > >
> > > > > {code}
> > > > >
> > > > >
> > > > > {code}
> > > > >
> > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| **EXPR$0 ** |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > *| *199552 * |*
> > > > >
> > > > > *+---------+*
> > > > >
> > > > > 1 row selected (0.827 seconds)
> > > > > {code}
> > > > >
> > > > > Thanks,
> > > > > Khurram
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >  Steven Phillips
> > >  Software Engineer
> > >
> > >  mapr.com
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Window function query takes too long to complete and return results

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.

I tried the query using the new implementation (DRILL-3200) and it's much
more faster: 14 seconds compared to 523 seconds using the current
implementation. I didn't check the results though.

On Tue, Jun 9, 2015 at 11:30 PM, Khurram Faraaz <kf...@maprtech.com>
wrote:

> JIRA 3269 is opened to track this behavior.
> I tried to iterate over the ResultSet from a JDBC program, I only iterated
> over the results until there were records, no results were
> processed/printed. It still took close to nine minutes to complete
> execution.
>
> Here is a snippet of what I did from JDBC.
>
> String query = "select count(*) over(partition by cast(columns[1] as
> varchar(25)) order by cast(columns[0] as bigint)) from
> `manyDuplicates.csv`"
> ;
>
>
>
>                 ResultSet rs = stmt.executeQuery(query);
>
>
>                 while (rs.next()) {
>
>                     System.out.println("1");
>
>                 }
>
> On Tue, Jun 9, 2015 at 9:56 PM, Steven Phillips <sp...@maprtech.com>
> wrote:
>
> > In cases like this where you are printing millions of record in SQLLINE,
> > you should pipe the output to /dev/null or to a file, and measure the
> > performance that way. I'm guessing that most of the time in this case is
> > spent printing the output to the console, and thus really unrelated to
> > Drill performance. If piping the data to a file or /dev/null causes the
> > query to run much faster, than it probably isn't a real issue.
> >
> > also, anytime you are investigating a performance related issue, you
> should
> > always check the profile. In this case, I suspect you might see that most
> > of the time is spent in the WAIT time of the SCREEN operator. That would
> > indicate that client side processing is slowing the query down.
> >
> > On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche <
> adeneche@maprtech.com
> > >
> > wrote:
> >
> > > please open a JIRA issue. please provide the test file (compressed) or
> a
> > > script to generate similar data.
> > >
> > > Thanks!
> > >
> > > On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <kf...@maprtech.com>
> > > wrote:
> > >
> > > > Query that uses window functions takes too long to complete and
> return
> > > > results. It returns close to a million records, for which it took
> 533.8
> > > > seconds ~8 minutes
> > > > Input CSV file has two columns, one integer and another varchar type
> > > > column. Please let me know if this needs to be investigated and I can
> > > > report a JIRA to track this if required ?
> > > >
> > > > Size of the input CSV file
> > > >
> > > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
> > > >
> > > > -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> > > > /tmp/manyDuplicates.csv
> > > >
> > > > {code}
> > > >
> > > > select count(*) over(partition by cast(columns[1] as varchar(25))
> order
> > > by
> > > > cast(columns[0] as bigint)) from `manyDuplicates.csv`;
> > > >
> > > > ...
> > > >
> > > > 1,000,007 rows selected (533.857 seconds)
> > > > {code}
> > > >
> > > > There are five distinct values in columns[1] in the CSV file. = [FIVE
> > > > PARTITIONS]
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> > > > `manyDuplicates.csv`;
> > > >
> > > > *+-----------------------+*
> > > >
> > > > *| **       EXPR$0        ** |*
> > > >
> > > > *+-----------------------+*
> > > >
> > > > *| *FFFFGGGGHHHHIIIIJJJJ * |*
> > > >
> > > > *| *PPPPQQQQRRRRSSSSTTTT * |*
> > > >
> > > > *| *AAAABBBBCCCCDDDDEEEE * |*
> > > >
> > > > *| *UUUUVVVVWWWWXXXXZZZZ * |*
> > > >
> > > > *| *KKKKLLLLMMMMNNNNOOOO * |*
> > > >
> > > > *+-----------------------+*
> > > >
> > > > 5 rows selected (1.906 seconds)
> > > > {code}
> > > >
> > > > Here is the count for each of those values in columns[1]
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *200484 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.961 seconds)
> > > >
> > > > {code}
> > > >
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *199353 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.86 seconds)
> > > >
> > > > {code}
> > > >
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *200702 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.826 seconds)
> > > >
> > > > {code}
> > > >
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *199916 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.851 seconds)
> > > >
> > > > {code}
> > > >
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *199552 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.827 seconds)
> > > > {code}
> > > >
> > > > Thanks,
> > > > Khurram
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
> >
> >
> > --
> >  Steven Phillips
> >  Software Engineer
> >
> >  mapr.com
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Window function query takes too long to complete and return results

Posted by Khurram Faraaz <kf...@maprtech.com>.

JIRA 3269 is opened to track this behavior.
I tried to iterate over the ResultSet from a JDBC program, I only iterated
over the results until there were records, no results were
processed/printed. It still took close to nine minutes to complete
execution.

Here is a snippet of what I did from JDBC.

String query = "select count(*) over(partition by cast(columns[1] as
varchar(25)) order by cast(columns[0] as bigint)) from `manyDuplicates.csv`"
;



                ResultSet rs = stmt.executeQuery(query);


                while (rs.next()) {

                    System.out.println("1");

                }

On Tue, Jun 9, 2015 at 9:56 PM, Steven Phillips <sp...@maprtech.com>
wrote:

> In cases like this where you are printing millions of record in SQLLINE,
> you should pipe the output to /dev/null or to a file, and measure the
> performance that way. I'm guessing that most of the time in this case is
> spent printing the output to the console, and thus really unrelated to
> Drill performance. If piping the data to a file or /dev/null causes the
> query to run much faster, than it probably isn't a real issue.
>
> also, anytime you are investigating a performance related issue, you should
> always check the profile. In this case, I suspect you might see that most
> of the time is spent in the WAIT time of the SCREEN operator. That would
> indicate that client side processing is slowing the query down.
>
> On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche <adeneche@maprtech.com
> >
> wrote:
>
> > please open a JIRA issue. please provide the test file (compressed) or a
> > script to generate similar data.
> >
> > Thanks!
> >
> > On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <kf...@maprtech.com>
> > wrote:
> >
> > > Query that uses window functions takes too long to complete and return
> > > results. It returns close to a million records, for which it took 533.8
> > > seconds ~8 minutes
> > > Input CSV file has two columns, one integer and another varchar type
> > > column. Please let me know if this needs to be investigated and I can
> > > report a JIRA to track this if required ?
> > >
> > > Size of the input CSV file
> > >
> > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
> > >
> > > -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> > > /tmp/manyDuplicates.csv
> > >
> > > {code}
> > >
> > > select count(*) over(partition by cast(columns[1] as varchar(25)) order
> > by
> > > cast(columns[0] as bigint)) from `manyDuplicates.csv`;
> > >
> > > ...
> > >
> > > 1,000,007 rows selected (533.857 seconds)
> > > {code}
> > >
> > > There are five distinct values in columns[1] in the CSV file. = [FIVE
> > > PARTITIONS]
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> > > `manyDuplicates.csv`;
> > >
> > > *+-----------------------+*
> > >
> > > *| **       EXPR$0        ** |*
> > >
> > > *+-----------------------+*
> > >
> > > *| *FFFFGGGGHHHHIIIIJJJJ * |*
> > >
> > > *| *PPPPQQQQRRRRSSSSTTTT * |*
> > >
> > > *| *AAAABBBBCCCCDDDDEEEE * |*
> > >
> > > *| *UUUUVVVVWWWWXXXXZZZZ * |*
> > >
> > > *| *KKKKLLLLMMMMNNNNOOOO * |*
> > >
> > > *+-----------------------+*
> > >
> > > 5 rows selected (1.906 seconds)
> > > {code}
> > >
> > > Here is the count for each of those values in columns[1]
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *200484 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.961 seconds)
> > >
> > > {code}
> > >
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *199353 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.86 seconds)
> > >
> > > {code}
> > >
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *200702 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.826 seconds)
> > >
> > > {code}
> > >
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *199916 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.851 seconds)
> > >
> > > {code}
> > >
> > >
> > > {code}
> > >
> > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
> > >
> > > *+---------+*
> > >
> > > *| **EXPR$0 ** |*
> > >
> > > *+---------+*
> > >
> > > *| *199552 * |*
> > >
> > > *+---------+*
> > >
> > > 1 row selected (0.827 seconds)
> > > {code}
> > >
> > > Thanks,
> > > Khurram
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>
>
>
> --
>  Steven Phillips
>  Software Engineer
>
>  mapr.com
>

Re: Window function query takes too long to complete and return results

Posted by Steven Phillips <sp...@maprtech.com>.

In cases like this where you are printing millions of record in SQLLINE,
you should pipe the output to /dev/null or to a file, and measure the
performance that way. I'm guessing that most of the time in this case is
spent printing the output to the console, and thus really unrelated to
Drill performance. If piping the data to a file or /dev/null causes the
query to run much faster, than it probably isn't a real issue.

also, anytime you are investigating a performance related issue, you should
always check the profile. In this case, I suspect you might see that most
of the time is spent in the WAIT time of the SCREEN operator. That would
indicate that client side processing is slowing the query down.

On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> please open a JIRA issue. please provide the test file (compressed) or a
> script to generate similar data.
>
> Thanks!
>
> On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > Query that uses window functions takes too long to complete and return
> > results. It returns close to a million records, for which it took 533.8
> > seconds ~8 minutes
> > Input CSV file has two columns, one integer and another varchar type
> > column. Please let me know if this needs to be investigated and I can
> > report a JIRA to track this if required ?
> >
> > Size of the input CSV file
> >
> > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
> >
> > -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> > /tmp/manyDuplicates.csv
> >
> > {code}
> >
> > select count(*) over(partition by cast(columns[1] as varchar(25)) order
> by
> > cast(columns[0] as bigint)) from `manyDuplicates.csv`;
> >
> > ...
> >
> > 1,000,007 rows selected (533.857 seconds)
> > {code}
> >
> > There are five distinct values in columns[1] in the CSV file. = [FIVE
> > PARTITIONS]
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> > `manyDuplicates.csv`;
> >
> > *+-----------------------+*
> >
> > *| **       EXPR$0        ** |*
> >
> > *+-----------------------+*
> >
> > *| *FFFFGGGGHHHHIIIIJJJJ * |*
> >
> > *| *PPPPQQQQRRRRSSSSTTTT * |*
> >
> > *| *AAAABBBBCCCCDDDDEEEE * |*
> >
> > *| *UUUUVVVVWWWWXXXXZZZZ * |*
> >
> > *| *KKKKLLLLMMMMNNNNOOOO * |*
> >
> > *+-----------------------+*
> >
> > 5 rows selected (1.906 seconds)
> > {code}
> >
> > Here is the count for each of those values in columns[1]
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *200484 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.961 seconds)
> >
> > {code}
> >
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *199353 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.86 seconds)
> >
> > {code}
> >
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *200702 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.826 seconds)
> >
> > {code}
> >
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *199916 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.851 seconds)
> >
> > {code}
> >
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *199552 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.827 seconds)
> > {code}
> >
> > Thanks,
> > Khurram
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Window function query takes too long to complete and return results

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.

please open a JIRA issue. please provide the test file (compressed) or a
script to generate similar data.

Thanks!

On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <kf...@maprtech.com> wrote:

> Query that uses window functions takes too long to complete and return
> results. It returns close to a million records, for which it took 533.8
> seconds ~8 minutes
> Input CSV file has two columns, one integer and another varchar type
> column. Please let me know if this needs to be investigated and I can
> report a JIRA to track this if required ?
>
> Size of the input CSV file
>
> root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
>
> -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> /tmp/manyDuplicates.csv
>
> {code}
>
> select count(*) over(partition by cast(columns[1] as varchar(25)) order by
> cast(columns[0] as bigint)) from `manyDuplicates.csv`;
>
> ...
>
> 1,000,007 rows selected (533.857 seconds)
> {code}
>
> There are five distinct values in columns[1] in the CSV file. = [FIVE
> PARTITIONS]
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> `manyDuplicates.csv`;
>
> *+-----------------------+*
>
> *| **       EXPR$0        ** |*
>
> *+-----------------------+*
>
> *| *FFFFGGGGHHHHIIIIJJJJ * |*
>
> *| *PPPPQQQQRRRRSSSSTTTT * |*
>
> *| *AAAABBBBCCCCDDDDEEEE * |*
>
> *| *UUUUVVVVWWWWXXXXZZZZ * |*
>
> *| *KKKKLLLLMMMMNNNNOOOO * |*
>
> *+-----------------------+*
>
> 5 rows selected (1.906 seconds)
> {code}
>
> Here is the count for each of those values in columns[1]
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *200484 * |*
>
> *+---------+*
>
> 1 row selected (0.961 seconds)
>
> {code}
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *199353 * |*
>
> *+---------+*
>
> 1 row selected (0.86 seconds)
>
> {code}
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *200702 * |*
>
> *+---------+*
>
> 1 row selected (0.826 seconds)
>
> {code}
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *199916 * |*
>
> *+---------+*
>
> 1 row selected (0.851 seconds)
>
> {code}
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *199552 * |*
>
> *+---------+*
>
> 1 row selected (0.827 seconds)
> {code}
>
> Thanks,
> Khurram
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>