You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hawq.apache.org by 陶进 <to...@outlook.com> on 2016/01/29 06:34:04 UTC

hawq peerformance on 10 billion rows table

hi guys,

We have several huge tables,and some of the table would more than 10 
billion rows.each table had the same  columns,each row is about 100 Byte.

Our query run on each  singal table to filter and sort some records,such 
as select a,b,c from t where a=1 b='hello' order by 1,2.

Now we use mongodb,and the bigest table had 4 billion rows.it could 
returned in 10 seconds.Now we want to use hawq as our query engine.Could 
they run the above query in 10 seconds?  what the hardware of the 
server?how many node would need?


Thanks.

---
Avast 防毒软件已对此电子邮件执行病毒检查。
https://www.avast.com/antivirus

Re: hawq peerformance on 10 billion rows table

Posted by "yuwei.sung@gmail.com" <yu...@gmail.com>.

If speed of filtering and sorting is the main focus in your queries, let's
say not many join operations,  you migh not get any benefit from hawq or
gpdb.  They are analytic databases.
 What's the size of the total dataset?  Maybe geode can help in your case.

On Thursday, January 28, 2016, 陶进 <to...@outlook.com> wrote:

> hi guys,
>
> We have several huge tables,and some of the table would more than 10
> billion rows.each table had the same  columns,each row is about 100 Byte.
>
> Our query run on each  singal table to filter and sort some records,such
> as select a,b,c from t where a=1 b='hello' order by 1,2.
>
> Now we use mongodb,and the bigest table had 4 billion rows.it could
> returned in 10 seconds.Now we want to use hawq as our query engine.Could
> they run the above query in 10 seconds?  what the hardware of the
> server?how many node would need?
>
>
> Thanks.
>
> ---
> Avast 防毒软件已对此电子邮件执行病毒检查。
> https://www.avast.com/antivirus
>
>

-- 
Yu-wei Sung

Re: hawq peerformance on 10 billion rows table

Posted by Konstantin Boudnik <co...@apache.org>.

billions with 'B': looks like MongoDB is web-scale after all!

On Fri, Jan 29, 2016 at 11:59AM, Alexey Grishchenko wrote:
> The main thing to consider for you is that HAWQ does not have indexes. So
> the only way to limit the amount of data it scans is to use partitioning +
> columnar tables (Parquet)
> In contrast, Greenplum has indexes, and if your query returns 100s of
> records from 10'000'000'000 rows table it might be a good thing for you.
> But you should be careful here - if you have "where" conditions on
> different columns you might end up building many indexes, which would lead
> you to the situation where index size for the table is greater than the
> size of its data
> 
> On Fri, Jan 29, 2016 at 10:26 AM, 陶进 <to...@outlook.com> wrote:
> 
> > hi Martin,
> >
> > Mnay thanks for your kindly help.
> >
> > I could  find little performance case of greenplum/hawq on
> > google,especialy on 10 billion row data.Your replying inspire confidence in
> > me. :-)
> >
> > our real-time query only returns hundreds of row from a huge table. I'll
> > test and tuning HAWQ after our machines are avaliable to approve the
> > performance.
> >
> > Thank you again for your promptly repling.
> >
> >
> > Best regards!
> >
> > Tony.
> >
> >
> >
> > 在 2016/1/29 17:29, Martin Visser 写道:
> >
> > Hi,
> >
> > for queries like that there are a couple of functionalities of HAWQ that
> > will help you.  One is columnar storage like Parquet.  This will help you
> > when you are only selecting columns a,b,c and the table has columns
> > a,b,...z   The other functionality that will help you is partitioning to
> > reduce the initial set without having to read the data.  How to choose the
> > partitioning will depend on your query patterns and the selectivity of the
> > column values.  For example in your query you could partition on column a.
> > But as mentioned if a only had values 1 and 2 that would only half the
> > number of rows being scanned etc.
> >
> > Another observation is that you are selecting individual rows in your
> > example rather than grouped results. Potentially this could result in a lot
> > of data having to be returned by the query.  Is that the case?  How many
> > rows would you expect queries to return?
> >
> > The answer for your 10 seconds is it is certainly possible due to HAWQs
> > linear scalability but it depends on a number of factors.
> >
> > hth
> > Martin
> >
> > On Fri, Jan 29, 2016 at 5:34 AM, 陶进 <to...@outlook.com> wrote:
> >
> >> hi guys,
> >>
> >> We have several huge tables,and some of the table would more than 10
> >> billion rows.each table had the same  columns,each row is about 100 Byte.
> >>
> >> Our query run on each  singal table to filter and sort some records,such
> >> as select a,b,c from t where a=1 b='hello' order by 1,2.
> >>
> >> Now we use mongodb,and the bigest table had 4 billion rows.it could
> >> returned in 10 seconds.Now we want to use hawq as our query engine.Could
> >> they run the above query in 10 seconds?  what the hardware of the
> >> server?how many node would need?
> >>
> >>
> >> Thanks.
> >>
> >> ---
> >> Avast 防毒软件已对此电子邮件执行病毒检查。
> >> https://www.avast.com/antivirus
> >>
> >>
> >
> >
> > <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 受
> > Avast 保护的无病毒计算机已发送该电子邮件。
> > www.avast.com
> > <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> >
> 
> 
> 
> -- 
> Best regards,
> Alexey Grishchenko

Re: hawq peerformance on 10 billion rows table

Posted by Alexey Grishchenko <ag...@pivotal.io>.

The main thing to consider for you is that HAWQ does not have indexes. So
the only way to limit the amount of data it scans is to use partitioning +
columnar tables (Parquet)
In contrast, Greenplum has indexes, and if your query returns 100s of
records from 10'000'000'000 rows table it might be a good thing for you.
But you should be careful here - if you have "where" conditions on
different columns you might end up building many indexes, which would lead
you to the situation where index size for the table is greater than the
size of its data

On Fri, Jan 29, 2016 at 10:26 AM, 陶进 <to...@outlook.com> wrote:

> hi Martin,
>
> Mnay thanks for your kindly help.
>
> I could  find little performance case of greenplum/hawq on
> google,especialy on 10 billion row data.Your replying inspire confidence in
> me. :-)
>
> our real-time query only returns hundreds of row from a huge table. I'll
> test and tuning HAWQ after our machines are avaliable to approve the
> performance.
>
> Thank you again for your promptly repling.
>
>
> Best regards!
>
> Tony.
>
>
>
> 在 2016/1/29 17:29, Martin Visser 写道:
>
> Hi,
>
> for queries like that there are a couple of functionalities of HAWQ that
> will help you.  One is columnar storage like Parquet.  This will help you
> when you are only selecting columns a,b,c and the table has columns
> a,b,...z   The other functionality that will help you is partitioning to
> reduce the initial set without having to read the data.  How to choose the
> partitioning will depend on your query patterns and the selectivity of the
> column values.  For example in your query you could partition on column a.
> But as mentioned if a only had values 1 and 2 that would only half the
> number of rows being scanned etc.
>
> Another observation is that you are selecting individual rows in your
> example rather than grouped results. Potentially this could result in a lot
> of data having to be returned by the query.  Is that the case?  How many
> rows would you expect queries to return?
>
> The answer for your 10 seconds is it is certainly possible due to HAWQs
> linear scalability but it depends on a number of factors.
>
> hth
> Martin
>
> On Fri, Jan 29, 2016 at 5:34 AM, 陶进 <to...@outlook.com> wrote:
>
>> hi guys,
>>
>> We have several huge tables,and some of the table would more than 10
>> billion rows.each table had the same  columns,each row is about 100 Byte.
>>
>> Our query run on each  singal table to filter and sort some records,such
>> as select a,b,c from t where a=1 b='hello' order by 1,2.
>>
>> Now we use mongodb,and the bigest table had 4 billion rows.it could
>> returned in 10 seconds.Now we want to use hawq as our query engine.Could
>> they run the above query in 10 seconds?  what the hardware of the
>> server?how many node would need?
>>
>>
>> Thanks.
>>
>> ---
>> Avast 防毒软件已对此电子邮件执行病毒检查。
>> https://www.avast.com/antivirus
>>
>>
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 受
> Avast 保护的无病毒计算机已发送该电子邮件。
> www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>



-- 
Best regards,
Alexey Grishchenko

Re: hawq peerformance on 10 billion rows table

Posted by 陶进 <to...@outlook.com>.

hi Martin,

Mnay thanks for your kindly help.

I could  find little performance case of greenplum/hawq on 
google,especialy on 10 billion row data.Your replying inspire confidence 
in me.:-)

our real-time query only returns hundreds of row from a huge table. I'll 
test and tuning HAWQ after our machines are avaliable to approve the 
performance.

Thank you again for your promptly repling.


Best regards!

Tony.


在 2016/1/29 17:29, Martin Visser 写道:
> Hi,
>
> for queries like that there are a couple of functionalities of HAWQ 
> that will help you.  One is columnar storage like Parquet.  This will 
> help you when you are only selecting columns a,b,c and the table has 
> columns a,b,...z   The other functionality that will help you is 
> partitioning to reduce the initial set without having to read the 
> data.  How to choose the partitioning will depend on your query 
> patterns and the selectivity of the column values.  For example in 
> your query you could partition on column a. But as mentioned if a only 
> had values 1 and 2 that would only half the number of rows being 
> scanned etc.
>
> Another observation is that you are selecting individual rows in your 
> example rather than grouped results. Potentially this could result in 
> a lot of data having to be returned by the query.  Is that the case?  
> How many rows would you expect queries to return?
>
> The answer for your 10 seconds is it is certainly possible due to 
> HAWQs linear scalability but it depends on a number of factors.
>
> hth
> Martin
>
> On Fri, Jan 29, 2016 at 5:34 AM, 陶进 <tonytao0505@outlook.com 
> <ma...@outlook.com>> wrote:
>
>     hi guys,
>
>     We have several huge tables,and some of the table would more than
>     10 billion rows.each table had the same  columns,each row is about
>     100 Byte.
>
>     Our query run on each  singal table to filter and sort some
>     records,such as select a,b,c from t where a=1 b='hello' order by 1,2.
>
>     Now we use mongodb,and the bigest table had 4 billion rows.it
>     <http://rows.it> could returned in 10 seconds.Now we want to use
>     hawq as our query engine.Could they run the above query in 10
>     seconds?  what the hardware of the server?how many node would need?
>
>
>     Thanks.
>
>     ---
>     Avast 防毒软件已对此电子邮件执行病毒检查。
>     https://www.avast.com/antivirus
>
>



---
Avast 防毒软件已对此电子邮件执行病毒检查。
https://www.avast.com/antivirus

Re: hawq peerformance on 10 billion rows table

Posted by Martin Visser <mv...@pivotal.io>.

Hi,

for queries like that there are a couple of functionalities of HAWQ that
will help you.  One is columnar storage like Parquet.  This will help you
when you are only selecting columns a,b,c and the table has columns
a,b,...z   The other functionality that will help you is partitioning to
reduce the initial set without having to read the data.  How to choose the
partitioning will depend on your query patterns and the selectivity of the
column values.  For example in your query you could partition on column a.
But as mentioned if a only had values 1 and 2 that would only half the
number of rows being scanned etc.

Another observation is that you are selecting individual rows in your
example rather than grouped results. Potentially this could result in a lot
of data having to be returned by the query.  Is that the case?  How many
rows would you expect queries to return?

The answer for your 10 seconds is it is certainly possible due to HAWQs
linear scalability but it depends on a number of factors.

hth
Martin

On Fri, Jan 29, 2016 at 5:34 AM, 陶进 <to...@outlook.com> wrote:

> hi guys,
>
> We have several huge tables,and some of the table would more than 10
> billion rows.each table had the same  columns,each row is about 100 Byte.
>
> Our query run on each  singal table to filter and sort some records,such
> as select a,b,c from t where a=1 b='hello' order by 1,2.
>
> Now we use mongodb,and the bigest table had 4 billion rows.it could
> returned in 10 seconds.Now we want to use hawq as our query engine.Could
> they run the above query in 10 seconds?  what the hardware of the
> server?how many node would need?
>
>
> Thanks.
>
> ---
> Avast 防毒软件已对此电子邮件执行病毒检查。
> https://www.avast.com/antivirus
>
>