You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by ashish tapdiya <as...@gmail.com> on 2015/12/03 21:34:15 UTC

understanding execution semantics...

Hi,

I am profiling TPC-H queries using Phoenix. For query no. 6 and db size 1GB
(lineitem table size is around 760 MB),

Query 6 : select sum(l_extendedprice * l_discount) as revenue from
lineitem_sf1 where l_shipdate >= TO_DATE('1993-01-01') and l_shipdate <
TO_DATE('1994-01-01') and l_discount between 0.06 and 0.08 and l_quantity <
24

the execution time recorded using following code:

long startTime = System.currentTimeMillis();
rset = stmt.executeQuery();
long stopTime = System.currentTimeMillis();
long elapsedTime = stopTime - startTime;

first run - 240 ms
second run onwards - 80 ms

However, when I iterate the result set (single row), query response time
including result set iteration shoots up to 19 seconds.

Does the query gets executed with stmt.executeQuery() or phoenix does not
execute query until first resultset.next() is invoked?

Cluster includes 4 slaves nodes. Phoenix version is 4.3.0

Thanks,
~Ashish

Re: understanding execution semantics...

Posted by James Taylor <ja...@apache.org>.

On Tue, Dec 8, 2015 at 9:33 PM, ashish tapdiya <as...@gmail.com>
wrote:

> Hello James,
>
> Thanks for the response. I will look into adding relevant TPC-H indexes.
>
> We are doing a performance study comparing Impala, Phoenix using TPC-H
> queries. For scale factor 1, I am not able to get numbers exhibiting trend
> similar to performance graph on our website comparing Impala and Phoenix
> .
>
> I have few questions regarding the performance graph comparing Phoenix and
> Impala for aggregation query (select count(1))
>
> Q1: Was the Impala data stored in parquet/text format?
>

No, the data is stored in HBase for both Impala and Phoenix as stated by
the title "Phoenix vs Impala (running over HBase)"


> Q2: What and how performance parameters (hadoop, hbase) were tuned for
> phoenix.
>

No tuning, but I'd recommend doing a major compaction prior to running the
queries.


> Q3. Can I get the same population script so that I can report numbers from
> the local cluster.
>

You can use our bin/performance.py script to generate the data.

>
> Thanks,
> Ashish
>
> On Thu, Dec 3, 2015 at 5:35 PM, James Taylor <ja...@apache.org>
> wrote:
>
>> Hi Ashish,
>> Please make sure to add a secondary index on l_shipdate + l_discount + l_quantity
>> for that query. There's a kind of well-known, canonical description of the
>> optimal indexes to add for the TPC benchmarks that I can't seem to find
>> (but maybe someone else can point you to it). You should include the
>> time to call resultset.next() as that call will block until the result is
>> calculated. Also, subsequent runs of the query will always be faster due to
>> various caching that happens (OS, HBase block cache, etc.) as well as the
>> JIT compiler kicking in.
>>
>> Thanks,
>> James
>>
>> On Thu, Dec 3, 2015 at 12:34 PM, ashish tapdiya <as...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am profiling TPC-H queries using Phoenix. For query no. 6 and db size
>>> 1GB (lineitem table size is around 760 MB),
>>>
>>> Query 6 : select sum(l_extendedprice * l_discount) as revenue from
>>> lineitem_sf1 where l_shipdate >= TO_DATE('1993-01-01') and l_shipdate <
>>> TO_DATE('1994-01-01') and l_discount between 0.06 and 0.08 and l_quantity <
>>> 24
>>>
>>> the execution time recorded using following code:
>>>
>>> long startTime = System.currentTimeMillis();
>>> rset = stmt.executeQuery();
>>> long stopTime = System.currentTimeMillis();
>>> long elapsedTime = stopTime - startTime;
>>>
>>> first run - 240 ms
>>> second run onwards - 80 ms
>>>
>>> However, when I iterate the result set (single row), query response time
>>> including result set iteration shoots up to 19 seconds.
>>>
>>> Does the query gets executed with stmt.executeQuery() or phoenix does
>>> not execute query until first resultset.next() is invoked?
>>>
>>> Cluster includes 4 slaves nodes. Phoenix version is 4.3.0
>>>
>>> Thanks,
>>> ~Ashish
>>>
>>
>>
>

Re: understanding execution semantics...

Posted by ashish tapdiya <as...@gmail.com>.

Hello James,

Thanks for the response. I will look into adding relevant TPC-H indexes.

We are doing a performance study comparing Impala, Phoenix using TPC-H
queries. For scale factor 1, I am not able to get numbers exhibiting trend
similar to performance graph on our website comparing Impala and Phoenix
.

I have few questions regarding the performance graph comparing Phoenix and
Impala for aggregation query (select count(1))

Q1: Was the Impala data stored in parquet/text format?

Q2: What and how performance parameters (hadoop, hbase) were tuned for
phoenix.

Q3. Can I get the same population script so that I can report numbers from
the local cluster.

Thanks,
Ashish

On Thu, Dec 3, 2015 at 5:35 PM, James Taylor <ja...@apache.org> wrote:

> Hi Ashish,
> Please make sure to add a secondary index on l_shipdate + l_discount + l_quantity
> for that query. There's a kind of well-known, canonical description of the
> optimal indexes to add for the TPC benchmarks that I can't seem to find
> (but maybe someone else can point you to it). You should include the time
> to call resultset.next() as that call will block until the result is
> calculated. Also, subsequent runs of the query will always be faster due to
> various caching that happens (OS, HBase block cache, etc.) as well as the
> JIT compiler kicking in.
>
> Thanks,
> James
>
> On Thu, Dec 3, 2015 at 12:34 PM, ashish tapdiya <as...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am profiling TPC-H queries using Phoenix. For query no. 6 and db size
>> 1GB (lineitem table size is around 760 MB),
>>
>> Query 6 : select sum(l_extendedprice * l_discount) as revenue from
>> lineitem_sf1 where l_shipdate >= TO_DATE('1993-01-01') and l_shipdate <
>> TO_DATE('1994-01-01') and l_discount between 0.06 and 0.08 and l_quantity <
>> 24
>>
>> the execution time recorded using following code:
>>
>> long startTime = System.currentTimeMillis();
>> rset = stmt.executeQuery();
>> long stopTime = System.currentTimeMillis();
>> long elapsedTime = stopTime - startTime;
>>
>> first run - 240 ms
>> second run onwards - 80 ms
>>
>> However, when I iterate the result set (single row), query response time
>> including result set iteration shoots up to 19 seconds.
>>
>> Does the query gets executed with stmt.executeQuery() or phoenix does not
>> execute query until first resultset.next() is invoked?
>>
>> Cluster includes 4 slaves nodes. Phoenix version is 4.3.0
>>
>> Thanks,
>> ~Ashish
>>
>
>

Re: understanding execution semantics...

Posted by James Taylor <ja...@apache.org>.

Hi Ashish,
Please make sure to add a secondary index on l_shipdate + l_discount +
l_quantity
for that query. There's a kind of well-known, canonical description of the
optimal indexes to add for the TPC benchmarks that I can't seem to find
(but maybe someone else can point you to it). You should include the time
to call resultset.next() as that call will block until the result is
calculated. Also, subsequent runs of the query will always be faster due to
various caching that happens (OS, HBase block cache, etc.) as well as the
JIT compiler kicking in.

Thanks,
James

On Thu, Dec 3, 2015 at 12:34 PM, ashish tapdiya <as...@gmail.com>
wrote:

> Hi,
>
> I am profiling TPC-H queries using Phoenix. For query no. 6 and db size
> 1GB (lineitem table size is around 760 MB),
>
> Query 6 : select sum(l_extendedprice * l_discount) as revenue from
> lineitem_sf1 where l_shipdate >= TO_DATE('1993-01-01') and l_shipdate <
> TO_DATE('1994-01-01') and l_discount between 0.06 and 0.08 and l_quantity <
> 24
>
> the execution time recorded using following code:
>
> long startTime = System.currentTimeMillis();
> rset = stmt.executeQuery();
> long stopTime = System.currentTimeMillis();
> long elapsedTime = stopTime - startTime;
>
> first run - 240 ms
> second run onwards - 80 ms
>
> However, when I iterate the result set (single row), query response time
> including result set iteration shoots up to 19 seconds.
>
> Does the query gets executed with stmt.executeQuery() or phoenix does not
> execute query until first resultset.next() is invoked?
>
> Cluster includes 4 slaves nodes. Phoenix version is 4.3.0
>
> Thanks,
> ~Ashish
>