You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Lew Jackman <le...@netzero.net> on 2017/07/27 20:36:19 UTC

Full Scan - is it really?

I am joining two tables by using only a key fields in two tables.
(if this were straight hbase, I know I would code with some range scans)
There are many billions of rows in each table.
I am trying to understand the explain plan as I am having difficulties with query 
failures under various load testing scenarios.
When used in sqlline, the query for which I am posting the plan  can take 6-30 seconds.

Below is my explain plan.

CLIENT 173251-CHUNK 87628582101 ROWS 53384270894992 BYTES PARALLEL 1-WAY ROUND ROBIN FULL 
SCAN OVER CLICKS                                                         
     SERVER FILTER BY FIRST KEY ONLY                                                      
                                                                                    
     PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE)                                             
                                                                                    
         CLIENT 5181-CHUNK 14470353216 ROWS 1562798147328 BYTES PARALLEL 256-WAY ROUND ROBIN SKIP SCAN ON 25600 KEYS OVER PRODUCT [0,'020lj7'] - [255,'z4l777']   
             SERVER FILTER BY FIRST KEY ONLY                                              
                                                                                    
     DYNAMIC SERVER FILTER BY CLICKS.SHA_KEY IN (PRODUCT.SHA_KEY)   

Question: Does " FULL SCAN " truly mean full scan?

I do know that running the hbase map reduce RowCounter to perform a row count, which 
truly is a full scan, can take over 30 minutes. So I am very confused about - when is a 
FULL SCAN really a full scan?

Thanks for any insight.

____________________________________________________________
Actress Maggie Q Shocks With Her Solution To Tummy Troubles
ActivatedYou
http://thirdpartyoffers.netzero.net/TGL3231/597a4efb256f54efb5d0cst02vuc

Re: Full Scan - is it really?

Posted by James Taylor <ja...@apache.org>.
I agree - the explain plan is confusing. Would be good to capture fixing
that in a JIRA. Would you mind sharing your DDL and queries too?

On Thu, Jul 27, 2017 at 3:55 PM, Josh Elser <el...@apache.org> wrote:

>
>
> On 7/27/17 4:36 PM, Lew Jackman wrote:
>
>> I am joining two tables by using only a key fields in two tables.
>> (if this were straight hbase, I know I would code with some range scans)
>> There are many billions of rows in each table.
>> I am trying to understand the explain plan as I am having difficulties
>> with query
>> failures under various load testing scenarios.
>> When used in sqlline, the query for which I am posting the plan  can take
>> 6-30 seconds.
>>
>> Below is my explain plan.
>>
>> CLIENT 173251-CHUNK 87628582101 ROWS 53384270894992 BYTES PARALLEL 1-WAY
>> ROUND ROBIN FULL
>> SCAN OVER CLICKS
>>       SERVER FILTER BY FIRST KEY ONLY
>>
>>                  PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE)
>>
>>                      CLIENT 5181-CHUNK 14470353216 ROWS 1562798147328 BYTES
>> PARALLEL 256-WAY ROUND ROBIN SKIP SCAN ON 25600 KEYS OVER PRODUCT
>> [0,'020lj7'] - [255,'z4l777']
>>               SERVER FILTER BY FIRST KEY ONLY
>>
>>                  DYNAMIC SERVER FILTER BY CLICKS.SHA_KEY IN
>> (PRODUCT.SHA_KEY)
>>
>> Question: Does " FULL SCAN " truly mean full scan?
>>
>> I do know that running the hbase map reduce RowCounter to perform a row
>> count, which
>> truly is a full scan, can take over 30 minutes. So I am very confused
>> about - when is a
>> FULL SCAN really a full scan?
>>
>> Thanks for any insight.
>>
>
> Yes and no, I think.
>
> You're joining SHA_KEY from the CLICKS and PRODUCT table, but your "search
> space" on CLICKS is the entire table -- the query doesn't have any
> information that it can use to limit your query to only a portion of the
> CLICKS table. Phoenix can make intelligent decisions about filtering data
> based on that inner-join which preclude the query from enumerating all of
> the key-values pairs in CLICKS.
>
> So, yes the query is doing less work than a RowCounter would take, but
> technically it's still looking at the entire CLICKS table.
>

Re: Full Scan - is it really?

Posted by Josh Elser <el...@apache.org>.

On 7/27/17 4:36 PM, Lew Jackman wrote:
> I am joining two tables by using only a key fields in two tables.
> (if this were straight hbase, I know I would code with some range scans)
> There are many billions of rows in each table.
> I am trying to understand the explain plan as I am having difficulties with query
> failures under various load testing scenarios.
> When used in sqlline, the query for which I am posting the plan  can take 6-30 seconds.
> 
> Below is my explain plan.
> 
> CLIENT 173251-CHUNK 87628582101 ROWS 53384270894992 BYTES PARALLEL 1-WAY ROUND ROBIN FULL
> SCAN OVER CLICKS
>       SERVER FILTER BY FIRST KEY ONLY
>                                                                                      
>       PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE)
>                                                                                      
>           CLIENT 5181-CHUNK 14470353216 ROWS 1562798147328 BYTES PARALLEL 256-WAY ROUND ROBIN SKIP SCAN ON 25600 KEYS OVER PRODUCT [0,'020lj7'] - [255,'z4l777']
>               SERVER FILTER BY FIRST KEY ONLY
>                                                                                      
>       DYNAMIC SERVER FILTER BY CLICKS.SHA_KEY IN (PRODUCT.SHA_KEY)
> 
> Question: Does " FULL SCAN " truly mean full scan?
> 
> I do know that running the hbase map reduce RowCounter to perform a row count, which
> truly is a full scan, can take over 30 minutes. So I am very confused about - when is a
> FULL SCAN really a full scan?
> 
> Thanks for any insight.

Yes and no, I think.

You're joining SHA_KEY from the CLICKS and PRODUCT table, but your 
"search space" on CLICKS is the entire table -- the query doesn't have 
any information that it can use to limit your query to only a portion of 
the CLICKS table. Phoenix can make intelligent decisions about filtering 
data based on that inner-join which preclude the query from enumerating 
all of the key-values pairs in CLICKS.

So, yes the query is doing less work than a RowCounter would take, but 
technically it's still looking at the entire CLICKS table.