You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by David Morel <dm...@gmail.com> on 2013/01/28 10:39:57 UTC

Cartesian product detection in the query plan?

Hi everyone,

I had to kill some queries that were taking forever, and it turns out
they were doing cartesian products (missing ON clause on a JOIN).

I wonder how I could see that in the EXPLAIN output (which I still find
a bit cryptic). Specifically, the stage that it was stuck in was this:

   Stage: Stage-7
     Map Reduce
       Alias -> Map Operator Tree:
         $INTNAME
             Reduce Output Operator
               sort order:
               tag: 1
               value expressions:
                     expr: _col1
                     type: int
         $INTNAME1
             Reduce Output Operator
               sort order:
               tag: 0
               value expressions:
                     expr: _col0
                     type: bigint
                     expr: _col1
                     type: string
       Reduce Operator Tree:
         Join Operator
           condition map:
                Inner Join 0 to 1
           condition expressions:
             0 {VALUE._col0} {VALUE._col1}
             1 {VALUE._col1}
           handleSkewJoin: false
           outputColumnNames: _col0, _col1, _col3
           File Output Operator
             compressed: true
             GlobalTableId: 0
             table:
                 input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                 output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Is there anything in there that should have alerted me?

I found out by looking at the query, but I wonder if the query plan (if
I could read it) would have given me that information.

Thanks a lot

David Morel

Re: Cartesian product detection in the query plan?

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
By the way, v0.10.0 adds a new CROSS JOIN feature, when you really intended
to do one ;)

SELECT a.x, b.y FROM tablea a CROSS JOIN tableb b;



On Mon, Jan 28, 2013 at 10:58 AM, Edward Capriolo <ed...@gmail.com>wrote:

> Currently order by is very inefficient in hive. Hopefully you can use sort
> by in most cases.
>
> It should be "visible" in the plan. The plan should be very different if
> you are using the ON clause vs not.  Can it be easily detected is another
> question.
>
>
> On Mon, Jan 28, 2013 at 11:45 AM, David Morel <dm...@gmail.com> wrote:
>
>> On 28 Jan 2013, at 14:29, Edward Capriolo wrote:
>>
>>  Iirc hive.mapred.mode strict should prevent this. If not we should add
>>> it.
>>>
>>
>> hi Edward,
>>
>> Yes, that's indeed what the book claims (quoting):
>>
>>   hive> SELECT * FROM fracture_act JOIN fracture_ads
>>  > WHERE fracture_act.planner_id = fracture_ads.planner_id;
>>   FAILED: Error in semantic analysis: In strict mode, cartesian product
>> is not allowed. If you really want to perform the operation,
>>   +set hive.mapred.mode=nonstrict+
>>
>> I am about to re-enable this setting on my cluster (after fixing all the
>> queries that it broke, especially all the ORDER BY ones :-) but I hoped
>> it was visible right there in the query plan, or in some other way. If
>> Hive can detect it, it should be visible somewhere, right?
>>
>> Thanks!
>>
>> david
>>
>>
>>> On Monday, January 28, 2013, David Morel <dm...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I had to kill some queries that were taking forever, and it turns out
>>>> they were doing cartesian products (missing ON clause on a JOIN).
>>>>
>>>> I wonder how I could see that in the EXPLAIN output (which I still find
>>>> a bit cryptic). Specifically, the stage that it was stuck in was this:
>>>>
>>>> Stage: Stage-7
>>>> Map Reduce
>>>> Alias -> Map Operator Tree:
>>>>   $INTNAME
>>>>       Reduce Output Operator
>>>>         sort order:
>>>>         tag: 1
>>>>         value expressions:
>>>>               expr: _col1
>>>>               type: int
>>>>   $INTNAME1
>>>>       Reduce Output Operator
>>>>         sort order:
>>>>         tag: 0
>>>>         value expressions:
>>>>               expr: _col0
>>>>               type: bigint
>>>>               expr: _col1
>>>>               type: string
>>>> Reduce Operator Tree:
>>>>   Join Operator
>>>>     condition map:
>>>>          Inner Join 0 to 1
>>>>     condition expressions:
>>>>       0 {VALUE._col0} {VALUE._col1}
>>>>       1 {VALUE._col1}
>>>>     handleSkewJoin: false
>>>>     outputColumnNames: _col0, _col1, _col3
>>>>     File Output Operator
>>>>       compressed: true
>>>>       GlobalTableId: 0
>>>>       table:
>>>>           input format:
>>>>
>>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>>
>>>>           output format:
>>>>
>>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>>
>>>>
>>>> Is there anything in there that should have alerted me?
>>>>
>>>> I found out by looking at the query, but I wonder if the query plan (if
>>>> I could read it) would have given me that information.
>>>>
>>>> Thanks a lot
>>>>
>>>> David Morel
>>>>
>>>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Cartesian product detection in the query plan?

Posted by David Morel <dm...@gmail.com>.
On 28 Jan 2013, at 17:58, Edward Capriolo wrote:

> Currently order by is very inefficient in hive. Hopefully you can use 
> sort by in most cases.

Makes complete sense, I should have thought of it sooner. Thanks a lot!

> It should be "visible" in the plan. The plan should be very different 
> if you are using the ON clause vs not.  Can it be easily detected is 
> another question.

Well, I guess that answers the question. Thanks again :-)

David

Re: Cartesian product detection in the query plan?

Posted by Edward Capriolo <ed...@gmail.com>.
Currently order by is very inefficient in hive. Hopefully you can use sort
by in most cases.

It should be "visible" in the plan. The plan should be very different if
you are using the ON clause vs not.  Can it be easily detected is another
question.


On Mon, Jan 28, 2013 at 11:45 AM, David Morel <dm...@gmail.com> wrote:

> On 28 Jan 2013, at 14:29, Edward Capriolo wrote:
>
>  Iirc hive.mapred.mode strict should prevent this. If not we should add it.
>>
>
> hi Edward,
>
> Yes, that's indeed what the book claims (quoting):
>
>   hive> SELECT * FROM fracture_act JOIN fracture_ads
>  > WHERE fracture_act.planner_id = fracture_ads.planner_id;
>   FAILED: Error in semantic analysis: In strict mode, cartesian product is
> not allowed. If you really want to perform the operation,
>   +set hive.mapred.mode=nonstrict+
>
> I am about to re-enable this setting on my cluster (after fixing all the
> queries that it broke, especially all the ORDER BY ones :-) but I hoped
> it was visible right there in the query plan, or in some other way. If
> Hive can detect it, it should be visible somewhere, right?
>
> Thanks!
>
> david
>
>
>> On Monday, January 28, 2013, David Morel <dm...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I had to kill some queries that were taking forever, and it turns out
>>> they were doing cartesian products (missing ON clause on a JOIN).
>>>
>>> I wonder how I could see that in the EXPLAIN output (which I still find
>>> a bit cryptic). Specifically, the stage that it was stuck in was this:
>>>
>>> Stage: Stage-7
>>> Map Reduce
>>> Alias -> Map Operator Tree:
>>>   $INTNAME
>>>       Reduce Output Operator
>>>         sort order:
>>>         tag: 1
>>>         value expressions:
>>>               expr: _col1
>>>               type: int
>>>   $INTNAME1
>>>       Reduce Output Operator
>>>         sort order:
>>>         tag: 0
>>>         value expressions:
>>>               expr: _col0
>>>               type: bigint
>>>               expr: _col1
>>>               type: string
>>> Reduce Operator Tree:
>>>   Join Operator
>>>     condition map:
>>>          Inner Join 0 to 1
>>>     condition expressions:
>>>       0 {VALUE._col0} {VALUE._col1}
>>>       1 {VALUE._col1}
>>>     handleSkewJoin: false
>>>     outputColumnNames: _col0, _col1, _col3
>>>     File Output Operator
>>>       compressed: true
>>>       GlobalTableId: 0
>>>       table:
>>>           input format:
>>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>
>>>           output format:
>>>
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>
>>>
>>> Is there anything in there that should have alerted me?
>>>
>>> I found out by looking at the query, but I wonder if the query plan (if
>>> I could read it) would have given me that information.
>>>
>>> Thanks a lot
>>>
>>> David Morel
>>>
>>>

Re: Cartesian product detection in the query plan?

Posted by David Morel <dm...@gmail.com>.
On 28 Jan 2013, at 14:29, Edward Capriolo wrote:

> Iirc hive.mapred.mode strict should prevent this. If not we should add 
> it.

hi Edward,

Yes, that's indeed what the book claims (quoting):

   hive> SELECT * FROM fracture_act JOIN fracture_ads
  > WHERE fracture_act.planner_id = fracture_ads.planner_id;
   FAILED: Error in semantic analysis: In strict mode, cartesian product 
is not allowed. If you really want to perform the operation,
   +set hive.mapred.mode=nonstrict+

I am about to re-enable this setting on my cluster (after fixing all the
queries that it broke, especially all the ORDER BY ones :-) but I hoped
it was visible right there in the query plan, or in some other way. If
Hive can detect it, it should be visible somewhere, right?

Thanks!

david

>
> On Monday, January 28, 2013, David Morel <dm...@gmail.com> wrote:
>> Hi everyone,
>>
>> I had to kill some queries that were taking forever, and it turns out
>> they were doing cartesian products (missing ON clause on a JOIN).
>>
>> I wonder how I could see that in the EXPLAIN output (which I still 
>> find
>> a bit cryptic). Specifically, the stage that it was stuck in was 
>> this:
>>
>> Stage: Stage-7
>> Map Reduce
>> Alias -> Map Operator Tree:
>>   $INTNAME
>>       Reduce Output Operator
>>         sort order:
>>         tag: 1
>>         value expressions:
>>               expr: _col1
>>               type: int
>>   $INTNAME1
>>       Reduce Output Operator
>>         sort order:
>>         tag: 0
>>         value expressions:
>>               expr: _col0
>>               type: bigint
>>               expr: _col1
>>               type: string
>> Reduce Operator Tree:
>>   Join Operator
>>     condition map:
>>          Inner Join 0 to 1
>>     condition expressions:
>>       0 {VALUE._col0} {VALUE._col1}
>>       1 {VALUE._col1}
>>     handleSkewJoin: false
>>     outputColumnNames: _col0, _col1, _col3
>>     File Output Operator
>>       compressed: true
>>       GlobalTableId: 0
>>       table:
>>           input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
>>           output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>
>> Is there anything in there that should have alerted me?
>>
>> I found out by looking at the query, but I wonder if the query plan 
>> (if
>> I could read it) would have given me that information.
>>
>> Thanks a lot
>>
>> David Morel
>>

Re: Cartesian product detection in the query plan?

Posted by Edward Capriolo <ed...@gmail.com>.
Iirc hive.mapred.mode strict should prevent this. If not we should add it.

On Monday, January 28, 2013, David Morel <dm...@gmail.com> wrote:
> Hi everyone,
>
> I had to kill some queries that were taking forever, and it turns out
> they were doing cartesian products (missing ON clause on a JOIN).
>
> I wonder how I could see that in the EXPLAIN output (which I still find
> a bit cryptic). Specifically, the stage that it was stuck in was this:
>
>   Stage: Stage-7
>     Map Reduce
>       Alias -> Map Operator Tree:
>         $INTNAME
>             Reduce Output Operator
>               sort order:
>               tag: 1
>               value expressions:
>                     expr: _col1
>                     type: int
>         $INTNAME1
>             Reduce Output Operator
>               sort order:
>               tag: 0
>               value expressions:
>                     expr: _col0
>                     type: bigint
>                     expr: _col1
>                     type: string
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Inner Join 0 to 1
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1}
>             1 {VALUE._col1}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col3
>           File Output Operator
>             compressed: true
>             GlobalTableId: 0
>             table:
>                 input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
>                 output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
> Is there anything in there that should have alerted me?
>
> I found out by looking at the query, but I wonder if the query plan (if
> I could read it) would have given me that information.
>
> Thanks a lot
>
> David Morel
>