You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Rajappa Iyer <rs...@mayin.org> on 2010/08/23 22:36:52 UTC

(Self) Joins on NULLable columns takes forever

Consider the following table (I've omitted things like additional columns
and the serde specification since I think they are mostly irrelevant):

CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id string
...) PARTITIONED BY (dt string) ROW FORMAT ...;

Where visitor_id is assigned to every visitor and user_id is only present
for logged in users.  In other words, visitor_id is never NULL, but user_id
can be.

I want to find out, for a given day, how many visitors were also seen
yesterday.  That is, I want the subset of visitor_id's which were also
present in the previous day.

Here's a query I wrote:

SELECT count(1) FROM (
      SELECT DISTINCT e1.visitor_id FROM
            event_log v1 JOIN event_log e2 ON
                    (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22'
AND e2.dt='2010-08-21')) event_log;

The average number of total records for each partition is about 350K.

This works, no problems.

But if I issue the same query with visitor_id replaced by user_id, the
Stage-1 reducer does not seem to make progress -- it remains stuck around
say 68% or so in several runs.

The last few lines from the log:

2010-08-23 13:13:27,036 INFO
org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000
rows
2010-08-23 13:13:27,036 INFO
org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000
rows
2010-08-23 13:13:29,230 INFO
org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000
rows
2010-08-23 13:13:29,230 INFO
org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000
rows
2010-08-23 13:13:31,489 INFO
org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000
rows
2010-08-23 13:13:31,490 INFO
org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000
rows
2010-08-23 13:13:33,680 INFO
org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000
rows
2010-08-23 13:13:33,680 INFO
org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000
rows

The number of rows keep on increasing without bound it seems.

Adding the condition "e1.user_id IS NOT NULL" to the join condition helps
matters considerably in that the job finishes but it is still not as fast as
the query for visitor_id.

Any ideas on how to improve the speed of the query?

This was on hive-0.5.0+20 from CDH3.

Thanks,
Raj

Re: (Self) Joins on NULLable columns takes forever

Posted by Rajappa Iyer <rs...@mayin.org>.

Sure.  I did three runs each.

The times for the visitor_id query were
124.515, 118.673 and 115.33 seconds

for user_id query with "e1.user_id is not null":
241.201, 252.091 and 238.144 seconds

The slowness seems to be mainly due to Stage-1 and Stage-2 reduce.

FYI, the row counts are as follows:

2010-08-22, total 295465, with user_id NOT NULL: 230813
2010-08-21, total 267236, with user_id NOT NULL: 213522

Hope this helps.

Raj

On Mon, Aug 23, 2010 at 3:59 PM, Ted Yu <yu...@gmail.com> wrote:

> Can you publish run time statistics for user_id query with the condition
> "e1.user_id IS NOT NULL" ?
>
> This would allow us to see how much slower that query is compared to the
> visitor_id query.
>
> Thanks
>
>
> On Mon, Aug 23, 2010 at 3:52 PM, Rajappa Iyer <rs...@mayin.org> wrote:
>
>> Yep, that was a typo... sorry.  It should read "event_log e1"
>>
>> Thanks,
>> Raj
>>
>>
>> On Mon, Aug 23, 2010 at 3:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> Was there a typo below (v1 -> e1) ?
>>>
>>>
>>> event_log v1 JOIN event_log e2 ON
>>>
>>> On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <rs...@mayin.org> wrote:
>>>
>>>> Consider the following table (I've omitted things like additional
>>>> columns and the serde specification since I think they are mostly
>>>> irrelevant):
>>>>
>>>> CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id
>>>> string ...) PARTITIONED BY (dt string) ROW FORMAT ...;
>>>>
>>>> Where visitor_id is assigned to every visitor and user_id is only
>>>> present for logged in users.  In other words, visitor_id is never NULL, but
>>>> user_id can be.
>>>>
>>>> I want to find out, for a given day, how many visitors were also seen
>>>> yesterday.  That is, I want the subset of visitor_id's which were also
>>>> present in the previous day.
>>>>
>>>> Here's a query I wrote:
>>>>
>>>> SELECT count(1) FROM (
>>>>       SELECT DISTINCT e1.visitor_id FROM
>>>>             event_log v1 JOIN event_log e2 ON
>>>>                     (e1.visitor_id = e2.visitor_id AND
>>>> e1.dt='2010-08-22' AND e2.dt='2010-08-21')) event_log;
>>>>
>>>> The average number of total records for each partition is about 350K.
>>>>
>>>> This works, no problems.
>>>>
>>>> But if I issue the same query with visitor_id replaced by user_id, the
>>>> Stage-1 reducer does not seem to make progress -- it remains stuck around
>>>> say 68% or so in several runs.
>>>>
>>>> The last few lines from the log:
>>>>
>>>> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000 rows
>>>> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000 rows
>>>> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000 rows
>>>> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000 rows
>>>> 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000 rows
>>>> 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000 rows
>>>> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000 rows
>>>> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000 rows
>>>>
>>>> The number of rows keep on increasing without bound it seems.
>>>>
>>>> Adding the condition "e1.user_id IS NOT NULL" to the join condition
>>>> helps matters considerably in that the job finishes but it is still not as
>>>> fast as the query for visitor_id.
>>>>
>>>> Any ideas on how to improve the speed of the query?
>>>>
>>>> This was on hive-0.5.0+20 from CDH3.
>>>>
>>>> Thanks,
>>>> Raj
>>>>
>>>>
>>>
>>
>

Re: (Self) Joins on NULLable columns takes forever

Posted by Rajappa Iyer <rs...@mayin.org>.

Yep, that was a typo... sorry.  It should read "event_log e1"

Thanks,
Raj

On Mon, Aug 23, 2010 at 3:42 PM, Ted Yu <yu...@gmail.com> wrote:

> Was there a typo below (v1 -> e1) ?
>
>
> event_log v1 JOIN event_log e2 ON
>
> On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <rs...@mayin.org> wrote:
>
>> Consider the following table (I've omitted things like additional columns
>> and the serde specification since I think they are mostly irrelevant):
>>
>> CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id
>> string ...) PARTITIONED BY (dt string) ROW FORMAT ...;
>>
>> Where visitor_id is assigned to every visitor and user_id is only present
>> for logged in users.  In other words, visitor_id is never NULL, but user_id
>> can be.
>>
>> I want to find out, for a given day, how many visitors were also seen
>> yesterday.  That is, I want the subset of visitor_id's which were also
>> present in the previous day.
>>
>> Here's a query I wrote:
>>
>> SELECT count(1) FROM (
>>       SELECT DISTINCT e1.visitor_id FROM
>>             event_log v1 JOIN event_log e2 ON
>>                     (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22'
>> AND e2.dt='2010-08-21')) event_log;
>>
>> The average number of total records for each partition is about 350K.
>>
>> This works, no problems.
>>
>> But if I issue the same query with visitor_id replaced by user_id, the
>> Stage-1 reducer does not seem to make progress -- it remains stuck around
>> say 68% or so in several runs.
>>
>> The last few lines from the log:
>>
>> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000 rows
>> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000 rows
>> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000 rows
>> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000 rows
>> 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000 rows
>> 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000 rows
>> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000 rows
>> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000 rows
>>
>> The number of rows keep on increasing without bound it seems.
>>
>> Adding the condition "e1.user_id IS NOT NULL" to the join condition helps
>> matters considerably in that the job finishes but it is still not as fast as
>> the query for visitor_id.
>>
>> Any ideas on how to improve the speed of the query?
>>
>> This was on hive-0.5.0+20 from CDH3.
>>
>> Thanks,
>> Raj
>>
>>
>

Re: (Self) Joins on NULLable columns takes forever

Posted by Ted Yu <yu...@gmail.com>.

Was there a typo below (v1 -> e1) ?

event_log v1 JOIN event_log e2 ON

On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <rs...@mayin.org> wrote:

> Consider the following table (I've omitted things like additional columns
> and the serde specification since I think they are mostly irrelevant):
>
> CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id
> string ...) PARTITIONED BY (dt string) ROW FORMAT ...;
>
> Where visitor_id is assigned to every visitor and user_id is only present
> for logged in users.  In other words, visitor_id is never NULL, but user_id
> can be.
>
> I want to find out, for a given day, how many visitors were also seen
> yesterday.  That is, I want the subset of visitor_id's which were also
> present in the previous day.
>
> Here's a query I wrote:
>
> SELECT count(1) FROM (
>       SELECT DISTINCT e1.visitor_id FROM
>             event_log v1 JOIN event_log e2 ON
>                     (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22'
> AND e2.dt='2010-08-21')) event_log;
>
> The average number of total records for each partition is about 350K.
>
> This works, no problems.
>
> But if I issue the same query with visitor_id replaced by user_id, the
> Stage-1 reducer does not seem to make progress -- it remains stuck around
> say 68% or so in several runs.
>
> The last few lines from the log:
>
> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000 rows
> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000 rows
> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000 rows
> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000 rows
> 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000 rows
> 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000 rows
> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000 rows
> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000 rows
>
> The number of rows keep on increasing without bound it seems.
>
> Adding the condition "e1.user_id IS NOT NULL" to the join condition helps
> matters considerably in that the job finishes but it is still not as fast as
> the query for visitor_id.
>
> Any ideas on how to improve the speed of the query?
>
> This was on hive-0.5.0+20 from CDH3.
>
> Thanks,
> Raj
>
>