You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Cam Bazz <ca...@gmail.com> on 2011/02/23 12:16:28 UTC

why this query gives wrong results

Hello,

I have three tables, one that counts hits, the other unique visits,
and the other clicks on that page:

The query below will fail to produce correct results: (number of
uniques is wrong, always set to 8, same number for all)

select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
c.clicks, 0) from selection_daily_hits h left outer join
selection_daily_uniques u left outer join selection_daily_clicks c on
(h.sel_sid = u.sel_sid and h.sel_sid = c.sel_sid and h.date_day =
'20110211' and u.date_day = '20110211' and c.date_day = '20110211');

where the query below will work and provide correct results

select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
c.clicks, 0) from selection_daily_hits h left outer join
selection_daily_uniques u left outer join selection_daily_clicks c on
(h.sel_sid = u.sel_sid and c.sel_sid = h.sel_sid and h.date_day =
'20110211' and u.date_day = '20110211' and c.date_day = '20110211');

the only difference is, on the non working query I have h.sel_sid =
c.sel_sid and in the working query I have c.sel_sid = h.sel_sid

notice that while the first and second table will always have the same
number keys, the third table might not have some keys, hence those
lines are converted to 0.

Best Regards,
-C.B.

Re: why this query gives wrong results

Posted by Viral Bajaria <vi...@gmail.com>.
my query had a bug in it ... it had on ON (it was repeated twice). try this

SELECT
   h.sel_sid, h.hits, u.uniques, if(c.clicks is not null, c.clicks, 0)
FROM
   selection_daily_hits h
   left outer join selection_daily_uniques u ON (h.sel_sid = u.sel_sid AND
h.date_day = u.date_day AND u.date_day = '20110211')
   left outer join selection_daily_clicks c ON (h.sel_sid = c.sel_sid AND
h.date_day = c.date_day AND c.date_day = '20110211')
WHERE h.date_day = '20110211';

On Wed, Feb 23, 2011 at 10:51 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello,
>
> Here are the table descriptions. they only have the identifier, hits,
> unqiques and date_day which is the partition
>
> hive> describe selection_daily_hits;
> OK
> sel_sid int
> hits    int
> date_day        string
>
> hive> describe selection_daily_uniques;
> OK
> sel_sid int
> uniques int
> date_day        string
>
> hive> describe selection_daily_clicks;
> OK
> sel_sid int
> clicks  int
> date_day        string
>
> i tried to query you have, which unfortunately returns:
>
> FAILED: Parse Error: line 1:272 cannot recognize input 'ON' in
> expression specification
>
> In the query I wrote, I thought the on clause covered all three joins,
> but I from your query how i should do it from your query, except, it
> will return parse error...? but why?
>
> best regards,
> c,b,
>
>
> On Wed, Feb 23, 2011 at 10:31 PM, Viral Bajaria <vi...@gmail.com>
> wrote:
> >
> > I have a few questions as follows:
> > 1) what's the schema of all 3 tables ? Do these tables only have
>  (sel_sid,
> > date_day) as the columns along with the facts that they represent or do
> they
> > have more columns besides those 2 columns ?
> > 2) why do you do a left outer join without an ON clause, not too sure if
> > MySql syntax is like that but I normally don't prefer to do a JOIN
> without
> > specifying anything in the ON clause. I just don't trust the result set
> and
> > the query is not really readable.
> > How about trying this query:
> > SELECT
> >    h.sel_sid, h.hits, u.uniques, if(c.clicks is not null, c.clicks, 0)
> > FROM
> >    selection_daily_hits h
> >    left outer join selection_daily_uniques u ON (h.sel_sid = u.sel_sid
> AND
> > h.date_day = u.date_day AND u.date_day = '20110211')
> >    left outer join selection_daily_clicks c on ON (h.sel_sid = c.sel_sid
> AND
> > h.date_day = c.date_day AND c.date_day = '20110211')
> > WHERE h.date_day = '20110211';
> > some notes about the query:
> > - I restrict the uniques and clicks to 20110211 in the JOIN clause
> because
> > in hive 0.5.0 if you put them in the WHERE clause the partitions don't
> get
> > trimmed and it scans the entire table before limiting the data. If your
> data
> > is not partitioned please go ahead and remove that restriction.
> > - I join on the date_day columns to make sure the data is correct if the
> > tables are not partitioned or the query plan causes table scans because
> > there are chances you can see the same sel_sid on different days (this is
> an
> > assumption)
> > -Viral
> > On Wed, Feb 23, 2011 at 3:16 AM, Cam Bazz <ca...@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >> I have three tables, one that counts hits, the other unique visits,
> >> and the other clicks on that page:
> >>
> >> The query below will fail to produce correct results: (number of
> >> uniques is wrong, always set to 8, same number for all)
> >>
> >> select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
> >> c.clicks, 0) from selection_daily_hits h left outer join
> >> selection_daily_uniques u left outer join selection_daily_clicks c on
> >> (h.sel_sid = u.sel_sid and h.sel_sid = c.sel_sid and h.date_day =
> >> '20110211' and u.date_day = '20110211' and c.date_day = '20110211');
> >>
> >> where the query below will work and provide correct results
> >>
> >> select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
> >> c.clicks, 0) from selection_daily_hits h left outer join
> >> selection_daily_uniques u left outer join selection_daily_clicks c on
> >> (h.sel_sid = u.sel_sid and c.sel_sid = h.sel_sid and h.date_day =
> >> '20110211' and u.date_day = '20110211' and c.date_day = '20110211');
> >>
> >> the only difference is, on the non working query I have h.sel_sid =
> >> c.sel_sid and in the working query I have c.sel_sid = h.sel_sid
> >>
> >> notice that while the first and second table will always have the same
> >> number keys, the third table might not have some keys, hence those
> >> lines are converted to 0.
> >>
> >> Best Regards,
> >> -C.B.
> >
> >
>

Re: why this query gives wrong results

Posted by Cam Bazz <ca...@gmail.com>.
Hello,

Here are the table descriptions. they only have the identifier, hits,
unqiques and date_day which is the partition

hive> describe selection_daily_hits;
OK
sel_sid int
hits    int
date_day        string

hive> describe selection_daily_uniques;
OK
sel_sid int
uniques int
date_day        string

hive> describe selection_daily_clicks;
OK
sel_sid int
clicks  int
date_day        string

i tried to query you have, which unfortunately returns:

FAILED: Parse Error: line 1:272 cannot recognize input 'ON' in
expression specification

In the query I wrote, I thought the on clause covered all three joins,
but I from your query how i should do it from your query, except, it
will return parse error...? but why?

best regards,
c,b,


On Wed, Feb 23, 2011 at 10:31 PM, Viral Bajaria <vi...@gmail.com> wrote:
>
> I have a few questions as follows:
> 1) what's the schema of all 3 tables ? Do these tables only have  (sel_sid,
> date_day) as the columns along with the facts that they represent or do they
> have more columns besides those 2 columns ?
> 2) why do you do a left outer join without an ON clause, not too sure if
> MySql syntax is like that but I normally don't prefer to do a JOIN without
> specifying anything in the ON clause. I just don't trust the result set and
> the query is not really readable.
> How about trying this query:
> SELECT
>    h.sel_sid, h.hits, u.uniques, if(c.clicks is not null, c.clicks, 0)
> FROM
>    selection_daily_hits h
>    left outer join selection_daily_uniques u ON (h.sel_sid = u.sel_sid AND
> h.date_day = u.date_day AND u.date_day = '20110211')
>    left outer join selection_daily_clicks c on ON (h.sel_sid = c.sel_sid AND
> h.date_day = c.date_day AND c.date_day = '20110211')
> WHERE h.date_day = '20110211';
> some notes about the query:
> - I restrict the uniques and clicks to 20110211 in the JOIN clause because
> in hive 0.5.0 if you put them in the WHERE clause the partitions don't get
> trimmed and it scans the entire table before limiting the data. If your data
> is not partitioned please go ahead and remove that restriction.
> - I join on the date_day columns to make sure the data is correct if the
> tables are not partitioned or the query plan causes table scans because
> there are chances you can see the same sel_sid on different days (this is an
> assumption)
> -Viral
> On Wed, Feb 23, 2011 at 3:16 AM, Cam Bazz <ca...@gmail.com> wrote:
>>
>> Hello,
>>
>> I have three tables, one that counts hits, the other unique visits,
>> and the other clicks on that page:
>>
>> The query below will fail to produce correct results: (number of
>> uniques is wrong, always set to 8, same number for all)
>>
>> select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
>> c.clicks, 0) from selection_daily_hits h left outer join
>> selection_daily_uniques u left outer join selection_daily_clicks c on
>> (h.sel_sid = u.sel_sid and h.sel_sid = c.sel_sid and h.date_day =
>> '20110211' and u.date_day = '20110211' and c.date_day = '20110211');
>>
>> where the query below will work and provide correct results
>>
>> select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
>> c.clicks, 0) from selection_daily_hits h left outer join
>> selection_daily_uniques u left outer join selection_daily_clicks c on
>> (h.sel_sid = u.sel_sid and c.sel_sid = h.sel_sid and h.date_day =
>> '20110211' and u.date_day = '20110211' and c.date_day = '20110211');
>>
>> the only difference is, on the non working query I have h.sel_sid =
>> c.sel_sid and in the working query I have c.sel_sid = h.sel_sid
>>
>> notice that while the first and second table will always have the same
>> number keys, the third table might not have some keys, hence those
>> lines are converted to 0.
>>
>> Best Regards,
>> -C.B.
>
>

Re: why this query gives wrong results

Posted by Viral Bajaria <vi...@gmail.com>.
I have a few questions as follows:

1) what's the schema of all 3 tables ? Do these tables only have  (sel_sid,
date_day) as the columns along with the facts that they represent or do they
have more columns besides those 2 columns ?

2) why do you do a left outer join without an ON clause, not too sure if
MySql syntax is like that but I normally don't prefer to do a JOIN without
specifying anything in the ON clause. I just don't trust the result set and
the query is not really readable.

How about trying this query:

SELECT
   h.sel_sid, h.hits, u.uniques, if(c.clicks is not null, c.clicks, 0)
FROM
   selection_daily_hits h
   left outer join selection_daily_uniques u ON (h.sel_sid = u.sel_sid AND
h.date_day = u.date_day AND u.date_day = '20110211')
   left outer join selection_daily_clicks c on ON (h.sel_sid = c.sel_sid AND
h.date_day = c.date_day AND c.date_day = '20110211')
WHERE h.date_day = '20110211';

some notes about the query:
- I restrict the uniques and clicks to 20110211 in the JOIN clause because
in hive 0.5.0 if you put them in the WHERE clause the partitions don't get
trimmed and it scans the entire table before limiting the data. If your data
is not partitioned please go ahead and remove that restriction.
- I join on the date_day columns to make sure the data is correct if the
tables are not partitioned or the query plan causes table scans because
there are chances you can see the same sel_sid on different days (this is an
assumption)

-Viral

On Wed, Feb 23, 2011 at 3:16 AM, Cam Bazz <ca...@gmail.com> wrote:

> Hello,
>
> I have three tables, one that counts hits, the other unique visits,
> and the other clicks on that page:
>
> The query below will fail to produce correct results: (number of
> uniques is wrong, always set to 8, same number for all)
>
> select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
> c.clicks, 0) from selection_daily_hits h left outer join
> selection_daily_uniques u left outer join selection_daily_clicks c on
> (h.sel_sid = u.sel_sid and h.sel_sid = c.sel_sid and h.date_day =
> '20110211' and u.date_day = '20110211' and c.date_day = '20110211');
>
> where the query below will work and provide correct results
>
> select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
> c.clicks, 0) from selection_daily_hits h left outer join
> selection_daily_uniques u left outer join selection_daily_clicks c on
> (h.sel_sid = u.sel_sid and c.sel_sid = h.sel_sid and h.date_day =
> '20110211' and u.date_day = '20110211' and c.date_day = '20110211');
>
> the only difference is, on the non working query I have h.sel_sid =
> c.sel_sid and in the working query I have c.sel_sid = h.sel_sid
>
> notice that while the first and second table will always have the same
> number keys, the third table might not have some keys, hence those
> lines are converted to 0.
>
> Best Regards,
> -C.B.
>