You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2010/02/04 17:41:19 UTC

Tracking down join issues

OK
55504011
Time taken: 290.216 seconds
hive> select count(1) from pageviews;

select count(1) from files f;
Ended Job = job_200909171715_20347
OK
10164516
Time taken: 29.946 seconds

select count(1) from files f join pageviews p on f.id = p.file_id

OK
89375203
Time taken: 164.767 seconds

Any hint on what is going wrong here? from our dataset each pageview
should be related to 1 or 0 files?

Thanks,
Edward

Re: Tracking down join issues

Posted by Edward Capriolo <ed...@gmail.com>.
On Thu, Feb 4, 2010 at 12:41 PM, Zheng Shao <zs...@gmail.com> wrote:
> Can you post the results of "explain" for all 3 queries?
>
>
> Zheng
>
> On Thu, Feb 4, 2010 at 8:41 AM, Edward Capriolo <ed...@gmail.com> wrote:
>> OK
>> 55504011
>> Time taken: 290.216 seconds
>> hive> select count(1) from pageviews;
>>
>> select count(1) from files f;
>> Ended Job = job_200909171715_20347
>> OK
>> 10164516
>> Time taken: 29.946 seconds
>>
>> select count(1) from files f join pageviews p on f.id = p.file_id
>>
>> OK
>> 89375203
>> Time taken: 164.767 seconds
>>
>> Any hint on what is going wrong here? from our dataset each pageview
>> should be related to 1 or 0 files?
>>
>> Thanks,
>> Edward
>>
>
>
>
> --
> Yours,
> Zheng
>

Zheng,

My mistake. I made some incorrect assumptions about my source data. We
should add referential integrity to prevent me from making this
mistake again. NOT!

Thanks again,
Edward

Re: Tracking down join issues

Posted by Zheng Shao <zs...@gmail.com>.
Can you post the results of "explain" for all 3 queries?


Zheng

On Thu, Feb 4, 2010 at 8:41 AM, Edward Capriolo <ed...@gmail.com> wrote:
> OK
> 55504011
> Time taken: 290.216 seconds
> hive> select count(1) from pageviews;
>
> select count(1) from files f;
> Ended Job = job_200909171715_20347
> OK
> 10164516
> Time taken: 29.946 seconds
>
> select count(1) from files f join pageviews p on f.id = p.file_id
>
> OK
> 89375203
> Time taken: 164.767 seconds
>
> Any hint on what is going wrong here? from our dataset each pageview
> should be related to 1 or 0 files?
>
> Thanks,
> Edward
>



-- 
Yours,
Zheng