You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Viral Parikh <vi...@gmail.com> on 2014/09/15 13:16:17 UTC
Correlated Subqueries Workaround in Hive!
To Whomsoever It May Concern,
I posted this question last week but still haven't heard from anyone; I'd
appreciate any reply.
I've got a table that contains a LocationId field. In some cases, where a
record shares the same foreign key, the LocationId might come through as -1.
What I want to do is in my select query is in the case of this happening,
the previous location.
Example data:
Record FK StartTime EndTime Location1
110 2011/01/01 12.30 2011/01/01 6.10 4562 110
2011/01/01 3.40 2011/01/01 4.00 -13 110
2011/01/02 1.00 2011/01/02 8.00 8914 110
2011/01/02 5.00 2011/01/02 6.00 -15 110
2011/01/02 6.10 2011/01/02 6.30 -1
The -1 should come out as 456 for record 2, and 891 for record 4 and 5
Can someone help me do this with Hive syntax?
I can do it using SQL syntax (as below) but since Hive doesnt support
correlated subqueries in select clauses and so I am unable to get it.
SELECT T1.record,
T1.fk,
T1.start_time,
T1.end_time,
CASE WHEN T1.location != -1 THEN Location
ELSE
(
SELECT TOP (1)
T2.location
FROM #temp1 AS T2
WHERE T2.record < T1.record
AND T2.fk = T1.fk
AND T2.location != -1
ORDER BY T2.Record DESC
)
ENDFROM #temp1 AS T1
Thank you for your help in advance!
Re: Correlated Subqueries Workaround in Hive!
Posted by Furcy Pin <fu...@flaminem.com>.
Hi,
what you are trying to do looks very much like what the LAG windowing
function does.
If your version of Hive is 0.11 or higher, I suggest trying it.
The hive doc for windowing function is here (but is quite poor):
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Fortunately, as it is the same syntax as standard SQL, you can find better
doc for it:
http://www.oracle-base.com/articles/misc/lag-lead-analytic-functions.php
Hope this helps,
Furcy
2014-09-15 16:12 GMT+02:00 Nitin Pawar <ni...@gmail.com>:
> Other way I can think at this is ..
>
> 1) ignore all -1 and create a tmp table
> 2) I see there are couple of time stamps
> 3) Oder the table by timestamp
> 4) from this tmp tabel create anothe tmp table which says FK MinStartTime
> MaxEndTime Location
> 5) Now this tmp table from step 4 join with ur raw data and put where
> clause with min and max times
>
> I hope this is not confusing
>
> On Mon, Sep 15, 2014 at 6:25 PM, Viral Parikh <vi...@gmail.com>
> wrote:
>
>> thanks!
>>
>> is there any other way than writing python UDF etc.
>>
>> any way i can leverage hive joins to get this working?
>>
>> On Mon, Sep 15, 2014 at 6:56 AM, Sreenath <sr...@gmail.com>
>> wrote:
>>
>>> How about writing a python UDF that takes input line by line
>>> and it saves the previous lines location and can replace it with that
>>> if location turns out to be '-1'
>>>
>>> On 15 September 2014 17:01, Nitin Pawar <ni...@gmail.com> wrote:
>>>
>>>> have you taken a look at lag and lead functions ?
>>>>
>>>> On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <viral.j.parikh@gmail.com
>>>> > wrote:
>>>>
>>>>> To Whomsoever It May Concern,
>>>>>
>>>>> I posted this question last week but still haven't heard from anyone;
>>>>> I'd appreciate any reply.
>>>>>
>>>>> I've got a table that contains a LocationId field. In some cases,
>>>>> where a record shares the same foreign key, the LocationId might come
>>>>> through as -1.
>>>>>
>>>>> What I want to do is in my select query is in the case of this
>>>>> happening, the previous location.
>>>>>
>>>>> Example data:
>>>>>
>>>>> Record FK StartTime EndTime Location1 110 2011/01/01 12.30 2011/01/01 6.10 4562 110 2011/01/01 3.40 2011/01/01 4.00 -13 110 2011/01/02 1.00 2011/01/02 8.00 8914 110 2011/01/02 5.00 2011/01/02 6.00 -15 110 2011/01/02 6.10 2011/01/02 6.30 -1
>>>>>
>>>>> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>>>>>
>>>>> Can someone help me do this with Hive syntax?
>>>>>
>>>>> I can do it using SQL syntax (as below) but since Hive doesnt support
>>>>> correlated subqueries in select clauses and so I am unable to get it.
>>>>>
>>>>> SELECT T1.record,
>>>>> T1.fk,
>>>>> T1.start_time,
>>>>> T1.end_time,
>>>>> CASE WHEN T1.location != -1 THEN Location
>>>>> ELSE
>>>>> (
>>>>> SELECT TOP (1)
>>>>> T2.location
>>>>> FROM #temp1 AS T2
>>>>> WHERE T2.record < T1.record
>>>>> AND T2.fk = T1.fk
>>>>> AND T2.location != -1
>>>>> ORDER BY T2.Record DESC
>>>>> )
>>>>> ENDFROM #temp1 AS T1
>>>>>
>>>>> Thank you for your help in advance!
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>>
>>> --
>>> Sreenath S Kamath
>>> Bangalore
>>> Ph No:+91-9590989106
>>>
>>
>>
>
>
> --
> Nitin Pawar
>
Re: Correlated Subqueries Workaround in Hive!
Posted by Nitin Pawar <ni...@gmail.com>.
Other way I can think at this is ..
1) ignore all -1 and create a tmp table
2) I see there are couple of time stamps
3) Oder the table by timestamp
4) from this tmp tabel create anothe tmp table which says FK MinStartTime
MaxEndTime Location
5) Now this tmp table from step 4 join with ur raw data and put where
clause with min and max times
I hope this is not confusing
On Mon, Sep 15, 2014 at 6:25 PM, Viral Parikh <vi...@gmail.com>
wrote:
> thanks!
>
> is there any other way than writing python UDF etc.
>
> any way i can leverage hive joins to get this working?
>
> On Mon, Sep 15, 2014 at 6:56 AM, Sreenath <sr...@gmail.com> wrote:
>
>> How about writing a python UDF that takes input line by line
>> and it saves the previous lines location and can replace it with that
>> if location turns out to be '-1'
>>
>> On 15 September 2014 17:01, Nitin Pawar <ni...@gmail.com> wrote:
>>
>>> have you taken a look at lag and lead functions ?
>>>
>>> On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <vi...@gmail.com>
>>> wrote:
>>>
>>>> To Whomsoever It May Concern,
>>>>
>>>> I posted this question last week but still haven't heard from anyone;
>>>> I'd appreciate any reply.
>>>>
>>>> I've got a table that contains a LocationId field. In some cases, where
>>>> a record shares the same foreign key, the LocationId might come through as
>>>> -1.
>>>>
>>>> What I want to do is in my select query is in the case of this
>>>> happening, the previous location.
>>>>
>>>> Example data:
>>>>
>>>> Record FK StartTime EndTime Location1 110 2011/01/01 12.30 2011/01/01 6.10 4562 110 2011/01/01 3.40 2011/01/01 4.00 -13 110 2011/01/02 1.00 2011/01/02 8.00 8914 110 2011/01/02 5.00 2011/01/02 6.00 -15 110 2011/01/02 6.10 2011/01/02 6.30 -1
>>>>
>>>> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>>>>
>>>> Can someone help me do this with Hive syntax?
>>>>
>>>> I can do it using SQL syntax (as below) but since Hive doesnt support
>>>> correlated subqueries in select clauses and so I am unable to get it.
>>>>
>>>> SELECT T1.record,
>>>> T1.fk,
>>>> T1.start_time,
>>>> T1.end_time,
>>>> CASE WHEN T1.location != -1 THEN Location
>>>> ELSE
>>>> (
>>>> SELECT TOP (1)
>>>> T2.location
>>>> FROM #temp1 AS T2
>>>> WHERE T2.record < T1.record
>>>> AND T2.fk = T1.fk
>>>> AND T2.location != -1
>>>> ORDER BY T2.Record DESC
>>>> )
>>>> ENDFROM #temp1 AS T1
>>>>
>>>> Thank you for your help in advance!
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Sreenath S Kamath
>> Bangalore
>> Ph No:+91-9590989106
>>
>
>
--
Nitin Pawar
Re: Correlated Subqueries Workaround in Hive!
Posted by Viral Parikh <vi...@gmail.com>.
thanks!
is there any other way than writing python UDF etc.
any way i can leverage hive joins to get this working?
On Mon, Sep 15, 2014 at 6:56 AM, Sreenath <sr...@gmail.com> wrote:
> How about writing a python UDF that takes input line by line
> and it saves the previous lines location and can replace it with that
> if location turns out to be '-1'
>
> On 15 September 2014 17:01, Nitin Pawar <ni...@gmail.com> wrote:
>
>> have you taken a look at lag and lead functions ?
>>
>> On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <vi...@gmail.com>
>> wrote:
>>
>>> To Whomsoever It May Concern,
>>>
>>> I posted this question last week but still haven't heard from anyone;
>>> I'd appreciate any reply.
>>>
>>> I've got a table that contains a LocationId field. In some cases, where
>>> a record shares the same foreign key, the LocationId might come through as
>>> -1.
>>>
>>> What I want to do is in my select query is in the case of this
>>> happening, the previous location.
>>>
>>> Example data:
>>>
>>> Record FK StartTime EndTime Location1 110 2011/01/01 12.30 2011/01/01 6.10 4562 110 2011/01/01 3.40 2011/01/01 4.00 -13 110 2011/01/02 1.00 2011/01/02 8.00 8914 110 2011/01/02 5.00 2011/01/02 6.00 -15 110 2011/01/02 6.10 2011/01/02 6.30 -1
>>>
>>> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>>>
>>> Can someone help me do this with Hive syntax?
>>>
>>> I can do it using SQL syntax (as below) but since Hive doesnt support
>>> correlated subqueries in select clauses and so I am unable to get it.
>>>
>>> SELECT T1.record,
>>> T1.fk,
>>> T1.start_time,
>>> T1.end_time,
>>> CASE WHEN T1.location != -1 THEN Location
>>> ELSE
>>> (
>>> SELECT TOP (1)
>>> T2.location
>>> FROM #temp1 AS T2
>>> WHERE T2.record < T1.record
>>> AND T2.fk = T1.fk
>>> AND T2.location != -1
>>> ORDER BY T2.Record DESC
>>> )
>>> ENDFROM #temp1 AS T1
>>>
>>> Thank you for your help in advance!
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Sreenath S Kamath
> Bangalore
> Ph No:+91-9590989106
>
Re: Correlated Subqueries Workaround in Hive!
Posted by Sreenath <sr...@gmail.com>.
How about writing a python UDF that takes input line by line
and it saves the previous lines location and can replace it with that
if location turns out to be '-1'
On 15 September 2014 17:01, Nitin Pawar <ni...@gmail.com> wrote:
> have you taken a look at lag and lead functions ?
>
> On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <vi...@gmail.com>
> wrote:
>
>> To Whomsoever It May Concern,
>>
>> I posted this question last week but still haven't heard from anyone; I'd
>> appreciate any reply.
>>
>> I've got a table that contains a LocationId field. In some cases, where a
>> record shares the same foreign key, the LocationId might come through as -1.
>>
>> What I want to do is in my select query is in the case of this happening,
>> the previous location.
>>
>> Example data:
>>
>> Record FK StartTime EndTime Location1 110 2011/01/01 12.30 2011/01/01 6.10 4562 110 2011/01/01 3.40 2011/01/01 4.00 -13 110 2011/01/02 1.00 2011/01/02 8.00 8914 110 2011/01/02 5.00 2011/01/02 6.00 -15 110 2011/01/02 6.10 2011/01/02 6.30 -1
>>
>> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>>
>> Can someone help me do this with Hive syntax?
>>
>> I can do it using SQL syntax (as below) but since Hive doesnt support
>> correlated subqueries in select clauses and so I am unable to get it.
>>
>> SELECT T1.record,
>> T1.fk,
>> T1.start_time,
>> T1.end_time,
>> CASE WHEN T1.location != -1 THEN Location
>> ELSE
>> (
>> SELECT TOP (1)
>> T2.location
>> FROM #temp1 AS T2
>> WHERE T2.record < T1.record
>> AND T2.fk = T1.fk
>> AND T2.location != -1
>> ORDER BY T2.Record DESC
>> )
>> ENDFROM #temp1 AS T1
>>
>> Thank you for your help in advance!
>>
>
>
>
> --
> Nitin Pawar
>
--
Sreenath S Kamath
Bangalore
Ph No:+91-9590989106
Re: Correlated Subqueries Workaround in Hive!
Posted by Nitin Pawar <ni...@gmail.com>.
have you taken a look at lag and lead functions ?
On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <vi...@gmail.com>
wrote:
> To Whomsoever It May Concern,
>
> I posted this question last week but still haven't heard from anyone; I'd
> appreciate any reply.
>
> I've got a table that contains a LocationId field. In some cases, where a
> record shares the same foreign key, the LocationId might come through as -1.
>
> What I want to do is in my select query is in the case of this happening,
> the previous location.
>
> Example data:
>
> Record FK StartTime EndTime Location1 110 2011/01/01 12.30 2011/01/01 6.10 4562 110 2011/01/01 3.40 2011/01/01 4.00 -13 110 2011/01/02 1.00 2011/01/02 8.00 8914 110 2011/01/02 5.00 2011/01/02 6.00 -15 110 2011/01/02 6.10 2011/01/02 6.30 -1
>
> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>
> Can someone help me do this with Hive syntax?
>
> I can do it using SQL syntax (as below) but since Hive doesnt support
> correlated subqueries in select clauses and so I am unable to get it.
>
> SELECT T1.record,
> T1.fk,
> T1.start_time,
> T1.end_time,
> CASE WHEN T1.location != -1 THEN Location
> ELSE
> (
> SELECT TOP (1)
> T2.location
> FROM #temp1 AS T2
> WHERE T2.record < T1.record
> AND T2.fk = T1.fk
> AND T2.location != -1
> ORDER BY T2.Record DESC
> )
> ENDFROM #temp1 AS T1
>
> Thank you for your help in advance!
>
--
Nitin Pawar