You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Viral Parikh <vi...@gmail.com> on 2014/09/15 13:16:17 UTC

Correlated Subqueries Workaround in Hive!

To Whomsoever It May Concern,

I posted this question last week but still haven't heard from anyone; I'd
appreciate any reply.

I've got a table that contains a LocationId field. In some cases, where a
record shares the same foreign key, the LocationId might come through as -1.

What I want to do is in my select query is in the case of this happening,
the previous location.

Example data:

Record  FK     StartTime               EndTime          Location1
 110  2011/01/01 12.30        2011/01/01 6.10      4562       110
2011/01/01 3.40         2011/01/01 4.00       -13       110
2011/01/02 1.00         2011/01/02 8.00      8914       110
2011/01/02 5.00         2011/01/02 6.00       -15       110
2011/01/02 6.10         2011/01/02 6.30       -1

The -1 should come out as 456 for record 2, and 891 for record 4 and 5

Can someone help me do this with Hive syntax?

I can do it using SQL syntax (as below) but since Hive doesnt support
correlated subqueries in select clauses and so I am unable to get it.

SELECT  T1.record,
        T1.fk,
        T1.start_time,
        T1.end_time,
        CASE WHEN T1.location != -1 THEN Location
        ELSE
            (
            SELECT  TOP (1)
                    T2.location
            FROM    #temp1 AS T2
            WHERE   T2.record < T1.record
            AND     T2.fk = T1.fk
            AND     T2.location != -1
            ORDER   BY T2.Record DESC
            )
        ENDFROM    #temp1 AS T1

Thank you for your help in advance!

Re: Correlated Subqueries Workaround in Hive!

Posted by Furcy Pin <fu...@flaminem.com>.

Hi,

what you are trying to do looks very much like what the LAG windowing
function does.
If your version of Hive is 0.11 or higher, I suggest trying it.
The hive doc for windowing function is here (but is quite poor):
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

Fortunately, as it is the same syntax as standard SQL, you can find better
doc for it:
http://www.oracle-base.com/articles/misc/lag-lead-analytic-functions.php

Hope this helps,

Furcy



2014-09-15 16:12 GMT+02:00 Nitin Pawar <ni...@gmail.com>:

> Other way I can think at this is ..
>
> 1) ignore all -1 and create a tmp table
> 2) I see there are couple of time stamps
> 3) Oder the table by timestamp
> 4) from this tmp tabel create anothe tmp table which says FK MinStartTime
> MaxEndTime Location
> 5) Now this tmp table from step 4 join with ur raw data and put where
> clause with min and max times
>
> I hope this is not confusing
>
> On Mon, Sep 15, 2014 at 6:25 PM, Viral Parikh <vi...@gmail.com>
> wrote:
>
>> thanks!
>>
>> is there any other way than writing python UDF etc.
>>
>> any way i can leverage hive joins to get this working?
>>
>> On Mon, Sep 15, 2014 at 6:56 AM, Sreenath <sr...@gmail.com>
>> wrote:
>>
>>> How about writing a python UDF that takes input line by line
>>> and it saves the previous lines location and can replace it with that
>>> if location turns out to be '-1'
>>>
>>> On 15 September 2014 17:01, Nitin Pawar <ni...@gmail.com> wrote:
>>>
>>>> have you taken a look at lag and lead functions ?
>>>>
>>>> On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <viral.j.parikh@gmail.com
>>>> > wrote:
>>>>
>>>>> To Whomsoever It May Concern,
>>>>>
>>>>> I posted this question last week but still haven't heard from anyone;
>>>>> I'd appreciate any reply.
>>>>>
>>>>> I've got a table that contains a LocationId field. In some cases,
>>>>> where a record shares the same foreign key, the LocationId might come
>>>>> through as -1.
>>>>>
>>>>> What I want to do is in my select query is in the case of this
>>>>> happening, the previous location.
>>>>>
>>>>> Example data:
>>>>>
>>>>> Record  FK     StartTime               EndTime          Location1       110  2011/01/01 12.30        2011/01/01 6.10      4562       110  2011/01/01 3.40         2011/01/01 4.00       -13       110  2011/01/02 1.00         2011/01/02 8.00      8914       110  2011/01/02 5.00         2011/01/02 6.00       -15       110  2011/01/02 6.10         2011/01/02 6.30       -1
>>>>>
>>>>> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>>>>>
>>>>> Can someone help me do this with Hive syntax?
>>>>>
>>>>> I can do it using SQL syntax (as below) but since Hive doesnt support
>>>>> correlated subqueries in select clauses and so I am unable to get it.
>>>>>
>>>>> SELECT  T1.record,
>>>>>         T1.fk,
>>>>>         T1.start_time,
>>>>>         T1.end_time,
>>>>>         CASE WHEN T1.location != -1 THEN Location
>>>>>         ELSE
>>>>>             (
>>>>>             SELECT  TOP (1)
>>>>>                     T2.location
>>>>>             FROM    #temp1 AS T2
>>>>>             WHERE   T2.record < T1.record
>>>>>             AND     T2.fk = T1.fk
>>>>>             AND     T2.location != -1
>>>>>             ORDER   BY T2.Record DESC
>>>>>             )
>>>>>         ENDFROM    #temp1 AS T1
>>>>>
>>>>> Thank you for your help in advance!
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>>
>>> --
>>> Sreenath S Kamath
>>> Bangalore
>>> Ph No:+91-9590989106
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Correlated Subqueries Workaround in Hive!

Posted by Nitin Pawar <ni...@gmail.com>.

Other way I can think at this is ..

1) ignore all -1 and create a tmp table
2) I see there are couple of time stamps
3) Oder the table by timestamp
4) from this tmp tabel create anothe tmp table which says FK MinStartTime
MaxEndTime Location
5) Now this tmp table from step 4 join with ur raw data and put where
clause with min and max times

I hope this is not confusing

On Mon, Sep 15, 2014 at 6:25 PM, Viral Parikh <vi...@gmail.com>
wrote:

> thanks!
>
> is there any other way than writing python UDF etc.
>
> any way i can leverage hive joins to get this working?
>
> On Mon, Sep 15, 2014 at 6:56 AM, Sreenath <sr...@gmail.com> wrote:
>
>> How about writing a python UDF that takes input line by line
>> and it saves the previous lines location and can replace it with that
>> if location turns out to be '-1'
>>
>> On 15 September 2014 17:01, Nitin Pawar <ni...@gmail.com> wrote:
>>
>>> have you taken a look at lag and lead functions ?
>>>
>>> On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <vi...@gmail.com>
>>> wrote:
>>>
>>>> To Whomsoever It May Concern,
>>>>
>>>> I posted this question last week but still haven't heard from anyone;
>>>> I'd appreciate any reply.
>>>>
>>>> I've got a table that contains a LocationId field. In some cases, where
>>>> a record shares the same foreign key, the LocationId might come through as
>>>> -1.
>>>>
>>>> What I want to do is in my select query is in the case of this
>>>> happening, the previous location.
>>>>
>>>> Example data:
>>>>
>>>> Record  FK     StartTime               EndTime          Location1       110  2011/01/01 12.30        2011/01/01 6.10      4562       110  2011/01/01 3.40         2011/01/01 4.00       -13       110  2011/01/02 1.00         2011/01/02 8.00      8914       110  2011/01/02 5.00         2011/01/02 6.00       -15       110  2011/01/02 6.10         2011/01/02 6.30       -1
>>>>
>>>> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>>>>
>>>> Can someone help me do this with Hive syntax?
>>>>
>>>> I can do it using SQL syntax (as below) but since Hive doesnt support
>>>> correlated subqueries in select clauses and so I am unable to get it.
>>>>
>>>> SELECT  T1.record,
>>>>         T1.fk,
>>>>         T1.start_time,
>>>>         T1.end_time,
>>>>         CASE WHEN T1.location != -1 THEN Location
>>>>         ELSE
>>>>             (
>>>>             SELECT  TOP (1)
>>>>                     T2.location
>>>>             FROM    #temp1 AS T2
>>>>             WHERE   T2.record < T1.record
>>>>             AND     T2.fk = T1.fk
>>>>             AND     T2.location != -1
>>>>             ORDER   BY T2.Record DESC
>>>>             )
>>>>         ENDFROM    #temp1 AS T1
>>>>
>>>> Thank you for your help in advance!
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Sreenath S Kamath
>> Bangalore
>> Ph No:+91-9590989106
>>
>
>


-- 
Nitin Pawar

Re: Correlated Subqueries Workaround in Hive!

Posted by Viral Parikh <vi...@gmail.com>.

thanks!

is there any other way than writing python UDF etc.

any way i can leverage hive joins to get this working?

On Mon, Sep 15, 2014 at 6:56 AM, Sreenath <sr...@gmail.com> wrote:

> How about writing a python UDF that takes input line by line
> and it saves the previous lines location and can replace it with that
> if location turns out to be '-1'
>
> On 15 September 2014 17:01, Nitin Pawar <ni...@gmail.com> wrote:
>
>> have you taken a look at lag and lead functions ?
>>
>> On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <vi...@gmail.com>
>> wrote:
>>
>>> To Whomsoever It May Concern,
>>>
>>> I posted this question last week but still haven't heard from anyone;
>>> I'd appreciate any reply.
>>>
>>> I've got a table that contains a LocationId field. In some cases, where
>>> a record shares the same foreign key, the LocationId might come through as
>>> -1.
>>>
>>> What I want to do is in my select query is in the case of this
>>> happening, the previous location.
>>>
>>> Example data:
>>>
>>> Record  FK     StartTime               EndTime          Location1       110  2011/01/01 12.30        2011/01/01 6.10      4562       110  2011/01/01 3.40         2011/01/01 4.00       -13       110  2011/01/02 1.00         2011/01/02 8.00      8914       110  2011/01/02 5.00         2011/01/02 6.00       -15       110  2011/01/02 6.10         2011/01/02 6.30       -1
>>>
>>> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>>>
>>> Can someone help me do this with Hive syntax?
>>>
>>> I can do it using SQL syntax (as below) but since Hive doesnt support
>>> correlated subqueries in select clauses and so I am unable to get it.
>>>
>>> SELECT  T1.record,
>>>         T1.fk,
>>>         T1.start_time,
>>>         T1.end_time,
>>>         CASE WHEN T1.location != -1 THEN Location
>>>         ELSE
>>>             (
>>>             SELECT  TOP (1)
>>>                     T2.location
>>>             FROM    #temp1 AS T2
>>>             WHERE   T2.record < T1.record
>>>             AND     T2.fk = T1.fk
>>>             AND     T2.location != -1
>>>             ORDER   BY T2.Record DESC
>>>             )
>>>         ENDFROM    #temp1 AS T1
>>>
>>> Thank you for your help in advance!
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Sreenath S Kamath
> Bangalore
> Ph No:+91-9590989106
>

Re: Correlated Subqueries Workaround in Hive!

Posted by Sreenath <sr...@gmail.com>.

How about writing a python UDF that takes input line by line
and it saves the previous lines location and can replace it with that
if location turns out to be '-1'

On 15 September 2014 17:01, Nitin Pawar <ni...@gmail.com> wrote:

> have you taken a look at lag and lead functions ?
>
> On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <vi...@gmail.com>
> wrote:
>
>> To Whomsoever It May Concern,
>>
>> I posted this question last week but still haven't heard from anyone; I'd
>> appreciate any reply.
>>
>> I've got a table that contains a LocationId field. In some cases, where a
>> record shares the same foreign key, the LocationId might come through as -1.
>>
>> What I want to do is in my select query is in the case of this happening,
>> the previous location.
>>
>> Example data:
>>
>> Record  FK     StartTime               EndTime          Location1       110  2011/01/01 12.30        2011/01/01 6.10      4562       110  2011/01/01 3.40         2011/01/01 4.00       -13       110  2011/01/02 1.00         2011/01/02 8.00      8914       110  2011/01/02 5.00         2011/01/02 6.00       -15       110  2011/01/02 6.10         2011/01/02 6.30       -1
>>
>> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>>
>> Can someone help me do this with Hive syntax?
>>
>> I can do it using SQL syntax (as below) but since Hive doesnt support
>> correlated subqueries in select clauses and so I am unable to get it.
>>
>> SELECT  T1.record,
>>         T1.fk,
>>         T1.start_time,
>>         T1.end_time,
>>         CASE WHEN T1.location != -1 THEN Location
>>         ELSE
>>             (
>>             SELECT  TOP (1)
>>                     T2.location
>>             FROM    #temp1 AS T2
>>             WHERE   T2.record < T1.record
>>             AND     T2.fk = T1.fk
>>             AND     T2.location != -1
>>             ORDER   BY T2.Record DESC
>>             )
>>         ENDFROM    #temp1 AS T1
>>
>> Thank you for your help in advance!
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Sreenath S Kamath
Bangalore
Ph No:+91-9590989106

Re: Correlated Subqueries Workaround in Hive!

Posted by Nitin Pawar <ni...@gmail.com>.

have you taken a look at lag and lead functions ?

On Mon, Sep 15, 2014 at 4:46 PM, Viral Parikh <vi...@gmail.com>
wrote:

> To Whomsoever It May Concern,
>
> I posted this question last week but still haven't heard from anyone; I'd
> appreciate any reply.
>
> I've got a table that contains a LocationId field. In some cases, where a
> record shares the same foreign key, the LocationId might come through as -1.
>
> What I want to do is in my select query is in the case of this happening,
> the previous location.
>
> Example data:
>
> Record  FK     StartTime               EndTime          Location1       110  2011/01/01 12.30        2011/01/01 6.10      4562       110  2011/01/01 3.40         2011/01/01 4.00       -13       110  2011/01/02 1.00         2011/01/02 8.00      8914       110  2011/01/02 5.00         2011/01/02 6.00       -15       110  2011/01/02 6.10         2011/01/02 6.30       -1
>
> The -1 should come out as 456 for record 2, and 891 for record 4 and 5
>
> Can someone help me do this with Hive syntax?
>
> I can do it using SQL syntax (as below) but since Hive doesnt support
> correlated subqueries in select clauses and so I am unable to get it.
>
> SELECT  T1.record,
>         T1.fk,
>         T1.start_time,
>         T1.end_time,
>         CASE WHEN T1.location != -1 THEN Location
>         ELSE
>             (
>             SELECT  TOP (1)
>                     T2.location
>             FROM    #temp1 AS T2
>             WHERE   T2.record < T1.record
>             AND     T2.fk = T1.fk
>             AND     T2.location != -1
>             ORDER   BY T2.Record DESC
>             )
>         ENDFROM    #temp1 AS T1
>
> Thank you for your help in advance!
>



-- 
Nitin Pawar