You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Johannes Schwenk <jo...@adition.com> on 2012/07/04 14:42:41 UTC

Re: Does pig support in clause?

Hi Alan,

I'd like to use this method to not include records in my output that are
already present in previously computed data. So I tried to use your
suggestion like this:

grunt> cat in.dat
1
2
3
4
5
6
7
8
9
grunt> C = LOAD 'in.dat' AS (A1); -- previously generated data
grunt> cat in2.dat
12
2
13
1
10
9
11
8
grunt> A = LOAD 'in2.dat' AS (A1); -- new data
grunt> B1 = join A by A1, C by A1;
grunt> B2 = filter B1 by SIZE(C) == 0;

Which gives me this error:

2012-07-04 14:36:16,768 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: Pig script failed to parse:
<line 14, column 23> Invalid scalar projection: C : A column needs to be
projected from a relation for it to be used as a scalar
Details at logfile: /home/schwenk/pig-0.10.0/pig_1341403702015.log

The relevant pig stack trace from the logfile can be found at

http://pastebin.com/MxPfduWS

What am I doing wrong?

Greetings,
Johannes

Am 25.06.2012 18:39, schrieb Alan Gates:
> This type of in is really a semi-join.  So you could rewrite this as:
> 
> B1 = join A by A1, C by A1;
> B2 = filter B1 by SIZE(C) > 0;
> B = foreach B2 flatten(A);
> 
> Alan.
> 
> On Jun 25, 2012, at 2:50 AM, yonghu wrote:
> 
>> Dear all,
>>
>> in the sql, there is a in clause  which is used to check if the value
>> is in a set or not? Does pig also have the same in clause? Such as:
>>
>> B = filter A by A1 in C;
>>
>> A,B,C are relation names and A1 is a column_name of A.
>>
>> Thanks!
>>
>> Yong
> 



Johannes Schwenk

-- 
Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434




Re: Does pig support in clause?

Posted by Johannes Schwenk <jo...@adition.com>.
Thank you very much Ruslan! That works well!

Greetings,
Johannes

Am 04.07.2012 15:53, schrieb Ruslan Al-Fakikh:
> Hi Johannes,
> 
> Try this
> C = LOAD 'in.dat' AS (A1);
> A = LOAD 'in2.dat' AS (A1);
> 
> joined = JOIN A BY A1 LEFT OUTER, C BY A1;
> 
> DESCRIBE joined;
> 
> newEntries = FILTER joined BY C::A1 IS NULL;
> 
> DUMP newEntries;
> 
> Ruslan
> 
> On Wed, Jul 4, 2012 at 4:42 PM, Johannes Schwenk
> <jo...@adition.com> wrote:
>> Hi Alan,
>>
>> I'd like to use this method to not include records in my output that are
>> already present in previously computed data. So I tried to use your
>> suggestion like this:
>>
>> grunt> cat in.dat
>> 1
>> 2
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> grunt> C = LOAD 'in.dat' AS (A1); -- previously generated data
>> grunt> cat in2.dat
>> 12
>> 2
>> 13
>> 1
>> 10
>> 9
>> 11
>> 8
>> grunt> A = LOAD 'in2.dat' AS (A1); -- new data
>> grunt> B1 = join A by A1, C by A1;
>> grunt> B2 = filter B1 by SIZE(C) == 0;
>>
>> Which gives me this error:
>>
>> 2012-07-04 14:36:16,768 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1200: Pig script failed to parse:
>> <line 14, column 23> Invalid scalar projection: C : A column needs to be
>> projected from a relation for it to be used as a scalar
>> Details at logfile: /home/schwenk/pig-0.10.0/pig_1341403702015.log
>>
>> The relevant pig stack trace from the logfile can be found at
>>
>> http://pastebin.com/MxPfduWS
>>
>> What am I doing wrong?
>>
>> Greetings,
>> Johannes
>>
>> Am 25.06.2012 18:39, schrieb Alan Gates:
>>> This type of in is really a semi-join.  So you could rewrite this as:
>>>
>>> B1 = join A by A1, C by A1;
>>> B2 = filter B1 by SIZE(C) > 0;
>>> B = foreach B2 flatten(A);
>>>
>>> Alan.
>>>
>>> On Jun 25, 2012, at 2:50 AM, yonghu wrote:
>>>
>>>> Dear all,
>>>>
>>>> in the sql, there is a in clause  which is used to check if the value
>>>> is in a set or not? Does pig also have the same in clause? Such as:
>>>>
>>>> B = filter A by A1 in C;
>>>>
>>>> A,B,C are relation names and A1 is a column_name of A.
>>>>
>>>> Thanks!
>>>>
>>>> Yong
>>>
>>
>>
>>
>> Johannes Schwenk
>>
>> --
>> Softwareentwickler (Reporting)
>> ________________________________________________________
>>
>> ADITION technologies AG
>> Schwarzwaldstraße 78b
>> 79117 Freiburg
>>
>> http://www.adition.com
>>
>> T +49 / (0)761 / 88147 - 30
>> F +49 / (0)761 / 88147 - 77
>> SUPPORT +49  / (0)1805 - ADITION
>>
>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>> UStIDNr.: DE 218 858 434
>>
>>
>>



Johannes Schwenk

-- 
Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434




Re: Does pig support in clause?

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Hi Johannes,

Try this
C = LOAD 'in.dat' AS (A1);
A = LOAD 'in2.dat' AS (A1);

joined = JOIN A BY A1 LEFT OUTER, C BY A1;

DESCRIBE joined;

newEntries = FILTER joined BY C::A1 IS NULL;

DUMP newEntries;

Ruslan

On Wed, Jul 4, 2012 at 4:42 PM, Johannes Schwenk
<jo...@adition.com> wrote:
> Hi Alan,
>
> I'd like to use this method to not include records in my output that are
> already present in previously computed data. So I tried to use your
> suggestion like this:
>
> grunt> cat in.dat
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> grunt> C = LOAD 'in.dat' AS (A1); -- previously generated data
> grunt> cat in2.dat
> 12
> 2
> 13
> 1
> 10
> 9
> 11
> 8
> grunt> A = LOAD 'in2.dat' AS (A1); -- new data
> grunt> B1 = join A by A1, C by A1;
> grunt> B2 = filter B1 by SIZE(C) == 0;
>
> Which gives me this error:
>
> 2012-07-04 14:36:16,768 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: Pig script failed to parse:
> <line 14, column 23> Invalid scalar projection: C : A column needs to be
> projected from a relation for it to be used as a scalar
> Details at logfile: /home/schwenk/pig-0.10.0/pig_1341403702015.log
>
> The relevant pig stack trace from the logfile can be found at
>
> http://pastebin.com/MxPfduWS
>
> What am I doing wrong?
>
> Greetings,
> Johannes
>
> Am 25.06.2012 18:39, schrieb Alan Gates:
>> This type of in is really a semi-join.  So you could rewrite this as:
>>
>> B1 = join A by A1, C by A1;
>> B2 = filter B1 by SIZE(C) > 0;
>> B = foreach B2 flatten(A);
>>
>> Alan.
>>
>> On Jun 25, 2012, at 2:50 AM, yonghu wrote:
>>
>>> Dear all,
>>>
>>> in the sql, there is a in clause  which is used to check if the value
>>> is in a set or not? Does pig also have the same in clause? Such as:
>>>
>>> B = filter A by A1 in C;
>>>
>>> A,B,C are relation names and A1 is a column_name of A.
>>>
>>> Thanks!
>>>
>>> Yong
>>
>
>
>
> Johannes Schwenk
>
> --
> Softwareentwickler (Reporting)
> ________________________________________________________
>
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
>
> http://www.adition.com
>
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
>
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434
>
>
>