You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Rob Stewart <ro...@googlemail.com> on 2010/01/12 23:57:22 UTC

Conditional Selects

Hi folks,

I have a somewhat obvious question, that needs asking (for my sakes).

Pig can do Joins, I realise that. But take for example:
Table_1
----------------------
| ID | fileName |
  1     foo.dat
  2     bar.dat
  3     harry.dat

Table_2
----------------------
| ID | fileName |
  1      tom.dat
  2      bar.dat
  3      gamma.dat


SQL Syntax for conditional select:
"select t1.fileName from Table_1 t1, Table_2 t2 where t1.fileName =
t2.fileName"

Result
--------
bar.dat

How is such a query represented in Pig?
tableOne = LOAD 'input1.dat' USING PigStorage() AS (id:int,
filename:chararray);
tableTwo = LOAD 'input2.dat' USING PigStorage() AS (id:int,
filename:chararray);
[Now what??]
STORE query INTO 'Output.pig' USING PigStorage();


As a bonus question, can anybody tell me if this sort of conditional select
query is possible writing in Java MapReduce?

thanks,

Rob Stewart

Re: Conditional Selects

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Rob,
I get confused how the fields are named sometimes, especially after a
series of joins and groups. The describe command helps (and the
illustrate command, which unfortunately doesn't support some
operators).

On Tue, Jan 12, 2010 at 3:27 PM, Mridul Muralidharan
<mr...@yahoo-inc.com> wrote:
>
> As a suffix to what Dmitriy described - just add a project to pick the
> columns you need.
> c = join a by filename, b by filename PARALLEL $MY_PARALLELISM;
> --- Please check this syntax though with pig latin docs.
> d = foreach c generate a::filename; --- Or anything else you want to pick.
>
> if you need, just do a distinct of d's output to remove duplicates ...
> though this might result in more MR jobs.
>
>
> - Mridul
>
>
>
> Rob Stewart wrote:
>>
>> Hi, yeah I thought so,
>>
>> the only slightly confusing issue is that the output would be:
>> bar.dat bar.dat
>>
>> ? (i.e. - showing you a.filename b.filename ) ?
>>
>> Rob.
>>
>>
>>
>> 2010/1/12 Dmitriy Ryaboy <dv...@gmail.com>
>>
>>> Rob, it's just a join.
>>>
>>> a = load 'rel1' using FooStorage() as (id, filename);
>>> b = load 'rel2' using FooStorage() as (id, filename);
>>> c = join a by filename, b by filename;
>>>
>>> Rows that don't match won't make it.
>>> If you DO want them to make it in, you need to use "outer" for the
>>> relations whose non-matching rows you want retained (the rest of the
>>> fields in the resulting relation will be filled in with nulls).
>>>
>>> Naturally, since Pig can do it, MR can do it.
>>>
>>> -D
>>>
>>> On Tue, Jan 12, 2010 at 2:57 PM, Rob Stewart
>>> <ro...@googlemail.com> wrote:
>>>>
>>>> Hi folks,
>>>>
>>>> I have a somewhat obvious question, that needs asking (for my sakes).
>>>>
>>>> Pig can do Joins, I realise that. But take for example:
>>>> Table_1
>>>> ----------------------
>>>> | ID | fileName |
>>>>  1     foo.dat
>>>>  2     bar.dat
>>>>  3     harry.dat
>>>>
>>>> Table_2
>>>> ----------------------
>>>> | ID | fileName |
>>>>  1      tom.dat
>>>>  2      bar.dat
>>>>  3      gamma.dat
>>>>
>>>>
>>>> SQL Syntax for conditional select:
>>>> "select t1.fileName from Table_1 t1, Table_2 t2 where t1.fileName =
>>>> t2.fileName"
>>>>
>>>> Result
>>>> --------
>>>> bar.dat
>>>>
>>>> How is such a query represented in Pig?
>>>> tableOne = LOAD 'input1.dat' USING PigStorage() AS (id:int,
>>>> filename:chararray);
>>>> tableTwo = LOAD 'input2.dat' USING PigStorage() AS (id:int,
>>>> filename:chararray);
>>>> [Now what??]
>>>> STORE query INTO 'Output.pig' USING PigStorage();
>>>>
>>>>
>>>> As a bonus question, can anybody tell me if this sort of conditional
>>>
>>> select
>>>>
>>>> query is possible writing in Java MapReduce?
>>>>
>>>> thanks,
>>>>
>>>> Rob Stewart
>>>>
>
>

Re: Conditional Selects

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
As a suffix to what Dmitriy described - just add a project to pick the 
columns you need.
c = join a by filename, b by filename PARALLEL $MY_PARALLELISM;
--- Please check this syntax though with pig latin docs.
d = foreach c generate a::filename; --- Or anything else you want to pick.

if you need, just do a distinct of d's output to remove duplicates ... 
though this might result in more MR jobs.


- Mridul



Rob Stewart wrote:
> Hi, yeah I thought so,
> 
> the only slightly confusing issue is that the output would be:
> bar.dat bar.dat
> 
> ? (i.e. - showing you a.filename b.filename ) ?
> 
> Rob.
> 
> 
> 
> 2010/1/12 Dmitriy Ryaboy <dv...@gmail.com>
> 
>> Rob, it's just a join.
>>
>> a = load 'rel1' using FooStorage() as (id, filename);
>> b = load 'rel2' using FooStorage() as (id, filename);
>> c = join a by filename, b by filename;
>>
>> Rows that don't match won't make it.
>> If you DO want them to make it in, you need to use "outer" for the
>> relations whose non-matching rows you want retained (the rest of the
>> fields in the resulting relation will be filled in with nulls).
>>
>> Naturally, since Pig can do it, MR can do it.
>>
>> -D
>>
>> On Tue, Jan 12, 2010 at 2:57 PM, Rob Stewart
>> <ro...@googlemail.com> wrote:
>>> Hi folks,
>>>
>>> I have a somewhat obvious question, that needs asking (for my sakes).
>>>
>>> Pig can do Joins, I realise that. But take for example:
>>> Table_1
>>> ----------------------
>>> | ID | fileName |
>>>  1     foo.dat
>>>  2     bar.dat
>>>  3     harry.dat
>>>
>>> Table_2
>>> ----------------------
>>> | ID | fileName |
>>>  1      tom.dat
>>>  2      bar.dat
>>>  3      gamma.dat
>>>
>>>
>>> SQL Syntax for conditional select:
>>> "select t1.fileName from Table_1 t1, Table_2 t2 where t1.fileName =
>>> t2.fileName"
>>>
>>> Result
>>> --------
>>> bar.dat
>>>
>>> How is such a query represented in Pig?
>>> tableOne = LOAD 'input1.dat' USING PigStorage() AS (id:int,
>>> filename:chararray);
>>> tableTwo = LOAD 'input2.dat' USING PigStorage() AS (id:int,
>>> filename:chararray);
>>> [Now what??]
>>> STORE query INTO 'Output.pig' USING PigStorage();
>>>
>>>
>>> As a bonus question, can anybody tell me if this sort of conditional
>> select
>>> query is possible writing in Java MapReduce?
>>>
>>> thanks,
>>>
>>> Rob Stewart
>>>


Re: Conditional Selects

Posted by Rob Stewart <ro...@googlemail.com>.
Hi, yeah I thought so,

the only slightly confusing issue is that the output would be:
bar.dat bar.dat

? (i.e. - showing you a.filename b.filename ) ?

Rob.



2010/1/12 Dmitriy Ryaboy <dv...@gmail.com>

> Rob, it's just a join.
>
> a = load 'rel1' using FooStorage() as (id, filename);
> b = load 'rel2' using FooStorage() as (id, filename);
> c = join a by filename, b by filename;
>
> Rows that don't match won't make it.
> If you DO want them to make it in, you need to use "outer" for the
> relations whose non-matching rows you want retained (the rest of the
> fields in the resulting relation will be filled in with nulls).
>
> Naturally, since Pig can do it, MR can do it.
>
> -D
>
> On Tue, Jan 12, 2010 at 2:57 PM, Rob Stewart
> <ro...@googlemail.com> wrote:
> > Hi folks,
> >
> > I have a somewhat obvious question, that needs asking (for my sakes).
> >
> > Pig can do Joins, I realise that. But take for example:
> > Table_1
> > ----------------------
> > | ID | fileName |
> >  1     foo.dat
> >  2     bar.dat
> >  3     harry.dat
> >
> > Table_2
> > ----------------------
> > | ID | fileName |
> >  1      tom.dat
> >  2      bar.dat
> >  3      gamma.dat
> >
> >
> > SQL Syntax for conditional select:
> > "select t1.fileName from Table_1 t1, Table_2 t2 where t1.fileName =
> > t2.fileName"
> >
> > Result
> > --------
> > bar.dat
> >
> > How is such a query represented in Pig?
> > tableOne = LOAD 'input1.dat' USING PigStorage() AS (id:int,
> > filename:chararray);
> > tableTwo = LOAD 'input2.dat' USING PigStorage() AS (id:int,
> > filename:chararray);
> > [Now what??]
> > STORE query INTO 'Output.pig' USING PigStorage();
> >
> >
> > As a bonus question, can anybody tell me if this sort of conditional
> select
> > query is possible writing in Java MapReduce?
> >
> > thanks,
> >
> > Rob Stewart
> >
>

Re: Conditional Selects

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Rob, it's just a join.

a = load 'rel1' using FooStorage() as (id, filename);
b = load 'rel2' using FooStorage() as (id, filename);
c = join a by filename, b by filename;

Rows that don't match won't make it.
If you DO want them to make it in, you need to use "outer" for the
relations whose non-matching rows you want retained (the rest of the
fields in the resulting relation will be filled in with nulls).

Naturally, since Pig can do it, MR can do it.

-D

On Tue, Jan 12, 2010 at 2:57 PM, Rob Stewart
<ro...@googlemail.com> wrote:
> Hi folks,
>
> I have a somewhat obvious question, that needs asking (for my sakes).
>
> Pig can do Joins, I realise that. But take for example:
> Table_1
> ----------------------
> | ID | fileName |
>  1     foo.dat
>  2     bar.dat
>  3     harry.dat
>
> Table_2
> ----------------------
> | ID | fileName |
>  1      tom.dat
>  2      bar.dat
>  3      gamma.dat
>
>
> SQL Syntax for conditional select:
> "select t1.fileName from Table_1 t1, Table_2 t2 where t1.fileName =
> t2.fileName"
>
> Result
> --------
> bar.dat
>
> How is such a query represented in Pig?
> tableOne = LOAD 'input1.dat' USING PigStorage() AS (id:int,
> filename:chararray);
> tableTwo = LOAD 'input2.dat' USING PigStorage() AS (id:int,
> filename:chararray);
> [Now what??]
> STORE query INTO 'Output.pig' USING PigStorage();
>
>
> As a bonus question, can anybody tell me if this sort of conditional select
> query is possible writing in Java MapReduce?
>
> thanks,
>
> Rob Stewart
>