You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/04/12 02:27:36 UTC

FILTER and fields from tuple/bags

I am new to pig and I have gone through the reference. I am getting used to
how this works but I keep getting questions as I write my scripts. I have
couple of questions:

i) I use FILTER with FOREACH? Below I am trying to FILTER, JOIN and MERGE
into one row. But in the end I get all the fields in form of row which
seems to have Bags inside tuples. In the end all I want is to output values
of some of the fields from each row in "a,b,c" format. How can I do that?


NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID == 'CITY' OR
FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');

AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID == 'OCCUP' OR
FIELD_ID == 'MARITAL') AND FORM_ID == 'FPERSWKS' AND FORM_COPY_NUM == '1';

NM_CT_ST = JOIN NM_CT_ST_FILTER BY (FILE_NAME,CREATED_DATE), D BY
(FILE_NAME,CREATED_DATE);

AG_OC_MT = JOIN AG_OC_MT_FILTER BY
(FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), D BY
(FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT);

FINAL = COGROUP NM_CT_ST BY (D::FILE_NAME,D::CREATED_DATE), AG_OC_MT BY
(D::FILE_NAME,D::CREATED_DATE);

2) Is it possible to use FILTER with foreach? something like foreach A
GENERATE B FILTER FIELD BY .. OR FIELD BY ..

Re: FILTER and fields from tuple/bags

Posted by Mohit Anchlia <mo...@gmail.com>.
Could someone please help me answer below questions?

On Wed, Apr 11, 2012 at 5:27 PM, Mohit Anchlia <mo...@gmail.com>wrote:

>
> I am new to pig and I have gone through the reference. I am getting used
> to how this works but I keep getting questions as I write my scripts. I
> have couple of questions:
>
> i) I use FILTER with FOREACH? Below I am trying to FILTER, JOIN and MERGE
> into one row. But in the end I get all the fields in form of row which
> seems to have Bags inside tuples. In the end all I want is to output values
> of some of the fields from each row in "a,b,c" format. How can I do that?
>
>
> NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID == 'CITY' OR
> FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
>
> AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID == 'OCCUP' OR
> FIELD_ID == 'MARITAL') AND FORM_ID == 'FPERSWKS' AND FORM_COPY_NUM == '1';
>
> NM_CT_ST = JOIN NM_CT_ST_FILTER BY (FILE_NAME,CREATED_DATE), D BY
> (FILE_NAME,CREATED_DATE);
>
> AG_OC_MT = JOIN AG_OC_MT_FILTER BY
> (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), D BY
> (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT);
>
> FINAL = COGROUP NM_CT_ST BY (D::FILE_NAME,D::CREATED_DATE), AG_OC_MT BY
> (D::FILE_NAME,D::CREATED_DATE);
>
> 2) Is it possible to use FILTER with foreach? something like foreach A
> GENERATE B FILTER FIELD BY .. OR FIELD BY ..
>

Re: FILTER and fields from tuple/bags

Posted by Mohit Anchlia <mo...@gmail.com>.
Could someone help with this? Is the best way to use Flatten in this case?
Or am I doing something entirely wrong.

On Fri, Apr 13, 2012 at 4:36 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> This is the output I am expecting
>
> NC,28613,55
>
> On Fri, Apr 13, 2012 at 4:28 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> Could you type out the actual results you would like to see? I am not
>> sure what you expect the results of " foreach rel GENERATE FIELD ==
>> ST, FIELD == ZIP, FIELD == AGE" to look like.
>>
>> Also, your Pig scripts do not have to be all caps. Using something
>> other than all caps will make then a lot more readable...
>>
>> D
>>
>> On Fri, Apr 13, 2012 at 3:41 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > This is my pig script so far that gives me output. What I want to do is
>> > arrange them in this order NC,28613,55 from below output.
>> >
>> > My question is from this relation how can I extract specific fields from
>> > bags and tuples? Essentially I want to do something like:
>> >
>> > foreach rel GENERATE FIELD == ST, FIELD == ZIP, FIELD == AGE --I want
>> > fields in this order from a given relation. But the problem is it's
>> > arranged in a bag and multiple tuples
>> >
>> >
>> >
>> (1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,04/03/12
>> > 11:36:25)
>> >
>> {(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/
>> > 03/12
>> >
>> 11:36:25,ST,NC),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
>> > 11:36:25,ZIP,28613),(1333477861077/home/hadoop/pigtest/./formml_dat/9
>> > 99000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
>> >
>> 11:36:25,CITY,Xxxxxxx),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
>> > 11
>> > :36:25,NAM2,Xxxxx X &xxx; Xxxxx X Xxxxxx)}
>> >
>> {(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,WKS,PER,WKS,04/03/12
>> > 11:36:25,AGE,55),(1333477861077/home/hadoo
>> > p/pigtest/./formml_dat/999000093_tax_return.xml,WKS,PER,WKS,04/03/12
>> > 11:36:25,OCCUP,xxxxxxx
>> >
>> xxxxx),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,WKS,S201
>> > 1US1040PER,WKS,04/03/12 11:36:25,MARITAL,Married)}
>> >
>> > snippet of the script
>> >
>> > D = FILTER A by F_ID == 'FINFOWKS' AND FIELD_ID == 'TSN';
>> > NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID == 'CITY'
>> OR
>> > FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
>> > AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID == 'OCCUP'
>> OR
>> > FIELD_ID == 'MARITAL') AND F_ID == 'WKS' AND F_COPY_NUM == '1';
>> > NM_CT_ST_FIELDS = FOREACH NM_CT_ST_FILTER GENERATE FILE_NAME as
>> > A_FILE_NAME, F_ID as A_F_ID, FSET_ID as A_FSET_ID, F_ID_ROOT as
>> > A_F_ID_ROOT, CREATED_DATE as A_CREATED_DATE,FIELD_ID as
>> > A_FIELD_ID,FIELD_VALUE as A_FIELD_VALUE;
>> > AG_OC_MT_FIELDS = FOREACH AG_OC_MT_FILTER GENERATE FILE_NAME as
>> > B_FILE_NAME,F_ID as B_F_ID, FSET_ID as B_FSET_ID, F_ID_ROOT as
>> B_F_ID_ROOT,
>> > CREATED_DATE as B_CREATED_DATE,FIELD_ID as B_FIELD_ID,FIELD_VALUE as
>> > B_FIELD_VALUE;
>> > A_JOIN = JOIN NM_CT_ST_FIELDS BY
>> > (A_FILE_NAME,A_CREATED_DATE,A_F_ID,A_F_ID_ROOT), D BY
>> > (FILE_NAME,CREATED_DATE,F_ID,F_ID_ROOT);
>> > B_JOIN = JOIN AG_OC_MT_FIELDS BY (B_FILE_NAME,B_CREATED_DATE), D BY
>> > (FILE_NAME,CREATED_DATE);
>> > A_JOIN_F = FOREACH A_JOIN GENERATE A_FILE_NAME, A_F_ID, A_FSET_ID,
>> > A_F_ID_ROOT, A_CREATED_DATE,A_FIELD_ID,A_FIELD_VALUE,FIELD_VALUE;
>> > B_JOIN_F = FOREACH B_JOIN GENERATE
>> >
>> B_FILE_NAME,B_F_ID,B_FSET_ID,B_F_ID_ROOT,B_CREATED_DATE,B_FIELD_ID,B_FIELD_VALUE;
>> > FINAL = COGROUP A_JOIN_F BY (A_FILE_NAME,A_CREATED_DATE), B_JOIN_F BY
>> > (B_FILE_NAME,B_CREATED_DATE);
>> > FINAL_DISTINCT = DISTINCT FINAL;
>> >
>> >
>> > On Thu, Apr 12, 2012 at 7:37 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> wrote:
>> >
>> >> It's not clear to me what exactly you are trying to accomplish. Could
>> you
>> >> provide some sample inputs and expected outputs?
>> >>
>> >> You can use filter inside a foreach:
>> >>
>> >> Foreach foo { a = filter bag_in_foo by condition; generate a; }
>> >>
>> >> On Apr 11, 2012, at 5:27 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> >>
>> >> > I am new to pig and I have gone through the reference. I am getting
>> used
>> >> to
>> >> > how this works but I keep getting questions as I write my scripts. I
>> have
>> >> > couple of questions:
>> >> >
>> >> > i) I use FILTER with FOREACH? Below I am trying to FILTER, JOIN and
>> MERGE
>> >> > into one row. But in the end I get all the fields in form of row
>> which
>> >> > seems to have Bags inside tuples. In the end all I want is to output
>> >> values
>> >> > of some of the fields from each row in "a,b,c" format. How can I do
>> that?
>> >> >
>> >> >
>> >> > NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID ==
>> 'CITY'
>> >> OR
>> >> > FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
>> >> >
>> >> > AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID ==
>> 'OCCUP'
>> >> OR
>> >> > FIELD_ID == 'MARITAL') AND FORM_ID == 'FPERSWKS' AND FORM_COPY_NUM ==
>> >> '1';
>> >> >
>> >> > NM_CT_ST = JOIN NM_CT_ST_FILTER BY (FILE_NAME,CREATED_DATE), D BY
>> >> > (FILE_NAME,CREATED_DATE);
>> >> >
>> >> > AG_OC_MT = JOIN AG_OC_MT_FILTER BY
>> >> > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), D BY
>> >> > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT);
>> >> >
>> >> > FINAL = COGROUP NM_CT_ST BY (D::FILE_NAME,D::CREATED_DATE), AG_OC_MT
>> BY
>> >> > (D::FILE_NAME,D::CREATED_DATE);
>> >> >
>> >> > 2) Is it possible to use FILTER with foreach? something like foreach
>> A
>> >> > GENERATE B FILTER FIELD BY .. OR FIELD BY ..
>> >>
>>
>
>

Re: FILTER and fields from tuple/bags

Posted by Mohit Anchlia <mo...@gmail.com>.
This is the output I am expecting

NC,28613,55

On Fri, Apr 13, 2012 at 4:28 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Could you type out the actual results you would like to see? I am not
> sure what you expect the results of " foreach rel GENERATE FIELD ==
> ST, FIELD == ZIP, FIELD == AGE" to look like.
>
> Also, your Pig scripts do not have to be all caps. Using something
> other than all caps will make then a lot more readable...
>
> D
>
> On Fri, Apr 13, 2012 at 3:41 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > This is my pig script so far that gives me output. What I want to do is
> > arrange them in this order NC,28613,55 from below output.
> >
> > My question is from this relation how can I extract specific fields from
> > bags and tuples? Essentially I want to do something like:
> >
> > foreach rel GENERATE FIELD == ST, FIELD == ZIP, FIELD == AGE --I want
> > fields in this order from a given relation. But the problem is it's
> > arranged in a bag and multiple tuples
> >
> >
> >
> (1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,04/03/12
> > 11:36:25)
> >
> {(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/
> > 03/12
> >
> 11:36:25,ST,NC),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
> > 11:36:25,ZIP,28613),(1333477861077/home/hadoop/pigtest/./formml_dat/9
> > 99000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
> >
> 11:36:25,CITY,Xxxxxxx),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
> > 11
> > :36:25,NAM2,Xxxxx X &xxx; Xxxxx X Xxxxxx)}
> >
> {(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,WKS,PER,WKS,04/03/12
> > 11:36:25,AGE,55),(1333477861077/home/hadoo
> > p/pigtest/./formml_dat/999000093_tax_return.xml,WKS,PER,WKS,04/03/12
> > 11:36:25,OCCUP,xxxxxxx
> >
> xxxxx),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,WKS,S201
> > 1US1040PER,WKS,04/03/12 11:36:25,MARITAL,Married)}
> >
> > snippet of the script
> >
> > D = FILTER A by F_ID == 'FINFOWKS' AND FIELD_ID == 'TSN';
> > NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID == 'CITY'
> OR
> > FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
> > AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID == 'OCCUP'
> OR
> > FIELD_ID == 'MARITAL') AND F_ID == 'WKS' AND F_COPY_NUM == '1';
> > NM_CT_ST_FIELDS = FOREACH NM_CT_ST_FILTER GENERATE FILE_NAME as
> > A_FILE_NAME, F_ID as A_F_ID, FSET_ID as A_FSET_ID, F_ID_ROOT as
> > A_F_ID_ROOT, CREATED_DATE as A_CREATED_DATE,FIELD_ID as
> > A_FIELD_ID,FIELD_VALUE as A_FIELD_VALUE;
> > AG_OC_MT_FIELDS = FOREACH AG_OC_MT_FILTER GENERATE FILE_NAME as
> > B_FILE_NAME,F_ID as B_F_ID, FSET_ID as B_FSET_ID, F_ID_ROOT as
> B_F_ID_ROOT,
> > CREATED_DATE as B_CREATED_DATE,FIELD_ID as B_FIELD_ID,FIELD_VALUE as
> > B_FIELD_VALUE;
> > A_JOIN = JOIN NM_CT_ST_FIELDS BY
> > (A_FILE_NAME,A_CREATED_DATE,A_F_ID,A_F_ID_ROOT), D BY
> > (FILE_NAME,CREATED_DATE,F_ID,F_ID_ROOT);
> > B_JOIN = JOIN AG_OC_MT_FIELDS BY (B_FILE_NAME,B_CREATED_DATE), D BY
> > (FILE_NAME,CREATED_DATE);
> > A_JOIN_F = FOREACH A_JOIN GENERATE A_FILE_NAME, A_F_ID, A_FSET_ID,
> > A_F_ID_ROOT, A_CREATED_DATE,A_FIELD_ID,A_FIELD_VALUE,FIELD_VALUE;
> > B_JOIN_F = FOREACH B_JOIN GENERATE
> >
> B_FILE_NAME,B_F_ID,B_FSET_ID,B_F_ID_ROOT,B_CREATED_DATE,B_FIELD_ID,B_FIELD_VALUE;
> > FINAL = COGROUP A_JOIN_F BY (A_FILE_NAME,A_CREATED_DATE), B_JOIN_F BY
> > (B_FILE_NAME,B_CREATED_DATE);
> > FINAL_DISTINCT = DISTINCT FINAL;
> >
> >
> > On Thu, Apr 12, 2012 at 7:37 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> It's not clear to me what exactly you are trying to accomplish. Could
> you
> >> provide some sample inputs and expected outputs?
> >>
> >> You can use filter inside a foreach:
> >>
> >> Foreach foo { a = filter bag_in_foo by condition; generate a; }
> >>
> >> On Apr 11, 2012, at 5:27 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> >>
> >> > I am new to pig and I have gone through the reference. I am getting
> used
> >> to
> >> > how this works but I keep getting questions as I write my scripts. I
> have
> >> > couple of questions:
> >> >
> >> > i) I use FILTER with FOREACH? Below I am trying to FILTER, JOIN and
> MERGE
> >> > into one row. But in the end I get all the fields in form of row which
> >> > seems to have Bags inside tuples. In the end all I want is to output
> >> values
> >> > of some of the fields from each row in "a,b,c" format. How can I do
> that?
> >> >
> >> >
> >> > NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID ==
> 'CITY'
> >> OR
> >> > FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
> >> >
> >> > AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID ==
> 'OCCUP'
> >> OR
> >> > FIELD_ID == 'MARITAL') AND FORM_ID == 'FPERSWKS' AND FORM_COPY_NUM ==
> >> '1';
> >> >
> >> > NM_CT_ST = JOIN NM_CT_ST_FILTER BY (FILE_NAME,CREATED_DATE), D BY
> >> > (FILE_NAME,CREATED_DATE);
> >> >
> >> > AG_OC_MT = JOIN AG_OC_MT_FILTER BY
> >> > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), D BY
> >> > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT);
> >> >
> >> > FINAL = COGROUP NM_CT_ST BY (D::FILE_NAME,D::CREATED_DATE), AG_OC_MT
> BY
> >> > (D::FILE_NAME,D::CREATED_DATE);
> >> >
> >> > 2) Is it possible to use FILTER with foreach? something like foreach A
> >> > GENERATE B FILTER FIELD BY .. OR FIELD BY ..
> >>
>

Re: FILTER and fields from tuple/bags

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Could you type out the actual results you would like to see? I am not
sure what you expect the results of " foreach rel GENERATE FIELD ==
ST, FIELD == ZIP, FIELD == AGE" to look like.

Also, your Pig scripts do not have to be all caps. Using something
other than all caps will make then a lot more readable...

D

On Fri, Apr 13, 2012 at 3:41 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> This is my pig script so far that gives me output. What I want to do is
> arrange them in this order NC,28613,55 from below output.
>
> My question is from this relation how can I extract specific fields from
> bags and tuples? Essentially I want to do something like:
>
> foreach rel GENERATE FIELD == ST, FIELD == ZIP, FIELD == AGE --I want
> fields in this order from a given relation. But the problem is it's
> arranged in a bag and multiple tuples
>
>
> (1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,04/03/12
> 11:36:25)
> {(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/
> 03/12
> 11:36:25,ST,NC),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
> 11:36:25,ZIP,28613),(1333477861077/home/hadoop/pigtest/./formml_dat/9
> 99000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
> 11:36:25,CITY,Xxxxxxx),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
> 11
> :36:25,NAM2,Xxxxx X &xxx; Xxxxx X Xxxxxx)}
> {(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,WKS,PER,WKS,04/03/12
> 11:36:25,AGE,55),(1333477861077/home/hadoo
> p/pigtest/./formml_dat/999000093_tax_return.xml,WKS,PER,WKS,04/03/12
> 11:36:25,OCCUP,xxxxxxx
> xxxxx),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,WKS,S201
> 1US1040PER,WKS,04/03/12 11:36:25,MARITAL,Married)}
>
> snippet of the script
>
> D = FILTER A by F_ID == 'FINFOWKS' AND FIELD_ID == 'TSN';
> NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID == 'CITY' OR
> FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
> AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID == 'OCCUP' OR
> FIELD_ID == 'MARITAL') AND F_ID == 'WKS' AND F_COPY_NUM == '1';
> NM_CT_ST_FIELDS = FOREACH NM_CT_ST_FILTER GENERATE FILE_NAME as
> A_FILE_NAME, F_ID as A_F_ID, FSET_ID as A_FSET_ID, F_ID_ROOT as
> A_F_ID_ROOT, CREATED_DATE as A_CREATED_DATE,FIELD_ID as
> A_FIELD_ID,FIELD_VALUE as A_FIELD_VALUE;
> AG_OC_MT_FIELDS = FOREACH AG_OC_MT_FILTER GENERATE FILE_NAME as
> B_FILE_NAME,F_ID as B_F_ID, FSET_ID as B_FSET_ID, F_ID_ROOT as B_F_ID_ROOT,
> CREATED_DATE as B_CREATED_DATE,FIELD_ID as B_FIELD_ID,FIELD_VALUE as
> B_FIELD_VALUE;
> A_JOIN = JOIN NM_CT_ST_FIELDS BY
> (A_FILE_NAME,A_CREATED_DATE,A_F_ID,A_F_ID_ROOT), D BY
> (FILE_NAME,CREATED_DATE,F_ID,F_ID_ROOT);
> B_JOIN = JOIN AG_OC_MT_FIELDS BY (B_FILE_NAME,B_CREATED_DATE), D BY
> (FILE_NAME,CREATED_DATE);
> A_JOIN_F = FOREACH A_JOIN GENERATE A_FILE_NAME, A_F_ID, A_FSET_ID,
> A_F_ID_ROOT, A_CREATED_DATE,A_FIELD_ID,A_FIELD_VALUE,FIELD_VALUE;
> B_JOIN_F = FOREACH B_JOIN GENERATE
> B_FILE_NAME,B_F_ID,B_FSET_ID,B_F_ID_ROOT,B_CREATED_DATE,B_FIELD_ID,B_FIELD_VALUE;
> FINAL = COGROUP A_JOIN_F BY (A_FILE_NAME,A_CREATED_DATE), B_JOIN_F BY
> (B_FILE_NAME,B_CREATED_DATE);
> FINAL_DISTINCT = DISTINCT FINAL;
>
>
> On Thu, Apr 12, 2012 at 7:37 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> It's not clear to me what exactly you are trying to accomplish. Could you
>> provide some sample inputs and expected outputs?
>>
>> You can use filter inside a foreach:
>>
>> Foreach foo { a = filter bag_in_foo by condition; generate a; }
>>
>> On Apr 11, 2012, at 5:27 PM, Mohit Anchlia <mo...@gmail.com> wrote:
>>
>> > I am new to pig and I have gone through the reference. I am getting used
>> to
>> > how this works but I keep getting questions as I write my scripts. I have
>> > couple of questions:
>> >
>> > i) I use FILTER with FOREACH? Below I am trying to FILTER, JOIN and MERGE
>> > into one row. But in the end I get all the fields in form of row which
>> > seems to have Bags inside tuples. In the end all I want is to output
>> values
>> > of some of the fields from each row in "a,b,c" format. How can I do that?
>> >
>> >
>> > NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID == 'CITY'
>> OR
>> > FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
>> >
>> > AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID == 'OCCUP'
>> OR
>> > FIELD_ID == 'MARITAL') AND FORM_ID == 'FPERSWKS' AND FORM_COPY_NUM ==
>> '1';
>> >
>> > NM_CT_ST = JOIN NM_CT_ST_FILTER BY (FILE_NAME,CREATED_DATE), D BY
>> > (FILE_NAME,CREATED_DATE);
>> >
>> > AG_OC_MT = JOIN AG_OC_MT_FILTER BY
>> > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), D BY
>> > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT);
>> >
>> > FINAL = COGROUP NM_CT_ST BY (D::FILE_NAME,D::CREATED_DATE), AG_OC_MT BY
>> > (D::FILE_NAME,D::CREATED_DATE);
>> >
>> > 2) Is it possible to use FILTER with foreach? something like foreach A
>> > GENERATE B FILTER FIELD BY .. OR FIELD BY ..
>>

Re: FILTER and fields from tuple/bags

Posted by Mohit Anchlia <mo...@gmail.com>.
This is my pig script so far that gives me output. What I want to do is
arrange them in this order NC,28613,55 from below output.

My question is from this relation how can I extract specific fields from
bags and tuples? Essentially I want to do something like:

foreach rel GENERATE FIELD == ST, FIELD == ZIP, FIELD == AGE --I want
fields in this order from a given relation. But the problem is it's
arranged in a bag and multiple tuples


(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,04/03/12
11:36:25)
{(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/
03/12
11:36:25,ST,NC),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
11:36:25,ZIP,28613),(1333477861077/home/hadoop/pigtest/./formml_dat/9
99000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
11:36:25,CITY,Xxxxxxx),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,FINFOWKS,PER,FINFOWKS,04/03/12
11
:36:25,NAM2,Xxxxx X &xxx; Xxxxx X Xxxxxx)}
{(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,WKS,PER,WKS,04/03/12
11:36:25,AGE,55),(1333477861077/home/hadoo
p/pigtest/./formml_dat/999000093_tax_return.xml,WKS,PER,WKS,04/03/12
11:36:25,OCCUP,xxxxxxx
xxxxx),(1333477861077/home/hadoop/pigtest/./formml_dat/999000093_tax_return.xml,WKS,S201
1US1040PER,WKS,04/03/12 11:36:25,MARITAL,Married)}

snippet of the script

D = FILTER A by F_ID == 'FINFOWKS' AND FIELD_ID == 'TSN';
NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID == 'CITY' OR
FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID == 'OCCUP' OR
FIELD_ID == 'MARITAL') AND F_ID == 'WKS' AND F_COPY_NUM == '1';
NM_CT_ST_FIELDS = FOREACH NM_CT_ST_FILTER GENERATE FILE_NAME as
A_FILE_NAME, F_ID as A_F_ID, FSET_ID as A_FSET_ID, F_ID_ROOT as
A_F_ID_ROOT, CREATED_DATE as A_CREATED_DATE,FIELD_ID as
A_FIELD_ID,FIELD_VALUE as A_FIELD_VALUE;
AG_OC_MT_FIELDS = FOREACH AG_OC_MT_FILTER GENERATE FILE_NAME as
B_FILE_NAME,F_ID as B_F_ID, FSET_ID as B_FSET_ID, F_ID_ROOT as B_F_ID_ROOT,
CREATED_DATE as B_CREATED_DATE,FIELD_ID as B_FIELD_ID,FIELD_VALUE as
B_FIELD_VALUE;
A_JOIN = JOIN NM_CT_ST_FIELDS BY
(A_FILE_NAME,A_CREATED_DATE,A_F_ID,A_F_ID_ROOT), D BY
(FILE_NAME,CREATED_DATE,F_ID,F_ID_ROOT);
B_JOIN = JOIN AG_OC_MT_FIELDS BY (B_FILE_NAME,B_CREATED_DATE), D BY
(FILE_NAME,CREATED_DATE);
A_JOIN_F = FOREACH A_JOIN GENERATE A_FILE_NAME, A_F_ID, A_FSET_ID,
A_F_ID_ROOT, A_CREATED_DATE,A_FIELD_ID,A_FIELD_VALUE,FIELD_VALUE;
B_JOIN_F = FOREACH B_JOIN GENERATE
B_FILE_NAME,B_F_ID,B_FSET_ID,B_F_ID_ROOT,B_CREATED_DATE,B_FIELD_ID,B_FIELD_VALUE;
FINAL = COGROUP A_JOIN_F BY (A_FILE_NAME,A_CREATED_DATE), B_JOIN_F BY
(B_FILE_NAME,B_CREATED_DATE);
FINAL_DISTINCT = DISTINCT FINAL;


On Thu, Apr 12, 2012 at 7:37 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> It's not clear to me what exactly you are trying to accomplish. Could you
> provide some sample inputs and expected outputs?
>
> You can use filter inside a foreach:
>
> Foreach foo { a = filter bag_in_foo by condition; generate a; }
>
> On Apr 11, 2012, at 5:27 PM, Mohit Anchlia <mo...@gmail.com> wrote:
>
> > I am new to pig and I have gone through the reference. I am getting used
> to
> > how this works but I keep getting questions as I write my scripts. I have
> > couple of questions:
> >
> > i) I use FILTER with FOREACH? Below I am trying to FILTER, JOIN and MERGE
> > into one row. But in the end I get all the fields in form of row which
> > seems to have Bags inside tuples. In the end all I want is to output
> values
> > of some of the fields from each row in "a,b,c" format. How can I do that?
> >
> >
> > NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID == 'CITY'
> OR
> > FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
> >
> > AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID == 'OCCUP'
> OR
> > FIELD_ID == 'MARITAL') AND FORM_ID == 'FPERSWKS' AND FORM_COPY_NUM ==
> '1';
> >
> > NM_CT_ST = JOIN NM_CT_ST_FILTER BY (FILE_NAME,CREATED_DATE), D BY
> > (FILE_NAME,CREATED_DATE);
> >
> > AG_OC_MT = JOIN AG_OC_MT_FILTER BY
> > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), D BY
> > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT);
> >
> > FINAL = COGROUP NM_CT_ST BY (D::FILE_NAME,D::CREATED_DATE), AG_OC_MT BY
> > (D::FILE_NAME,D::CREATED_DATE);
> >
> > 2) Is it possible to use FILTER with foreach? something like foreach A
> > GENERATE B FILTER FIELD BY .. OR FIELD BY ..
>

Re: FILTER and fields from tuple/bags

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
It's not clear to me what exactly you are trying to accomplish. Could you provide some sample inputs and expected outputs?

You can use filter inside a foreach: 

Foreach foo { a = filter bag_in_foo by condition; generate a; }

On Apr 11, 2012, at 5:27 PM, Mohit Anchlia <mo...@gmail.com> wrote:

> I am new to pig and I have gone through the reference. I am getting used to
> how this works but I keep getting questions as I write my scripts. I have
> couple of questions:
> 
> i) I use FILTER with FOREACH? Below I am trying to FILTER, JOIN and MERGE
> into one row. But in the end I get all the fields in form of row which
> seems to have Bags inside tuples. In the end all I want is to output values
> of some of the fields from each row in "a,b,c" format. How can I do that?
> 
> 
> NM_CT_ST_FILTER = FILTER A by (FIELD_ID == 'NAM2' OR FIELD_ID == 'CITY' OR
> FIELD_ID == 'ST' OR FIELD_ID == 'ZIP');
> 
> AG_OC_MT_FILTER = FILTER A by (FIELD_ID == 'AGE' OR FIELD_ID == 'OCCUP' OR
> FIELD_ID == 'MARITAL') AND FORM_ID == 'FPERSWKS' AND FORM_COPY_NUM == '1';
> 
> NM_CT_ST = JOIN NM_CT_ST_FILTER BY (FILE_NAME,CREATED_DATE), D BY
> (FILE_NAME,CREATED_DATE);
> 
> AG_OC_MT = JOIN AG_OC_MT_FILTER BY
> (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), D BY
> (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT);
> 
> FINAL = COGROUP NM_CT_ST BY (D::FILE_NAME,D::CREATED_DATE), AG_OC_MT BY
> (D::FILE_NAME,D::CREATED_DATE);
> 
> 2) Is it possible to use FILTER with foreach? something like foreach A
> GENERATE B FILTER FIELD BY .. OR FIELD BY ..