You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Syed Wasti <md...@hotmail.com> on 2010/05/05 01:36:57 UTC

Pig Latin Questions


Hi,
I am new to Hadoop and Pig Latin Language. 
I am trying to convert the below Hive QL to Pig Latin. Any suggestions please.

INSERT OVERWRITE TABLE A
SELECT id, org_type, dept_type, cnt, cnt_distinct
FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT dept_id) cnt_distinct
         FROM B
         WHERE visible_flag = 1
         GROUP BY id, dept_type

Questions:
1. Is there an option to overwrite the table ? OR what does Pig Latin offer ?
2. You can see in the inner Query "'S' org_type" I am creating a new column and inserting 'S' as the value to this. what does Pig Latin offer ?
3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the count based on how many dept_type and id has and generating a new column and inserting the count in there. How can I do this in pig ?

Thanks for you help. 

Regards
MD

 		 	   		  
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

RE: Pig Latin Questions

Posted by Syed Wasti <md...@hotmail.com>.
Thank you all for your expert suggestions. This was of great help. 

Regards
Syed Wasti


> Date: Wed, 5 May 2010 12:25:08 -0700
> Subject: Re: Pig Latin Questions
> From: dvryaboy@gmail.com
> To: pig-user@hadoop.apache.org
> CC: mdwasti@hotmail.com
> 
> in that case:
> 
> store into 'tmpdir';
> exec;
> fs -rmf 'destdir'
> mv 'tmpdir' 'destdir'
> 
> -D
> 
> On Wed, May 5, 2010 at 12:13 PM, Edward Capriolo <ed...@gmail.com> wrote:
> > On Wed, May 5, 2010 at 2:49 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> >
> >> Under the hood Hive tables are just files too.
> >> I am not sure what the INSERT OVERWRITE semantics are in edge cases
> >> (like if your query fails), but you may be able to simulate it using
> >> 'fs -mv' and 'fs -rmf' commands that Pig provides to operate on the
> >> hadoop file system.
> >> Note that for safety, Pig will refuse to run if you are trying to
> >> write into a directory that already exists, so you *must* use a move
> >> or a remove if you might already have data in the target location.
> >>
> >> All of that goes out the window for both Hive and Pig if you are using
> >> custom SerDes/StoreFuncs, which can do more or less whatever they
> >> want.
> >>
> >> -D
> >>
> >> On Wed, May 5, 2010 at 11:07 AM, Thejas Nair <te...@yahoo-inc.com> wrote:
> >> > Hi Syed,
> >> >
> >> > 1. Released versions of  pig don't support concept of table, there will
> >> be
> >> > one in owl specific loaders once they are available. Pig-latin output
> >> goes
> >> > into files (if store cmd is used) or STDOUT (if dump is used). The
> >> behavior
> >> > if the file already exists is determined by the StoreFunc , PigStorage
> >> will
> >> > give an error if the file already exists.
> >> >
> >> >
> >> > Re 2 & 3  - here is the translation to pig-latin -
> >> >
> >> > L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);
> >> >
> >> > FIL = filter L by visible_flag == 1;
> >> >
> >> > G = group FIL BY (id, dept_type);
> >> >
> >> > FE = foreach G  {
> >> >  DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
> >> >  generate group.id, 'S' as org_type,  group.dept_type, COUNT_STAR(FIL)
> >> as
> >> > cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
> >> > }
> >> >
> >> > describe FE;
> >> > FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
> >> > bytearray,org_type: chararray}
> >> >
> >> > store FE into 'A'
> >> >
> >> >
> >> > On 5/4/10 4:36 PM, "Syed Wasti" <md...@hotmail.com> wrote:
> >> >
> >> >>
> >> >>
> >> >> Hi,
> >> >> I am new to Hadoop and Pig Latin Language.
> >> >> I am trying to convert the below Hive QL to Pig Latin. Any suggestions
> >> please.
> >> >>
> >> >> INSERT OVERWRITE TABLE A
> >> >> SELECT id, org_type, dept_type, cnt, cnt_distinct
> >> >> FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
> >> >> dept_id) cnt_distinct
> >> >>          FROM B
> >> >>          WHERE visible_flag = 1
> >> >>          GROUP BY id, dept_type
> >> >>
> >> >> Questions:
> >> >> 1. Is there an option to overwrite the table ? OR what does Pig Latin
> >> offer ?
> >> >> 2. You can see in the inner Query "'S' org_type" I am creating a new
> >> column
> >> >> and inserting 'S' as the value to this. what does Pig Latin offer ?
> >> >> 3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the
> >> count
> >> >> based on how many dept_type and id has and generating a new column and
> >> >> inserting the count in there. How can I do this in pig ?
> >> >>
> >> >> Thanks for you help.
> >> >>
> >> >> Regards
> >> >> MD
> >> >>
> >> >>
> >> >> _________________________________________________________________
> >> >> Hotmail is redefining busy with tools for the New Busy. Get more from
> >> your
> >> >> inbox.
> >> >>
> >> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL
> >> :
> >> >> en-US:WM_HMP:042010_2
> >> >
> >> >
> >>
> >
> > The semantics of INSERT OVERWRITE are simple. The output of your queries are
> > written to a temp folder and the final step it is moved to its final
> > destination. So you should never end up with partial files in the final
> > directory.
> >
 		 	   		  
_________________________________________________________________
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

Re: Pig Latin Questions

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
in that case:

store into 'tmpdir';
exec;
fs -rmf 'destdir'
mv 'tmpdir' 'destdir'

-D

On Wed, May 5, 2010 at 12:13 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Wed, May 5, 2010 at 2:49 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> Under the hood Hive tables are just files too.
>> I am not sure what the INSERT OVERWRITE semantics are in edge cases
>> (like if your query fails), but you may be able to simulate it using
>> 'fs -mv' and 'fs -rmf' commands that Pig provides to operate on the
>> hadoop file system.
>> Note that for safety, Pig will refuse to run if you are trying to
>> write into a directory that already exists, so you *must* use a move
>> or a remove if you might already have data in the target location.
>>
>> All of that goes out the window for both Hive and Pig if you are using
>> custom SerDes/StoreFuncs, which can do more or less whatever they
>> want.
>>
>> -D
>>
>> On Wed, May 5, 2010 at 11:07 AM, Thejas Nair <te...@yahoo-inc.com> wrote:
>> > Hi Syed,
>> >
>> > 1. Released versions of  pig don't support concept of table, there will
>> be
>> > one in owl specific loaders once they are available. Pig-latin output
>> goes
>> > into files (if store cmd is used) or STDOUT (if dump is used). The
>> behavior
>> > if the file already exists is determined by the StoreFunc , PigStorage
>> will
>> > give an error if the file already exists.
>> >
>> >
>> > Re 2 & 3  - here is the translation to pig-latin -
>> >
>> > L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);
>> >
>> > FIL = filter L by visible_flag == 1;
>> >
>> > G = group FIL BY (id, dept_type);
>> >
>> > FE = foreach G  {
>> >  DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
>> >  generate group.id, 'S' as org_type,  group.dept_type, COUNT_STAR(FIL)
>> as
>> > cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
>> > }
>> >
>> > describe FE;
>> > FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
>> > bytearray,org_type: chararray}
>> >
>> > store FE into 'A'
>> >
>> >
>> > On 5/4/10 4:36 PM, "Syed Wasti" <md...@hotmail.com> wrote:
>> >
>> >>
>> >>
>> >> Hi,
>> >> I am new to Hadoop and Pig Latin Language.
>> >> I am trying to convert the below Hive QL to Pig Latin. Any suggestions
>> please.
>> >>
>> >> INSERT OVERWRITE TABLE A
>> >> SELECT id, org_type, dept_type, cnt, cnt_distinct
>> >> FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
>> >> dept_id) cnt_distinct
>> >>          FROM B
>> >>          WHERE visible_flag = 1
>> >>          GROUP BY id, dept_type
>> >>
>> >> Questions:
>> >> 1. Is there an option to overwrite the table ? OR what does Pig Latin
>> offer ?
>> >> 2. You can see in the inner Query "'S' org_type" I am creating a new
>> column
>> >> and inserting 'S' as the value to this. what does Pig Latin offer ?
>> >> 3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the
>> count
>> >> based on how many dept_type and id has and generating a new column and
>> >> inserting the count in there. How can I do this in pig ?
>> >>
>> >> Thanks for you help.
>> >>
>> >> Regards
>> >> MD
>> >>
>> >>
>> >> _________________________________________________________________
>> >> Hotmail is redefining busy with tools for the New Busy. Get more from
>> your
>> >> inbox.
>> >>
>> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL
>> :
>> >> en-US:WM_HMP:042010_2
>> >
>> >
>>
>
> The semantics of INSERT OVERWRITE are simple. The output of your queries are
> written to a temp folder and the final step it is moved to its final
> destination. So you should never end up with partial files in the final
> directory.
>

Re: Pig Latin Questions

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, May 5, 2010 at 2:49 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Under the hood Hive tables are just files too.
> I am not sure what the INSERT OVERWRITE semantics are in edge cases
> (like if your query fails), but you may be able to simulate it using
> 'fs -mv' and 'fs -rmf' commands that Pig provides to operate on the
> hadoop file system.
> Note that for safety, Pig will refuse to run if you are trying to
> write into a directory that already exists, so you *must* use a move
> or a remove if you might already have data in the target location.
>
> All of that goes out the window for both Hive and Pig if you are using
> custom SerDes/StoreFuncs, which can do more or less whatever they
> want.
>
> -D
>
> On Wed, May 5, 2010 at 11:07 AM, Thejas Nair <te...@yahoo-inc.com> wrote:
> > Hi Syed,
> >
> > 1. Released versions of  pig don't support concept of table, there will
> be
> > one in owl specific loaders once they are available. Pig-latin output
> goes
> > into files (if store cmd is used) or STDOUT (if dump is used). The
> behavior
> > if the file already exists is determined by the StoreFunc , PigStorage
> will
> > give an error if the file already exists.
> >
> >
> > Re 2 & 3  - here is the translation to pig-latin -
> >
> > L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);
> >
> > FIL = filter L by visible_flag == 1;
> >
> > G = group FIL BY (id, dept_type);
> >
> > FE = foreach G  {
> >  DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
> >  generate group.id, 'S' as org_type,  group.dept_type, COUNT_STAR(FIL)
> as
> > cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
> > }
> >
> > describe FE;
> > FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
> > bytearray,org_type: chararray}
> >
> > store FE into 'A'
> >
> >
> > On 5/4/10 4:36 PM, "Syed Wasti" <md...@hotmail.com> wrote:
> >
> >>
> >>
> >> Hi,
> >> I am new to Hadoop and Pig Latin Language.
> >> I am trying to convert the below Hive QL to Pig Latin. Any suggestions
> please.
> >>
> >> INSERT OVERWRITE TABLE A
> >> SELECT id, org_type, dept_type, cnt, cnt_distinct
> >> FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
> >> dept_id) cnt_distinct
> >>          FROM B
> >>          WHERE visible_flag = 1
> >>          GROUP BY id, dept_type
> >>
> >> Questions:
> >> 1. Is there an option to overwrite the table ? OR what does Pig Latin
> offer ?
> >> 2. You can see in the inner Query "'S' org_type" I am creating a new
> column
> >> and inserting 'S' as the value to this. what does Pig Latin offer ?
> >> 3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the
> count
> >> based on how many dept_type and id has and generating a new column and
> >> inserting the count in there. How can I do this in pig ?
> >>
> >> Thanks for you help.
> >>
> >> Regards
> >> MD
> >>
> >>
> >> _________________________________________________________________
> >> Hotmail is redefining busy with tools for the New Busy. Get more from
> your
> >> inbox.
> >>
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL
> :
> >> en-US:WM_HMP:042010_2
> >
> >
>

The semantics of INSERT OVERWRITE are simple. The output of your queries are
written to a temp folder and the final step it is moved to its final
destination. So you should never end up with partial files in the final
directory.

Re: Pig Latin Questions

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Under the hood Hive tables are just files too.
I am not sure what the INSERT OVERWRITE semantics are in edge cases
(like if your query fails), but you may be able to simulate it using
'fs -mv' and 'fs -rmf' commands that Pig provides to operate on the
hadoop file system.
Note that for safety, Pig will refuse to run if you are trying to
write into a directory that already exists, so you *must* use a move
or a remove if you might already have data in the target location.

All of that goes out the window for both Hive and Pig if you are using
custom SerDes/StoreFuncs, which can do more or less whatever they
want.

-D

On Wed, May 5, 2010 at 11:07 AM, Thejas Nair <te...@yahoo-inc.com> wrote:
> Hi Syed,
>
> 1. Released versions of  pig don't support concept of table, there will be
> one in owl specific loaders once they are available. Pig-latin output goes
> into files (if store cmd is used) or STDOUT (if dump is used). The behavior
> if the file already exists is determined by the StoreFunc , PigStorage will
> give an error if the file already exists.
>
>
> Re 2 & 3  - here is the translation to pig-latin -
>
> L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);
>
> FIL = filter L by visible_flag == 1;
>
> G = group FIL BY (id, dept_type);
>
> FE = foreach G  {
>  DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
>  generate group.id, 'S' as org_type,  group.dept_type, COUNT_STAR(FIL) as
> cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
> }
>
> describe FE;
> FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
> bytearray,org_type: chararray}
>
> store FE into 'A'
>
>
> On 5/4/10 4:36 PM, "Syed Wasti" <md...@hotmail.com> wrote:
>
>>
>>
>> Hi,
>> I am new to Hadoop and Pig Latin Language.
>> I am trying to convert the below Hive QL to Pig Latin. Any suggestions please.
>>
>> INSERT OVERWRITE TABLE A
>> SELECT id, org_type, dept_type, cnt, cnt_distinct
>> FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
>> dept_id) cnt_distinct
>>          FROM B
>>          WHERE visible_flag = 1
>>          GROUP BY id, dept_type
>>
>> Questions:
>> 1. Is there an option to overwrite the table ? OR what does Pig Latin offer ?
>> 2. You can see in the inner Query "'S' org_type" I am creating a new column
>> and inserting 'S' as the value to this. what does Pig Latin offer ?
>> 3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the count
>> based on how many dept_type and id has and generating a new column and
>> inserting the count in there. How can I do this in pig ?
>>
>> Thanks for you help.
>>
>> Regards
>> MD
>>
>>
>> _________________________________________________________________
>> Hotmail is redefining busy with tools for the New Busy. Get more from your
>> inbox.
>> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:
>> en-US:WM_HMP:042010_2
>
>

Re: Pig Latin Questions

Posted by Thejas Nair <te...@yahoo-inc.com>.
Hi Syed,

1. Released versions of  pig don't support concept of table, there will be
one in owl specific loaders once they are available. Pig-latin output goes
into files (if store cmd is used) or STDOUT (if dump is used). The behavior
if the file already exists is determined by the StoreFunc , PigStorage will
give an error if the file already exists.


Re 2 & 3  - here is the translation to pig-latin -

L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);

FIL = filter L by visible_flag == 1;

G = group FIL BY (id, dept_type);

FE = foreach G  { 
  DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
  generate group.id, 'S' as org_type,  group.dept_type, COUNT_STAR(FIL) as
cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
}

describe FE;
FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
bytearray,org_type: chararray}

store FE into 'A'


On 5/4/10 4:36 PM, "Syed Wasti" <md...@hotmail.com> wrote:

> 
> 
> Hi,
> I am new to Hadoop and Pig Latin Language.
> I am trying to convert the below Hive QL to Pig Latin. Any suggestions please.
> 
> INSERT OVERWRITE TABLE A
> SELECT id, org_type, dept_type, cnt, cnt_distinct
> FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
> dept_id) cnt_distinct
>          FROM B
>          WHERE visible_flag = 1
>          GROUP BY id, dept_type
> 
> Questions:
> 1. Is there an option to overwrite the table ? OR what does Pig Latin offer ?
> 2. You can see in the inner Query "'S' org_type" I am creating a new column
> and inserting 'S' as the value to this. what does Pig Latin offer ?
> 3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the count
> based on how many dept_type and id has and generating a new column and
> inserting the count in there. How can I do this in pig ?
> 
> Thanks for you help.
> 
> Regards
> MD
> 
>  
> _________________________________________________________________
> Hotmail is redefining busy with tools for the New Busy. Get more from your
> inbox.
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:
> en-US:WM_HMP:042010_2