You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by felix gao <gr...@gmail.com> on 2012/03/17 01:32:16 UTC

Distributed Cache in Pig0.7

I need to put a small shared file on distributed cache so I can load it my
udf in pig0.7.  We are using Hadoop 0.20.2+228.  I tried to run it using

PIG_OPTS="-Dmapred.cache.archives=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
-Dmapred.create.symlink=yes", runpig ~felix/testingr.pig
and
PIG_OPTS="-Dmapred.cache.files=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
-Dmapred.create.symlink=yes", runpig ~felix/testingr.pig


when I do
hadoop fs -ls
hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories
I do see the file there.

However, on the UDF side I see
java.io.FileNotFoundException: excludeCategory (No such file or directory)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:106)
    at java.io.FileInputStream.<init>(FileInputStream.java:66)
    at java.io.FileReader.<init>(FileReader.java:41)

What did I do wrong?

RE: Selective removal of data from a relation

Posted by rakesh sharma <ra...@hotmail.com>.

Hi Dmitriy,
Thanks for sending it out. My problem is slightly different. However, you provide me some ideas and I am going to try them and provide an update.
Thanks,Rakesh
> Date: Mon, 19 Mar 2012 09:51:09 -0700
> Subject: Re: Selective removal of data from a relation
> From: dvryaboy@gmail.com
> To: user@pig.apache.org
> 
> Assume A: {id, foo} and B: {id, bar}
> 
> To get all rows that have ids in both A and B:
> 
> C = join A by id, B by id;
> 
> To get all rows that have ids in A but not in B:
> 
> C = filter (join A by id left outer, B by id) by B::id is null;
> 
> To get all rows that have ids in B but not in A:
> C = filter (join A by id right outer, B by id) by A::id is null;
> 
> To get all rows that don't have a matching row in another relation:
> C = filter (join A by id outer, B by id) by A::id is null OR B::id is null;
> 
> 2012/3/18 rakesh sharma <ra...@hotmail.com>:
> > Dmitriy,
> > I tried it. However, I don't seem to be getting a handle on it. Some pseudo code will be highly appreciated.
> >
> > Thanks,
> > Rakesh
> >
> >> Date: Sun, 18 Mar 2012 14:27:25 -0700
> >> Subject: Re: Selective removal of data from a relation
> >> From: dvryaboy@gmail.com
> >> To: user@pig.apache.org
> >>
> >> Rakesh,
> >> Just like in SQL, this is achieved by doing an outer join and
> >> filtering for nulls (a null join key indicates absence of a matching
> >> row).
> >>
> >> D
> >>
> >> 2012/3/18 rakesh sharma <ra...@hotmail.com>:
> >> >
> >> > Thanks to Dan for suggesting to post it on gist. Here is the link to the post:
> >> > https://raw.github.com/gist/2079527/bf68dd2f0a7ee3864ef066f126c34880b20b6b04/SelectiveDataRemoval‏
> >> > Please take a look and I am sure many of you have solution to this problem.
> >> > Thanks,Rakesh
> >> >> Date: Sun, 18 Mar 2012 12:35:33 -0600
> >> >> Subject: RE: Selective removal of data from a relation
> >> >> From: danoyoung@gmail.com
> >> >> To: user@pig.apache.org
> >> >>
> >> >> Post it on https://gist.github.com/ and email out the gist.
> >> >>
> >> >> Regards,
> >> >>
> >> >> Dan
> >> >> On Mar 18, 2012 12:33 PM, "rakesh sharma" <ra...@hotmail.com>
> >> >> wrote:
> >> >>
> >> >> >
> >> >> > All indentations get removed when message comes back from
> >> >> > user@pig.apache.org. Any idea how I can make it work.
> >> >> >
> >> >> > > From: rakesh_sharma66@hotmail.com
> >> >> > > To: user@pig.apache.org
> >> >> > > Subject: RE: Selective removal of data from a relation
> >> >> > > Date: Sun, 18 Mar 2012 18:26:01 +0000
> >> >> > >
> >> >> > >
> >> >> > > I am sorry for so many re-sends. Resending in Rich text format...
> >> >> > > Hi All,
> >> >> > > I have two relations "mix" and "child_parent". Relation "mix" contains
> >> >> > rows of ids. Each Id can be a parent or a child. Another relation
> >> >> > "child-parent" has rows of children and associated parents. It may not have
> >> >> > data for every child existing in relation "mix". Also, it can have some
> >> >> > data for which there is no matching data in relation "mix". I need to
> >> >> > remove all children from relation "mix" whose parent exists in the
> >> >> > relation. Here is an example to show what I am trying to achieve:mix = load
> >> >> > "all_data" as (id:chararray);dump mix;
> >> >> > > 13469
> >> >> > > child_parent = load "mapping" as (childId:chararray,
> >> >> > parentId:chararray);dump child_parent;
> >> >> > > (3       1)(6       1)(9      15)
> >> >> > > Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be
> >> >> > removed from "all_data". However, child "9" will stay as its parent "15"
> >> >> > does not exist in "all_data". The outcome will be:149I am having hard time
> >> >> > in solving it due to lack of experience with pig. Any help/suggestion will
> >> >> > be highly appreciated.
> >> >> > > Thanks,Rakesh
> >> >> >
> >> >
> >

Re: Selective removal of data from a relation

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Assume A: {id, foo} and B: {id, bar}

To get all rows that have ids in both A and B:

C = join A by id, B by id;

To get all rows that have ids in A but not in B:

C = filter (join A by id left outer, B by id) by B::id is null;

To get all rows that have ids in B but not in A:
C = filter (join A by id right outer, B by id) by A::id is null;

To get all rows that don't have a matching row in another relation:
C = filter (join A by id outer, B by id) by A::id is null OR B::id is null;

2012/3/18 rakesh sharma <ra...@hotmail.com>:
> Dmitriy,
> I tried it. However, I don't seem to be getting a handle on it. Some pseudo code will be highly appreciated.
>
> Thanks,
> Rakesh
>
>> Date: Sun, 18 Mar 2012 14:27:25 -0700
>> Subject: Re: Selective removal of data from a relation
>> From: dvryaboy@gmail.com
>> To: user@pig.apache.org
>>
>> Rakesh,
>> Just like in SQL, this is achieved by doing an outer join and
>> filtering for nulls (a null join key indicates absence of a matching
>> row).
>>
>> D
>>
>> 2012/3/18 rakesh sharma <ra...@hotmail.com>:
>> >
>> > Thanks to Dan for suggesting to post it on gist. Here is the link to the post:
>> > https://raw.github.com/gist/2079527/bf68dd2f0a7ee3864ef066f126c34880b20b6b04/SelectiveDataRemoval‏
>> > Please take a look and I am sure many of you have solution to this problem.
>> > Thanks,Rakesh
>> >> Date: Sun, 18 Mar 2012 12:35:33 -0600
>> >> Subject: RE: Selective removal of data from a relation
>> >> From: danoyoung@gmail.com
>> >> To: user@pig.apache.org
>> >>
>> >> Post it on https://gist.github.com/ and email out the gist.
>> >>
>> >> Regards,
>> >>
>> >> Dan
>> >> On Mar 18, 2012 12:33 PM, "rakesh sharma" <ra...@hotmail.com>
>> >> wrote:
>> >>
>> >> >
>> >> > All indentations get removed when message comes back from
>> >> > user@pig.apache.org. Any idea how I can make it work.
>> >> >
>> >> > > From: rakesh_sharma66@hotmail.com
>> >> > > To: user@pig.apache.org
>> >> > > Subject: RE: Selective removal of data from a relation
>> >> > > Date: Sun, 18 Mar 2012 18:26:01 +0000
>> >> > >
>> >> > >
>> >> > > I am sorry for so many re-sends. Resending in Rich text format...
>> >> > > Hi All,
>> >> > > I have two relations "mix" and "child_parent". Relation "mix" contains
>> >> > rows of ids. Each Id can be a parent or a child. Another relation
>> >> > "child-parent" has rows of children and associated parents. It may not have
>> >> > data for every child existing in relation "mix". Also, it can have some
>> >> > data for which there is no matching data in relation "mix". I need to
>> >> > remove all children from relation "mix" whose parent exists in the
>> >> > relation. Here is an example to show what I am trying to achieve:mix = load
>> >> > "all_data" as (id:chararray);dump mix;
>> >> > > 13469
>> >> > > child_parent = load "mapping" as (childId:chararray,
>> >> > parentId:chararray);dump child_parent;
>> >> > > (3       1)(6       1)(9      15)
>> >> > > Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be
>> >> > removed from "all_data". However, child "9" will stay as its parent "15"
>> >> > does not exist in "all_data". The outcome will be:149I am having hard time
>> >> > in solving it due to lack of experience with pig. Any help/suggestion will
>> >> > be highly appreciated.
>> >> > > Thanks,Rakesh
>> >> >
>> >
>

RE: Selective removal of data from a relation

Posted by rakesh sharma <ra...@hotmail.com>.

Dmitriy,
I tried it. However, I don't seem to be getting a handle on it. Some pseudo code will be highly appreciated.

Thanks,
Rakesh

> Date: Sun, 18 Mar 2012 14:27:25 -0700
> Subject: Re: Selective removal of data from a relation
> From: dvryaboy@gmail.com
> To: user@pig.apache.org
> 
> Rakesh,
> Just like in SQL, this is achieved by doing an outer join and
> filtering for nulls (a null join key indicates absence of a matching
> row).
> 
> D
> 
> 2012/3/18 rakesh sharma <ra...@hotmail.com>:
> >
> > Thanks to Dan for suggesting to post it on gist. Here is the link to the post:
> > https://raw.github.com/gist/2079527/bf68dd2f0a7ee3864ef066f126c34880b20b6b04/SelectiveDataRemoval‏
> > Please take a look and I am sure many of you have solution to this problem.
> > Thanks,Rakesh
> >> Date: Sun, 18 Mar 2012 12:35:33 -0600
> >> Subject: RE: Selective removal of data from a relation
> >> From: danoyoung@gmail.com
> >> To: user@pig.apache.org
> >>
> >> Post it on https://gist.github.com/ and email out the gist.
> >>
> >> Regards,
> >>
> >> Dan
> >> On Mar 18, 2012 12:33 PM, "rakesh sharma" <ra...@hotmail.com>
> >> wrote:
> >>
> >> >
> >> > All indentations get removed when message comes back from
> >> > user@pig.apache.org. Any idea how I can make it work.
> >> >
> >> > > From: rakesh_sharma66@hotmail.com
> >> > > To: user@pig.apache.org
> >> > > Subject: RE: Selective removal of data from a relation
> >> > > Date: Sun, 18 Mar 2012 18:26:01 +0000
> >> > >
> >> > >
> >> > > I am sorry for so many re-sends. Resending in Rich text format...
> >> > > Hi All,
> >> > > I have two relations "mix" and "child_parent". Relation "mix" contains
> >> > rows of ids. Each Id can be a parent or a child. Another relation
> >> > "child-parent" has rows of children and associated parents. It may not have
> >> > data for every child existing in relation "mix". Also, it can have some
> >> > data for which there is no matching data in relation "mix". I need to
> >> > remove all children from relation "mix" whose parent exists in the
> >> > relation. Here is an example to show what I am trying to achieve:mix = load
> >> > "all_data" as (id:chararray);dump mix;
> >> > > 13469
> >> > > child_parent = load "mapping" as (childId:chararray,
> >> > parentId:chararray);dump child_parent;
> >> > > (3       1)(6       1)(9      15)
> >> > > Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be
> >> > removed from "all_data". However, child "9" will stay as its parent "15"
> >> > does not exist in "all_data". The outcome will be:149I am having hard time
> >> > in solving it due to lack of experience with pig. Any help/suggestion will
> >> > be highly appreciated.
> >> > > Thanks,Rakesh
> >> >
> >

Re: Selective removal of data from a relation

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Rakesh,
Just like in SQL, this is achieved by doing an outer join and
filtering for nulls (a null join key indicates absence of a matching
row).

D

2012/3/18 rakesh sharma <ra...@hotmail.com>:
>
> Thanks to Dan for suggesting to post it on gist. Here is the link to the post:
> https://raw.github.com/gist/2079527/bf68dd2f0a7ee3864ef066f126c34880b20b6b04/SelectiveDataRemoval‏
> Please take a look and I am sure many of you have solution to this problem.
> Thanks,Rakesh
>> Date: Sun, 18 Mar 2012 12:35:33 -0600
>> Subject: RE: Selective removal of data from a relation
>> From: danoyoung@gmail.com
>> To: user@pig.apache.org
>>
>> Post it on https://gist.github.com/ and email out the gist.
>>
>> Regards,
>>
>> Dan
>> On Mar 18, 2012 12:33 PM, "rakesh sharma" <ra...@hotmail.com>
>> wrote:
>>
>> >
>> > All indentations get removed when message comes back from
>> > user@pig.apache.org. Any idea how I can make it work.
>> >
>> > > From: rakesh_sharma66@hotmail.com
>> > > To: user@pig.apache.org
>> > > Subject: RE: Selective removal of data from a relation
>> > > Date: Sun, 18 Mar 2012 18:26:01 +0000
>> > >
>> > >
>> > > I am sorry for so many re-sends. Resending in Rich text format...
>> > > Hi All,
>> > > I have two relations "mix" and "child_parent". Relation "mix" contains
>> > rows of ids. Each Id can be a parent or a child. Another relation
>> > "child-parent" has rows of children and associated parents. It may not have
>> > data for every child existing in relation "mix". Also, it can have some
>> > data for which there is no matching data in relation "mix". I need to
>> > remove all children from relation "mix" whose parent exists in the
>> > relation. Here is an example to show what I am trying to achieve:mix = load
>> > "all_data" as (id:chararray);dump mix;
>> > > 13469
>> > > child_parent = load "mapping" as (childId:chararray,
>> > parentId:chararray);dump child_parent;
>> > > (3       1)(6       1)(9      15)
>> > > Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be
>> > removed from "all_data". However, child "9" will stay as its parent "15"
>> > does not exist in "all_data". The outcome will be:149I am having hard time
>> > in solving it due to lack of experience with pig. Any help/suggestion will
>> > be highly appreciated.
>> > > Thanks,Rakesh
>> >
>

RE: Selective removal of data from a relation

Posted by rakesh sharma <ra...@hotmail.com>.

Thanks to Dan for suggesting to post it on gist. Here is the link to the post: 
https://raw.github.com/gist/2079527/bf68dd2f0a7ee3864ef066f126c34880b20b6b04/SelectiveDataRemoval‏
Please take a look and I am sure many of you have solution to this problem.
Thanks,Rakesh
> Date: Sun, 18 Mar 2012 12:35:33 -0600
> Subject: RE: Selective removal of data from a relation
> From: danoyoung@gmail.com
> To: user@pig.apache.org
> 
> Post it on https://gist.github.com/ and email out the gist.
> 
> Regards,
> 
> Dan
> On Mar 18, 2012 12:33 PM, "rakesh sharma" <ra...@hotmail.com>
> wrote:
> 
> >
> > All indentations get removed when message comes back from
> > user@pig.apache.org. Any idea how I can make it work.
> >
> > > From: rakesh_sharma66@hotmail.com
> > > To: user@pig.apache.org
> > > Subject: RE: Selective removal of data from a relation
> > > Date: Sun, 18 Mar 2012 18:26:01 +0000
> > >
> > >
> > > I am sorry for so many re-sends. Resending in Rich text format...
> > > Hi All,
> > > I have two relations "mix" and "child_parent". Relation "mix" contains
> > rows of ids. Each Id can be a parent or a child. Another relation
> > "child-parent" has rows of children and associated parents. It may not have
> > data for every child existing in relation "mix". Also, it can have some
> > data for which there is no matching data in relation "mix". I need to
> > remove all children from relation "mix" whose parent exists in the
> > relation. Here is an example to show what I am trying to achieve:mix = load
> > "all_data" as (id:chararray);dump mix;
> > > 13469
> > > child_parent = load "mapping" as (childId:chararray,
> > parentId:chararray);dump child_parent;
> > > (3       1)(6       1)(9      15)
> > > Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be
> > removed from "all_data". However, child "9" will stay as its parent "15"
> > does not exist in "all_data". The outcome will be:149I am having hard time
> > in solving it due to lack of experience with pig. Any help/suggestion will
> > be highly appreciated.
> > > Thanks,Rakesh
> >

RE: Selective removal of data from a relation

Posted by Dan Young <da...@gmail.com>.

Post it on https://gist.github.com/ and email out the gist.

Regards,

Dan
On Mar 18, 2012 12:33 PM, "rakesh sharma" <ra...@hotmail.com>
wrote:

>
> All indentations get removed when message comes back from
> user@pig.apache.org. Any idea how I can make it work.
>
> > From: rakesh_sharma66@hotmail.com
> > To: user@pig.apache.org
> > Subject: RE: Selective removal of data from a relation
> > Date: Sun, 18 Mar 2012 18:26:01 +0000
> >
> >
> > I am sorry for so many re-sends. Resending in Rich text format...
> > Hi All,
> > I have two relations "mix" and "child_parent". Relation "mix" contains
> rows of ids. Each Id can be a parent or a child. Another relation
> "child-parent" has rows of children and associated parents. It may not have
> data for every child existing in relation "mix". Also, it can have some
> data for which there is no matching data in relation "mix". I need to
> remove all children from relation "mix" whose parent exists in the
> relation. Here is an example to show what I am trying to achieve:mix = load
> "all_data" as (id:chararray);dump mix;
> > 13469
> > child_parent = load "mapping" as (childId:chararray,
> parentId:chararray);dump child_parent;
> > (3       1)(6       1)(9      15)
> > Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be
> removed from "all_data". However, child "9" will stay as its parent "15"
> does not exist in "all_data". The outcome will be:149I am having hard time
> in solving it due to lack of experience with pig. Any help/suggestion will
> be highly appreciated.
> > Thanks,Rakesh
>

RE: Selective removal of data from a relation

Posted by rakesh sharma <ra...@hotmail.com>.

All indentations get removed when message comes back from user@pig.apache.org. Any idea how I can make it work.

> From: rakesh_sharma66@hotmail.com
> To: user@pig.apache.org
> Subject: RE: Selective removal of data from a relation
> Date: Sun, 18 Mar 2012 18:26:01 +0000
> 
> 
> I am sorry for so many re-sends. Resending in Rich text format...
> Hi All,
> I have two relations "mix" and "child_parent". Relation "mix" contains rows of ids. Each Id can be a parent or a child. Another relation "child-parent" has rows of children and associated parents. It may not have data for every child existing in relation "mix". Also, it can have some data for which there is no matching data in relation "mix". I need to remove all children from relation "mix" whose parent exists in the relation. Here is an example to show what I am trying to achieve:mix = load "all_data" as (id:chararray);dump mix;
> 13469
> child_parent = load "mapping" as (childId:chararray, parentId:chararray);dump child_parent;
> (3       1)(6       1)(9      15)
> Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be removed from "all_data". However, child "9" will stay as its parent "15" does not exist in "all_data". The outcome will be:149I am having hard time in solving it due to lack of experience with pig. Any help/suggestion will be highly appreciated.
> Thanks,Rakesh

RE: Selective removal of data from a relation

Posted by rakesh sharma <ra...@hotmail.com>.

I am sorry for so many re-sends. Resending in Rich text format...
Hi All,
I have two relations "mix" and "child_parent". Relation "mix" contains rows of ids. Each Id can be a parent or a child. Another relation "child-parent" has rows of children and associated parents. It may not have data for every child existing in relation "mix". Also, it can have some data for which there is no matching data in relation "mix". I need to remove all children from relation "mix" whose parent exists in the relation. Here is an example to show what I am trying to achieve:mix = load "all_data" as (id:chararray);dump mix;
13469
child_parent = load "mapping" as (childId:chararray, parentId:chararray);dump child_parent;
(3       1)(6       1)(9      15)
Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be removed from "all_data". However, child "9" will stay as its parent "15" does not exist in "all_data". The outcome will be:149I am having hard time in solving it due to lack of experience with pig. Any help/suggestion will be highly appreciated.
Thanks,Rakesh

RE: Selective removal of data from a relation

Posted by rakesh sharma <ra...@hotmail.com>.

Reformatting for clarity....
 Hi All,
I have two relations "mix" and "child_parent". Relation "mix" contains rows of ids. Each Id can be a parent or a child. Another relation "child-parent" has rows of children and associated parents. It may not have data for every child existing in relation "mix". Also, it can have some data for which there is no matching data in relation "mix". I need to remove all children from relation "mix" whose parent exists in the relation. Here is an example to show what I am trying to achieve:
mix = load "all_data" as (id:chararray);dump mix;13469
child_parent = load "mapping" as (childId:chararray, parentId:chararray);dump child_parent;(3,     1)(6,         1)(9,        15)Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be removed from "all_data". However, child "9" will stay as its parent "15" does not exist in "all_data". The outcome will be:149
I am having hard time in solving it due to lack of experience with pig. Any help/suggestion will be highly appreciated.

Thanks,Rakesh

Selective removal of data from a relation

Posted by rakesh sharma <ra...@hotmail.com>.

Hi All,
I have two relations "mix" and "child_parent". Relation "mix" contains rows of ids. Each Id can be a parent or a child. Another relation "child-parent" has rows of children and associated parents. It may not have data for every child existing in relation "mix". Also, it can have some data for which there is no matching data in relation "mix". I need to remove all children from relation "mix" whose parent exists in the relation. Here is an example to show what I am trying to achieve:
mix = load "all_data" as (id:chararray);dump mix;
13469
child_parent = load "mapping" as (childId:chararray, parentId:chararray);dump child_parent;
(3       1)(6       1)(9      15)
Children "3" and "6" has matching parent "1". Hence, 3 and 6 need to be removed from "all_data". However, child "9" will stay as its parent "15" does not exist in "all_data". The outcome will be:
149
I am having hard time in solving it due to lack of experience with pig. Any help/suggestion will be highly appreciated.
Thanks,Rakesh

Re: Distributed Cache in Pig0.7

Posted by felix gao <gr...@gmail.com>.

Thanks guys,

what is the expected release date for 0.10?

Felix

On Sat, Mar 17, 2012 at 6:13 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> I personally did a lot of work to migrate from pig 8 to pig 9. It's a
> nontrivial jump, and not just for the UDFs (I'd argue they changed less):
> mainly it's because the parser changed. My recommendation would be to wait
> until the 0.10 release is baked in and move up to that, but at this point,
> 0.7 is really really old and it is worth the pain to upgrade.
>
> Hopefully in the future we'll have ways to help facilitate that
> process...the e2e tests help a lot, but nothing beats running your batch
> jobs against the new version, catching errors, and hopefully filing JIRAs
> if you hit weird bugs :)
>
> 2012/3/16 felix gao <gr...@gmail.com>
>
> > We haven't upgrade because we have a lot of UDFs that is written for
> > pig0.7. If I upgrade I am afraid that I have to re-write many of them to
> > support the new version.  Do you know if the upgrade from pig0.7 to pig
> > 0.9' with respect to the Udfs need any migration work?
> >
> > Thanks,
> >
> > Felix
> >
> > On Fri, Mar 16, 2012 at 5:37 PM, Prashant Kommireddi <
> prash1784@gmail.com
> > >wrote:
> >
> > > Felix,
> > >
> > > 0.7 does not support distributed cache within Pig UDFs. Is there a
> reason
> > > you are using such an old version of Pig?
> > >
> > > 0.9 and later would support this for you. Alan's book has great info on
> > > doing this
> > http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html
> > >
> > > Thanks,
> > > Prashant
> > >
> > >
> > > On Fri, Mar 16, 2012 at 5:32 PM, felix gao <gr...@gmail.com> wrote:
> > >
> > > > I need to put a small shared file on distributed cache so I can load
> it
> > > my
> > > > udf in pig0.7.  We are using Hadoop 0.20.2+228.  I tried to run it
> > using
> > > >
> > > >
> > > >
> > >
> >
> PIG_OPTS="-Dmapred.cache.archives=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
> > > > -Dmapred.create.symlink=yes", runpig ~felix/testingr.pig
> > > > and
> > > >
> > > >
> > >
> >
> PIG_OPTS="-Dmapred.cache.files=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
> > > > -Dmapred.create.symlink=yes", runpig ~felix/testingr.pig
> > > >
> > > >
> > > > when I do
> > > > hadoop fs -ls
> > > >
> > > >
> > >
> >
> hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories
> > > > I do see the file there.
> > > >
> > > > However, on the UDF side I see
> > > > java.io.FileNotFoundException: excludeCategory (No such file or
> > > directory)
> > > >    at java.io.FileInputStream.open(Native Method)
> > > >    at java.io.FileInputStream.<init>(FileInputStream.java:106)
> > > >    at java.io.FileInputStream.<init>(FileInputStream.java:66)
> > > >    at java.io.FileReader.<init>(FileReader.java:41)
> > > >
> > > > What did I do wrong?
> > > >
> > >
> >
>

Re: Distributed Cache in Pig0.7

Posted by Jonathan Coveney <jc...@gmail.com>.

I personally did a lot of work to migrate from pig 8 to pig 9. It's a
nontrivial jump, and not just for the UDFs (I'd argue they changed less):
mainly it's because the parser changed. My recommendation would be to wait
until the 0.10 release is baked in and move up to that, but at this point,
0.7 is really really old and it is worth the pain to upgrade.

Hopefully in the future we'll have ways to help facilitate that
process...the e2e tests help a lot, but nothing beats running your batch
jobs against the new version, catching errors, and hopefully filing JIRAs
if you hit weird bugs :)

2012/3/16 felix gao <gr...@gmail.com>

> We haven't upgrade because we have a lot of UDFs that is written for
> pig0.7. If I upgrade I am afraid that I have to re-write many of them to
> support the new version.  Do you know if the upgrade from pig0.7 to pig
> 0.9' with respect to the Udfs need any migration work?
>
> Thanks,
>
> Felix
>
> On Fri, Mar 16, 2012 at 5:37 PM, Prashant Kommireddi <prash1784@gmail.com
> >wrote:
>
> > Felix,
> >
> > 0.7 does not support distributed cache within Pig UDFs. Is there a reason
> > you are using such an old version of Pig?
> >
> > 0.9 and later would support this for you. Alan's book has great info on
> > doing this
> http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html
> >
> > Thanks,
> > Prashant
> >
> >
> > On Fri, Mar 16, 2012 at 5:32 PM, felix gao <gr...@gmail.com> wrote:
> >
> > > I need to put a small shared file on distributed cache so I can load it
> > my
> > > udf in pig0.7.  We are using Hadoop 0.20.2+228.  I tried to run it
> using
> > >
> > >
> > >
> >
> PIG_OPTS="-Dmapred.cache.archives=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
> > > -Dmapred.create.symlink=yes", runpig ~felix/testingr.pig
> > > and
> > >
> > >
> >
> PIG_OPTS="-Dmapred.cache.files=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
> > > -Dmapred.create.symlink=yes", runpig ~felix/testingr.pig
> > >
> > >
> > > when I do
> > > hadoop fs -ls
> > >
> > >
> >
> hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories
> > > I do see the file there.
> > >
> > > However, on the UDF side I see
> > > java.io.FileNotFoundException: excludeCategory (No such file or
> > directory)
> > >    at java.io.FileInputStream.open(Native Method)
> > >    at java.io.FileInputStream.<init>(FileInputStream.java:106)
> > >    at java.io.FileInputStream.<init>(FileInputStream.java:66)
> > >    at java.io.FileReader.<init>(FileReader.java:41)
> > >
> > > What did I do wrong?
> > >
> >
>

Re: Distributed Cache in Pig0.7

Posted by felix gao <gr...@gmail.com>.

We haven't upgrade because we have a lot of UDFs that is written for
pig0.7. If I upgrade I am afraid that I have to re-write many of them to
support the new version.  Do you know if the upgrade from pig0.7 to pig
0.9' with respect to the Udfs need any migration work?

Thanks,

Felix

On Fri, Mar 16, 2012 at 5:37 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> Felix,
>
> 0.7 does not support distributed cache within Pig UDFs. Is there a reason
> you are using such an old version of Pig?
>
> 0.9 and later would support this for you. Alan's book has great info on
> doing this http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html
>
> Thanks,
> Prashant
>
>
> On Fri, Mar 16, 2012 at 5:32 PM, felix gao <gr...@gmail.com> wrote:
>
> > I need to put a small shared file on distributed cache so I can load it
> my
> > udf in pig0.7.  We are using Hadoop 0.20.2+228.  I tried to run it using
> >
> >
> >
> PIG_OPTS="-Dmapred.cache.archives=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
> > -Dmapred.create.symlink=yes", runpig ~felix/testingr.pig
> > and
> >
> >
> PIG_OPTS="-Dmapred.cache.files=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
> > -Dmapred.create.symlink=yes", runpig ~felix/testingr.pig
> >
> >
> > when I do
> > hadoop fs -ls
> >
> >
> hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories
> > I do see the file there.
> >
> > However, on the UDF side I see
> > java.io.FileNotFoundException: excludeCategory (No such file or
> directory)
> >    at java.io.FileInputStream.open(Native Method)
> >    at java.io.FileInputStream.<init>(FileInputStream.java:106)
> >    at java.io.FileInputStream.<init>(FileInputStream.java:66)
> >    at java.io.FileReader.<init>(FileReader.java:41)
> >
> > What did I do wrong?
> >
>

Re: Distributed Cache in Pig0.7

Posted by Prashant Kommireddi <pr...@gmail.com>.

Felix,

0.7 does not support distributed cache within Pig UDFs. Is there a reason
you are using such an old version of Pig?

0.9 and later would support this for you. Alan's book has great info on
doing this http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html

Thanks,
Prashant


On Fri, Mar 16, 2012 at 5:32 PM, felix gao <gr...@gmail.com> wrote:

> I need to put a small shared file on distributed cache so I can load it my
> udf in pig0.7.  We are using Hadoop 0.20.2+228.  I tried to run it using
>
>
> PIG_OPTS="-Dmapred.cache.archives=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
> -Dmapred.create.symlink=yes", runpig ~felix/testingr.pig
> and
>
> PIG_OPTS="-Dmapred.cache.files=hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories#excludeCategory
> -Dmapred.create.symlink=yes", runpig ~felix/testingr.pig
>
>
> when I do
> hadoop fs -ls
>
> hdfs://namenode.host:5001/user/gen/categories/exclude/2012-03-15/exclude-categories
> I do see the file there.
>
> However, on the UDF side I see
> java.io.FileNotFoundException: excludeCategory (No such file or directory)
>    at java.io.FileInputStream.open(Native Method)
>    at java.io.FileInputStream.<init>(FileInputStream.java:106)
>    at java.io.FileInputStream.<init>(FileInputStream.java:66)
>    at java.io.FileReader.<init>(FileReader.java:41)
>
> What did I do wrong?
>