You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Tamir Kamara <ta...@gmail.com> on 2009/07/06 07:55:01 UTC

Using the ConfigurationUtil

Hi,

I modified a function that I saw on JIRA that filters based on a (small)
list of values present in a file in order to avoid another cogroup followed
by a filter. The function gets the dfs path to a file and loads it to memory
and then do the actual matching/filtering.
It seems that if the function is used in the reduce phase then it has a
problem with the ConfigurationUtil:

java.lang.NullPointerException
	at org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(ConfigurationUtil.java:45)
	at pigUDF.InList.init(InList.java:42)
	at pigUDF.InList.exec(InList.java:67)


private void init() throws IOException  {
        hs = new HashSet<String>();
        Properties props =
ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line 42 *****
        InputStream is = FileLocalizer.openDFSFile(FilterFileName,
props);
        BufferedReader reader = new BufferedReader(new
InputStreamReader(is));
        while (true) {
            String line = reader.readLine();
            if (line == null)
                break;
            String FilterField = line.split("\t")[FieldIndex];

            hs.add(FilterField);
        }
        System.out.println("Hash Size: " + hs.size());
}

When the function is used in the map then it work perfectly.

The script I'm using:
REGISTER pigUDF.jar;
%declare bots_file 'bots.txt'
b01 = load 'file01' as (key: long, value: int);
b02 = load 'file01' as (key: long, value: int);
b03 = load 'file01' as (key: long, value: int);
b04 = load 'file01' as (key: long, value: int);
b05 = load 'file01' as (key: long, value: int);
c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by key;
DEFINE INLIST pigUDF.InList('$bots_file', '0');
c1 = filter c by COUNT(b01)>0 and not INLIST(group);


Am I doing something wrong?

Thanks,
Tamir

Re: Using the ConfigurationUtil

Posted by Tamir Kamara <ta...@gmail.com>.

I tried using PigMapReduce.sJobConf as Dimitriy suggested and it worked.

Thanks!


On Mon, Jul 6, 2009 at 10:45 PM, Olga Natkovich <ol...@yahoo-inc.com> wrote:

> I think your initial problem was caused by
> http://issues.apache.org/jira/browse/PIG-67. We actually have a solution
> but it is part of skew join patch. We will try to separate it make
> available within next few days.
>
> Olga
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com]
> Sent: Monday, July 06, 2009 12:19 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Using the ConfigurationUtil
>
> Tamir,
> Try this:
> ConfigurationUtil.toProperties(PigMapReduce.sJobConf)
>
> -D
>
> On Mon, Jul 6, 2009 at 11:12 AM, Tamir Kamara
> <ta...@kamarafamily.com>wrote:
>
> >
> > Maybe the script I originally gave isn't a good example, consider this
> one:
> > a = load 'bigfile' as (key1, key2, value);
> > badkeys = load 'badkeysfile' as (key1);
> > b = cogroup badkeys by key1, a by key1;
> > c = filter b by count(badkeys)==0;
> > d = foreach c generate flatten(a);
> > e = group d by key2;
> > f = foreach e generate group, sum(value);
> > dump f;
> >
> > this script will require 2 separate jobs and when knowing that the
> > badkeysfile is small, it can be done faster with only 1 job without
> the
> > cogroup.
> >
> > Maybe we can move passed the reason why I need it and concentrate on
> the
> > how
> > - is there another way to load a dfs file in a UDF (other than the one
> I
> > used)?
> >
> > Tamir
> >
> > -----Original Message-----
> > From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > Sent: Monday, July 06, 2009 8:25 PM
> > To: pig-user@hadoop.apache.org
> > Subject: RE: Using the ConfigurationUtil
> >
> > I think you can just add it to the original cogroup.
> >
> > -----Original Message-----
> > From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> > Sent: Monday, July 06, 2009 9:39 AM
> > To: pig-user@hadoop.apache.org
> > Subject: RE: Using the ConfigurationUtil
> >
> > Hi Olga,
> >
> > As I wrote in my previous response - I do not want to use cogroup (and
> > filter) because this requires a separate reduce phase which doesn't
> make
> > sense if the second dataset is small enough it's simply a waste of
> > resources.
> >
> > Please read my earlier see why exactly I think a UDF would be better
> > here.
> >
> > Thanks,
> > Tamir
> >
> > -----Original Message-----
> > From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > Sent: Monday, July 06, 2009 7:09 PM
> > To: pig-user@hadoop.apache.org
> > Subject: RE: Using the ConfigurationUtil
> >
> > I don't think you need either join or UDF. You can just load you
> dataset
> > and include in the cogroup and then use filter to only keep rows where
> > the count for your list is equal to 0. You can look at L5 in
> > http://wiki.apache.org/pig/PigMix.
> >
> > Olga
> >
> > -----Original Message-----
> > From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> > Sent: Monday, July 06, 2009 8:49 AM
> > To: pig-user@hadoop.apache.org
> > Subject: RE: Using the ConfigurationUtil
> >
> > Hi,
> >
> > Actually, this is not a replicated join because as you can see in my
> > script
> > I use the *not* keyword before the function.
> > In fact, if there was an option to do an outer (replicated) join then
> > that
> > would work, but I remember someone once said that all joins are inner.
> >
> > I have many cases where I want to filter one list that doesn't exist
> in
> > the
> > other (similar to NOT IN or MINUS in sql). Currently the only way of
> > doing
> > this in to use a cogroup followed by a filter count(something)==0, and
> > for
> > me it looks like a waste of a MR cycle that costs me a lot of time.
> What
> > I
> > tried to do is to write this NOT IN for pig that would save time and
> > cycles.
> >
> > So, back to my original question - what's wrong with the
> > ConfigurationUtil
> > at the reduce phase?
> > Or, is there another way of accessing dfs files within UDFs?
> >
> > Thanks,
> > Tamir
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com]
> > Sent: Monday, July 06, 2009 5:52 PM
> > To: pig-user@hadoop.apache.org
> > Subject: Re: Using the ConfigurationUtil
> >
> > JOIN A by id, B by id USING "replicated" PARALLEL 30
> >
> > On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com>
> > wrote:
> >
> > > On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
> > >
> > > >
> > > > filters based on a (small)
> > > > list of values present in a file in order to avoid another cogroup
> > > followed
> > > > by a filter.
> > >
> > > That's a fragment-and-replicate join. Pig has a built-in command for
> > that
> > > now (anybody know the syntax off-hand?).
> > >
> > > -Chris
> > >
> > >
> > > > The function gets the dfs path to a file and loads it to memory
> > > > and then do the actual matching/filtering.
> > > > It seems that if the function is used in the reduce phase then it
> > has a
> > > > problem with the ConfigurationUtil:
> > > >
> > > > java.lang.NullPointerException
> > > > at
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties
> > (Con
> > fi
> > > > gurationUtil.java:45)
> > > > at pigUDF.InList.init(InList.java:42)
> > > > at pigUDF.InList.exec(InList.java:67)
> > > >
> > > >
> > > > private void init() throws IOException  {
> > > >         hs = new HashSet<String>();
> > > >         Properties props =
> > > > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line
> > 42
> > > *****
> > > >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > > > props);
> > > >         BufferedReader reader = new BufferedReader(new
> > > > InputStreamReader(is));
> > > >         while (true) {
> > > >             String line = reader.readLine();
> > > >             if (line == null)
> > > >                 break;
> > > >             String FilterField = line.split("\t")[FieldIndex];
> > > >
> > > >             hs.add(FilterField);
> > > >         }
> > > >         System.out.println("Hash Size: " + hs.size());
> > > > }
> > > >
> > > > When the function is used in the map then it work perfectly.
> > > >
> > > > The script I'm using:
> > > > REGISTER pigUDF.jar;
> > > > %declare bots_file 'bots.txt'
> > > > b01 = load 'file01' as (key: long, value: int);
> > > > b02 = load 'file01' as (key: long, value: int);
> > > > b03 = load 'file01' as (key: long, value: int);
> > > > b04 = load 'file01' as (key: long, value: int);
> > > > b05 = load 'file01' as (key: long, value: int);
> > > > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by
> > key;
> > > > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > > > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> > > >
> > > >
> > > > Am I doing something wrong?
> > > >
> > > > Thanks,
> > > > Tamir
> > >
> > > --
> > > Christopher Olston, Ph.D.
> > > Sr. Research Scientist
> > > Yahoo! Research
> > >
> > >
> > >
> > >
> > >
> >
> >
>

RE: Using the ConfigurationUtil

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

I think your initial problem was caused by
http://issues.apache.org/jira/browse/PIG-67. We actually have a solution
but it is part of skew join patch. We will try to separate it make
available within next few days.

Olga

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com] 
Sent: Monday, July 06, 2009 12:19 PM
To: pig-user@hadoop.apache.org
Subject: Re: Using the ConfigurationUtil

Tamir,
Try this:
ConfigurationUtil.toProperties(PigMapReduce.sJobConf)

-D

On Mon, Jul 6, 2009 at 11:12 AM, Tamir Kamara
<ta...@kamarafamily.com>wrote:

>
> Maybe the script I originally gave isn't a good example, consider this
one:
> a = load 'bigfile' as (key1, key2, value);
> badkeys = load 'badkeysfile' as (key1);
> b = cogroup badkeys by key1, a by key1;
> c = filter b by count(badkeys)==0;
> d = foreach c generate flatten(a);
> e = group d by key2;
> f = foreach e generate group, sum(value);
> dump f;
>
> this script will require 2 separate jobs and when knowing that the
> badkeysfile is small, it can be done faster with only 1 job without
the
> cogroup.
>
> Maybe we can move passed the reason why I need it and concentrate on
the
> how
> - is there another way to load a dfs file in a UDF (other than the one
I
> used)?
>
> Tamir
>
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Monday, July 06, 2009 8:25 PM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> I think you can just add it to the original cogroup.
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> Sent: Monday, July 06, 2009 9:39 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> Hi Olga,
>
> As I wrote in my previous response - I do not want to use cogroup (and
> filter) because this requires a separate reduce phase which doesn't
make
> sense if the second dataset is small enough it's simply a waste of
> resources.
>
> Please read my earlier see why exactly I think a UDF would be better
> here.
>
> Thanks,
> Tamir
>
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Monday, July 06, 2009 7:09 PM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> I don't think you need either join or UDF. You can just load you
dataset
> and include in the cogroup and then use filter to only keep rows where
> the count for your list is equal to 0. You can look at L5 in
> http://wiki.apache.org/pig/PigMix.
>
> Olga
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> Sent: Monday, July 06, 2009 8:49 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> Hi,
>
> Actually, this is not a replicated join because as you can see in my
> script
> I use the *not* keyword before the function.
> In fact, if there was an option to do an outer (replicated) join then
> that
> would work, but I remember someone once said that all joins are inner.
>
> I have many cases where I want to filter one list that doesn't exist
in
> the
> other (similar to NOT IN or MINUS in sql). Currently the only way of
> doing
> this in to use a cogroup followed by a filter count(something)==0, and
> for
> me it looks like a waste of a MR cycle that costs me a lot of time.
What
> I
> tried to do is to write this NOT IN for pig that would save time and
> cycles.
>
> So, back to my original question - what's wrong with the
> ConfigurationUtil
> at the reduce phase?
> Or, is there another way of accessing dfs files within UDFs?
>
> Thanks,
> Tamir
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com]
> Sent: Monday, July 06, 2009 5:52 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Using the ConfigurationUtil
>
> JOIN A by id, B by id USING "replicated" PARALLEL 30
>
> On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com>
> wrote:
>
> > On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
> >
> > >
> > > filters based on a (small)
> > > list of values present in a file in order to avoid another cogroup
> > followed
> > > by a filter.
> >
> > That's a fragment-and-replicate join. Pig has a built-in command for
> that
> > now (anybody know the syntax off-hand?).
> >
> > -Chris
> >
> >
> > > The function gets the dfs path to a file and loads it to memory
> > > and then do the actual matching/filtering.
> > > It seems that if the function is used in the reduce phase then it
> has a
> > > problem with the ConfigurationUtil:
> > >
> > > java.lang.NullPointerException
> > > at
> > >
> >
>
org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties
> (Con
> fi
> > > gurationUtil.java:45)
> > > at pigUDF.InList.init(InList.java:42)
> > > at pigUDF.InList.exec(InList.java:67)
> > >
> > >
> > > private void init() throws IOException  {
> > >         hs = new HashSet<String>();
> > >         Properties props =
> > > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line
> 42
> > *****
> > >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > > props);
> > >         BufferedReader reader = new BufferedReader(new
> > > InputStreamReader(is));
> > >         while (true) {
> > >             String line = reader.readLine();
> > >             if (line == null)
> > >                 break;
> > >             String FilterField = line.split("\t")[FieldIndex];
> > >
> > >             hs.add(FilterField);
> > >         }
> > >         System.out.println("Hash Size: " + hs.size());
> > > }
> > >
> > > When the function is used in the map then it work perfectly.
> > >
> > > The script I'm using:
> > > REGISTER pigUDF.jar;
> > > %declare bots_file 'bots.txt'
> > > b01 = load 'file01' as (key: long, value: int);
> > > b02 = load 'file01' as (key: long, value: int);
> > > b03 = load 'file01' as (key: long, value: int);
> > > b04 = load 'file01' as (key: long, value: int);
> > > b05 = load 'file01' as (key: long, value: int);
> > > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by
> key;
> > > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> > >
> > >
> > > Am I doing something wrong?
> > >
> > > Thanks,
> > > Tamir
> >
> > --
> > Christopher Olston, Ph.D.
> > Sr. Research Scientist
> > Yahoo! Research
> >
> >
> >
> >
> >
>
>

Re: Using the ConfigurationUtil

Posted by Dmitriy Ryaboy <dv...@cloudera.com>.

Tamir,
Try this:
ConfigurationUtil.toProperties(PigMapReduce.sJobConf)

-D

On Mon, Jul 6, 2009 at 11:12 AM, Tamir Kamara <ta...@kamarafamily.com>wrote:

>
> Maybe the script I originally gave isn't a good example, consider this one:
> a = load 'bigfile' as (key1, key2, value);
> badkeys = load 'badkeysfile' as (key1);
> b = cogroup badkeys by key1, a by key1;
> c = filter b by count(badkeys)==0;
> d = foreach c generate flatten(a);
> e = group d by key2;
> f = foreach e generate group, sum(value);
> dump f;
>
> this script will require 2 separate jobs and when knowing that the
> badkeysfile is small, it can be done faster with only 1 job without the
> cogroup.
>
> Maybe we can move passed the reason why I need it and concentrate on the
> how
> - is there another way to load a dfs file in a UDF (other than the one I
> used)?
>
> Tamir
>
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Monday, July 06, 2009 8:25 PM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> I think you can just add it to the original cogroup.
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> Sent: Monday, July 06, 2009 9:39 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> Hi Olga,
>
> As I wrote in my previous response - I do not want to use cogroup (and
> filter) because this requires a separate reduce phase which doesn't make
> sense if the second dataset is small enough it's simply a waste of
> resources.
>
> Please read my earlier see why exactly I think a UDF would be better
> here.
>
> Thanks,
> Tamir
>
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Monday, July 06, 2009 7:09 PM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> I don't think you need either join or UDF. You can just load you dataset
> and include in the cogroup and then use filter to only keep rows where
> the count for your list is equal to 0. You can look at L5 in
> http://wiki.apache.org/pig/PigMix.
>
> Olga
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> Sent: Monday, July 06, 2009 8:49 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> Hi,
>
> Actually, this is not a replicated join because as you can see in my
> script
> I use the *not* keyword before the function.
> In fact, if there was an option to do an outer (replicated) join then
> that
> would work, but I remember someone once said that all joins are inner.
>
> I have many cases where I want to filter one list that doesn't exist in
> the
> other (similar to NOT IN or MINUS in sql). Currently the only way of
> doing
> this in to use a cogroup followed by a filter count(something)==0, and
> for
> me it looks like a waste of a MR cycle that costs me a lot of time. What
> I
> tried to do is to write this NOT IN for pig that would save time and
> cycles.
>
> So, back to my original question - what's wrong with the
> ConfigurationUtil
> at the reduce phase?
> Or, is there another way of accessing dfs files within UDFs?
>
> Thanks,
> Tamir
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com]
> Sent: Monday, July 06, 2009 5:52 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Using the ConfigurationUtil
>
> JOIN A by id, B by id USING "replicated" PARALLEL 30
>
> On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com>
> wrote:
>
> > On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
> >
> > >
> > > filters based on a (small)
> > > list of values present in a file in order to avoid another cogroup
> > followed
> > > by a filter.
> >
> > That's a fragment-and-replicate join. Pig has a built-in command for
> that
> > now (anybody know the syntax off-hand?).
> >
> > -Chris
> >
> >
> > > The function gets the dfs path to a file and loads it to memory
> > > and then do the actual matching/filtering.
> > > It seems that if the function is used in the reduce phase then it
> has a
> > > problem with the ConfigurationUtil:
> > >
> > > java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties
> (Con
> fi
> > > gurationUtil.java:45)
> > > at pigUDF.InList.init(InList.java:42)
> > > at pigUDF.InList.exec(InList.java:67)
> > >
> > >
> > > private void init() throws IOException  {
> > >         hs = new HashSet<String>();
> > >         Properties props =
> > > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line
> 42
> > *****
> > >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > > props);
> > >         BufferedReader reader = new BufferedReader(new
> > > InputStreamReader(is));
> > >         while (true) {
> > >             String line = reader.readLine();
> > >             if (line == null)
> > >                 break;
> > >             String FilterField = line.split("\t")[FieldIndex];
> > >
> > >             hs.add(FilterField);
> > >         }
> > >         System.out.println("Hash Size: " + hs.size());
> > > }
> > >
> > > When the function is used in the map then it work perfectly.
> > >
> > > The script I'm using:
> > > REGISTER pigUDF.jar;
> > > %declare bots_file 'bots.txt'
> > > b01 = load 'file01' as (key: long, value: int);
> > > b02 = load 'file01' as (key: long, value: int);
> > > b03 = load 'file01' as (key: long, value: int);
> > > b04 = load 'file01' as (key: long, value: int);
> > > b05 = load 'file01' as (key: long, value: int);
> > > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by
> key;
> > > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> > >
> > >
> > > Am I doing something wrong?
> > >
> > > Thanks,
> > > Tamir
> >
> > --
> > Christopher Olston, Ph.D.
> > Sr. Research Scientist
> > Yahoo! Research
> >
> >
> >
> >
> >
>
>

RE: Using the ConfigurationUtil

Posted by Tamir Kamara <ta...@kamarafamily.com>.

Maybe the script I originally gave isn't a good example, consider this one:
a = load 'bigfile' as (key1, key2, value);
badkeys = load 'badkeysfile' as (key1);
b = cogroup badkeys by key1, a by key1;
c = filter b by count(badkeys)==0;
d = foreach c generate flatten(a);
e = group d by key2;
f = foreach e generate group, sum(value);
dump f;

this script will require 2 separate jobs and when knowing that the
badkeysfile is small, it can be done faster with only 1 job without the
cogroup.

Maybe we can move passed the reason why I need it and concentrate on the how
- is there another way to load a dfs file in a UDF (other than the one I
used)?

Tamir

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Monday, July 06, 2009 8:25 PM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

I think you can just add it to the original cogroup.

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Monday, July 06, 2009 9:39 AM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

Hi Olga,

As I wrote in my previous response - I do not want to use cogroup (and
filter) because this requires a separate reduce phase which doesn't make
sense if the second dataset is small enough it's simply a waste of
resources.

Please read my earlier see why exactly I think a UDF would be better
here.

Thanks,
Tamir

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Monday, July 06, 2009 7:09 PM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

I don't think you need either join or UDF. You can just load you dataset
and include in the cogroup and then use filter to only keep rows where
the count for your list is equal to 0. You can look at L5 in
http://wiki.apache.org/pig/PigMix.

Olga

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Monday, July 06, 2009 8:49 AM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

Hi,

Actually, this is not a replicated join because as you can see in my
script
I use the *not* keyword before the function.
In fact, if there was an option to do an outer (replicated) join then
that
would work, but I remember someone once said that all joins are inner.

I have many cases where I want to filter one list that doesn't exist in
the
other (similar to NOT IN or MINUS in sql). Currently the only way of
doing
this in to use a cogroup followed by a filter count(something)==0, and
for
me it looks like a waste of a MR cycle that costs me a lot of time. What
I
tried to do is to write this NOT IN for pig that would save time and
cycles.

So, back to my original question - what's wrong with the
ConfigurationUtil
at the reduce phase?
Or, is there another way of accessing dfs files within UDFs?

Thanks,
Tamir

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com] 
Sent: Monday, July 06, 2009 5:52 PM
To: pig-user@hadoop.apache.org
Subject: Re: Using the ConfigurationUtil

JOIN A by id, B by id USING "replicated" PARALLEL 30

On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com>
wrote:

> On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
>
> >
> > filters based on a (small)
> > list of values present in a file in order to avoid another cogroup
> followed
> > by a filter.
>
> That's a fragment-and-replicate join. Pig has a built-in command for
that
> now (anybody know the syntax off-hand?).
>
> -Chris
>
>
> > The function gets the dfs path to a file and loads it to memory
> > and then do the actual matching/filtering.
> > It seems that if the function is used in the reduce phase then it
has a
> > problem with the ConfigurationUtil:
> >
> > java.lang.NullPointerException
> > at
> >
>
org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties
(Con
fi
> > gurationUtil.java:45)
> > at pigUDF.InList.init(InList.java:42)
> > at pigUDF.InList.exec(InList.java:67)
> >
> >
> > private void init() throws IOException  {
> >         hs = new HashSet<String>();
> >         Properties props =
> > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line
42
> *****
> >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > props);
> >         BufferedReader reader = new BufferedReader(new
> > InputStreamReader(is));
> >         while (true) {
> >             String line = reader.readLine();
> >             if (line == null)
> >                 break;
> >             String FilterField = line.split("\t")[FieldIndex];
> >
> >             hs.add(FilterField);
> >         }
> >         System.out.println("Hash Size: " + hs.size());
> > }
> >
> > When the function is used in the map then it work perfectly.
> >
> > The script I'm using:
> > REGISTER pigUDF.jar;
> > %declare bots_file 'bots.txt'
> > b01 = load 'file01' as (key: long, value: int);
> > b02 = load 'file01' as (key: long, value: int);
> > b03 = load 'file01' as (key: long, value: int);
> > b04 = load 'file01' as (key: long, value: int);
> > b05 = load 'file01' as (key: long, value: int);
> > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by
key;
> > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> >
> >
> > Am I doing something wrong?
> >
> > Thanks,
> > Tamir
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>
>
>

Re: Using the ConfigurationUtil

Posted by Tamir Kamara <ta...@gmail.com>.

Maybe the script I originally gave isn't a good example, consider this one:
a = load 'bigfile' as (key1, key2, value);
badkeys = load 'badkeysfile' as (key1);
b = cogroup badkeys by key1, a by key1;
c = filter b by count(badkeys)==0;
d = foreach c generate flatten(a);
e = group d by key2;
f = foreach e generate group, sum(value);
dump f;

this script will require 2 separate jobs and when knowing that the
badkeysfile is small, it can be done faster with only 1 job without the
cogroup.

Tamir


On Mon, Jul 6, 2009 at 8:25 PM, Olga Natkovich <ol...@yahoo-inc.com> wrote:

> I think you can just add it to the original cogroup.
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
>  Sent: Monday, July 06, 2009 9:39 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> Hi Olga,
>
> As I wrote in my previous response - I do not want to use cogroup (and
> filter) because this requires a separate reduce phase which doesn't make
> sense if the second dataset is small enough it's simply a waste of
> resources.
>
> Please read my earlier see why exactly I think a UDF would be better
> here.
>
> Thanks,
> Tamir
>
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Monday, July 06, 2009 7:09 PM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> I don't think you need either join or UDF. You can just load you dataset
> and include in the cogroup and then use filter to only keep rows where
> the count for your list is equal to 0. You can look at L5 in
> http://wiki.apache.org/pig/PigMix.
>
> Olga
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> Sent: Monday, July 06, 2009 8:49 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Using the ConfigurationUtil
>
> Hi,
>
> Actually, this is not a replicated join because as you can see in my
> script
> I use the *not* keyword before the function.
> In fact, if there was an option to do an outer (replicated) join then
> that
> would work, but I remember someone once said that all joins are inner.
>
> I have many cases where I want to filter one list that doesn't exist in
> the
> other (similar to NOT IN or MINUS in sql). Currently the only way of
> doing
> this in to use a cogroup followed by a filter count(something)==0, and
> for
> me it looks like a waste of a MR cycle that costs me a lot of time. What
> I
> tried to do is to write this NOT IN for pig that would save time and
> cycles.
>
> So, back to my original question - what's wrong with the
> ConfigurationUtil
> at the reduce phase?
> Or, is there another way of accessing dfs files within UDFs?
>
> Thanks,
> Tamir
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com]
> Sent: Monday, July 06, 2009 5:52 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Using the ConfigurationUtil
>
> JOIN A by id, B by id USING "replicated" PARALLEL 30
>
> On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com>
> wrote:
>
> > On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
> >
> > >
> > > filters based on a (small)
> > > list of values present in a file in order to avoid another cogroup
> > followed
> > > by a filter.
> >
> > That's a fragment-and-replicate join. Pig has a built-in command for
> that
> > now (anybody know the syntax off-hand?).
> >
> > -Chris
> >
> >
> > > The function gets the dfs path to a file and loads it to memory
> > > and then do the actual matching/filtering.
> > > It seems that if the function is used in the reduce phase then it
> has a
> > > problem with the ConfigurationUtil:
> > >
> > > java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties
> (Con
> fi
> > > gurationUtil.java:45)
> > > at pigUDF.InList.init(InList.java:42)
> > > at pigUDF.InList.exec(InList.java:67)
> > >
> > >
> > > private void init() throws IOException  {
> > >         hs = new HashSet<String>();
> > >         Properties props =
> > > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line
> 42
> > *****
> > >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > > props);
> > >         BufferedReader reader = new BufferedReader(new
> > > InputStreamReader(is));
> > >         while (true) {
> > >             String line = reader.readLine();
> > >             if (line == null)
> > >                 break;
> > >             String FilterField = line.split("\t")[FieldIndex];
> > >
> > >             hs.add(FilterField);
> > >         }
> > >         System.out.println("Hash Size: " + hs.size());
> > > }
> > >
> > > When the function is used in the map then it work perfectly.
> > >
> > > The script I'm using:
> > > REGISTER pigUDF.jar;
> > > %declare bots_file 'bots.txt'
> > > b01 = load 'file01' as (key: long, value: int);
> > > b02 = load 'file01' as (key: long, value: int);
> > > b03 = load 'file01' as (key: long, value: int);
> > > b04 = load 'file01' as (key: long, value: int);
> > > b05 = load 'file01' as (key: long, value: int);
> > > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by
> key;
> > > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> > >
> > >
> > > Am I doing something wrong?
> > >
> > > Thanks,
> > > Tamir
> >
> > --
> > Christopher Olston, Ph.D.
> > Sr. Research Scientist
> > Yahoo! Research
> >
> >
> >
> >
> >
>
>

RE: Using the ConfigurationUtil

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

I think you can just add it to the original cogroup.

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Monday, July 06, 2009 9:39 AM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

Hi Olga,

As I wrote in my previous response - I do not want to use cogroup (and
filter) because this requires a separate reduce phase which doesn't make
sense if the second dataset is small enough it's simply a waste of
resources.

Please read my earlier see why exactly I think a UDF would be better
here.

Thanks,
Tamir

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Monday, July 06, 2009 7:09 PM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

I don't think you need either join or UDF. You can just load you dataset
and include in the cogroup and then use filter to only keep rows where
the count for your list is equal to 0. You can look at L5 in
http://wiki.apache.org/pig/PigMix.

Olga

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Monday, July 06, 2009 8:49 AM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

Hi,

Actually, this is not a replicated join because as you can see in my
script
I use the *not* keyword before the function.
In fact, if there was an option to do an outer (replicated) join then
that
would work, but I remember someone once said that all joins are inner.

I have many cases where I want to filter one list that doesn't exist in
the
other (similar to NOT IN or MINUS in sql). Currently the only way of
doing
this in to use a cogroup followed by a filter count(something)==0, and
for
me it looks like a waste of a MR cycle that costs me a lot of time. What
I
tried to do is to write this NOT IN for pig that would save time and
cycles.

So, back to my original question - what's wrong with the
ConfigurationUtil
at the reduce phase?
Or, is there another way of accessing dfs files within UDFs?

Thanks,
Tamir

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com] 
Sent: Monday, July 06, 2009 5:52 PM
To: pig-user@hadoop.apache.org
Subject: Re: Using the ConfigurationUtil

JOIN A by id, B by id USING "replicated" PARALLEL 30

On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com>
wrote:

> On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
>
> >
> > filters based on a (small)
> > list of values present in a file in order to avoid another cogroup
> followed
> > by a filter.
>
> That's a fragment-and-replicate join. Pig has a built-in command for
that
> now (anybody know the syntax off-hand?).
>
> -Chris
>
>
> > The function gets the dfs path to a file and loads it to memory
> > and then do the actual matching/filtering.
> > It seems that if the function is used in the reduce phase then it
has a
> > problem with the ConfigurationUtil:
> >
> > java.lang.NullPointerException
> > at
> >
>
org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties
(Con
fi
> > gurationUtil.java:45)
> > at pigUDF.InList.init(InList.java:42)
> > at pigUDF.InList.exec(InList.java:67)
> >
> >
> > private void init() throws IOException  {
> >         hs = new HashSet<String>();
> >         Properties props =
> > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line
42
> *****
> >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > props);
> >         BufferedReader reader = new BufferedReader(new
> > InputStreamReader(is));
> >         while (true) {
> >             String line = reader.readLine();
> >             if (line == null)
> >                 break;
> >             String FilterField = line.split("\t")[FieldIndex];
> >
> >             hs.add(FilterField);
> >         }
> >         System.out.println("Hash Size: " + hs.size());
> > }
> >
> > When the function is used in the map then it work perfectly.
> >
> > The script I'm using:
> > REGISTER pigUDF.jar;
> > %declare bots_file 'bots.txt'
> > b01 = load 'file01' as (key: long, value: int);
> > b02 = load 'file01' as (key: long, value: int);
> > b03 = load 'file01' as (key: long, value: int);
> > b04 = load 'file01' as (key: long, value: int);
> > b05 = load 'file01' as (key: long, value: int);
> > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by
key;
> > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> >
> >
> > Am I doing something wrong?
> >
> > Thanks,
> > Tamir
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>
>
>

RE: Using the ConfigurationUtil

Posted by Tamir Kamara <ta...@gmail.com>.

Hi Olga,

As I wrote in my previous response - I do not want to use cogroup (and
filter) because this requires a separate reduce phase which doesn't make
sense if the second dataset is small enough it's simply a waste of
resources.

Please read my earlier see why exactly I think a UDF would be better here.

Thanks,
Tamir

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Monday, July 06, 2009 7:09 PM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

I don't think you need either join or UDF. You can just load you dataset
and include in the cogroup and then use filter to only keep rows where
the count for your list is equal to 0. You can look at L5 in
http://wiki.apache.org/pig/PigMix.

Olga

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Monday, July 06, 2009 8:49 AM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

Hi,

Actually, this is not a replicated join because as you can see in my
script
I use the *not* keyword before the function.
In fact, if there was an option to do an outer (replicated) join then
that
would work, but I remember someone once said that all joins are inner.

I have many cases where I want to filter one list that doesn't exist in
the
other (similar to NOT IN or MINUS in sql). Currently the only way of
doing
this in to use a cogroup followed by a filter count(something)==0, and
for
me it looks like a waste of a MR cycle that costs me a lot of time. What
I
tried to do is to write this NOT IN for pig that would save time and
cycles.

So, back to my original question - what's wrong with the
ConfigurationUtil
at the reduce phase?
Or, is there another way of accessing dfs files within UDFs?

Thanks,
Tamir

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com] 
Sent: Monday, July 06, 2009 5:52 PM
To: pig-user@hadoop.apache.org
Subject: Re: Using the ConfigurationUtil

JOIN A by id, B by id USING "replicated" PARALLEL 30

On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com>
wrote:

> On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
>
> >
> > filters based on a (small)
> > list of values present in a file in order to avoid another cogroup
> followed
> > by a filter.
>
> That's a fragment-and-replicate join. Pig has a built-in command for
that
> now (anybody know the syntax off-hand?).
>
> -Chris
>
>
> > The function gets the dfs path to a file and loads it to memory
> > and then do the actual matching/filtering.
> > It seems that if the function is used in the reduce phase then it
has a
> > problem with the ConfigurationUtil:
> >
> > java.lang.NullPointerException
> > at
> >
>
org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties
(Con
fi
> > gurationUtil.java:45)
> > at pigUDF.InList.init(InList.java:42)
> > at pigUDF.InList.exec(InList.java:67)
> >
> >
> > private void init() throws IOException  {
> >         hs = new HashSet<String>();
> >         Properties props =
> > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line
42
> *****
> >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > props);
> >         BufferedReader reader = new BufferedReader(new
> > InputStreamReader(is));
> >         while (true) {
> >             String line = reader.readLine();
> >             if (line == null)
> >                 break;
> >             String FilterField = line.split("\t")[FieldIndex];
> >
> >             hs.add(FilterField);
> >         }
> >         System.out.println("Hash Size: " + hs.size());
> > }
> >
> > When the function is used in the map then it work perfectly.
> >
> > The script I'm using:
> > REGISTER pigUDF.jar;
> > %declare bots_file 'bots.txt'
> > b01 = load 'file01' as (key: long, value: int);
> > b02 = load 'file01' as (key: long, value: int);
> > b03 = load 'file01' as (key: long, value: int);
> > b04 = load 'file01' as (key: long, value: int);
> > b05 = load 'file01' as (key: long, value: int);
> > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by
key;
> > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> >
> >
> > Am I doing something wrong?
> >
> > Thanks,
> > Tamir
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>
>
>

RE: Using the ConfigurationUtil

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

I don't think you need either join or UDF. You can just load you dataset
and include in the cogroup and then use filter to only keep rows where
the count for your list is equal to 0. You can look at L5 in
http://wiki.apache.org/pig/PigMix.

Olga

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Monday, July 06, 2009 8:49 AM
To: pig-user@hadoop.apache.org
Subject: RE: Using the ConfigurationUtil

Hi,

Actually, this is not a replicated join because as you can see in my
script
I use the *not* keyword before the function.
In fact, if there was an option to do an outer (replicated) join then
that
would work, but I remember someone once said that all joins are inner.

I have many cases where I want to filter one list that doesn't exist in
the
other (similar to NOT IN or MINUS in sql). Currently the only way of
doing
this in to use a cogroup followed by a filter count(something)==0, and
for
me it looks like a waste of a MR cycle that costs me a lot of time. What
I
tried to do is to write this NOT IN for pig that would save time and
cycles.

So, back to my original question - what's wrong with the
ConfigurationUtil
at the reduce phase?
Or, is there another way of accessing dfs files within UDFs?

Thanks,
Tamir

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com] 
Sent: Monday, July 06, 2009 5:52 PM
To: pig-user@hadoop.apache.org
Subject: Re: Using the ConfigurationUtil

JOIN A by id, B by id USING "replicated" PARALLEL 30

On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com>
wrote:

> On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
>
> >
> > filters based on a (small)
> > list of values present in a file in order to avoid another cogroup
> followed
> > by a filter.
>
> That's a fragment-and-replicate join. Pig has a built-in command for
that
> now (anybody know the syntax off-hand?).
>
> -Chris
>
>
> > The function gets the dfs path to a file and loads it to memory
> > and then do the actual matching/filtering.
> > It seems that if the function is used in the reduce phase then it
has a
> > problem with the ConfigurationUtil:
> >
> > java.lang.NullPointerException
> > at
> >
>
org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties
(Con
fi
> > gurationUtil.java:45)
> > at pigUDF.InList.init(InList.java:42)
> > at pigUDF.InList.exec(InList.java:67)
> >
> >
> > private void init() throws IOException  {
> >         hs = new HashSet<String>();
> >         Properties props =
> > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line
42
> *****
> >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > props);
> >         BufferedReader reader = new BufferedReader(new
> > InputStreamReader(is));
> >         while (true) {
> >             String line = reader.readLine();
> >             if (line == null)
> >                 break;
> >             String FilterField = line.split("\t")[FieldIndex];
> >
> >             hs.add(FilterField);
> >         }
> >         System.out.println("Hash Size: " + hs.size());
> > }
> >
> > When the function is used in the map then it work perfectly.
> >
> > The script I'm using:
> > REGISTER pigUDF.jar;
> > %declare bots_file 'bots.txt'
> > b01 = load 'file01' as (key: long, value: int);
> > b02 = load 'file01' as (key: long, value: int);
> > b03 = load 'file01' as (key: long, value: int);
> > b04 = load 'file01' as (key: long, value: int);
> > b05 = load 'file01' as (key: long, value: int);
> > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by
key;
> > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> >
> >
> > Am I doing something wrong?
> >
> > Thanks,
> > Tamir
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>
>
>

RE: Using the ConfigurationUtil

Posted by Tamir Kamara <ta...@gmail.com>.

Hi,

Actually, this is not a replicated join because as you can see in my script
I use the *not* keyword before the function.
In fact, if there was an option to do an outer (replicated) join then that
would work, but I remember someone once said that all joins are inner.

I have many cases where I want to filter one list that doesn't exist in the
other (similar to NOT IN or MINUS in sql). Currently the only way of doing
this in to use a cogroup followed by a filter count(something)==0, and for
me it looks like a waste of a MR cycle that costs me a lot of time. What I
tried to do is to write this NOT IN for pig that would save time and cycles.

So, back to my original question - what's wrong with the ConfigurationUtil
at the reduce phase?
Or, is there another way of accessing dfs files within UDFs?

Thanks,
Tamir

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@cloudera.com] 
Sent: Monday, July 06, 2009 5:52 PM
To: pig-user@hadoop.apache.org
Subject: Re: Using the ConfigurationUtil

JOIN A by id, B by id USING "replicated" PARALLEL 30

On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com> wrote:

> On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
>
> >
> > filters based on a (small)
> > list of values present in a file in order to avoid another cogroup
> followed
> > by a filter.
>
> That's a fragment-and-replicate join. Pig has a built-in command for that
> now (anybody know the syntax off-hand?).
>
> -Chris
>
>
> > The function gets the dfs path to a file and loads it to memory
> > and then do the actual matching/filtering.
> > It seems that if the function is used in the reduce phase then it has a
> > problem with the ConfigurationUtil:
> >
> > java.lang.NullPointerException
> > at
> >
>
org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(Con
fi
> > gurationUtil.java:45)
> > at pigUDF.InList.init(InList.java:42)
> > at pigUDF.InList.exec(InList.java:67)
> >
> >
> > private void init() throws IOException  {
> >         hs = new HashSet<String>();
> >         Properties props =
> > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line 42
> *****
> >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > props);
> >         BufferedReader reader = new BufferedReader(new
> > InputStreamReader(is));
> >         while (true) {
> >             String line = reader.readLine();
> >             if (line == null)
> >                 break;
> >             String FilterField = line.split("\t")[FieldIndex];
> >
> >             hs.add(FilterField);
> >         }
> >         System.out.println("Hash Size: " + hs.size());
> > }
> >
> > When the function is used in the map then it work perfectly.
> >
> > The script I'm using:
> > REGISTER pigUDF.jar;
> > %declare bots_file 'bots.txt'
> > b01 = load 'file01' as (key: long, value: int);
> > b02 = load 'file01' as (key: long, value: int);
> > b03 = load 'file01' as (key: long, value: int);
> > b04 = load 'file01' as (key: long, value: int);
> > b05 = load 'file01' as (key: long, value: int);
> > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by key;
> > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> >
> >
> > Am I doing something wrong?
> >
> > Thanks,
> > Tamir
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>
>
>

Re: Using the ConfigurationUtil

Posted by Dmitriy Ryaboy <dv...@cloudera.com>.

JOIN A by id, B by id USING "replicated" PARALLEL 30

On Mon, Jul 6, 2009 at 7:34 AM, Chris Olston <ol...@yahoo-inc.com> wrote:

> On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:
>
> >
> > filters based on a (small)
> > list of values present in a file in order to avoid another cogroup
> followed
> > by a filter.
>
> That's a fragment-and-replicate join. Pig has a built-in command for that
> now (anybody know the syntax off-hand?).
>
> -Chris
>
>
> > The function gets the dfs path to a file and loads it to memory
> > and then do the actual matching/filtering.
> > It seems that if the function is used in the reduce phase then it has a
> > problem with the ConfigurationUtil:
> >
> > java.lang.NullPointerException
> > at
> >
> org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(Confi
> > gurationUtil.java:45)
> > at pigUDF.InList.init(InList.java:42)
> > at pigUDF.InList.exec(InList.java:67)
> >
> >
> > private void init() throws IOException  {
> >         hs = new HashSet<String>();
> >         Properties props =
> > ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line 42
> *****
> >         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> > props);
> >         BufferedReader reader = new BufferedReader(new
> > InputStreamReader(is));
> >         while (true) {
> >             String line = reader.readLine();
> >             if (line == null)
> >                 break;
> >             String FilterField = line.split("\t")[FieldIndex];
> >
> >             hs.add(FilterField);
> >         }
> >         System.out.println("Hash Size: " + hs.size());
> > }
> >
> > When the function is used in the map then it work perfectly.
> >
> > The script I'm using:
> > REGISTER pigUDF.jar;
> > %declare bots_file 'bots.txt'
> > b01 = load 'file01' as (key: long, value: int);
> > b02 = load 'file01' as (key: long, value: int);
> > b03 = load 'file01' as (key: long, value: int);
> > b04 = load 'file01' as (key: long, value: int);
> > b05 = load 'file01' as (key: long, value: int);
> > c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by key;
> > DEFINE INLIST pigUDF.InList('$bots_file', '0');
> > c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> >
> >
> > Am I doing something wrong?
> >
> > Thanks,
> > Tamir
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>
>
>

Re: Using the ConfigurationUtil

Posted by Chris Olston <ol...@yahoo-inc.com>.

On 7/5/09 10:55 PM, "Tamir Kamara" <ta...@gmail.com> wrote:

> 
> filters based on a (small)
> list of values present in a file in order to avoid another cogroup followed
> by a filter. 

That's a fragment-and-replicate join. Pig has a built-in command for that
now (anybody know the syntax off-hand?).

-Chris


> The function gets the dfs path to a file and loads it to memory
> and then do the actual matching/filtering.
> It seems that if the function is used in the reduce phase then it has a
> problem with the ConfigurationUtil:
> 
> java.lang.NullPointerException
> at 
> org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(Confi
> gurationUtil.java:45)
> at pigUDF.InList.init(InList.java:42)
> at pigUDF.InList.exec(InList.java:67)
> 
> 
> private void init() throws IOException  {
>         hs = new HashSet<String>();
>         Properties props =
> ConfigurationUtil.toProperties(PigInputFormat.sJob); // ***** line 42 *****
>         InputStream is = FileLocalizer.openDFSFile(FilterFileName,
> props);
>         BufferedReader reader = new BufferedReader(new
> InputStreamReader(is));
>         while (true) {
>             String line = reader.readLine();
>             if (line == null)
>                 break;
>             String FilterField = line.split("\t")[FieldIndex];
> 
>             hs.add(FilterField);
>         }
>         System.out.println("Hash Size: " + hs.size());
> }
> 
> When the function is used in the map then it work perfectly.
> 
> The script I'm using:
> REGISTER pigUDF.jar;
> %declare bots_file 'bots.txt'
> b01 = load 'file01' as (key: long, value: int);
> b02 = load 'file01' as (key: long, value: int);
> b03 = load 'file01' as (key: long, value: int);
> b04 = load 'file01' as (key: long, value: int);
> b05 = load 'file01' as (key: long, value: int);
> c = cogroup b01 by key, b02 by key, b03 by key, b04 by key, b05 by key;
> DEFINE INLIST pigUDF.InList('$bots_file', '0');
> c1 = filter c by COUNT(b01)>0 and not INLIST(group);
> 
> 
> Am I doing something wrong?
> 
> Thanks,
> Tamir

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research