You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Timothy Potter <th...@gmail.com> on 2012/05/11 20:38:58 UTC

Question about storage in Pig-vector (Pig + Mahout)

I'm trying to run the simple 20-newsgroups example to train a Mahout
classifier using Pig and am unsure about the elephant-bird stuff.

First, after battling with getting a build of elephant-bird, the store to
SequenceFile didn't work for me. Then I saw the PigModelStorage and just
used that and it works just fine. Here is my script (with comments removed
for brevity):

-- Train:

register '.../target/pig-vector-1.0-jar-with-dependencies.jar';

define train org.apache.mahout.pig.LogisticRegression('iterations=5,
inMemory=true, features=100000, categories=alt.atheism
comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
comp.graphics comp.windows.x rec.sport.baseball sci.med
talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey
sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
soc.religion.christian talk.religion.misc');

docs = load '20news-bydate-train/*/*' using
org.apache.mahout.pig.MessageLoader()
    as (newsgroup, id:int, subject, body);

define encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000',
'subject+body', 'group:word, article:numeric, subject:text, body:text');
vectors = foreach docs generate newsgroup, encodeVector(*) as v;

grouped = group vectors all;

model = foreach grouped generate 1 as key, train(vectors) as model;

store model into 'pv-tmp/news_model2' using
org.apache.mahout.pig.PigModelStorage();


-- Eval:

define evaluate
org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-tmp/news_model2/part-r-00000,
key=1');
test = load '20news-bydate-test/*/*' using
org.apache.mahout.pig.MessageLoader()
    as (newsgroup, id:int, subject, body);
testvecs = foreach test generate newsgroup, encodeVector(*) as v;
describe testvecs;
evalvecs = foreach testvecs generate evaluate(v);

dump evalvecs;

----

So my main question is what does the elephant-bird model storage stuff do
that PigModelStorage doesn't?

Cheers,
Tim

Re: Question about storage in Pig-vector (Pig + Mahout)

Posted by Jake Mannix <ja...@gmail.com>.
Well actually elephant-bird has a
generic com.twitter.elephantbird.pig.store.SequenceFileStorage
which lets you use generic WritableConverters
(com.twitter.elephantbird.pig.util.TextConverter,
com.twitter.elephantbird.pig.util.IntWritableConverter, etc) to produce
*Writable types
as keys and values, given input Pig tuple types.  One of these is yes,
VectorWritableConverter.

On Fri, May 11, 2012 at 2:59 PM, Ted Dunning <te...@gmail.com> wrote:

> PigModelStorage stores SGD models.
>
> The elephant bird stuff stores data in the form of vectors.
>
> On Fri, May 11, 2012 at 11:38 AM, Timothy Potter <thelabdude@gmail.com
> >wrote:
>
> > So my main question is what does the elephant-bird model storage stuff do
> > that PigModelStorage doesn't?
> >
>



-- 

  -jake

Re: Question about storage in Pig-vector (Pig + Mahout)

Posted by Ted Dunning <te...@gmail.com>.
PigModelStorage stores SGD models.

The elephant bird stuff stores data in the form of vectors.

On Fri, May 11, 2012 at 11:38 AM, Timothy Potter <th...@gmail.com>wrote:

> So my main question is what does the elephant-bird model storage stuff do
> that PigModelStorage doesn't?
>

Re: Question about storage in Pig-vector (Pig + Mahout)

Posted by Timothy Potter <th...@gmail.com>.
My pleasure and hoping to do more with it ;-)

Cheers,
Tim

On Mon, May 14, 2012 at 1:11 PM, Ted Dunning <te...@gmail.com> wrote:

> Tim,
>
> Sorry for the confusion and lack of help.  Pig-vector is half-done and not
> even quite half-baked.
>
> Your help in updating the readme is very much appreciated.
>
> On Mon, May 14, 2012 at 10:17 AM, Timothy Potter <thelabdude@gmail.com
> >wrote:
>
> > Hi Ted,
> >
> > Re:
> >
> > In the readme, there is an example of using elephant-bird to store the
> > Classifier in a SequenceFile, i.e.
> >
> >    /* the trained model is passed to use as a bytearray so we just pass
> it
> > on out.  The classifier
> >
> >       class just contains the list of target valeus and the
> > OnlineLogisticRegression object itself. */
> >
> >    store model into 'model.dat' using
> > com.twitter.elephantbird.pig.store.SequenceFileStorage (
> >
> >       '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
> >
> >       '-c com.twitter.elephantbird.pig.util.GenericWritableConverter -t
> > org.apache.mahout.pig.Classifier'
> >
> >    );
> >
> > Is this intended? If so, this is what is not working and is confusing. If
> > not intended, I'll update the readme to use PigModelStorage instead.
> >
> >
> > Also, I'll try with elephant-bird 2.2.2 instead and see where that gets
> me.
> >
> >
> > Cheers,
> >
> > Tim
> >
> > On Sun, May 13, 2012 at 11:54 AM, Andy Schlaikjer <
> > andrew.schlaikjer@gmail.com> wrote:
> >
> > > On Fri, May 11, 2012 at 12:40 PM, Timothy Potter <thelabdude@gmail.com
> >
> > > wrote:
> > > > [ERROR] Failed to execute goal on project pig-vector: Could not
> resolve
> > > > dependencies for project pig-vector:pig-vector:jar:1.0: Could not
> find
> > > > artifact com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
> > > > https://raw.github.com/kevinweil/elephant-bird/master/repo) -> [Help
> > 1]
> > > > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> > execute
> > > > goal on project pig-vector: Could not resolve dependencies for
> project
> > > > pig-vector:pig-vector:jar:1.0: Could not find artifact
> > > > com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
> > > > https://raw.github.com/kevinweil/elephant-bird/master/repo)
> > > >  at
> > > >
> > >
> >
> org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependency
> > >
> > > I don't see a 2.1.2 release in the repo:
> > >
> > >
> > >
> >
> https://github.com/kevinweil/elephant-bird/tree/master/repo/com/twitter/elephant-bird
> > >
> > > Latest release is 2.2.2.
> > >
> > > Andy
> > >
> >
>

Re: Question about storage in Pig-vector (Pig + Mahout)

Posted by Ted Dunning <te...@gmail.com>.
Tim,

Sorry for the confusion and lack of help.  Pig-vector is half-done and not
even quite half-baked.

Your help in updating the readme is very much appreciated.

On Mon, May 14, 2012 at 10:17 AM, Timothy Potter <th...@gmail.com>wrote:

> Hi Ted,
>
> Re:
>
> In the readme, there is an example of using elephant-bird to store the
> Classifier in a SequenceFile, i.e.
>
>    /* the trained model is passed to use as a bytearray so we just pass it
> on out.  The classifier
>
>       class just contains the list of target valeus and the
> OnlineLogisticRegression object itself. */
>
>    store model into 'model.dat' using
> com.twitter.elephantbird.pig.store.SequenceFileStorage (
>
>       '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>
>       '-c com.twitter.elephantbird.pig.util.GenericWritableConverter -t
> org.apache.mahout.pig.Classifier'
>
>    );
>
> Is this intended? If so, this is what is not working and is confusing. If
> not intended, I'll update the readme to use PigModelStorage instead.
>
>
> Also, I'll try with elephant-bird 2.2.2 instead and see where that gets me.
>
>
> Cheers,
>
> Tim
>
> On Sun, May 13, 2012 at 11:54 AM, Andy Schlaikjer <
> andrew.schlaikjer@gmail.com> wrote:
>
> > On Fri, May 11, 2012 at 12:40 PM, Timothy Potter <th...@gmail.com>
> > wrote:
> > > [ERROR] Failed to execute goal on project pig-vector: Could not resolve
> > > dependencies for project pig-vector:pig-vector:jar:1.0: Could not find
> > > artifact com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
> > > https://raw.github.com/kevinweil/elephant-bird/master/repo) -> [Help
> 1]
> > > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> execute
> > > goal on project pig-vector: Could not resolve dependencies for project
> > > pig-vector:pig-vector:jar:1.0: Could not find artifact
> > > com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
> > > https://raw.github.com/kevinweil/elephant-bird/master/repo)
> > >  at
> > >
> >
> org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependency
> >
> > I don't see a 2.1.2 release in the repo:
> >
> >
> >
> https://github.com/kevinweil/elephant-bird/tree/master/repo/com/twitter/elephant-bird
> >
> > Latest release is 2.2.2.
> >
> > Andy
> >
>

Re: Question about storage in Pig-vector (Pig + Mahout)

Posted by Timothy Potter <th...@gmail.com>.
Hi Ted,

Re:

In the readme, there is an example of using elephant-bird to store the
Classifier in a SequenceFile, i.e.

    /* the trained model is passed to use as a bytearray so we just pass it
on out.  The classifier

       class just contains the list of target valeus and the
OnlineLogisticRegression object itself. */

    store model into 'model.dat' using
com.twitter.elephantbird.pig.store.SequenceFileStorage (

       '-c com.twitter.elephantbird.pig.util.IntWritableConverter',

       '-c com.twitter.elephantbird.pig.util.GenericWritableConverter -t
org.apache.mahout.pig.Classifier'

    );

Is this intended? If so, this is what is not working and is confusing. If
not intended, I'll update the readme to use PigModelStorage instead.


Also, I'll try with elephant-bird 2.2.2 instead and see where that gets me.


Cheers,

Tim

On Sun, May 13, 2012 at 11:54 AM, Andy Schlaikjer <
andrew.schlaikjer@gmail.com> wrote:

> On Fri, May 11, 2012 at 12:40 PM, Timothy Potter <th...@gmail.com>
> wrote:
> > [ERROR] Failed to execute goal on project pig-vector: Could not resolve
> > dependencies for project pig-vector:pig-vector:jar:1.0: Could not find
> > artifact com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
> > https://raw.github.com/kevinweil/elephant-bird/master/repo) -> [Help 1]
> > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> > goal on project pig-vector: Could not resolve dependencies for project
> > pig-vector:pig-vector:jar:1.0: Could not find artifact
> > com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
> > https://raw.github.com/kevinweil/elephant-bird/master/repo)
> >  at
> >
> org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependency
>
> I don't see a 2.1.2 release in the repo:
>
>
> https://github.com/kevinweil/elephant-bird/tree/master/repo/com/twitter/elephant-bird
>
> Latest release is 2.2.2.
>
> Andy
>

Re: Question about storage in Pig-vector (Pig + Mahout)

Posted by Andy Schlaikjer <an...@gmail.com>.
On Fri, May 11, 2012 at 12:40 PM, Timothy Potter <th...@gmail.com> wrote:
> [ERROR] Failed to execute goal on project pig-vector: Could not resolve
> dependencies for project pig-vector:pig-vector:jar:1.0: Could not find
> artifact com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
> https://raw.github.com/kevinweil/elephant-bird/master/repo) -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal on project pig-vector: Could not resolve dependencies for project
> pig-vector:pig-vector:jar:1.0: Could not find artifact
> com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
> https://raw.github.com/kevinweil/elephant-bird/master/repo)
>  at
> org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependency

I don't see a 2.1.2 release in the repo:

https://github.com/kevinweil/elephant-bird/tree/master/repo/com/twitter/elephant-bird

Latest release is 2.2.2.

Andy

Re: Question about storage in Pig-vector (Pig + Mahout)

Posted by Timothy Potter <th...@gmail.com>.
Thanks for the help Jake. Makes sense about interfacing with other Mahout
classes. What is confusing is that the PigModelStorage class also seems to
produce a SequenceFile, i.e

public OutputFormat getOutputFormat() throws IOException {

        return new SequenceFileOutputFormat();

}

Maven couldn't resolve elephant-bird at the time I tried to build
pig-vector ... just tried again and am getting:

[ERROR] Failed to execute goal on project pig-vector: Could not resolve
dependencies for project pig-vector:pig-vector:jar:1.0: Could not find
artifact com.twitter:elephant-bird:jar:2.1.2 in central (
http://repo1.maven.org/maven2) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal on project pig-vector: Could not resolve dependencies for project
pig-vector:pig-vector:jar:1.0: Could not find artifact
com.twitter:elephant-bird:jar:2.1.2 in central (
http://repo1.maven.org/maven2)

So looking at the ele..-bird readme, I see mention of Maven repo:
https://raw.github.com/kevinweil/elephant-bird/master/repo

That didn't work either :-(

[ERROR] Failed to execute goal on project pig-vector: Could not resolve
dependencies for project pig-vector:pig-vector:jar:1.0: Could not find
artifact com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
https://raw.github.com/kevinweil/elephant-bird/master/repo) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal on project pig-vector: Could not resolve dependencies for project
pig-vector:pig-vector:jar:1.0: Could not find artifact
com.twitter:elephant-bird:jar:2.1.2 in elephant-bird (
https://raw.github.com/kevinweil/elephant-bird/master/repo)
 at
org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependency

In any case, I got a 2.1.2 version to build, but when running the training
example, the storing the model using elephant-bird fails. It looks to me
like the serialized model is corrupted somehow. I added some debug
statements to Classifier to see what it thinks it's serializing and
de-serializing in write and readFields respectively. There's clearly a
mis-match (see below). I've tried this with elephant-bird 2.1.2 and the
latest 2.2.3-SNAPSHOT from Github

Here is the debug output from the code I added to Classifier:

*In write(DataOutput dataOutput)
*>> wrote size: 20
>> wrote category: alt.atheism
>> wrote category: comp.sys.mac.hardware
>> wrote category: rec.motorcycles
>> wrote category: sci.electronics
>> wrote category: talk.politics.guns
>> wrote category: comp.graphics
>> wrote category: comp.windows.x
>> wrote category: rec.sport.baseball
>> wrote category: sci.med
>> wrote category: talk.politics.mideast
>> wrote category: comp.os.ms-windows.misc
>> wrote category: misc.forsale
>> wrote category: rec.sport.hockey
>> wrote category: sci.space
>> wrote category: talk.politics.misc
>> wrote category: comp.sys.ibm.pc.hardware
>> wrote category: rec.autos
>> wrote category: sci.crypt
>> wrote category: soc.religion.christian
>> wrote category: talk.religion.misc
>> wrote model

so far so good ...

*In readFields(DataInput dataInput)
*>> read size: 2125682
>> read category: .apache.mahout.pig.Classifier

alt.atheismcomp.sys.mac.hardwarerec.motorcyclessci.electronicstal
comp.graphicscomp.windows.xrec.sport.baseballsci.medtalk.politics.mideastcomp.os.ms-win
>> read category: ows.misc
                          misc.forsalerec.sport.hockey
sci.spacetalk.politics.misccomp.sys.ibm.pc.hardware    rec.aut
>> read category: s    sci.cryptsoc.religion.christianta


Cheers,
Tim


On Fri, May 11, 2012 at 1:09 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Fri, May 11, 2012 at 11:38 AM, Timothy Potter <thelabdude@gmail.com
> >wrote:
>
> > I'm trying to run the simple 20-newsgroups example to train a Mahout
> > classifier using Pig and am unsure about the elephant-bird stuff.
> >
> > First, after battling with getting a build of elephant-bird,
>
>
> Why did you have to build it?  Aren't the jars available via maven?
>
>
> > the store to
> > SequenceFile didn't work for me. Then I saw the PigModelStorage and just
> > used that and it works just fine. Here is my script (with comments
> removed
> > for brevity):
> >
> > -- Train:
> >
> > register '.../target/pig-vector-1.0-jar-with-dependencies.jar';
> >
> > define train org.apache.mahout.pig.LogisticRegression('iterations=5,
> > inMemory=true, features=100000, categories=alt.atheism
> > comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
> > comp.graphics comp.windows.x rec.sport.baseball sci.med
> > talk.politics.mideast comp.os.ms-windows.misc misc.forsale
> rec.sport.hockey
> > sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
> > soc.religion.christian talk.religion.misc');
> >
> > docs = load '20news-bydate-train/*/*' using
> > org.apache.mahout.pig.MessageLoader()
> >    as (newsgroup, id:int, subject, body);
> >
> > define encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000',
> > 'subject+body', 'group:word, article:numeric, subject:text, body:text');
> > vectors = foreach docs generate newsgroup, encodeVector(*) as v;
> >
> > grouped = group vectors all;
> >
> > model = foreach grouped generate 1 as key, train(vectors) as model;
> >
> > store model into 'pv-tmp/news_model2' using
> > org.apache.mahout.pig.PigModelStorage();
> >
> >
> > -- Eval:
> >
> > define evaluate
> >
> >
> org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-tmp/news_model2/part-r-00000,
> > key=1');
> > test = load '20news-bydate-test/*/*' using
> > org.apache.mahout.pig.MessageLoader()
> >    as (newsgroup, id:int, subject, body);
> > testvecs = foreach test generate newsgroup, encodeVector(*) as v;
> > describe testvecs;
> > evalvecs = foreach testvecs generate evaluate(v);
> >
> > dump evalvecs;
> >
> > ----
> >
> > So my main question is what does the elephant-bird model storage stuff do
> > that PigModelStorage doesn't?
> >
>
> SequenceFileStorage leads to producing data in a format which many of the
> other
> Mahout utilities can read (they typically assume things like SequenceFile's
> of Text,
> IntWritable, and/or VectorWritable).
>
>
> >
> > Cheers,
> > Tim
> >
>
>
>
> --
>
>  -jake
>

Re: Question about storage in Pig-vector (Pig + Mahout)

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, May 11, 2012 at 11:38 AM, Timothy Potter <th...@gmail.com>wrote:

> I'm trying to run the simple 20-newsgroups example to train a Mahout
> classifier using Pig and am unsure about the elephant-bird stuff.
>
> First, after battling with getting a build of elephant-bird,


Why did you have to build it?  Aren't the jars available via maven?


> the store to
> SequenceFile didn't work for me. Then I saw the PigModelStorage and just
> used that and it works just fine. Here is my script (with comments removed
> for brevity):
>
> -- Train:
>
> register '.../target/pig-vector-1.0-jar-with-dependencies.jar';
>
> define train org.apache.mahout.pig.LogisticRegression('iterations=5,
> inMemory=true, features=100000, categories=alt.atheism
> comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
> comp.graphics comp.windows.x rec.sport.baseball sci.med
> talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey
> sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
> soc.religion.christian talk.religion.misc');
>
> docs = load '20news-bydate-train/*/*' using
> org.apache.mahout.pig.MessageLoader()
>    as (newsgroup, id:int, subject, body);
>
> define encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000',
> 'subject+body', 'group:word, article:numeric, subject:text, body:text');
> vectors = foreach docs generate newsgroup, encodeVector(*) as v;
>
> grouped = group vectors all;
>
> model = foreach grouped generate 1 as key, train(vectors) as model;
>
> store model into 'pv-tmp/news_model2' using
> org.apache.mahout.pig.PigModelStorage();
>
>
> -- Eval:
>
> define evaluate
>
> org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-tmp/news_model2/part-r-00000,
> key=1');
> test = load '20news-bydate-test/*/*' using
> org.apache.mahout.pig.MessageLoader()
>    as (newsgroup, id:int, subject, body);
> testvecs = foreach test generate newsgroup, encodeVector(*) as v;
> describe testvecs;
> evalvecs = foreach testvecs generate evaluate(v);
>
> dump evalvecs;
>
> ----
>
> So my main question is what does the elephant-bird model storage stuff do
> that PigModelStorage doesn't?
>

SequenceFileStorage leads to producing data in a format which many of the
other
Mahout utilities can read (they typically assume things like SequenceFile's
of Text,
IntWritable, and/or VectorWritable).


>
> Cheers,
> Tim
>



-- 

  -jake