You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dan Brickley <da...@danbri.org> on 2011/09/07 17:08:52 UTC

Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Hi all! I have been experimenting with wrapping some of Apache
Mahout's machine learning -related jobs inside Pig macros, via the
MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few
issues, hence this mail.

While I enjoyed an initial minor success, I hit a problem because the
job I was trying actually wanted to take input from existing data in
hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO
INTO' clause when using MAPREDUCE. Is there any reason this is not
optional?

2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason:
<file mig.macro, line 6, column 1>  mismatched input 'LOAD' expecting
STORE

Complicating things further, I couldn't see a way of creating data for
this dummy input within Pig Latin (or at least the Grunt shell), other
than loading an empty file (which needed creating, cleaning up, etc).
Is there a syntax for declaring relations as literal data inline that
I'm missing? Also experimenting in Grunt I found it tricky that
piggybank.jar couldn't be registered within the macro I 'IMPORT', and
that it was all too easy to get an error from importing the same macro
twice within one session.

The Mahout/Pig proof of concept examples are at
https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt

Details of the Mahout side of things at
http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1CvjL8Cs2Q@mail.gmail.com%3E

If I'm missing something obvious that will provide for smoother
integration, I'd be very happy to learn. Currently what I have is just
this example (simplest case of reading seq directory in mahout and
doing downstream filtering of mahout results in pig latin):


run miglib.pig; -- basic setup, including macro definitions

-- get collocated phrases from a seqdir
reuters_phrases =
collocations('/user/danbri/migtest/reuters-out-seqdir', IGNORE);

political_phrases = FILTER reuters_phrases BY phrase MATCHES
'.*(president|minister|government|election).*' AND score > (float)10;

I'd love to get rid of the 'IGNORE' here, but this is the macro expansion:

DEFINE collocations (SEQDIR,IGNORE) RETURNS sorted_concepts {
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
raw_concepts = MAPREDUCE
'../../core/target/mahout-core-0.6-SNAPSHOT-job.jar' STORE IGNORE INTO
'migtest/dummy-input' LOAD
'migtest/collocations_output/ngrams/part-r-*' USING SequenceFileLoader
AS (phrase: chararray, score: float)
`org.apache.mahout.driver.MahoutDriver
org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i $SEQDIR
-o migtest/collocations_output --analyzerName
org.apache.mahout.vectorizer.DefaultAnalyzer --maxNGramSize 2
--preprocess --overwrite `;
$sorted_concepts = order raw_concepts by score desc;
};


Is this a reasonable thing to attempt? At least in the Mahout case, it
looks to me common that input might come from other files in hdfs
rather than from Pig relations, so maybe the requirement for STORE ...
INTO could be softened?

Thanks for any suggestions...

Dan

Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Posted by Daniel Dai <da...@hortonworks.com>.
On Fri, Sep 9, 2011 at 12:18 AM, Dan Brickley <da...@danbri.org> wrote:

> On 9 September 2011 01:28, Daniel Dai <da...@hortonworks.com> wrote:
> > Yes, makes sense to change it in Pig anyway. The code is in
> > org.apache.pig.parser.LogicalPlanBuilder.buildNativeOp. You may also need
> to
> > change parser to make Load/Store optional. Would you want to give a try?
>
> Having slept on this, I'm not so sure now. If we lose the LOAD/STORE
> then Pig knows that relation B needs A; but it doesn't see that
> relation C and relation D are each defined in terms of the (final,
> complete) result of B.
>
> Without this information, how is Pig's execution engine supposed to
> plan dependencies appropriately? Is there not a risk that these
> logically sequential jobs are initiated in parallel?
>

Yes, we also need a way to specify the dependency, like:
B = MAPREDUCE A jar ......


> Re Shawn's suggestion to drive everything from Python; I'm openminded.
> Whatever works, really. I've not tried wrapping Pig in Python yet,
> I've only seen it used for UDFs.
>

This should to be a good approach. With Python you get more flexibility. And
it's
easy to embed Pig script in it.


>
> Dan
>
> >> > A = xxxxx -- Pig pipeline
> >> > B = MAPREDUCE mahout.jar Store A into
> >> '<PATH>/content/reuters/reuters-out'
> >> > seqdirectory –input <PATH>/content/reuters/reuters-out –output
> >> > <PATH>/content/reuters/seqfiles –charset UTF-8
> >> > C = MAPREDUCE mahout.jar seq2sparse –input
> >> <PATH>/content/reuters/seqfiles
> >> > –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
> >> > D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
> >> > seq2sparse –input<PATH>/content/reuters/seqfiles –output
> >> > <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
> >> > E = foreach D generate ....   -- Pig pipeline
> >> >
> >> > You only need to interface Pig in the first and last step, but Pig
> >> requires
> >> > you to do LOAD/STORE for each job, and that's the problem. If we make
> >> > Store/Load as optional, that will solve your problem, right?
>

Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Posted by Dan Brickley <da...@danbri.org>.
On 9 September 2011 01:28, Daniel Dai <da...@hortonworks.com> wrote:
> Yes, makes sense to change it in Pig anyway. The code is in
> org.apache.pig.parser.LogicalPlanBuilder.buildNativeOp. You may also need to
> change parser to make Load/Store optional. Would you want to give a try?

Having slept on this, I'm not so sure now. If we lose the LOAD/STORE
then Pig knows that relation B needs A; but it doesn't see that
relation C and relation D are each defined in terms of the (final,
complete) result of B.

Without this information, how is Pig's execution engine supposed to
plan dependencies appropriately? Is there not a risk that these
logically sequential jobs are initiated in parallel?

Re Shawn's suggestion to drive everything from Python; I'm openminded.
Whatever works, really. I've not tried wrapping Pig in Python yet,
I've only seen it used for UDFs.

Dan

>> > A = xxxxx -- Pig pipeline
>> > B = MAPREDUCE mahout.jar Store A into
>> '<PATH>/content/reuters/reuters-out'
>> > seqdirectory –input <PATH>/content/reuters/reuters-out –output
>> > <PATH>/content/reuters/seqfiles –charset UTF-8
>> > C = MAPREDUCE mahout.jar seq2sparse –input
>> <PATH>/content/reuters/seqfiles
>> > –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
>> > D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
>> > seq2sparse –input<PATH>/content/reuters/seqfiles –output
>> > <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
>> > E = foreach D generate ....   -- Pig pipeline
>> >
>> > You only need to interface Pig in the first and last step, but Pig
>> requires
>> > you to do LOAD/STORE for each job, and that's the problem. If we make
>> > Store/Load as optional, that will solve your problem, right?

Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Posted by Xiaomeng Wan <sh...@gmail.com>.
Dan, is there any reason you donot want to wrap both pig and mahout in
python? It is much easier.

Shawn

On Thu, Sep 8, 2011 at 5:28 PM, Daniel Dai <da...@hortonworks.com> wrote:
> Yes, makes sense to change it in Pig anyway. The code is in
> org.apache.pig.parser.LogicalPlanBuilder.buildNativeOp. You may also need to
> change parser to make Load/Store optional. Would you want to give a try?
>
> Daniel
>
> On Thu, Sep 8, 2011 at 2:50 PM, Dan Brickley <da...@danbri.org> wrote:
>
>> On 8 Sep 2011, at 23:36, Daniel Dai <da...@hortonworks.com> wrote:
>>
>> > It seems like you want to do something like this:
>> >
>> > A = xxxxx -- Pig pipeline
>> > B = MAPREDUCE mahout.jar Store A into
>> '<PATH>/content/reuters/reuters-out'
>> > seqdirectory –input <PATH>/content/reuters/reuters-out –output
>> > <PATH>/content/reuters/seqfiles –charset UTF-8
>> > C = MAPREDUCE mahout.jar seq2sparse –input
>> <PATH>/content/reuters/seqfiles
>> > –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
>> > D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
>> > seq2sparse –input<PATH>/content/reuters/seqfiles –output
>> > <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
>> > E = foreach D generate ....   -- Pig pipeline
>> >
>> > You only need to interface Pig in the first and last step, but Pig
>> requires
>> > you to do LOAD/STORE for each job, and that's the problem. If we make
>> > Store/Load as optional, that will solve your problem, right?
>>
>> I think so. I'd like to confirm that this really works ok before asking for
>> a change to Pig. But I guess there should be other non-Mahout scenarios that
>> have similar needs. Can you suggest where to patch Pig to make store/load
>> optional?
>>
>> Dan
>

Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Posted by Daniel Dai <da...@hortonworks.com>.
Yes, makes sense to change it in Pig anyway. The code is in
org.apache.pig.parser.LogicalPlanBuilder.buildNativeOp. You may also need to
change parser to make Load/Store optional. Would you want to give a try?

Daniel

On Thu, Sep 8, 2011 at 2:50 PM, Dan Brickley <da...@danbri.org> wrote:

> On 8 Sep 2011, at 23:36, Daniel Dai <da...@hortonworks.com> wrote:
>
> > It seems like you want to do something like this:
> >
> > A = xxxxx -- Pig pipeline
> > B = MAPREDUCE mahout.jar Store A into
> '<PATH>/content/reuters/reuters-out'
> > seqdirectory –input <PATH>/content/reuters/reuters-out –output
> > <PATH>/content/reuters/seqfiles –charset UTF-8
> > C = MAPREDUCE mahout.jar seq2sparse –input
> <PATH>/content/reuters/seqfiles
> > –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
> > D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
> > seq2sparse –input<PATH>/content/reuters/seqfiles –output
> > <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
> > E = foreach D generate ....   -- Pig pipeline
> >
> > You only need to interface Pig in the first and last step, but Pig
> requires
> > you to do LOAD/STORE for each job, and that's the problem. If we make
> > Store/Load as optional, that will solve your problem, right?
>
> I think so. I'd like to confirm that this really works ok before asking for
> a change to Pig. But I guess there should be other non-Mahout scenarios that
> have similar needs. Can you suggest where to patch Pig to make store/load
> optional?
>
> Dan

Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Posted by Dan Brickley <da...@danbri.org>.
On 8 Sep 2011, at 23:36, Daniel Dai <da...@hortonworks.com> wrote:

> It seems like you want to do something like this:
> 
> A = xxxxx -- Pig pipeline
> B = MAPREDUCE mahout.jar Store A into '<PATH>/content/reuters/reuters-out'
> seqdirectory –input <PATH>/content/reuters/reuters-out –output
> <PATH>/content/reuters/seqfiles –charset UTF-8
> C = MAPREDUCE mahout.jar seq2sparse –input <PATH>/content/reuters/seqfiles
> –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
> D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
> seq2sparse –input<PATH>/content/reuters/seqfiles –output
> <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
> E = foreach D generate ....   -- Pig pipeline
> 
> You only need to interface Pig in the first and last step, but Pig requires
> you to do LOAD/STORE for each job, and that's the problem. If we make
> Store/Load as optional, that will solve your problem, right?

I think so. I'd like to confirm that this really works ok before asking for a change to Pig. But I guess there should be other non-Mahout scenarios that have similar needs. Can you suggest where to patch Pig to make store/load optional?

Dan

Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Posted by Daniel Dai <da...@hortonworks.com>.
It seems like you want to do something like this:

A = xxxxx -- Pig pipeline
B = MAPREDUCE mahout.jar Store A into '<PATH>/content/reuters/reuters-out'
seqdirectory –input <PATH>/content/reuters/reuters-out –output
<PATH>/content/reuters/seqfiles –charset UTF-8
C = MAPREDUCE mahout.jar seq2sparse –input <PATH>/content/reuters/seqfiles
–output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
seq2sparse –input<PATH>/content/reuters/seqfiles –output
<PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
E = foreach D generate ....   -- Pig pipeline

You only need to interface Pig in the first and last step, but Pig requires
you to do LOAD/STORE for each job, and that's the problem. If we make
Store/Load as optional, that will solve your problem, right?

Daniel

On Thu, Sep 8, 2011 at 1:22 PM, Dan Brickley <da...@danbri.org> wrote:

> On 8 September 2011 20:29, Daniel Dai <da...@hortonworks.com> wrote:
> > Thanks Dan, see my comments inline.
> > On Wed, Sep 7, 2011 at 8:08 AM, Dan Brickley <da...@danbri.org> wrote:
> >
> >> Hi all! I have been experimenting with wrapping some of Apache
> >> Mahout's machine learning -related jobs inside Pig macros, via the
> >> MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few
> >> issues, hence this mail.
> >>
> >
> >> While I enjoyed an initial minor success, I hit a problem because the
> >> job I was trying actually wanted to take input from existing data in
> >> hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO
> >> INTO' clause when using MAPREDUCE. Is there any reason this is not
> >> optional?
> >>
> >
> > We expect the native mapreduce job takes one input produced by Pig, and
> > produce one output feeding into the rest of Pig script. This is the
> interface
> > between Pig and Mapreduce.
> > Take WordCount as an example:
> > b = mapreduce 'hadoop-examples.jar' Store a into 'input' Load 'output'
> > `wordcount input output;
> >
> > Pig will save a into 'input' and wordcount will take it as its input.
> >
> > In your script, I saw you hard code the Mahous input/output. I believe
> this
> > is a just a test, in real world
> > you will use Pig to prepare and consume input/output. Otherwise, what's
> the
> > point to binding Pig/Mahout?
>
> Yes, I would expect Pig could take on more of the data preparation and
> filtering tasks. However Mahout itself offers several different
> components that typically get pipelined together to solve problems. In
> the example I was trying to extend by also making a macro for the
> Mahout task 'seqdirectory',
> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html ...
> I'm not sure if that can be directly 'piggified', but I was expecting
> that Pig could be used to express the data flow, and that a common
> pattern would be for data to start with Pig, and perhaps one two or
> three Mahout-based tasks, then final output back into Pig's world.
>
> Maybe it would help to take some of the concrete examples that show up
> in typical Mahout howtos, and think through how those might be
> expressed in a more Piggy way? For example
>
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
> shows a sequence of Mahout jobs, beginning with fetching a Reuters
> dataset (collection of documents), and then creating sequence files,
> and then from those, creating different flavoured Sparse Vector
> representations via different arguments/parameters, for subsequent
> consumption in LDA and kmeans clustering jobs. Oh, and then the
> results are printed/explored. Is that the kind of data flow that Pig
> could reasonably be expected to manage via 'MAPREDUCE', or am I
> over-stretching the mechanism?
>
> Another example (clustering again), from
>
> http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/
>
> https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_vectors.sh
> then
> https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_kmeans.sh
>
> So again the flow here from those .sh scripts (i'll trim some params,
> leaving just the in/out pipeline) is:
>
> bin/mahout seqdirectory --input
> examples/src/main/resources/seinfeld-scripts-preprocessed \
>                        --output            out-seinfeld-seqfiles [...]
> bin/mahout seq2sparse   --input             out-seinfeld-seqfiles    \
>                        --output            out-seinfeld-vectors    [...]
> bin/mahout kmeans       --input
>  out-seinfeld-vectors/tfidf-vectors \
>                        --output           out-seinfeld-kmeans/clusters \
>                        --clusters
> out-seinfeld-kmeans/initialclusters [...]
> bin/mahout clusterdump  --seqFileDir
> out-seinfeld-kmeans/clusters/clusters-1 \
>                        --pointsDir
> out-seinfeld-kmeans/clusters/clusteredPoints \
>                        --numWords          5 \
>                        --dictionary
> out-seinfeld-vectors/dictionary.file-0 \
>                        --dictionaryType    sequencefile
>
> I should say I'm no expert on the Mahout details either, but since a
> lot of my base input data is being handled (and joined, filtered etc)
> very nicely by Pig, I'm very curious about having some closer
> integration here. I also have no strong intuition about the impact of
> all this on efficiency, ... in terms either of parallelism, costs re
> storing on disk rather than everything in Pig datastructures, etc.
>
> >> 2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt
> >> - ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason:
> >> <file mig.macro, line 6, column 1>  mismatched input 'LOAD' expecting
> >> STORE
> >>
> >> Complicating things further, I couldn't see a way of creating data for
> >> this dummy input within Pig Latin (or at least the Grunt shell), other
> >> than loading an empty file (which needed creating, cleaning up, etc).
> >> Is there a syntax for declaring relations as literal data inline that
> >> I'm missing? Also experimenting in Grunt I found it tricky that
> >> piggybank.jar couldn't be registered within the macro I 'IMPORT', and
> >> that it was all too easy to get an error from importing the same macro
> >> twice within one session.
> >>
> >
> > This definitely we want to fix.
>
> Thanks. Let me know if you need any more detailed report / filing.
>
> >> The Mahout/Pig proof of concept examples are at
> >>
> >>
> https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt
> >>
> >> Details of the Mahout side of things at
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1CvjL8Cs2Q@mail.gmail.com%3E
> >>
> >> If I'm missing something obvious that will provide for smoother
> >> integration, I'd be very happy to learn. [...]
> >> Is this a reasonable thing to attempt? At least in the Mahout case, it
> >> looks to me common that input might come from other files in hdfs
> >> rather than from Pig relations, so maybe the requirement for STORE ...
> >> INTO could be softened?
> >>
> >
> >> Thanks for any suggestions...
>
> > That seems to be a very interesting project. Let me know your progress
> and
> > anything I can help.
>
> Thanks. I hit a few issues on the Mahout side too, but I'll see how
> far I can get with a simple set of macros, even if I have to use the
> 'IGNORE' hack for now. If you have any suggestion for a cleaner
> syntax/approach that'll work in Pig 0.9 I'd love to hear.
>
> Whether this will ever be truly useful I think depends on the kind of
> pipeline scenarios sketched above, i.e. where > 1 consecutive steps
> are happening outside of Pig. There might be a case for interacting
> with those external programs without having each step of their results
> written into hdfs, but I'm not sure how that would best be
> implemented.
>
> cheers,
>
> Dan
>

Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Posted by Dan Brickley <da...@danbri.org>.
On 8 September 2011 20:29, Daniel Dai <da...@hortonworks.com> wrote:
> Thanks Dan, see my comments inline.
> On Wed, Sep 7, 2011 at 8:08 AM, Dan Brickley <da...@danbri.org> wrote:
>
>> Hi all! I have been experimenting with wrapping some of Apache
>> Mahout's machine learning -related jobs inside Pig macros, via the
>> MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few
>> issues, hence this mail.
>>
>
>> While I enjoyed an initial minor success, I hit a problem because the
>> job I was trying actually wanted to take input from existing data in
>> hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO
>> INTO' clause when using MAPREDUCE. Is there any reason this is not
>> optional?
>>
>
> We expect the native mapreduce job takes one input produced by Pig, and
> produce one output feeding into the rest of Pig script. This is the interface
> between Pig and Mapreduce.
> Take WordCount as an example:
> b = mapreduce 'hadoop-examples.jar' Store a into 'input' Load 'output'
> `wordcount input output;
>
> Pig will save a into 'input' and wordcount will take it as its input.
>
> In your script, I saw you hard code the Mahous input/output. I believe this
> is a just a test, in real world
> you will use Pig to prepare and consume input/output. Otherwise, what's the
> point to binding Pig/Mahout?

Yes, I would expect Pig could take on more of the data preparation and
filtering tasks. However Mahout itself offers several different
components that typically get pipelined together to solve problems. In
the example I was trying to extend by also making a macro for the
Mahout task 'seqdirectory',
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html ...
I'm not sure if that can be directly 'piggified', but I was expecting
that Pig could be used to express the data flow, and that a common
pattern would be for data to start with Pig, and perhaps one two or
three Mahout-based tasks, then final output back into Pig's world.

Maybe it would help to take some of the concrete examples that show up
in typical Mahout howtos, and think through how those might be
expressed in a more Piggy way? For example
http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
shows a sequence of Mahout jobs, beginning with fetching a Reuters
dataset (collection of documents), and then creating sequence files,
and then from those, creating different flavoured Sparse Vector
representations via different arguments/parameters, for subsequent
consumption in LDA and kmeans clustering jobs. Oh, and then the
results are printed/explored. Is that the kind of data flow that Pig
could reasonably be expected to manage via 'MAPREDUCE', or am I
over-stretching the mechanism?

Another example (clustering again), from
http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/
https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_vectors.sh
then https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_kmeans.sh

So again the flow here from those .sh scripts (i'll trim some params,
leaving just the in/out pipeline) is:

bin/mahout seqdirectory --input
examples/src/main/resources/seinfeld-scripts-preprocessed \
                        --output            out-seinfeld-seqfiles [...]
bin/mahout seq2sparse   --input             out-seinfeld-seqfiles    \
                        --output            out-seinfeld-vectors    [...]
bin/mahout kmeans       --input            out-seinfeld-vectors/tfidf-vectors \
                        --output           out-seinfeld-kmeans/clusters \
                        --clusters
out-seinfeld-kmeans/initialclusters [...]
bin/mahout clusterdump  --seqFileDir
out-seinfeld-kmeans/clusters/clusters-1 \
                        --pointsDir
out-seinfeld-kmeans/clusters/clusteredPoints \
                        --numWords          5 \
                        --dictionary
out-seinfeld-vectors/dictionary.file-0 \
                        --dictionaryType    sequencefile

I should say I'm no expert on the Mahout details either, but since a
lot of my base input data is being handled (and joined, filtered etc)
very nicely by Pig, I'm very curious about having some closer
integration here. I also have no strong intuition about the impact of
all this on efficiency, ... in terms either of parallelism, costs re
storing on disk rather than everything in Pig datastructures, etc.

>> 2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt
>> - ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason:
>> <file mig.macro, line 6, column 1>  mismatched input 'LOAD' expecting
>> STORE
>>
>> Complicating things further, I couldn't see a way of creating data for
>> this dummy input within Pig Latin (or at least the Grunt shell), other
>> than loading an empty file (which needed creating, cleaning up, etc).
>> Is there a syntax for declaring relations as literal data inline that
>> I'm missing? Also experimenting in Grunt I found it tricky that
>> piggybank.jar couldn't be registered within the macro I 'IMPORT', and
>> that it was all too easy to get an error from importing the same macro
>> twice within one session.
>>
>
> This definitely we want to fix.

Thanks. Let me know if you need any more detailed report / filing.

>> The Mahout/Pig proof of concept examples are at
>>
>> https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt
>>
>> Details of the Mahout side of things at
>>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1CvjL8Cs2Q@mail.gmail.com%3E
>>
>> If I'm missing something obvious that will provide for smoother
>> integration, I'd be very happy to learn. [...]
>> Is this a reasonable thing to attempt? At least in the Mahout case, it
>> looks to me common that input might come from other files in hdfs
>> rather than from Pig relations, so maybe the requirement for STORE ...
>> INTO could be softened?
>>
>
>> Thanks for any suggestions...

> That seems to be a very interesting project. Let me know your progress and
> anything I can help.

Thanks. I hit a few issues on the Mahout side too, but I'll see how
far I can get with a simple set of macros, even if I have to use the
'IGNORE' hack for now. If you have any suggestion for a cleaner
syntax/approach that'll work in Pig 0.9 I'd love to hear.

Whether this will ever be truly useful I think depends on the kind of
pipeline scenarios sketched above, i.e. where > 1 consecutive steps
are happening outside of Pig. There might be a case for interacting
with those external programs without having each step of their results
written into hdfs, but I'm not sure how that would best be
implemented.

cheers,

Dan

Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Posted by Daniel Dai <da...@hortonworks.com>.
Thanks Dan, see my comments inline.

Daniel

On Wed, Sep 7, 2011 at 8:08 AM, Dan Brickley <da...@danbri.org> wrote:

> Hi all! I have been experimenting with wrapping some of Apache
> Mahout's machine learning -related jobs inside Pig macros, via the
> MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few
> issues, hence this mail.
>

> While I enjoyed an initial minor success, I hit a problem because the
> job I was trying actually wanted to take input from existing data in
> hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO
> INTO' clause when using MAPREDUCE. Is there any reason this is not
> optional?
>

We expect the native mapreduce job takes one input produced by Pig, and
produce
one output feeding into the rest of Pig script. This is the interface
between Pig and Mapreduce.
Take WordCount as an example:
b = mapreduce 'hadoop-examples.jar' Store a into 'input' Load 'output'
`wordcount input output;

Pig will save a into 'input' and wordcount will take it as its input.

In your script, I saw you hard code the Mahous input/output. I believe this
is a just a test, in real world
you will use Pig to prepare and consume input/output. Otherwise, what's the
point to binding Pig/Mahous?


>
> 2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt
> - ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason:
> <file mig.macro, line 6, column 1>  mismatched input 'LOAD' expecting
> STORE
>
> Complicating things further, I couldn't see a way of creating data for
> this dummy input within Pig Latin (or at least the Grunt shell), other
> than loading an empty file (which needed creating, cleaning up, etc).
> Is there a syntax for declaring relations as literal data inline that
> I'm missing? Also experimenting in Grunt I found it tricky that
> piggybank.jar couldn't be registered within the macro I 'IMPORT', and
> that it was all too easy to get an error from importing the same macro
> twice within one session.
>

This definitely we want to fix.


>
> The Mahout/Pig proof of concept examples are at
>
> https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt
>
> Details of the Mahout side of things at
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1CvjL8Cs2Q@mail.gmail.com%3E
>
> If I'm missing something obvious that will provide for smoother
> integration, I'd be very happy to learn. Currently what I have is just
> this example (simplest case of reading seq directory in mahout and
> doing downstream filtering of mahout results in pig latin):
>
>
> run miglib.pig; -- basic setup, including macro definitions
>
> -- get collocated phrases from a seqdir
> reuters_phrases =
> collocations('/user/danbri/migtest/reuters-out-seqdir', IGNORE);
>
> political_phrases = FILTER reuters_phrases BY phrase MATCHES
> '.*(president|minister|government|election).*' AND score > (float)10;
>
> I'd love to get rid of the 'IGNORE' here, but this is the macro expansion:
>
> DEFINE collocations (SEQDIR,IGNORE) RETURNS sorted_concepts {
> DEFINE SequenceFileLoader
> org.apache.pig.piggybank.storage.SequenceFileLoader();
> raw_concepts = MAPREDUCE
> '../../core/target/mahout-core-0.6-SNAPSHOT-job.jar' STORE IGNORE INTO
> 'migtest/dummy-input' LOAD
> 'migtest/collocations_output/ngrams/part-r-*' USING SequenceFileLoader
> AS (phrase: chararray, score: float)
> `org.apache.mahout.driver.MahoutDriver
> org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i $SEQDIR
> -o migtest/collocations_output --analyzerName
> org.apache.mahout.vectorizer.DefaultAnalyzer --maxNGramSize 2
> --preprocess --overwrite `;
> $sorted_concepts = order raw_concepts by score desc;
> };
>
>
> Is this a reasonable thing to attempt? At least in the Mahout case, it
> looks to me common that input might come from other files in hdfs
> rather than from Pig relations, so maybe the requirement for STORE ...
> INTO could be softened?
>

> Thanks for any suggestions...
>

That seems to be a very interesting project. Let me know your progress and
anything I can help.


>
> Dan
>