You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Loek Cleophas <lo...@kalooga.com> on 2010/02/07 09:56:30 UTC

mahout-0.2 - Twenty newsgroups Ant task extract-20news-18828

Hi

A few weeks ago, after some toiling, I managed to get the input data  
for the 20 newsgroups example into the format used by the Bayes  
classifiers in Mahout. I did this on the trunk, and remember that it  
took some tricks in particular to get the PrepareTwentyNewsgroups code  
to run on the expanded data and extract/collapse it into the format  
used by Mahout's Bayes classifiers.

For some reason now beyond me, I removed that copy of the trunk with  
the example data. Now, I'm trying to redo the same (albeit this time  
on release 0.2), but am having trouble. I copied the maven/build.xml  
into examples/build.xml according to a September post on the user  
group (http://old.nabble.com/20-newsgroups-example-td25235941.html).  
That post also suggested modifying the file, i.e. taking out the  
reference classpath refid="maven.test.classpath"/ (which indeed is not  
recognized when I run the extract-20news-18828 ant target), and adding  
the following lines:

       <classpath>
           <path id="lib.path.ref">
             <fileset dir="target" includes="*.jar"/>
           </path>
           <path id="lib.path.ref">
             <fileset dir="lib" includes="*.jar"/>
           </path>
       </classpath>

The "target" one makes some sense, but the lib one does not - I don't  
see any lib folder in my mahout-0.2 checkout (even after having done  
the mvn install of core and mvn compile of examples). Can anyone  
(Robin?) tell me what lines to add instead to get the Ant task to  
work? I know I managed to get it working before on my own, but can't  
remember for the life of me how I did it :-\

Regards,
Loek 

Re: mahout-0.2 - Twenty newsgroups Ant task extract-20news-18828

Posted by Loek Cleophas <lo...@kalooga.com>.
Whoops! I hadn't checked it since a few days ago. Thanks for updating  
- I'm sure it'll be helpful for others trying to follow the example.

Regards,
Loek

On Feb 9, 2010, at 09:23, Robin Anil wrote:

> Yes it was updated shortly. Its here.
> http://cwiki.apache.org/MAHOUT/twentynewsgroups.html
>
>
>
> On Tue, Feb 9, 2010 at 1:48 PM, Loek Cleophas <loek.cleophas@kalooga.com 
> >wrote:
>
>> Hi Robin,
>>
>> Thank you, that was definitely enough. I ran the  
>> PrepareTwentyNewsgroups
>> task using the mvn exec command you suggested now (seems it's time  
>> for me to
>> read up on Maven - useful how it takes care of finding the includes  
>> etc.).
>> For training and testing, I'm using hadoop directly, which works  
>> fine.
>>
>> This is probably already on your/someone's to do list, but it might  
>> be a
>> good idea to update the wiki page describing the example, so that  
>> it deals
>> with 0.2 or the trunk vs. some pre 0.2 release version (?). I know,  
>> you
>> probably have enough to work on without that..
>>
>> Regards,
>> Loek
>>
>>
>> On Feb 7, 2010, at 13:48, Robin Anil wrote:
>>
>> Is the mvn exec commands to run 20-newsgroups example enough?. I  
>> havent
>>> used
>>> the ant for a while(read 8 months), and mahout has shifted to maven
>>> anyways
>>>
>>> So here goes. In examples directory
>>>
>>> $ tar zxf 20news-18828.tar.gz
>>> $ mkdir 20news-input
>>> $ mvn -e  exec:java
>>>
>>> - 
>>> Dexec 
>>> .mainClass 
>>> =org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups
>>> -Dexec.args="-p 20news-18828 -o 20news-input -a
>>> org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8"
>>> To Train
>>> $ mvn -e  exec:java
>>> -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier
>>> -Dexec.args="-i 20news-input -o 20news-model -type cbayes -ng 1 - 
>>> source
>>> hdfs"
>>> To Test
>>> $ mvn -e  exec:java
>>> -Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier
>>> -Dexec.args="-m 20news-model -d 20news-input -type cbayes -ng 1 - 
>>> source
>>> hdfs
>>> -method sequential"
>>>
>>>
>>>
>>> On Sun, Feb 7, 2010 at 2:26 PM, Loek Cleophas <loek.cleophas@kalooga.com
>>>> wrote:
>>>
>>> Hi
>>>>
>>>> A few weeks ago, after some toiling, I managed to get the input  
>>>> data for
>>>> the 20 newsgroups example into the format used by the Bayes  
>>>> classifiers
>>>> in
>>>> Mahout. I did this on the trunk, and remember that it took some  
>>>> tricks in
>>>> particular to get the PrepareTwentyNewsgroups code to run on the  
>>>> expanded
>>>> data and extract/collapse it into the format used by Mahout's Bayes
>>>> classifiers.
>>>>
>>>> For some reason now beyond me, I removed that copy of the trunk  
>>>> with the
>>>> example data. Now, I'm trying to redo the same (albeit this time on
>>>> release
>>>> 0.2), but am having trouble. I copied the maven/build.xml into
>>>> examples/build.xml according to a September post on the user  
>>>> group (
>>>> http://old.nabble.com/20-newsgroups-example-td25235941.html).  
>>>> That post
>>>> also suggested modifying the file, i.e. taking out the reference
>>>> classpath
>>>> refid="maven.test.classpath"/ (which indeed is not recognized  
>>>> when I run
>>>> the
>>>> extract-20news-18828 ant target), and adding the following lines:
>>>>
>>>>   <classpath>
>>>>       <path id="lib.path.ref">
>>>>         <fileset dir="target" includes="*.jar"/>
>>>>       </path>
>>>>       <path id="lib.path.ref">
>>>>         <fileset dir="lib" includes="*.jar"/>
>>>>       </path>
>>>>   </classpath>
>>>>
>>>> The "target" one makes some sense, but the lib one does not - I  
>>>> don't see
>>>> any lib folder in my mahout-0.2 checkout (even after having done  
>>>> the mvn
>>>> install of core and mvn compile of examples). Can anyone (Robin?)  
>>>> tell me
>>>> what lines to add instead to get the Ant task to work? I know I  
>>>> managed
>>>> to
>>>> get it working before on my own, but can't remember for the life  
>>>> of me
>>>> how I
>>>> did it :-\
>>>>
>>>> Regards,
>>>> Loek
>>>>
>>>
>>


Re: mahout-0.2 - Twenty newsgroups Ant task extract-20news-18828

Posted by Robin Anil <ro...@gmail.com>.
Yes it was updated shortly. Its here.
http://cwiki.apache.org/MAHOUT/twentynewsgroups.html



On Tue, Feb 9, 2010 at 1:48 PM, Loek Cleophas <lo...@kalooga.com>wrote:

> Hi Robin,
>
> Thank you, that was definitely enough. I ran the PrepareTwentyNewsgroups
> task using the mvn exec command you suggested now (seems it's time for me to
> read up on Maven - useful how it takes care of finding the includes etc.).
> For training and testing, I'm using hadoop directly, which works fine.
>
> This is probably already on your/someone's to do list, but it might be a
> good idea to update the wiki page describing the example, so that it deals
> with 0.2 or the trunk vs. some pre 0.2 release version (?). I know, you
> probably have enough to work on without that..
>
> Regards,
> Loek
>
>
> On Feb 7, 2010, at 13:48, Robin Anil wrote:
>
>  Is the mvn exec commands to run 20-newsgroups example enough?. I havent
>> used
>> the ant for a while(read 8 months), and mahout has shifted to maven
>> anyways
>>
>> So here goes. In examples directory
>>
>> $ tar zxf 20news-18828.tar.gz
>> $ mkdir 20news-input
>> $ mvn -e  exec:java
>>
>> -Dexec.mainClass=org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups
>> -Dexec.args="-p 20news-18828 -o 20news-input -a
>> org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8"
>> To Train
>> $ mvn -e  exec:java
>> -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier
>> -Dexec.args="-i 20news-input -o 20news-model -type cbayes -ng 1 -source
>> hdfs"
>> To Test
>> $ mvn -e  exec:java
>> -Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier
>> -Dexec.args="-m 20news-model -d 20news-input -type cbayes -ng 1 -source
>> hdfs
>> -method sequential"
>>
>>
>>
>> On Sun, Feb 7, 2010 at 2:26 PM, Loek Cleophas <loek.cleophas@kalooga.com
>> >wrote:
>>
>>  Hi
>>>
>>> A few weeks ago, after some toiling, I managed to get the input data for
>>> the 20 newsgroups example into the format used by the Bayes classifiers
>>> in
>>> Mahout. I did this on the trunk, and remember that it took some tricks in
>>> particular to get the PrepareTwentyNewsgroups code to run on the expanded
>>> data and extract/collapse it into the format used by Mahout's Bayes
>>> classifiers.
>>>
>>> For some reason now beyond me, I removed that copy of the trunk with the
>>> example data. Now, I'm trying to redo the same (albeit this time on
>>> release
>>> 0.2), but am having trouble. I copied the maven/build.xml into
>>> examples/build.xml according to a September post on the user group (
>>> http://old.nabble.com/20-newsgroups-example-td25235941.html). That post
>>> also suggested modifying the file, i.e. taking out the reference
>>> classpath
>>> refid="maven.test.classpath"/ (which indeed is not recognized when I run
>>> the
>>> extract-20news-18828 ant target), and adding the following lines:
>>>
>>>    <classpath>
>>>        <path id="lib.path.ref">
>>>          <fileset dir="target" includes="*.jar"/>
>>>        </path>
>>>        <path id="lib.path.ref">
>>>          <fileset dir="lib" includes="*.jar"/>
>>>        </path>
>>>    </classpath>
>>>
>>> The "target" one makes some sense, but the lib one does not - I don't see
>>> any lib folder in my mahout-0.2 checkout (even after having done the mvn
>>> install of core and mvn compile of examples). Can anyone (Robin?) tell me
>>> what lines to add instead to get the Ant task to work? I know I managed
>>> to
>>> get it working before on my own, but can't remember for the life of me
>>> how I
>>> did it :-\
>>>
>>> Regards,
>>> Loek
>>>
>>
>

Re: mahout-0.2 - Twenty newsgroups Ant task extract-20news-18828

Posted by Loek Cleophas <lo...@kalooga.com>.
Hi Robin,

Thank you, that was definitely enough. I ran the  
PrepareTwentyNewsgroups task using the mvn exec command you suggested  
now (seems it's time for me to read up on Maven - useful how it takes  
care of finding the includes etc.). For training and testing, I'm  
using hadoop directly, which works fine.

This is probably already on your/someone's to do list, but it might be  
a good idea to update the wiki page describing the example, so that it  
deals with 0.2 or the trunk vs. some pre 0.2 release version (?). I  
know, you probably have enough to work on without that..

Regards,
Loek

On Feb 7, 2010, at 13:48, Robin Anil wrote:

> Is the mvn exec commands to run 20-newsgroups example enough?. I  
> havent used
> the ant for a while(read 8 months), and mahout has shifted to maven  
> anyways
>
> So here goes. In examples directory
>
> $ tar zxf 20news-18828.tar.gz
> $ mkdir 20news-input
> $ mvn -e  exec:java
> - 
> Dexec 
> .mainClass=org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups
> -Dexec.args="-p 20news-18828 -o 20news-input -a
> org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8"
> To Train
> $ mvn -e  exec:java
> -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier
> -Dexec.args="-i 20news-input -o 20news-model -type cbayes -ng 1 - 
> source
> hdfs"
> To Test
> $ mvn -e  exec:java
> -Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier
> -Dexec.args="-m 20news-model -d 20news-input -type cbayes -ng 1 - 
> source hdfs
> -method sequential"
>
>
>
> On Sun, Feb 7, 2010 at 2:26 PM, Loek Cleophas <loek.cleophas@kalooga.com 
> >wrote:
>
>> Hi
>>
>> A few weeks ago, after some toiling, I managed to get the input  
>> data for
>> the 20 newsgroups example into the format used by the Bayes  
>> classifiers in
>> Mahout. I did this on the trunk, and remember that it took some  
>> tricks in
>> particular to get the PrepareTwentyNewsgroups code to run on the  
>> expanded
>> data and extract/collapse it into the format used by Mahout's Bayes
>> classifiers.
>>
>> For some reason now beyond me, I removed that copy of the trunk  
>> with the
>> example data. Now, I'm trying to redo the same (albeit this time on  
>> release
>> 0.2), but am having trouble. I copied the maven/build.xml into
>> examples/build.xml according to a September post on the user group (
>> http://old.nabble.com/20-newsgroups-example-td25235941.html). That  
>> post
>> also suggested modifying the file, i.e. taking out the reference  
>> classpath
>> refid="maven.test.classpath"/ (which indeed is not recognized when  
>> I run the
>> extract-20news-18828 ant target), and adding the following lines:
>>
>>     <classpath>
>>         <path id="lib.path.ref">
>>           <fileset dir="target" includes="*.jar"/>
>>         </path>
>>         <path id="lib.path.ref">
>>           <fileset dir="lib" includes="*.jar"/>
>>         </path>
>>     </classpath>
>>
>> The "target" one makes some sense, but the lib one does not - I  
>> don't see
>> any lib folder in my mahout-0.2 checkout (even after having done  
>> the mvn
>> install of core and mvn compile of examples). Can anyone (Robin?)  
>> tell me
>> what lines to add instead to get the Ant task to work? I know I  
>> managed to
>> get it working before on my own, but can't remember for the life of  
>> me how I
>> did it :-\
>>
>> Regards,
>> Loek


Re: mahout-0.2 - Twenty newsgroups Ant task extract-20news-18828

Posted by Robin Anil <ro...@gmail.com>.
Is the mvn exec commands to run 20-newsgroups example enough?. I havent used
the ant for a while(read 8 months), and mahout has shifted to maven anyways

So here goes. In examples directory

$ tar zxf 20news-18828.tar.gz
$ mkdir 20news-input
$ mvn -e  exec:java
-Dexec.mainClass=org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups
-Dexec.args="-p 20news-18828 -o 20news-input -a
org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8"
To Train
$ mvn -e  exec:java
-Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier
-Dexec.args="-i 20news-input -o 20news-model -type cbayes -ng 1 -source
hdfs"
To Test
$ mvn -e  exec:java
-Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier
-Dexec.args="-m 20news-model -d 20news-input -type cbayes -ng 1 -source hdfs
-method sequential"



On Sun, Feb 7, 2010 at 2:26 PM, Loek Cleophas <lo...@kalooga.com>wrote:

> Hi
>
> A few weeks ago, after some toiling, I managed to get the input data for
> the 20 newsgroups example into the format used by the Bayes classifiers in
> Mahout. I did this on the trunk, and remember that it took some tricks in
> particular to get the PrepareTwentyNewsgroups code to run on the expanded
> data and extract/collapse it into the format used by Mahout's Bayes
> classifiers.
>
> For some reason now beyond me, I removed that copy of the trunk with the
> example data. Now, I'm trying to redo the same (albeit this time on release
> 0.2), but am having trouble. I copied the maven/build.xml into
> examples/build.xml according to a September post on the user group (
> http://old.nabble.com/20-newsgroups-example-td25235941.html). That post
> also suggested modifying the file, i.e. taking out the reference classpath
> refid="maven.test.classpath"/ (which indeed is not recognized when I run the
> extract-20news-18828 ant target), and adding the following lines:
>
>      <classpath>
>          <path id="lib.path.ref">
>            <fileset dir="target" includes="*.jar"/>
>          </path>
>          <path id="lib.path.ref">
>            <fileset dir="lib" includes="*.jar"/>
>          </path>
>      </classpath>
>
> The "target" one makes some sense, but the lib one does not - I don't see
> any lib folder in my mahout-0.2 checkout (even after having done the mvn
> install of core and mvn compile of examples). Can anyone (Robin?) tell me
> what lines to add instead to get the Ant task to work? I know I managed to
> get it working before on my own, but can't remember for the life of me how I
> did it :-\
>
> Regards,
> Loek