You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sreejith S <sr...@gmail.com> on 2011/01/10 15:00:20 UTC

Re: Seq2Sparse and Collocation

Thank u...

I ran seq2sparse on a sequential file and it created folders names
df-count,wordcout,tf-vectors,tfidf,tokenized etc..

while running the bin/mahout
org.apache.mahout.vectorizer.collocations.llr.CollocDriver an error occured
like below
java.io.FileNotFoundException: File
file:/home/developer/Desktop/seqinput/wordcount/data does not exist.

What am i supposed to do??

Thank u...
Sreejith

On Thu, Dec 16, 2010 at 6:16 PM, Isabel Drost <is...@apache.org> wrote:

> On Thu, 16 Dec 2010 Federico Castanedo <fc...@inf.uc3m.es> wrote:
> > > <https://cwiki.apache.org/confluence/display/MAHOUT/Collocations>
> >
> > I think that wiki entry is old, the new class now lives at
> >
> > bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
> > --help
>
> Would you please be so kind to fix the page? Simply log in to the wiki
> (or create an account, if you don't have one already) and hit the
> "edit" button.
>
> Isabel
>

Re: Seq2Sparse and Collocation

Posted by Sreejith S <sr...@gmail.com>.

bin/mahout seq2sparse -a org.apache.mahout.vectorizer.DefaultAnalyzer  -o
/home/developer/Desktop/seqinput -i /home/developer/Desktop/input/text -ng 2

after that it created 5 folders in seqinput folder.they are
df-count , tfidf-vectors , tf-vectors (contains part-r-0000) ,
tokenized-documents (contains part-m-0000) and finally a folder named
wordcount

word count folder contains 2 sub folders ngam and subgram (it contain
part-r-0000).

After this i tried

bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
/home/developer/Desktop/seqinput -o /home/developer/Desktop/phrases -ng 2 -a
org.apache.mahout.vectorizer.DefaultAnalyzer

Then the output is...

no HADOOP_HOME set, running locally
11 Jan, 2011 9:36:09 AM org.slf4j.impl.JCLLoggerAdapter warn
WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props
found on classpath, will use command-line arguments only
11 Jan, 2011 9:36:10 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments:
{--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer,
--endPhase=2147483647, --input=/home/developer/Desktop/seqinput,
--maxNGramSize=2, --maxRed=2, --minLLR=1.0, --minSupport=2,
--output=/home/developer/Desktop/seqinput, --startPhase=0, --tempDir=temp}
11 Jan, 2011 9:36:10 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Maximum n-gram size is: 2
11 Jan, 2011 9:36:10 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum Support value: 2
11 Jan, 2011 9:36:10 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum LLR value: 1.0
11 Jan, 2011 9:36:10 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Number of pass1 reduce tasks: 2
11 Jan, 2011 9:36:10 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input will NOT be preprocessed
11 Jan, 2011 9:36:10 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
11 Jan, 2011 9:36:11 AM
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 7
Exception in thread "main" java.io.FileNotFoundException: File
file:/home/developer/Desktop/seqinput/wordcount/data does not exist.
    at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
    at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
    at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
    at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
    at
org.apache.mahout.vectorizer.collocations.llr.CollocDriver.generateCollocations(CollocDriver.java:236)
    at
org.apache.mahout.vectorizer.collocations.llr.CollocDriver.run(CollocDriver.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at
org.apache.mahout.vectorizer.collocations.llr.CollocDriver.main(CollocDriver.java:65)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)

Is there any probs in my procedure...Pls help me...

Thank u
Sreejith



On Mon, Jan 10, 2011 at 7:54 PM, Robin Anil <ro...@gmail.com> wrote:

> Is this during the first map/reduce or the second? Can you paste the entire
> output and the directory structure of the input folder
>
> Robin
>
>
> On Mon, Jan 10, 2011 at 7:30 PM, Sreejith S <sr...@gmail.com> wrote:
>
> > Thank u...
> >
> > I ran seq2sparse on a sequential file and it created folders names
> > df-count,wordcout,tf-vectors,tfidf,tokenized etc..
> >
> > while running the bin/mahout
> > org.apache.mahout.vectorizer.collocations.llr.CollocDriver an error
> occured
> > like below
> > java.io.FileNotFoundException: File
> > file:/home/developer/Desktop/seqinput/wordcount/data does not exist.
> >
> > What am i supposed to do??
> >
> > Thank u...
> > Sreejith
> >
> >
> > On Thu, Dec 16, 2010 at 6:16 PM, Isabel Drost <is...@apache.org> wrote:
> >
> > > On Thu, 16 Dec 2010 Federico Castanedo <fc...@inf.uc3m.es> wrote:
> > > > > <https://cwiki.apache.org/confluence/display/MAHOUT/Collocations>
> > > >
> > > > I think that wiki entry is old, the new class now lives at
> > > >
> > > > bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
> > > > --help
> > >
> > > Would you please be so kind to fix the page? Simply log in to the wiki
> > > (or create an account, if you don't have one already) and hit the
> > > "edit" button.
> > >
> > > Isabel
> > >
> >
>

Re: Seq2Sparse and Collocation

Posted by Robin Anil <ro...@gmail.com>.

Is this during the first map/reduce or the second? Can you paste the entire
output and the directory structure of the input folder

Robin


On Mon, Jan 10, 2011 at 7:30 PM, Sreejith S <sr...@gmail.com> wrote:

> Thank u...
>
> I ran seq2sparse on a sequential file and it created folders names
> df-count,wordcout,tf-vectors,tfidf,tokenized etc..
>
> while running the bin/mahout
> org.apache.mahout.vectorizer.collocations.llr.CollocDriver an error occured
> like below
> java.io.FileNotFoundException: File
> file:/home/developer/Desktop/seqinput/wordcount/data does not exist.
>
> What am i supposed to do??
>
> Thank u...
> Sreejith
>
>
> On Thu, Dec 16, 2010 at 6:16 PM, Isabel Drost <is...@apache.org> wrote:
>
> > On Thu, 16 Dec 2010 Federico Castanedo <fc...@inf.uc3m.es> wrote:
> > > > <https://cwiki.apache.org/confluence/display/MAHOUT/Collocations>
> > >
> > > I think that wiki entry is old, the new class now lives at
> > >
> > > bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
> > > --help
> >
> > Would you please be so kind to fix the page? Simply log in to the wiki
> > (or create an account, if you don't have one already) and hit the
> > "edit" button.
> >
> > Isabel
> >
>