You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Darren Govoni <da...@ontrenet.com> on 2011/01/21 23:36:25 UTC
Running CollocDriver, exception
Hi,
I'm new to mahout and tried to research this a bit before
encountering this problem.
After I generate sequencefile for directory of text files, I run this:
bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
-i out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer
-ng 3
It produces a couple exceptions:
...
WARNING: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
to org.apache.mahout.common.StringTuple
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
...
ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
How can I make this work?
Thanks for any tips,
Darren
Re: Running CollocDriver, exception
Posted by Drew Farris <dr...@apache.org>.
On Sun, Jan 23, 2011 at 11:09 PM, Darren Govoni <da...@ontrenet.com> wrote:
> Drew,
> Thanks for the tip. It works great now!
Great, glad it's working.
> PS. the sort command you suggested doesn't quite sort by LLR score
> because its only a lexical sort and misses something like 70.000 should be
> greater than 8.000
>> Running the results through 'sort -rm -k 6,6' will give you output
>> sorted by LLR score descending.
Oops, thanks for pointing that out -- that 'm' is a typo -- should be
'n', e.g: 'sort -rn 6,6'
Re: Running CollocDriver, exception
Posted by Darren Govoni <da...@ontrenet.com>.
Drew,
Thanks for the tip. It works great now!
Darren
PS. the sort command you suggested doesn't quite sort by LLR score
because its only a lexical sort and misses something like 70.000 should
be greater than 8.000
On 01/23/2011 11:59 AM, Drew Farris wrote:
> Ahh, ok. Output from seqdirectory is a SequenceFile<Text,Text>, where
> the value is the un-tokenized text of each document. By default the
> CollocDriver expects tokenized text as input, but if you add the '-p'
> option to the CollocDriver command-line it will tokenize the text
> before generating the collocations, so you can use the output of
> seqdirectory as is.
>
> for example:
>
> ./bin/mahout seqdirectory \
> -i ./examples/bin/work/reuters-out/ \
> -o ./examples/bin/work/reuters-out-seqdir \
> -c UTF-8 -chunk 5
>
> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
> -i ./examples/bin/work/reuters-out-seqdir \
> -o ./examples/bin/work/reuters-colloc-2 \
> -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p
>
> Drew
>
> On Sun, Jan 23, 2011 at 10:44 AM, Darren Govoni<da...@ontrenet.com> wrote:
>> Hi Drew,
>> Thanks for the tips - much appreciated. See inline.
>>
>> On 01/23/2011 09:22 AM, Drew Farris wrote:
>>> Hi Darren,
>>>
>>> From the error message you receive, it is not exactly clear what is
>>> happening here. I suppose it could be due to the format of the input
>>> sequence file, but I'm not certain.
>>>
>>> A couple questions that will help me answer your question:
>>>
>>> 1) What version of Mahout are you using?
>> 0.4
>>> 2) How are you generating the sequence file you are using as input to
>>> the CollocDriver?
>> bin/mahout seqdirectory --charset ascii --input textfiles/ --output out
>>
>> Then I run:
>>
>> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>> out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>>
>> I am not running hadoop. The error is repeatable. Here is the full output.
>> -----------
>> no HADOOP_HOME set, running locally
>> Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Program took 317 ms
>> [darren@cobalt mahout-distribution-0.4]$ bin/mahout
>> org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o
>> phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>> no HADOOP_HOME set, running locally
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
>> WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props
>> found on classpath, will use command-line arguments only
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Command line arguments:
>> {--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer,
>> --endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, --maxRed=2,
>> --minLLR=1.0, --minSupport=2, --output=phrases, --startPhase=0,
>> --tempDir=temp}
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Maximum n-gram size is: 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Minimum Support value: 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Minimum LLR value: 1.0
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Number of pass1 reduce tasks: 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Input will NOT be preprocessed
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
>> INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
>> Jan 23, 2011 10:42:56 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 1
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Running job: job_local_0001
>> Jan 23, 2011 10:42:56 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 1
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>> <init>
>> INFO: io.sort.mb = 100
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>> <init>
>> INFO: data buffer = 79691776/99614720
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>> <init>
>> INFO: record buffer = 262144/327680
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Max Ngram size is 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Emit Unitgrams is false
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
>> WARNING: job_local_0001
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.mahout.common.StringTuple
>> at
>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: map 0% reduce 0%
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Job complete: job_local_0001
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
>> INFO: Counters: 0
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
>> INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId=
>> - already initialized
>> Jan 23, 2011 10:42:57 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 0
>> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Running job: job_local_0002
>> Jan 23, 2011 10:42:58 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 0
>> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
>> WARNING: job_local_0002
>> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>> at java.util.ArrayList.get(ArrayList.java:322)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: map 0% reduce 0%
>> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Job complete: job_local_0002
>> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
>> INFO: Counters: 0
>> Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Program took 3064 ms
>>
>>> Using the latest code from trunk, I was able to run the following
>>> sequence of commands on the data available after running
>>> ./examples/bin/build-reuters.sh
>>>
>>> (All run from the mahout toplevel directory)
>>>
>>> ./bin/mahout seqdirectory \
>>> -i ./examples/bin/work/reuters-out/ \
>>> -o ./examples/bin/work/reuters-out-seqdir \
>>> -c UTF-8 -chunk 5 \
>>>
>>> ./bin/mahout seq2sparse \
>>> -i ./examples/bin/work/reuters-out-seqdir/ \
>>> -o ./examples/bin/work/reuters-out-seqdir-sparse \
>>>
>>> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>>> -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o
>>> \
>>> -o ./examples/bin/work/reuters-colloc \
>>> -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>>>
>>> ./bin/mahout seqdumper -s
>>> ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less
>>>
>>> This produces output like:
>>>
>>> Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.hadoop.io.DoubleWritable
>>> Key: 0 0 25: Value: 18.436118042416638
>>> Key: 0 0 zen: Value: 39.36827993847055
>>>
>>> Where the key is the trigram and the value is the llr score.
>>>
>>> If there are multiple parts in
>>> examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
>>> them e.g:
>>>
>>> ./bin/mahout seqdumper -s
>>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00000>> out
>>> ./bin/mahout seqdumper -s
>>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00001>> out
>>>
>>> Running the results through 'sort -rm -k 6,6' will give you output
>>> sorted by LLR score descending.
>>>
>>> HTH,
>>>
>>> Drew
>>>
>>> On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<da...@ontrenet.com>
>>> wrote:
>>>> Hi,
>>>> I'm new to mahout and tried to research this a bit before encountering
>>>> this
>>>> problem.
>>>>
>>>> After I generate sequencefile for directory of text files, I run this:
>>>>
>>>> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>>>> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng
>>>> 3
>>>>
>>>> It produces a couple exceptions:
>>>> ...
>>>> WARNING: job_local_0001
>>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>>>> org.apache.mahout.common.StringTuple
>>>> at
>>>>
>>>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>> at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
>>>> monitorAndPrintJob
>>>> ...
>>>> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>>>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>> at java.util.ArrayList.get(ArrayList.java:322)
>>>> at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>>>>
>>>> How can I make this work?
>>>>
>>>> Thanks for any tips,
>>>> Darren
>>>>
>>
Re: Running CollocDriver, exception
Posted by Drew Farris <dr...@apache.org>.
Ahh, ok. Output from seqdirectory is a SequenceFile<Text,Text>, where
the value is the un-tokenized text of each document. By default the
CollocDriver expects tokenized text as input, but if you add the '-p'
option to the CollocDriver command-line it will tokenize the text
before generating the collocations, so you can use the output of
seqdirectory as is.
for example:
./bin/mahout seqdirectory \
-i ./examples/bin/work/reuters-out/ \
-o ./examples/bin/work/reuters-out-seqdir \
-c UTF-8 -chunk 5
./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
-i ./examples/bin/work/reuters-out-seqdir \
-o ./examples/bin/work/reuters-colloc-2 \
-a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p
Drew
On Sun, Jan 23, 2011 at 10:44 AM, Darren Govoni <da...@ontrenet.com> wrote:
> Hi Drew,
> Thanks for the tips - much appreciated. See inline.
>
> On 01/23/2011 09:22 AM, Drew Farris wrote:
>>
>> Hi Darren,
>>
>> From the error message you receive, it is not exactly clear what is
>> happening here. I suppose it could be due to the format of the input
>> sequence file, but I'm not certain.
>>
>> A couple questions that will help me answer your question:
>>
>> 1) What version of Mahout are you using?
>
> 0.4
>>
>> 2) How are you generating the sequence file you are using as input to
>> the CollocDriver?
>
> bin/mahout seqdirectory --charset ascii --input textfiles/ --output out
>
> Then I run:
>
> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
> out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>
> I am not running hadoop. The error is repeatable. Here is the full output.
> -----------
> no HADOOP_HOME set, running locally
> Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Program took 317 ms
> [darren@cobalt mahout-distribution-0.4]$ bin/mahout
> org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o
> phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
> no HADOOP_HOME set, running locally
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
> WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props
> found on classpath, will use command-line arguments only
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Command line arguments:
> {--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer,
> --endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, --maxRed=2,
> --minLLR=1.0, --minSupport=2, --output=phrases, --startPhase=0,
> --tempDir=temp}
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Maximum n-gram size is: 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Minimum Support value: 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Minimum LLR value: 1.0
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Number of pass1 reduce tasks: 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Input will NOT be preprocessed
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
> INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
> Jan 23, 2011 10:42:56 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 1
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Running job: job_local_0001
> Jan 23, 2011 10:42:56 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 1
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> <init>
> INFO: io.sort.mb = 100
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> <init>
> INFO: data buffer = 79691776/99614720
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> <init>
> INFO: record buffer = 262144/327680
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Max Ngram size is 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Emit Unitgrams is false
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0001
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.mahout.common.StringTuple
> at
> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: map 0% reduce 0%
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Job complete: job_local_0001
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
> INFO: Counters: 0
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
> INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId=
> - already initialized
> Jan 23, 2011 10:42:57 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 0
> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Running job: job_local_0002
> Jan 23, 2011 10:42:58 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 0
> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0002
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> at java.util.ArrayList.get(ArrayList.java:322)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: map 0% reduce 0%
> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Job complete: job_local_0002
> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
> INFO: Counters: 0
> Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Program took 3064 ms
>
>> Using the latest code from trunk, I was able to run the following
>> sequence of commands on the data available after running
>> ./examples/bin/build-reuters.sh
>>
>> (All run from the mahout toplevel directory)
>>
>> ./bin/mahout seqdirectory \
>> -i ./examples/bin/work/reuters-out/ \
>> -o ./examples/bin/work/reuters-out-seqdir \
>> -c UTF-8 -chunk 5 \
>>
>> ./bin/mahout seq2sparse \
>> -i ./examples/bin/work/reuters-out-seqdir/ \
>> -o ./examples/bin/work/reuters-out-seqdir-sparse \
>>
>> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>> -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o
>> \
>> -o ./examples/bin/work/reuters-colloc \
>> -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>>
>> ./bin/mahout seqdumper -s
>> ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less
>>
>> This produces output like:
>>
>> Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.hadoop.io.DoubleWritable
>> Key: 0 0 25: Value: 18.436118042416638
>> Key: 0 0 zen: Value: 39.36827993847055
>>
>> Where the key is the trigram and the value is the llr score.
>>
>> If there are multiple parts in
>> examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
>> them e.g:
>>
>> ./bin/mahout seqdumper -s
>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00000>> out
>> ./bin/mahout seqdumper -s
>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00001>> out
>>
>> Running the results through 'sort -rm -k 6,6' will give you output
>> sorted by LLR score descending.
>>
>> HTH,
>>
>> Drew
>>
>> On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<da...@ontrenet.com>
>> wrote:
>>>
>>> Hi,
>>> I'm new to mahout and tried to research this a bit before encountering
>>> this
>>> problem.
>>>
>>> After I generate sequencefile for directory of text files, I run this:
>>>
>>> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>>> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng
>>> 3
>>>
>>> It produces a couple exceptions:
>>> ...
>>> WARNING: job_local_0001
>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>>> org.apache.mahout.common.StringTuple
>>> at
>>>
>>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>> at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
>>> monitorAndPrintJob
>>> ...
>>> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>> at java.util.ArrayList.get(ArrayList.java:322)
>>> at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>>>
>>> How can I make this work?
>>>
>>> Thanks for any tips,
>>> Darren
>>>
>
>
Re: Running CollocDriver, exception
Posted by Darren Govoni <da...@ontrenet.com>.
Hi Drew,
Thanks for the tips - much appreciated. See inline.
On 01/23/2011 09:22 AM, Drew Farris wrote:
> Hi Darren,
>
> From the error message you receive, it is not exactly clear what is
> happening here. I suppose it could be due to the format of the input
> sequence file, but I'm not certain.
>
> A couple questions that will help me answer your question:
>
> 1) What version of Mahout are you using?
0.4
> 2) How are you generating the sequence file you are using as input to
> the CollocDriver?
bin/mahout seqdirectory --charset ascii --input textfiles/ --output out
Then I run:
bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
I am not running hadoop. The error is repeatable. Here is the full output.
-----------
no HADOOP_HOME set, running locally
Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 317 ms
[darren@cobalt mahout-distribution-0.4]$ bin/mahout
org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
no HADOOP_HOME set, running locally
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
WARNING: No
org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props found
on classpath, will use command-line arguments only
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments:
{--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer,
--endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2,
--maxRed=2, --minLLR=1.0, --minSupport=2, --output=phrases,
--startPhase=0, --tempDir=temp}
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Maximum n-gram size is: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum Support value: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum LLR value: 1.0
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Number of pass1 reduce tasks: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input will NOT be preprocessed
Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Jan 23, 2011 10:42:56 AM
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: Running job: job_local_0001
Jan 23, 2011 10:42:56 AM
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Jan 23, 2011 10:42:56 AM
org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Jan 23, 2011 10:42:56 AM
org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Jan 23, 2011 10:42:56 AM
org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Max Ngram size is 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Emit Unitgrams is false
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
to org.apache.mahout.common.StringTuple
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: map 0% reduce 0%
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: Job complete: job_local_0001
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Cannot initialize JVM Metrics with processName=JobTracker,
sessionId= - already initialized
Jan 23, 2011 10:42:57 AM
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 0
Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: Running job: job_local_0002
Jan 23, 2011 10:42:58 AM
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 0
Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0002
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: map 0% reduce 0%
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: Job complete: job_local_0002
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 3064 ms
> Using the latest code from trunk, I was able to run the following
> sequence of commands on the data available after running
> ./examples/bin/build-reuters.sh
>
> (All run from the mahout toplevel directory)
>
> ./bin/mahout seqdirectory \
> -i ./examples/bin/work/reuters-out/ \
> -o ./examples/bin/work/reuters-out-seqdir \
> -c UTF-8 -chunk 5 \
>
> ./bin/mahout seq2sparse \
> -i ./examples/bin/work/reuters-out-seqdir/ \
> -o ./examples/bin/work/reuters-out-seqdir-sparse \
>
> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
> -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o \
> -o ./examples/bin/work/reuters-colloc \
> -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>
> ./bin/mahout seqdumper -s
> ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less
>
> This produces output like:
>
> Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.hadoop.io.DoubleWritable
> Key: 0 0 25: Value: 18.436118042416638
> Key: 0 0 zen: Value: 39.36827993847055
>
> Where the key is the trigram and the value is the llr score.
>
> If there are multiple parts in
> examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
> them e.g:
>
> ./bin/mahout seqdumper -s
> ./examples/bin/work/reuters-colloc/ngrams/part-r-00000>> out
> ./bin/mahout seqdumper -s
> ./examples/bin/work/reuters-colloc/ngrams/part-r-00001>> out
>
> Running the results through 'sort -rm -k 6,6' will give you output
> sorted by LLR score descending.
>
> HTH,
>
> Drew
>
> On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<da...@ontrenet.com> wrote:
>> Hi,
>> I'm new to mahout and tried to research this a bit before encountering this
>> problem.
>>
>> After I generate sequencefile for directory of text files, I run this:
>>
>> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>>
>> It produces a couple exceptions:
>> ...
>> WARNING: job_local_0001
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.mahout.common.StringTuple
>> at
>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> ...
>> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>> at java.util.ArrayList.get(ArrayList.java:322)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>>
>> How can I make this work?
>>
>> Thanks for any tips,
>> Darren
>>
Re: Running CollocDriver, exception
Posted by Drew Farris <dr...@apache.org>.
Hi Darren,
>From the error message you receive, it is not exactly clear what is
happening here. I suppose it could be due to the format of the input
sequence file, but I'm not certain.
A couple questions that will help me answer your question:
1) What version of Mahout are you using?
2) How are you generating the sequence file you are using as input to
the CollocDriver?
Using the latest code from trunk, I was able to run the following
sequence of commands on the data available after running
./examples/bin/build-reuters.sh
(All run from the mahout toplevel directory)
./bin/mahout seqdirectory \
-i ./examples/bin/work/reuters-out/ \
-o ./examples/bin/work/reuters-out-seqdir \
-c UTF-8 -chunk 5 \
./bin/mahout seq2sparse \
-i ./examples/bin/work/reuters-out-seqdir/ \
-o ./examples/bin/work/reuters-out-seqdir-sparse \
./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
-i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o \
-o ./examples/bin/work/reuters-colloc \
-a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less
This produces output like:
Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.DoubleWritable
Key: 0 0 25: Value: 18.436118042416638
Key: 0 0 zen: Value: 39.36827993847055
Where the key is the trigram and the value is the llr score.
If there are multiple parts in
examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
them e.g:
./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00000 >> out
./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00001 >> out
Running the results through 'sort -rm -k 6,6' will give you output
sorted by LLR score descending.
HTH,
Drew
On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni <da...@ontrenet.com> wrote:
> Hi,
> I'm new to mahout and tried to research this a bit before encountering this
> problem.
>
> After I generate sequencefile for directory of text files, I run this:
>
> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>
> It produces a couple exceptions:
> ...
> WARNING: job_local_0001
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.mahout.common.StringTuple
> at
> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> ...
> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> at java.util.ArrayList.get(ArrayList.java:322)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>
> How can I make this work?
>
> Thanks for any tips,
> Darren
>