You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Darren Govoni <da...@ontrenet.com> on 2011/01/21 23:36:25 UTC

Running CollocDriver, exception

Hi,
   I'm new to mahout and tried to research this a bit before 
encountering this problem.

After I generate sequencefile for directory of text files, I run this:

  bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver 
-i out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer 
-ng 3

It produces a couple exceptions:
...
WARNING: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast 
to org.apache.mahout.common.StringTuple
     at 
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient 
monitorAndPrintJob
...
ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
     at java.util.ArrayList.RangeCheck(ArrayList.java:547)
     at java.util.ArrayList.get(ArrayList.java:322)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)

How can I make this work?

Thanks for any tips,
Darren

Re: Running CollocDriver, exception

Posted by Drew Farris <dr...@apache.org>.
On Sun, Jan 23, 2011 at 11:09 PM, Darren Govoni <da...@ontrenet.com> wrote:
> Drew,
>  Thanks for the tip. It works great now!

Great, glad it's working.

> PS. the sort command you suggested doesn't quite sort by LLR score
> because its only a lexical sort and misses something like 70.000 should be
> greater than 8.000
>> Running the results through 'sort -rm -k 6,6' will give you output
>> sorted by LLR score descending.

Oops, thanks for pointing that out -- that 'm' is a typo -- should be
'n', e.g: 'sort -rn 6,6'

Re: Running CollocDriver, exception

Posted by Darren Govoni <da...@ontrenet.com>.
Drew,
   Thanks for the tip. It works great now!

Darren

PS. the sort command you suggested doesn't quite sort by LLR score
because its only a lexical sort and misses something like 70.000 should 
be greater than 8.000


On 01/23/2011 11:59 AM, Drew Farris wrote:
> Ahh, ok. Output from seqdirectory is a SequenceFile<Text,Text>, where
> the value is the un-tokenized text of each document. By default the
> CollocDriver expects tokenized text as input, but if you add the '-p'
> option to the CollocDriver command-line it will tokenize the text
> before generating the collocations, so you can use the output of
> seqdirectory as is.
>
> for example:
>
> ./bin/mahout seqdirectory \
>   -i ./examples/bin/work/reuters-out/ \
>   -o ./examples/bin/work/reuters-out-seqdir \
>   -c UTF-8 -chunk 5
>
> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>    -i ./examples/bin/work/reuters-out-seqdir \
>    -o ./examples/bin/work/reuters-colloc-2 \
>    -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p
>
> Drew
>
> On Sun, Jan 23, 2011 at 10:44 AM, Darren Govoni<da...@ontrenet.com>  wrote:
>> Hi Drew,
>>   Thanks for the tips - much appreciated. See inline.
>>
>> On 01/23/2011 09:22 AM, Drew Farris wrote:
>>> Hi Darren,
>>>
>>>   From the error message you receive, it is not exactly clear what is
>>> happening here. I suppose it could be due to the format of the input
>>> sequence file, but I'm not certain.
>>>
>>> A couple questions that will help me answer your question:
>>>
>>> 1) What version of Mahout are you using?
>> 0.4
>>> 2) How are you generating the sequence file you are using as input to
>>> the CollocDriver?
>> bin/mahout seqdirectory --charset ascii --input textfiles/ --output out
>>
>> Then I run:
>>
>> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>> out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>>
>> I am not running hadoop. The error is repeatable. Here is the full output.
>> -----------
>> no HADOOP_HOME set, running locally
>> Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Program took 317 ms
>> [darren@cobalt mahout-distribution-0.4]$ bin/mahout
>> org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o
>> phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>> no HADOOP_HOME set, running locally
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
>> WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props
>> found on classpath, will use command-line arguments only
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Command line arguments:
>> {--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer,
>> --endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, --maxRed=2,
>> --minLLR=1.0, --minSupport=2, --output=phrases, --startPhase=0,
>> --tempDir=temp}
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Maximum n-gram size is: 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Minimum Support value: 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Minimum LLR value: 1.0
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Number of pass1 reduce tasks: 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Input will NOT be preprocessed
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
>> INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
>> Jan 23, 2011 10:42:56 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 1
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Running job: job_local_0001
>> Jan 23, 2011 10:42:56 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 1
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>> <init>
>> INFO: io.sort.mb = 100
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>> <init>
>> INFO: data buffer = 79691776/99614720
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>> <init>
>> INFO: record buffer = 262144/327680
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Max Ngram size is 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Emit Unitgrams is false
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
>> WARNING: job_local_0001
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.mahout.common.StringTuple
>>     at
>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO:  map 0% reduce 0%
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Job complete: job_local_0001
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
>> INFO: Counters: 0
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
>> INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId=
>> - already initialized
>> Jan 23, 2011 10:42:57 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 0
>> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Running job: job_local_0002
>> Jan 23, 2011 10:42:58 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 0
>> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
>> WARNING: job_local_0002
>> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>>     at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>     at java.util.ArrayList.get(ArrayList.java:322)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO:  map 0% reduce 0%
>> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Job complete: job_local_0002
>> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
>> INFO: Counters: 0
>> Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Program took 3064 ms
>>
>>> Using the latest code from trunk, I was able to run the following
>>> sequence of commands on the data available after running
>>> ./examples/bin/build-reuters.sh
>>>
>>> (All run from the mahout toplevel directory)
>>>
>>> ./bin/mahout seqdirectory \
>>>    -i ./examples/bin/work/reuters-out/ \
>>>    -o ./examples/bin/work/reuters-out-seqdir \
>>>    -c UTF-8 -chunk 5 \
>>>
>>> ./bin/mahout seq2sparse \
>>>    -i ./examples/bin/work/reuters-out-seqdir/ \
>>>    -o ./examples/bin/work/reuters-out-seqdir-sparse \
>>>
>>> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>>>    -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o
>>> \
>>>    -o ./examples/bin/work/reuters-colloc \
>>>    -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>>>
>>>   ./bin/mahout seqdumper -s
>>> ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less
>>>
>>> This produces output like:
>>>
>>> Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.hadoop.io.DoubleWritable
>>> Key: 0 0 25: Value: 18.436118042416638
>>> Key: 0 0 zen: Value: 39.36827993847055
>>>
>>> Where the key is the trigram and the value is the llr score.
>>>
>>> If there are multiple parts in
>>> examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
>>> them e.g:
>>>
>>> ./bin/mahout seqdumper -s
>>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00000>>    out
>>> ./bin/mahout seqdumper -s
>>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00001>>    out
>>>
>>> Running the results through 'sort -rm -k 6,6' will give you output
>>> sorted by LLR score descending.
>>>
>>> HTH,
>>>
>>> Drew
>>>
>>> On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<da...@ontrenet.com>
>>>   wrote:
>>>> Hi,
>>>>   I'm new to mahout and tried to research this a bit before encountering
>>>> this
>>>> problem.
>>>>
>>>> After I generate sequencefile for directory of text files, I run this:
>>>>
>>>>   bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>>>> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng
>>>> 3
>>>>
>>>> It produces a couple exceptions:
>>>> ...
>>>> WARNING: job_local_0001
>>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>>>> org.apache.mahout.common.StringTuple
>>>>     at
>>>>
>>>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>>>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>>     at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
>>>> monitorAndPrintJob
>>>> ...
>>>> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>>>>     at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>>     at java.util.ArrayList.get(ArrayList.java:322)
>>>>     at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>>>>
>>>> How can I make this work?
>>>>
>>>> Thanks for any tips,
>>>> Darren
>>>>
>>


Re: Running CollocDriver, exception

Posted by Drew Farris <dr...@apache.org>.
Ahh, ok. Output from seqdirectory is a SequenceFile<Text,Text>, where
the value is the un-tokenized text of each document. By default the
CollocDriver expects tokenized text as input, but if you add the '-p'
option to the CollocDriver command-line it will tokenize the text
before generating the collocations, so you can use the output of
seqdirectory as is.

for example:

./bin/mahout seqdirectory \
 -i ./examples/bin/work/reuters-out/ \
 -o ./examples/bin/work/reuters-out-seqdir \
 -c UTF-8 -chunk 5

./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
  -i ./examples/bin/work/reuters-out-seqdir \
  -o ./examples/bin/work/reuters-colloc-2 \
  -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p

Drew

On Sun, Jan 23, 2011 at 10:44 AM, Darren Govoni <da...@ontrenet.com> wrote:
> Hi Drew,
>  Thanks for the tips - much appreciated. See inline.
>
> On 01/23/2011 09:22 AM, Drew Farris wrote:
>>
>> Hi Darren,
>>
>>  From the error message you receive, it is not exactly clear what is
>> happening here. I suppose it could be due to the format of the input
>> sequence file, but I'm not certain.
>>
>> A couple questions that will help me answer your question:
>>
>> 1) What version of Mahout are you using?
>
> 0.4
>>
>> 2) How are you generating the sequence file you are using as input to
>> the CollocDriver?
>
> bin/mahout seqdirectory --charset ascii --input textfiles/ --output out
>
> Then I run:
>
> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
> out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>
> I am not running hadoop. The error is repeatable. Here is the full output.
> -----------
> no HADOOP_HOME set, running locally
> Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Program took 317 ms
> [darren@cobalt mahout-distribution-0.4]$ bin/mahout
> org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o
> phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
> no HADOOP_HOME set, running locally
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
> WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props
> found on classpath, will use command-line arguments only
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Command line arguments:
> {--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer,
> --endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, --maxRed=2,
> --minLLR=1.0, --minSupport=2, --output=phrases, --startPhase=0,
> --tempDir=temp}
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Maximum n-gram size is: 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Minimum Support value: 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Minimum LLR value: 1.0
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Number of pass1 reduce tasks: 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Input will NOT be preprocessed
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
> INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
> Jan 23, 2011 10:42:56 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 1
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Running job: job_local_0001
> Jan 23, 2011 10:42:56 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 1
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> <init>
> INFO: io.sort.mb = 100
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> <init>
> INFO: data buffer = 79691776/99614720
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> <init>
> INFO: record buffer = 262144/327680
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Max Ngram size is 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Emit Unitgrams is false
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0001
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.mahout.common.StringTuple
>    at
> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO:  map 0% reduce 0%
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Job complete: job_local_0001
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
> INFO: Counters: 0
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
> INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId=
> - already initialized
> Jan 23, 2011 10:42:57 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 0
> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Running job: job_local_0002
> Jan 23, 2011 10:42:58 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 0
> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0002
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>    at java.util.ArrayList.get(ArrayList.java:322)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO:  map 0% reduce 0%
> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Job complete: job_local_0002
> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
> INFO: Counters: 0
> Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Program took 3064 ms
>
>> Using the latest code from trunk, I was able to run the following
>> sequence of commands on the data available after running
>> ./examples/bin/build-reuters.sh
>>
>> (All run from the mahout toplevel directory)
>>
>> ./bin/mahout seqdirectory \
>>   -i ./examples/bin/work/reuters-out/ \
>>   -o ./examples/bin/work/reuters-out-seqdir \
>>   -c UTF-8 -chunk 5 \
>>
>> ./bin/mahout seq2sparse \
>>   -i ./examples/bin/work/reuters-out-seqdir/ \
>>   -o ./examples/bin/work/reuters-out-seqdir-sparse \
>>
>> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>>   -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o
>> \
>>   -o ./examples/bin/work/reuters-colloc \
>>   -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>>
>>  ./bin/mahout seqdumper -s
>> ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less
>>
>> This produces output like:
>>
>> Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.hadoop.io.DoubleWritable
>> Key: 0 0 25: Value: 18.436118042416638
>> Key: 0 0 zen: Value: 39.36827993847055
>>
>> Where the key is the trigram and the value is the llr score.
>>
>> If there are multiple parts in
>> examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
>> them e.g:
>>
>> ./bin/mahout seqdumper -s
>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00000>>  out
>> ./bin/mahout seqdumper -s
>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00001>>  out
>>
>> Running the results through 'sort -rm -k 6,6' will give you output
>> sorted by LLR score descending.
>>
>> HTH,
>>
>> Drew
>>
>> On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<da...@ontrenet.com>
>>  wrote:
>>>
>>> Hi,
>>>  I'm new to mahout and tried to research this a bit before encountering
>>> this
>>> problem.
>>>
>>> After I generate sequencefile for directory of text files, I run this:
>>>
>>>  bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>>> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng
>>> 3
>>>
>>> It produces a couple exceptions:
>>> ...
>>> WARNING: job_local_0001
>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>>> org.apache.mahout.common.StringTuple
>>>    at
>>>
>>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>>>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>    at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
>>> monitorAndPrintJob
>>> ...
>>> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>>>    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>    at java.util.ArrayList.get(ArrayList.java:322)
>>>    at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>>>
>>> How can I make this work?
>>>
>>> Thanks for any tips,
>>> Darren
>>>
>
>

Re: Running CollocDriver, exception

Posted by Darren Govoni <da...@ontrenet.com>.
Hi Drew,
   Thanks for the tips - much appreciated. See inline.

On 01/23/2011 09:22 AM, Drew Farris wrote:
> Hi Darren,
>
>  From the error message you receive, it is not exactly clear what is
> happening here. I suppose it could be due to the format of the input
> sequence file, but I'm not certain.
>
> A couple questions that will help me answer your question:
>
> 1) What version of Mahout are you using?
0.4
> 2) How are you generating the sequence file you are using as input to
> the CollocDriver?
bin/mahout seqdirectory --charset ascii --input textfiles/ --output out

Then I run:

bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i 
out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer

I am not running hadoop. The error is repeatable. Here is the full output.
-----------
no HADOOP_HOME set, running locally
Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 317 ms
[darren@cobalt mahout-distribution-0.4]$ bin/mahout 
org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i 
out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
no HADOOP_HOME set, running locally
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
WARNING: No 
org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props found 
on classpath, will use command-line arguments only
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: 
{--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer, 
--endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, 
--maxRed=2, --minLLR=1.0, --minSupport=2, --output=phrases, 
--startPhase=0, --tempDir=temp}
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Maximum n-gram size is: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum Support value: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum LLR value: 1.0
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Number of pass1 reduce tasks: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input will NOT be preprocessed
Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Jan 23, 2011 10:42:56 AM 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient 
monitorAndPrintJob
INFO: Running job: job_local_0001
Jan 23, 2011 10:42:56 AM 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Jan 23, 2011 10:42:56 AM 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Jan 23, 2011 10:42:56 AM 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Jan 23, 2011 10:42:56 AM 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Max Ngram size is 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Emit Unitgrams is false
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast 
to org.apache.mahout.common.StringTuple
     at 
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient 
monitorAndPrintJob
INFO:  map 0% reduce 0%
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient 
monitorAndPrintJob
INFO: Job complete: job_local_0001
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Cannot initialize JVM Metrics with processName=JobTracker, 
sessionId= - already initialized
Jan 23, 2011 10:42:57 AM 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 0
Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient 
monitorAndPrintJob
INFO: Running job: job_local_0002
Jan 23, 2011 10:42:58 AM 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 0
Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0002
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
     at java.util.ArrayList.RangeCheck(ArrayList.java:547)
     at java.util.ArrayList.get(ArrayList.java:322)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient 
monitorAndPrintJob
INFO:  map 0% reduce 0%
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient 
monitorAndPrintJob
INFO: Job complete: job_local_0002
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 3064 ms

> Using the latest code from trunk, I was able to run the following
> sequence of commands on the data available after running
> ./examples/bin/build-reuters.sh
>
> (All run from the mahout toplevel directory)
>
> ./bin/mahout seqdirectory \
>    -i ./examples/bin/work/reuters-out/ \
>    -o ./examples/bin/work/reuters-out-seqdir \
>    -c UTF-8 -chunk 5 \
>
> ./bin/mahout seq2sparse \
>    -i ./examples/bin/work/reuters-out-seqdir/ \
>    -o ./examples/bin/work/reuters-out-seqdir-sparse \
>
> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>    -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o \
>    -o ./examples/bin/work/reuters-colloc \
>    -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>
>   ./bin/mahout seqdumper -s
> ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less
>
> This produces output like:
>
> Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.hadoop.io.DoubleWritable
> Key: 0 0 25: Value: 18.436118042416638
> Key: 0 0 zen: Value: 39.36827993847055
>
> Where the key is the trigram and the value is the llr score.
>
> If there are multiple parts in
> examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
> them e.g:
>
> ./bin/mahout seqdumper -s
> ./examples/bin/work/reuters-colloc/ngrams/part-r-00000>>  out
> ./bin/mahout seqdumper -s
> ./examples/bin/work/reuters-colloc/ngrams/part-r-00001>>  out
>
> Running the results through 'sort -rm -k 6,6' will give you output
> sorted by LLR score descending.
>
> HTH,
>
> Drew
>
> On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<da...@ontrenet.com>  wrote:
>> Hi,
>>   I'm new to mahout and tried to research this a bit before encountering this
>> problem.
>>
>> After I generate sequencefile for directory of text files, I run this:
>>
>>   bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>>
>> It produces a couple exceptions:
>> ...
>> WARNING: job_local_0001
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.mahout.common.StringTuple
>>     at
>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> ...
>> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>>     at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>     at java.util.ArrayList.get(ArrayList.java:322)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>>
>> How can I make this work?
>>
>> Thanks for any tips,
>> Darren
>>


Re: Running CollocDriver, exception

Posted by Drew Farris <dr...@apache.org>.
Hi Darren,

>From the error message you receive, it is not exactly clear what is
happening here. I suppose it could be due to the format of the input
sequence file, but I'm not certain.

A couple questions that will help me answer your question:

1) What version of Mahout are you using?
2) How are you generating the sequence file you are using as input to
the CollocDriver?

Using the latest code from trunk, I was able to run the following
sequence of commands on the data available after running
./examples/bin/build-reuters.sh

(All run from the mahout toplevel directory)

./bin/mahout seqdirectory \
  -i ./examples/bin/work/reuters-out/ \
  -o ./examples/bin/work/reuters-out-seqdir \
  -c UTF-8 -chunk 5 \

./bin/mahout seq2sparse \
  -i ./examples/bin/work/reuters-out-seqdir/ \
  -o ./examples/bin/work/reuters-out-seqdir-sparse \

./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
  -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o \
  -o ./examples/bin/work/reuters-colloc \
  -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3

 ./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less

This produces output like:

Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.DoubleWritable
Key: 0 0 25: Value: 18.436118042416638
Key: 0 0 zen: Value: 39.36827993847055

Where the key is the trigram and the value is the llr score.

If there are multiple parts in
examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
them e.g:

./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00000 >> out
./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00001 >> out

Running the results through 'sort -rm -k 6,6' will give you output
sorted by LLR score descending.

HTH,

Drew

On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni <da...@ontrenet.com> wrote:
> Hi,
>  I'm new to mahout and tried to research this a bit before encountering this
> problem.
>
> After I generate sequencefile for directory of text files, I run this:
>
>  bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>
> It produces a couple exceptions:
> ...
> WARNING: job_local_0001
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.mahout.common.StringTuple
>    at
> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> ...
> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>    at java.util.ArrayList.get(ArrayList.java:322)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>
> How can I make this work?
>
> Thanks for any tips,
> Darren
>