You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Duncan Lawie <ha...@hoopoes.com> on 2014/05/26 23:07:13 UTC

Trouble with AdaptiveLogistic command line

Hi ,

I'm trying to get a grip on the mahout command line options, and getting 
caught either in gross misunderstanding or Java errors. Help greatly 
appreciated.

I've created some hand-built data which I expect to be noisy, but still 
hoped to run through my workflow before improving my data quality.

"id","brace","target"
000040045,0194,1
000006445,0149,1
000033554,0013,1
...

My understanding is that my workflow should be as follows
1: Use "trainAdaptiveLogistic" with scored data to create a model (here 
called PC.model)
2: Use "validateAdaptiveLogistic " to test how good the model is on a 
holdout data set which has been scored
3: Use "runAdaptiveLogistic" on some unscored data (ie no third column) 
to find out new things

Firstly ... Is that a valid workflow?

runAdaptiveLogistic appears to expect scored data as well - at least, it 
fails if I give it only unscored data (ie the "target" column is absent)

If not, how do I productionise a model?

(Note:  I got the flow to work (at least with scored data for all three) 
with mahout-0.7 and mahout-0.8 but as I thought the "run" step should 
work differently I tried mahout-0.9.  Here, the second step also fails.


[cloudera@localhost ]$ mahout trainAdaptiveLogistic \
--passes 100 \
--input ./PCtrain \
--features 50 \
--output ./PC.model \
--target target \
--categories 2 \
--predictors brace \
--types t

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop 
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
14/05/26 13:56:19 WARN driver.MahoutDriver: No 
trainAdaptiveLogistic.props found on classpath, will use command-line 
arguments only
50
target ~

     0.000000000     0.051644057     0.000000000     0.000000000 
0.000000000     0.023763329     0.000000000     0.000000000 
-0.054034312    -0.000000000     0.000000000     0.021475032 
0.028820276     0.000000000     0.033145160     0.000000000 
0.000000000     0.000000000     0.000000000    -0.000000000 
0.000000000     0.000000000     0.000000000     0.000000000 
0.000000000     0.000000000     0.051755156     0.000000000 
-0.000000000    -0.000000001     0.000000000    -0.053815953 
0.030166157     0.000000000     0.000000000    -0.073127179 
0.000000000    -0.000000000     0.000000000     0.000000000 
-0.000000000     0.000000000     0.000000000    -0.108047988 
0.000000000     0.000000000     0.000000000     0.000000000 
0.000000000    -0.000000000
14/05/26 13:56:36 INFO driver.MahoutDriver: Program took 17784 ms 
(Minutes: 0.2964)

[cloudera@localhost]$ mahout validateAdaptiveLogistic \
--input ./PCtest \
--model ./PC.model \
--auc \
--confusion
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop 
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
14/05/26 13:56:53 WARN driver.MahoutDriver: No 
validateAdaptiveLogistic.props found on classpath, will use command-line 
arguments only

Log-likelihood:Min=-0.78, Max=-0.61, Mean=-0.68, Median=-0.69

AUC = 0.65

=======================================================
Confusion Matrix
-------------------------------------------------------
a        b        <--Classified as
182      0         |  182       a     = 1
0        18        |  18        b     = 2



Entropy Matrix: [[-0.7, -0.4], [-0.7, -0.3]]
14/05/26 13:56:54 INFO driver.MahoutDriver: Program took 1125 ms 
(Minutes: 0.018766666666666668)

[cloudera@localhost]$ mahout runAdaptiveLogistic \
--input ./PCrun \
--model ./PC.model \
--idcolumn id \
--output ./PC.out
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop 
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
14/05/26 13:57:09 WARN driver.MahoutDriver: No runAdaptiveLogistic.props 
found on classpath, will use command-line arguments only
Exception in thread "main" java.lang.NullPointerException
     at 
org.apache.mahout.classifier.sgd.CsvRecordFactory.firstLine(CsvRecordFactory.java:176)
     at 
org.apache.mahout.classifier.sgd.RunAdaptiveLogistic.mainToOutput(RunAdaptiveLogistic.java:83)
     at 
org.apache.mahout.classifier.sgd.RunAdaptiveLogistic.main(RunAdaptiveLogistic.java:54)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)
     at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
     at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
     at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

[cloudera@localhost]$ mahout runAdaptiveLogistic \
--input ./PCtest \
--model ./PC.model \
--idcolumn id \
--output ./PC.out
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop 
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
14/05/26 13:57:35 WARN driver.MahoutDriver: No runAdaptiveLogistic.props 
found on classpath, will use command-line arguments only
100 records processed
200 records processed
200 records processed totally.
14/05/26 13:57:36 INFO driver.MahoutDriver: Program took 943 ms 
(Minutes: 0.015716666666666667)

Thanks,
Duncan.


Re: Trouble with AdaptiveLogistic command line

Posted by Duncan Lawie <ha...@hoopoes.com>.
Just briefly ...

It looks like org/apache/mahout/classifier/sgd/CsvRecordFactory.java is 
throwing a null exception when there is no target column at line 197

196:  // record target column and establish dictionary for decoding target
197:     target = vars.get(targetName);

Letting vars.get(targetName) return a null without throwing an exception 
would appear to let this run and classify new data.

On 26/05/2014 22:07, Duncan Lawie wrote:
> Hi ,
>
> I'm trying to get a grip on the mahout command line options, and 
> getting caught either in gross misunderstanding or Java errors. Help 
> greatly appreciated.
>
> I've created some hand-built data which I expect to be noisy, but 
> still hoped to run through my workflow before improving my data quality.
>
> "id","brace","target"
> 000040045,0194,1
> 000006445,0149,1
> 000033554,0013,1
> ...
>
> My understanding is that my workflow should be as follows
> 1: Use "trainAdaptiveLogistic" with scored data to create a model 
> (here called PC.model)
> 2: Use "validateAdaptiveLogistic " to test how good the model is on a 
> holdout data set which has been scored
> 3: Use "runAdaptiveLogistic" on some unscored data (ie no third 
> column) to find out new things
>
> Firstly ... Is that a valid workflow?
>
> runAdaptiveLogistic appears to expect scored data as well - at least, 
> it fails if I give it only unscored data (ie the "target" column is 
> absent)
>
> If not, how do I productionise a model?
>
> (Note:  I got the flow to work (at least with scored data for all 
> three) with mahout-0.7 and mahout-0.8 but as I thought the "run" step 
> should work differently I tried mahout-0.9.  Here, the second step 
> also fails.
>
>
> [cloudera@localhost ]$ mahout trainAdaptiveLogistic \
> --passes 100 \
> --input ./PCtrain \
> --features 50 \
> --output ./PC.model \
> --target target \
> --categories 2 \
> --predictors brace \
> --types t
>
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using 
> /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop 
> and HADOOP_CONF_DIR=/etc/hadoop/conf
> MAHOUT-JOB: 
> /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
> 14/05/26 13:56:19 WARN driver.MahoutDriver: No 
> trainAdaptiveLogistic.props found on classpath, will use command-line 
> arguments only
> 50
> target ~
>
>     0.000000000     0.051644057     0.000000000     0.000000000 
> 0.000000000     0.023763329     0.000000000     0.000000000 
> -0.054034312    -0.000000000     0.000000000     0.021475032 
> 0.028820276     0.000000000     0.033145160     0.000000000 
> 0.000000000     0.000000000     0.000000000    -0.000000000 
> 0.000000000     0.000000000     0.000000000     0.000000000 
> 0.000000000     0.000000000     0.051755156     0.000000000 
> -0.000000000    -0.000000001     0.000000000    -0.053815953 
> 0.030166157     0.000000000     0.000000000    -0.073127179 
> 0.000000000    -0.000000000     0.000000000     0.000000000 
> -0.000000000     0.000000000     0.000000000    -0.108047988 
> 0.000000000     0.000000000     0.000000000     0.000000000 
> 0.000000000    -0.000000000
> 14/05/26 13:56:36 INFO driver.MahoutDriver: Program took 17784 ms 
> (Minutes: 0.2964)
>
> [cloudera@localhost]$ mahout validateAdaptiveLogistic \
> --input ./PCtest \
> --model ./PC.model \
> --auc \
> --confusion
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using 
> /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop 
> and HADOOP_CONF_DIR=/etc/hadoop/conf
> MAHOUT-JOB: 
> /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
> 14/05/26 13:56:53 WARN driver.MahoutDriver: No 
> validateAdaptiveLogistic.props found on classpath, will use 
> command-line arguments only
>
> Log-likelihood:Min=-0.78, Max=-0.61, Mean=-0.68, Median=-0.69
>
> AUC = 0.65
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a        b        <--Classified as
> 182      0         |  182       a     = 1
> 0        18        |  18        b     = 2
>
>
>
> Entropy Matrix: [[-0.7, -0.4], [-0.7, -0.3]]
> 14/05/26 13:56:54 INFO driver.MahoutDriver: Program took 1125 ms 
> (Minutes: 0.018766666666666668)
>
> [cloudera@localhost]$ mahout runAdaptiveLogistic \
> --input ./PCrun \
> --model ./PC.model \
> --idcolumn id \
> --output ./PC.out
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using 
> /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop 
> and HADOOP_CONF_DIR=/etc/hadoop/conf
> MAHOUT-JOB: 
> /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
> 14/05/26 13:57:09 WARN driver.MahoutDriver: No 
> runAdaptiveLogistic.props found on classpath, will use command-line 
> arguments only
> Exception in thread "main" java.lang.NullPointerException
>     at 
> org.apache.mahout.classifier.sgd.CsvRecordFactory.firstLine(CsvRecordFactory.java:176)
>     at 
> org.apache.mahout.classifier.sgd.RunAdaptiveLogistic.mainToOutput(RunAdaptiveLogistic.java:83)
>     at 
> org.apache.mahout.classifier.sgd.RunAdaptiveLogistic.main(RunAdaptiveLogistic.java:54)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
>     at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
>     at 
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
>     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>
> [cloudera@localhost]$ mahout runAdaptiveLogistic \
> --input ./PCtest \
> --model ./PC.model \
> --idcolumn id \
> --output ./PC.out
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using 
> /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop 
> and HADOOP_CONF_DIR=/etc/hadoop/conf
> MAHOUT-JOB: 
> /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
> 14/05/26 13:57:35 WARN driver.MahoutDriver: No 
> runAdaptiveLogistic.props found on classpath, will use command-line 
> arguments only
> 100 records processed
> 200 records processed
> 200 records processed totally.
> 14/05/26 13:57:36 INFO driver.MahoutDriver: Program took 943 ms 
> (Minutes: 0.015716666666666667)
>
> Thanks,
> Duncan.
>
>
>
> !DSPAM:5383ad33115841664913184!
>