You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by sa...@eclinicalworks.com on 2013/02/21 10:57:46 UTC

Simple Mahout classification example


I want to train create model for classification. For me this text is coming from database and I really do not want to store them to file for mahout training. I checked out the the MIA source code and changed the following code for very basic training task. Usual issue with mahout examples are either they show how to use mahout from cmd prompt using 20 news group, or the code has lot of dependency on Hadoop Zookeeper etc. I will really appreciate if someone can have a look at my code, or point me to a very simple tutorial which show how to train a model and then use it.
As of now in following code I am never getting past if (best != null) becauselearningAlgorithm.getBest(); is always returning null!
Sorry for posting the whole code, but didn't see any other option
 
public class Classifier {    private static final int FEATURES = 10000;    private static final TextValueEncoder encoder = new TextValueEncoder("body");    private static final FeatureVectorEncoder bias = new ConstantValueEncoder("Intercept");    private static final String[] LEAK_LABELS = {"none", "month-year", "day-month-year"};    /**     * @param args the command line arguments     */    public static void main(String[] args) throws Exception {        int leakType = 0;        // TODO code application logic here        AdaptiveLogisticRegression learningAlgorithm = new AdaptiveLogisticRegression(20, FEATURES, new L1());        Dictionary newsGroups = new Dictionary();        //ModelDissector md = new ModelDissector();        ListMultimap<String, String> noteBySection = LinkedListMultimap.create();        noteBySection.put("good", "I love this product, the screen is a pleasure to work with and is a great choice for any business");        noteBySection.put("good", "What a product!! Really amazing clarity and works pretty well");        noteBySection.put("good", "This product has good battery life and is a little bit heavy but I like it");        noteBySection.put("bad", "I am really bored with the same UI, this is their 5th version(or fourth or sixth, who knows) and it looks just like the first one");        noteBySection.put("bad", "The phone is bulky and useless");        noteBySection.put("bad", "I wish i had never bought this laptop. It died in the first year and now i am not able to return it");        encoder.setProbes(2);        double step = 0;        int[] bumps = {1, 2, 5};        double averageCorrect = 0;        double averageLL = 0;        int k = 0;        //-------------------------------------        //notes.keySet()        for (String key : noteBySection.keySet()) {            System.out.println(key);            List<String> notes = noteBySection.get(key);            for (Iterator<String> it = notes.iterator(); it.hasNext();) {                String note = it.next();                int actual = newsGroups.intern(key);                Vector v = encodeFeatureVector(note);                learningAlgorithm.train(actual, v);                k++;                int bump = bumps[(int) Math.floor(step) % bumps.length];                int scale = (int) Math.pow(10, Math.floor(step / bumps.length));                State<AdaptiveLogisticRegression.Wrapper, CrossFoldLearner> best = learningAlgorithm.getBest();                double maxBeta;                double nonZeros;                double positive;                double norm;                double lambda = 0;                double mu = 0;                if (best != null) {                    CrossFoldLearner state = best.getPayload().getLearner();                    averageCorrect = state.percentCorrect();                    averageLL = state.logLikelihood();                    OnlineLogisticRegression model = state.getModels().get(0);                    // finish off pending regularization                    model.close();                    Matrix beta = model.getBeta();                    maxBeta = beta.aggregate(Functions.MAX, Functions.ABS);                    nonZeros = beta.aggregate(Functions.PLUS, new DoubleFunction() {                        @Override                        public double apply(double v) {                            return Math.abs(v) > 1.0e-6 ? 1 : 0;                        }                    });                    positive = beta.aggregate(Functions.PLUS, new DoubleFunction() {                        @Override                        public double apply(double v) {                            return v > 0 ? 1 : 0;                        }                    });                    norm = beta.aggregate(Functions.PLUS, Functions.ABS);                    lambda = learningAlgorithm.getBest().getMappedParams()[0];                    mu = learningAlgorithm.getBest().getMappedParams()[1];                } else {                    maxBeta = 0;                    nonZeros = 0;                    positive = 0;                    norm = 0;                }                System.out.println(k % (bump * scale));                if (k % (bump * scale) == 0) {                    if (learningAlgorithm.getBest() != null) {                        System.out.println("----------------------------");                        ModelSerializer.writeBinary("c:/tmp/news-group-" + k + ".model",                                learningAlgorithm.getBest().getPayload().getLearner().getModels().get(0));                    }                    step += 0..25;                    System.out.printf("%.2f\t%.2f\t%.2f\t%.2f\t%.8g\t%.8g\t", maxBeta, nonZeros, positive, norm, lambda, mu);                    System.out.printf("%d\t%.3f\t%.2f\t%s\n",                            k, averageLL, averageCorrect * 100, LEAK_LABELS[leakType % 3]);                }            }        }         learningAlgorithm.close();    }    private static Vector encodeFeatureVector(String text) {        encoder.addText(text.toLowerCase());        //System.out.println(encoder.asString(text));        Vector v = new RandomAccessSparseVector(FEATURES);        bias.addToVector((byte[]) null, 1, v);        encoder.flush(1, v);        return v;    }}
 
 
Sapankumar Parikh
 Product Development
 
eClinicalWorks
2 Technology Drive | Westborough, MA 01581
T: 5084750450 x 17269
[mailto:john.doe@eclinicalworks.com] sapan.parikh@eclinicalworks.com | [http://www.eclinicalworks.com/] www.eclinicalworks.com 
70,000+ physicians | 220,000+ providers | 410,000+ users | 23,000+ practices
Voted Most Interesting Vendor in 2010 by Healthcare Informatics | Top-rated vendor by IDC Health Insights | Seven Davies Award Winners – eCW Customers | Named in Inc. 5000 list 2007 - 2012 

This transmission contains confidential information belonging to the sender that is legally privileged and proprietary and may be subject to protection under the law, including the Health Insurance Portability and Accountability Act (HIPAA). If you are not the intended recipient of this e-mail, you are prohibited from sharing, copying, or otherwise using or disclosing its contents. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and permanently delete this e-mail and any attachments without reading, forwarding, or saving them. Thank you.
 Please consider the environment and only print this e-mail if necessary
 
 

CONFIDENTIALITY NOTICE TO RECIPIENT: This transmission contains confidential information belonging to the sender that is legally privileged and proprietary and may be subject to protection under the law, including the Health Insurance Portability and Accountability Act (HIPAA). If you are not the intended recipient of this e-mail, you are prohibited from sharing, copying, or otherwise using or disclosing its contents. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and permanently delete this e-mail and any attachments without reading, forwarding or saving them. Thank you.

RE: Simple Mahout classification example

Posted by sa...@eclinicalworks.com.
Oops I messed up in my first Mahout mailing list post!!!
Please see the formatted questions here
http://stackoverflow.com/questions/14998250/simple-mahout-classification-example/14998762#14998762
 
-----Original Message-----
From: sapan.parikh@eclinicalworks.com
Sent: Thursday, February 21, 2013 4:57am
To: user@mahout.apache.org
Subject: Simple Mahout classification example




I want to train create model for classification. For me this text is coming from database and I really do not want to store them to file for mahout training. I checked out the the MIA source code and changed the following code for very basic training task. Usual issue with mahout examples are either they show how to use mahout from cmd prompt using 20 news group, or the code has lot of dependency on Hadoop Zookeeper etc. I will really appreciate if someone can have a look at my code, or point me to a very simple tutorial which show how to train a model and then use it.
As of now in following code I am never getting past if (best != null) becauselearningAlgorithm.getBest(); is always returning null!
Sorry for posting the whole code, but didn't see any other option

public class Classifier {    private static final int FEATURES = 10000;    private static final TextValueEncoder encoder = new TextValueEncoder("body");    private static final FeatureVectorEncoder bias = new ConstantValueEncoder("Intercept");    private static final String[] LEAK_LABELS = {"none", "month-year", "day-month-year"};    /**     * @param args the command line arguments     */    public static void main(String[] args) throws Exception {        int leakType = 0;        // TODO code application logic here        AdaptiveLogisticRegression learningAlgorithm = new AdaptiveLogisticRegression(20, FEATURES, new L1());        Dictionary newsGroups = new Dictionary();        //ModelDissector md = new ModelDissector();        ListMultimap<String, String> noteBySection = LinkedListMultimap.create();        noteBySection.put("good", "I love this product, the screen is a pleasure to work with and is a great choice for any business");        noteBySection.put("good", "What a product!! Really amazing clarity and works pretty well");        noteBySection.put("good", "This product has good battery life and is a little bit heavy but I like it");        noteBySection.put("bad", "I am really bored with the same UI, this is their 5th version(or fourth or sixth, who knows) and it looks just like the first one");        noteBySection.put("bad", "The phone is bulky and useless");        noteBySection.put("bad", "I wish i had never bought this laptop. It died in the first year and now i am not able to return it");        encoder.setProbes(2);        double step = 0;        int[] bumps = {1, 2, 5};        double averageCorrect = 0;        double averageLL = 0;        int k = 0;        //-------------------------------------        //notes.keySet()        for (String key : noteBySection.keySet()) {            System.out.println(key);            List<String> notes = noteBySection.get(key);            for (Iterator<String> it = notes.iterator(); it.hasNext();) {                String note = it.next();                int actual = newsGroups.intern(key);                Vector v = encodeFeatureVector(note);                learningAlgorithm.train(actual, v);                k++;                int bump = bumps[(int) Math.floor(step) % bumps.length];                int scale = (int) Math.pow(10, Math.floor(step / bumps.length));                State<AdaptiveLogisticRegression.Wrapper, CrossFoldLearner> best = learningAlgorithm.getBest();                double maxBeta;                double nonZeros;                double positive;                double norm;                double lambda = 0;                double mu = 0;                if (best != null) {                    CrossFoldLearner state = best.getPayload().getLearner();                    averageCorrect = state.percentCorrect();                    averageLL = state.logLikelihood();                    OnlineLogisticRegression model = state.getModels().get(0);                    // finish off pending regularization                    model.close();                    Matrix beta = model.getBeta();                    maxBeta = beta.aggregate(Functions.MAX, Functions.ABS);                    nonZeros = beta.aggregate(Functions.PLUS, new DoubleFunction() {                        @Override                        public double apply(double v) {                            return Math.abs(v) > 1.0e-6 ? 1 : 0;                        }                    });                    positive = beta.aggregate(Functions.PLUS, new DoubleFunction() {                        @Override                        public double apply(double v) {                            return v > 0 ? 1 : 0;                        }                    });                    norm = beta.aggregate(Functions.PLUS, Functions.ABS);                    lambda = learningAlgorithm..getBest().getMappedParams()[0];                    mu = learningAlgorithm.getBest().getMappedParams()[1];                } else {                    maxBeta = 0;                    nonZeros = 0;                    positive = 0;                    norm = 0;                }                System.out.println(k % (bump * scale));                if (k % (bump * scale) == 0) {                    if (learningAlgorithm.getBest() != null) {                        System.out.println("----------------------------");                        ModelSerializer.writeBinary("c:/tmp/news-group-" + k + ".model",                                learningAlgorithm.getBest()..getPayload().getLearner().getModels().get(0));                    }                    step += 0..25;                    System.out.printf("%.2f\t%..2f\t%.2f\t%.2f\t%.8g\t%.8g\t", maxBeta, nonZeros, positive, norm, lambda, mu);                    System.out.printf("%d\t%.3f\t%.2f\t%s\n",                            k, averageLL, averageCorrect * 100, LEAK_LABELS[leakType % 3]);                }            }        }         learningAlgorithm.close();    }    private static Vector encodeFeatureVector(String text) {        encoder.addText(text.toLowerCase());        //System.out.println(encoder..asString(text));        Vector v = new RandomAccessSparseVector(FEATURES);        bias.addToVector((byte[]) null, 1, v);        encoder.flush(1, v);        return v;    }}


Sapankumar Parikh
 Product Development

eClinicalWorks
2 Technology Drive | Westborough, MA 01581
T: 5084750450 x 17269
[mailto:john.doe@eclinicalworks.com] sapan.parikh@eclinicalworks.com | [http://www.eclinicalworks.com/] www.eclinicalworks.com 
70,000+ physicians | 220,000+ providers | 410,000+ users | 23,000+ practices
Voted Most Interesting Vendor in 2010 by Healthcare Informatics | Top-rated vendor by IDC Health Insights | Seven Davies Award Winners – eCW Customers | Named in Inc. 5000 list 2007 - 2012 

This transmission contains confidential information belonging to the sender that is legally privileged and proprietary and may be subject to protection under the law, including the Health Insurance Portability and Accountability Act (HIPAA). If you are not the intended recipient of this e-mail, you are prohibited from sharing, copying, or otherwise using or disclosing its contents. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and permanently delete this e-mail and any attachments without reading, forwarding, or saving them. Thank you.
 Please consider the environment and only print this e-mail if necessary



CONFIDENTIALITY NOTICE TO RECIPIENT: This transmission contains confidential information belonging to the sender that is legally privileged and proprietary and may be subject to protection under the law, including the Health Insurance Portability and Accountability Act (HIPAA). If you are not the intended recipient of this e-mail, you are prohibited from sharing, copying, or otherwise using or disclosing its contents. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and permanently delete this e-mail and any attachments without reading, forwarding or saving them. Thank you.

CONFIDENTIALITY NOTICE TO RECIPIENT: This transmission contains confidential information belonging to the sender that is legally privileged and proprietary and may be subject to protection under the law, including the Health Insurance Portability and Accountability Act (HIPAA). If you are not the intended recipient of this e-mail, you are prohibited from sharing, copying, or otherwise using or disclosing its contents. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and permanently delete this e-mail and any attachments without reading, forwarding or saving them. Thank you.