You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Robin Anil (JIRA)" <ji...@apache.org> on 2009/06/23 09:24:07 UTC

[jira] Updated: (MAHOUT-124) Online Classification using HBase

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-124:
------------------------------

    Attachment: MAHOUT-124-June-23.patch

Added HBase Support for CBayes Classifier

The HBase data model is

{noformat} 
{
   "feature1":
    {
              "info:label1":  "score1",
              "info:label2":  "score2",
              "info:Sigma_j":  "sum" //sum of weights
    }
    "feature2":
    {
              "info:label1":  "score1",
              "info:Sigma_j":  "sum" //sum of weights
    }
    "feature2":
    {
              "info:label2":  "score2",
              "info:Sigma_j":  "sum" //sum of weights
    }

}
{noformat} 

Here are some links to get you started on Hbase
{noformat} 
  http://wiki.apache.org/hadoop/Hbase/MapReduce
  http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
  http://jimbojw.com/wiki/index.php?title=Understanding_HBase_column-family_performance_options
{noformat} 

* I have disabled get-enwiki maven task which was done by default while compiling examples. It should be kept as an option not as default.  I dont like downloading 2.4 gigs just to run Mahout.

* Put hbase-0.19.3 JAR file in the core/lib directory. the maven build files take it up from there (Thanks Isabel for pointing it to me)

* I have also commented out a like in the ant jar which was causing the mahout-example job file to take twice the size due to multiple copies of dependent jar files getting zipped up
This was causing the map reduce jobs to take a couple of seconds extra to start

* This patch breaks the Bayes code, Hbase is Server is necessary to run this Bayes examples if you apply this patch, it uses HBase to get the weight sparse matrix while loading the label sums from hdfs as it was doing previously.
        More work is needed to refactor Bayes code this so that users can independantly use eihter hdfs / hbase to store the classification model

* Added meaningful jobnames and reporter status to monitor the job.

* removed the map task number setting from the code. It now uses the default map  task as specifed in your hadoop conf

* Hbase inserts takes place at around 4000/s on a 2.4GHz core2duo 1GB VMware running ubuntu karmic koala.

* The Hbase reads are very slow at the moment. at around 150/s. I had enabled inMemory and BloomFilters on both the HBase table and column.  More investigation is needed to improve the speed. It seems more time is spent in searching non-existant columns. When you classify a document, it tries to go through all the given features in the document for all the labels. Suppose a document has 1000 words. then it takes 1000x20 lookups (in the 20 news groups example). A majority of these are empty cells, HBase talks about enabling bloom filters to improve efficiency. But as of 0.19.3 i believe its not part of the code. atleast i cant perceive any benefits


















> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by Ted Dunning <te...@gmail.com>.
Welcome back Robin!

>From the presentation at the Hadoop summit, it seems that version 20 of
hbase is worlds better than 18 or 19.

On Tue, Jun 23, 2009 at 12:24 AM, Robin Anil (JIRA) <ji...@apache.org> wrote:

>
> * The Hbase reads are very slow at the moment. at around 150/s. I had
> enabled inMemory and BloomFilters on both the HBase table and column.  More
> investigation is needed to improve the speed. It seems more time is spent in
> searching non-existant columns. When you classify a document, it tries to go
> through all the given features in the document for all the labels. Suppose a
> document has 1000 words. then it takes 1000x20 lookups (in the 20 news
> groups example). A majority of these are empty cells, HBase talks about
> enabling bloom filters to improve efficiency. But as of 0.19.3 i believe its
> not part of the code. atleast i cant perceive any benefits
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> > Online Classification using HBase
> > ---------------------------------
> >
> >                 Key: MAHOUT-124
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Classification
> >    Affects Versions: 0.2
> >            Reporter: Robin Anil
> >         Attachments: MAHOUT-124-June-23.patch
> >
> >
> > #       Batch classification of flat file documents and flat file model:
> > #       Storing the model in HBase and the end of Model Building
> Map/Reduce stages
> > #       Using the model stored in HBase create an interface (both command
> line and web service) to classify a give document
> > #       Using the model stored in HBase, batch classify documents stored
> on the HDFS
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)