You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Robin Anil (JIRA)" <ji...@apache.org> on 2009/05/27 17:20:46 UTC

[jira] Created: (MAHOUT-124) Online Classification using HBase

Online Classification using HBase
---------------------------------

                 Key: MAHOUT-124
                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
             Project: Mahout
          Issue Type: New Feature
          Components: Classification
    Affects Versions: 0.2
            Reporter: Robin Anil


#       Batch classification of flat file documents and flat file model:
#       Storing the model in HBase and the end of Model Building Map/Reduce stages
#       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
#       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by Robin Anil <ro...@gmail.com>.

Please use the Trunk version. There are slight changes from alpha. Also
increase zooKeeper client connections to 100 from 10 (default). Guys over at
HBase are trying to figure out a bug which prevents tcp connections to
zookeper from closing causing them to linger around like zombies

Re: [jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by "Edward J. Yoon" <ed...@apache.org>.

See here : http://people.apache.org/~stack/hbase-0.20.0-alpha/

On Mon, Jul 6, 2009 at 4:11 PM, zhao zhendong<zh...@gmail.com> wrote:
> Hi,
>
> How can I download the Hbase 0.20.jar?
>
> Cheers,
> Zhendong
>
> On Mon, Jul 6, 2009 at 5:21 AM, Robin Anil (JIRA) <ji...@apache.org> wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Robin Anil updated MAHOUT-124:
>> ------------------------------
>>
>>     Attachment: MAHOUT-124-July-6.patch
>>
>> * Added command line option dataSource to choose between hdfs or hbase+hdfs
>> model storage
>> * replaced command line option -p (path) with -m (model location) which
>> either takes sequence file or Hbase table depending on above
>> * First level refactor. Build the just hbase model or just the sequence
>> file model. Further revisions will streamline the code to remove
>> redundancies.
>> * Hbase get encapsulated in Hybrid Cache(LFU + LRU)
>>
>>
>> > Online Classification using HBase
>> > ---------------------------------
>> >
>> >                 Key: MAHOUT-124
>> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>> >             Project: Mahout
>> >          Issue Type: New Feature
>> >          Components: Classification
>> >    Affects Versions: 0.2
>> >            Reporter: Robin Anil
>> >         Attachments: MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>> >
>> >
>> > #       Batch classification of flat file documents and flat file model:
>> > #       Storing the model in HBase and the end of Model Building
>> Map/Reduce stages
>> > #       Using the model stored in HBase create an interface (both command
>> line and web service) to classify a give document
>> > #       Using the model stored in HBase, batch classify documents stored
>> on the HDFS
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>
>
> --
> -------------------------------------------------------------
>
> Zhen-Dong Zhao (Maxim)
>
> <><<><><><><><><><>><><><><><>>>>>>
>
> Department of Computer Science
> School of Computing
> National University of Singapore
>
>><><><><><><><><><><><><><><><><<<<
> Homepage:http://zhaozhendong.googlepages.com
> Mail: zhaozhendong@gmail.com
>>>>>>>><><><><><><><><<><>><><<<<<<
>



-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardyoon@apache.org
http://blog.udanax.org

Re: [jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by Ted Dunning <te...@gmail.com>.

I suspect that there is no such animal and that you need to build the jar
from the trunk or 0.20 branch.

On Mon, Jul 6, 2009 at 12:11 AM, zhao zhendong <zh...@gmail.com>wrote:

> How can I download the Hbase 0.20.jar?

Re: [jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by zhao zhendong <zh...@gmail.com>.

Hi,

How can I download the Hbase 0.20.jar?

Cheers,
Zhendong

On Mon, Jul 6, 2009 at 5:21 AM, Robin Anil (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Robin Anil updated MAHOUT-124:
> ------------------------------
>
>     Attachment: MAHOUT-124-July-6.patch
>
> * Added command line option dataSource to choose between hdfs or hbase+hdfs
> model storage
> * replaced command line option -p (path) with -m (model location) which
> either takes sequence file or Hbase table depending on above
> * First level refactor. Build the just hbase model or just the sequence
> file model. Further revisions will streamline the code to remove
> redundancies.
> * Hbase get encapsulated in Hybrid Cache(LFU + LRU)
>
>
> > Online Classification using HBase
> > ---------------------------------
> >
> >                 Key: MAHOUT-124
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Classification
> >    Affects Versions: 0.2
> >            Reporter: Robin Anil
> >         Attachments: MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
> >
> >
> > #       Batch classification of flat file documents and flat file model:
> > #       Storing the model in HBase and the end of Model Building
> Map/Reduce stages
> > #       Using the model stored in HBase create an interface (both command
> line and web service) to classify a give document
> > #       Using the model stored in HBase, batch classify documents stored
> on the HDFS
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

><><><><><><><><><><><><><><><><<<<
Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhendong@gmail.com
>>>>>>><><><><><><><><<><>><><<<<<<

Re: [jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by Ted Dunning <te...@gmail.com>.

Welcome back Robin!

>From the presentation at the Hadoop summit, it seems that version 20 of
hbase is worlds better than 18 or 19.

On Tue, Jun 23, 2009 at 12:24 AM, Robin Anil (JIRA) <ji...@apache.org> wrote:

>
> * The Hbase reads are very slow at the moment. at around 150/s. I had
> enabled inMemory and BloomFilters on both the HBase table and column.  More
> investigation is needed to improve the speed. It seems more time is spent in
> searching non-existant columns. When you classify a document, it tries to go
> through all the given features in the document for all the labels. Suppose a
> document has 1000 words. then it takes 1000x20 lookups (in the 20 news
> groups example). A majority of these are empty cells, HBase talks about
> enabling bloom filters to improve efficiency. But as of 0.19.3 i believe its
> not part of the code. atleast i cant perceive any benefits
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> > Online Classification using HBase
> > ---------------------------------
> >
> >                 Key: MAHOUT-124
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Classification
> >    Affects Versions: 0.2
> >            Reporter: Robin Anil
> >         Attachments: MAHOUT-124-June-23.patch
> >
> >
> > #       Batch classification of flat file documents and flat file model:
> > #       Storing the model in HBase and the end of Model Building
> Map/Reduce stages
> > #       Using the model stored in HBase create an interface (both command
> line and web service) to classify a give document
> > #       Using the model stored in HBase, batch classify documents stored
> on the HDFS
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726211#action_12726211 ] 

Ted Dunning commented on MAHOUT-124:
------------------------------------

.bq The numbers speak for themselves. I think i will stick to LinkedHashMap for now

They do indeed.  Simple wins again.

Nice work.  Finding out the simpler solution is better is always pleasant, but many avoid doing the tests to check it.

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-124) Online Classification using HBase

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726211#action_12726211 ] 

Ted Dunning edited comment on MAHOUT-124 at 7/1/09 1:21 PM:
------------------------------------------------------------

bq. The numbers speak for themselves. I think i will stick to LinkedHashMap for now

They do indeed.  Simple wins again.

Nice work.  Finding out the simpler solution is better is always pleasant, but many avoid doing the tests to check it.

      was (Author: tdunning):
    .bq The numbers speak for themselves. I think i will stick to LinkedHashMap for now

They do indeed.  Simple wins again.

Nice work.  Finding out the simpler solution is better is always pleasant, but many avoid doing the tests to check it.
  
> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-124:
------------------------------

    Attachment: MAHOUT-124-July-23.patch

Fixed the broken code after Checkstyle update was done. Tests Pass. Checkstyle will throw warnings.  

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733069#action_12733069 ] 

Isabel Drost commented on MAHOUT-124:
-------------------------------------

*ThetaNormalizerReducer, *BayesTFIDFReducer and *BayesSummerReducer still have dependencies to HBase - I think one can factor them out.

Interface "Algorithm" - I think it might sense to initialise the the Algorithm with a reference to the datastore instead of injecting that reference with every method call. Other than that: Looks good. Bayes and CBayes look a lot cleaner now.

Interface Datastore looks good. I like the separation of data handling and actual algorithm implementation.

I would move Pair over to the utils package.

Good work Robin.


> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-13.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728311#action_12728311 ] 

Isabel Drost commented on MAHOUT-124:
-------------------------------------

Some initial comments on the patch:

org/apache/mahout/utils/Cache.java - I am missing some documentation for the methods. For interfaces, you can omit the public with methods. For classes implementing this interface, you might want to at least use @inheritDoc to link back to the original documentation. Please also note in the class comment whether your implementation is safe to use in a multi-threaded context or not.

org.apache.mahout.common.Model - To me it looks a bit weird to add a dependency to HBase directly to the model. I would prefer the HBase implementation to be less tightly coupled with the core code. Currently it looks like the model is really doing two tasks at once: Implementing an in-memory-model as well as an HBase model. I think it should be possible to refactor the code such that the two can be separated into distinct classes that can then be used interchangeably. My first guess would be that the strategy pattern should be helpful with this task. 

You probably will have to refactor CBayesModel and BayesModel as well. The same applies to org/apache/mahout/classifier/Classify.java and CBayesModel, Model, BayesTfIdfDriver, BayesTfIDFReducer, BayesWeightSummerReducer.

org.apache.mahout.classifier.cbase - I really like your additions for reporting progress back to Hadoop. I would suggest to split these from the patch, open a separate Issue and attach the changes there. This would keep this patch more focussed on the original task of adding HBase support.

org.apache.mahout.classifier.cbase.CBayesModel - Please remove the code you commented out if you do not need it anymore. In case of catching an IOException you should at least write some warning log message (e.g. line 60). 

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742227#action_12742227 ] 

Isabel Drost commented on MAHOUT-124:
-------------------------------------

> Ant config was done to decrease the job jar file size. See first comment in this issue point No:3

Ah, thanks for the reminder...

> I need the new Eclipse Code formatter for that purpose. I am still using the lucene code formatter, which is causing this break.

Ok, I see. I guess that should be no show-stopper for the code to get in.

> Docs... already on it!
> Removed all hard coded map/Reduce task number limit from code. Will conform to the cluster its being run on.

Great!

> Map/Reduce jobs doesnt do much leg work that it confuses reading the code, I could factor them out as well if needed.

I think we could leave that open for a later patch.

> TODO: Algorithm will keep datastore internally.
> TODO: add jar from latest trunk of HBase

You could probably add a JIRA task to upgrade HBase to the official release as soon as that is out. Just so we do not forget that task. Other than that, to me it looks like this code code go in by the end of this week. If anyone else would like to have a look over the code before and needs more time, please do tell.

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Isabel Drost
>             Fix For: 0.2
>
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726203#action_12726203 ] 

Robin Anil commented on MAHOUT-124:
-----------------------------------

Implemented Hybrid Caching with LFU and LRU.
Found EHCache is slower than HashMap. Get usually take around 0.0018 s avg for a get(inMemory). which i thought was very poor.

I did some testing. where i check the EHCache second if my LRU returns nothing and inserts into EHCache only till it evicts an amount == Capacity as given below

{noformat}

EHcache LFU Capacity = 20000 
LinkedHashMap LRU Capacity  = 100000
nCalls = 18828;
sumTime = 1213.468s;
minTime = 2.67E-4ms;
maxTime = 67123.484ms;
meanTime = 64.45018ms;
stdDevTime = 650.7636ms;

EHcache LFU Capacity = 5000 
LinkedHashMap LRU Capacity  = 100000
nCalls = 18828;
sumTime = 622.92816s;
minTime = 1.6E-4ms;
maxTime = 20972.49ms;
meanTime = 33.0852ms;
stdDevTime = 206.95724ms;

EHcache LFU Capacity = 0
LinkedHashMap LRU Capacity  = 100000
nCalls = 18828;
sumTime = 353.0143s;
minTime = 0.0ms;
maxTime = 9331.663ms;
meanTime = 18.749432ms;
stdDevTime = 99.65778ms;

{noformat}


The numbers speak for themselves. I think i will stick to LinkedHashMap  for now

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-124:
------------------------------

    Attachment: MAHOUT-124-August-26.patch

Refactored out Jobs in bayes.mapreduce.*
cleaned up Htable objects at the end of map or reduce 

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Isabel Drost
>             Fix For: 0.2
>
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-August-26.patch, MAHOUT-124-August17.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742210#action_12742210 ] 

Robin Anil commented on MAHOUT-124:
-----------------------------------

* Throw out commented code... Done
* Ant config was done to decrease the job jar file size. See first comment in this issue point No:3
* I need the new Eclipse Code formatter for that purpose. I am still using the lucene code formatter, which is causing this break.
* Docs... already on it!
* Map/Reduce jobs doesnt do much leg work that it confuses reading the code, I could factor them out as well if needed.
* Removed all hard coded map/Reduce task number limit from code. Will conform to the cluster its being run on. 
* TODO: Algorithm will keep datastore internally.
* TODO: add jar from latest trunk of HBase



> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Isabel Drost
>             Fix For: 0.2
>
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-124:
------------------------------

    Attachment: MAHOUT-124-August17.patch

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Isabel Drost
>             Fix For: 0.2
>
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-August17.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734219#action_12734219 ] 

Grant Ingersoll commented on MAHOUT-124:
----------------------------------------

FYI, patch no longer applies.

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-13.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724159#action_12724159 ] 

Otis Gospodnetic commented on MAHOUT-124:
-----------------------------------------

Just read about HBase 0.20 the other day - over and order of magnitude performance improvements and random reads approaching the speed of RDBMS is what the presentation I saw claimed.


> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733075#action_12733075 ] 

Isabel Drost commented on MAHOUT-124:
-------------------------------------

Just forgot two final notes:

You should update your svn-checkout. The patch was done against an old revision of trunk and does no longer apply cleanly.

The patch was broken - line 988 in the patch file has a broken directive: 

@@ -48,67 +54,107 @@

should really be

@@ -48,67 +54,105 @@

the effect being that "patch" assumes a hunk length of 107 lines which makes it fail. Your hunk is only 105 lines, so better not lie to "patch" :) However, that one was trivial to fix.

(Thanks to Thilo Fromm for helping me fix and explain that.)

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-13.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728971#action_12728971 ] 

Robin Anil commented on MAHOUT-124:
-----------------------------------

Refactored the code. Removed Model BayesModel CBayesModel Classifier BayesClassifier CBayes Classifier

Now there are 4 classes BayesAlgorithm CBayesAlgorithm InMemoryBayesDatastore HBaseBayesDataStore

Initialize a Classifier as

new ClassifierContext(new BayesAlgorithm, new HBaseBayesDataStore)

Interface Algorithm assumes Datastore is a collection of Matrices and Vectors which can be accessed by the Matrix/Vector name(String), and row/column or index of the cell 

Tests therefore have become useless. So I am writing new Tests now

Have tried to cleanup code whereever possible. 

In Reply to stacks comments. (In order)

removed all unused imports. 

HbayesModelReader is removed. Made useless after the above refactor

LRUCache the class just uses LinkedHashMap in the backend, I had tried EHCache(LFU as well as LRU) earlier(See above) . It was slower than HashMaps 

In the BayesTfIdfReducer, I have to write both to the filesystem or the Hbase depending on the configuration. Will TableOutputFormat be faster interms of Hbase Inserts ? If then I might  create another set of Map/Reduces specifically for Hbase instead of using the same class
Htable is used only if the output is set as hbase not hdfs

About setInMemory and LZO support: specifically done for 20news groups and for my test setup. I would remove it in the final patch. 
About bloomFilers?  Are they still not implemented yet? 

Yeah, sure thing will replace those with  a hba.TableExists(output) check. 
about compaction just this right hba.compact(".META.");?

  





> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726206#action_12726206 ] 

Robin Anil commented on MAHOUT-124:
-----------------------------------

The numbers above are for per document classification time. or the time spend in Classifier.classify()

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-124:
------------------------------

    Attachment: MAHOUT-124-August-2.patch

Added parallel Classification from both hdfs and hbase. 

the usage is given below. 

{noformat}
hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job org.apache.mahout.classifier.bayes.TestClassifier -m model -d <INPUT> -ng 1 -type cbayes -source <hdfs|hbase> -method <sequential|mapreduce>
{noformat}

The patch is using the latest trunk. I hope this get committed soon. Its has become too difficult to manage files across MAHOUT-124 and FPGrowth algo. 

I am waiting for this commit before attempting to change all Text Writables to Vector.

Package summary will be added as a part of another issue

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744155#action_12744155 ] 

Robin Anil commented on MAHOUT-124:
-----------------------------------

Ran FindBugs through the code.  Everything looks fine.

Inmemory Datastore reads the whole model from HDFS into memory. Had the model been read directly from HDFS we could have called it a Datastore. Maybe a 2 level (memory + HDFS) storage could be called a HDFS datastore in the future. Does that sound sane?

Could you try this new patch.  Also try with 0.20 RC1 of Hbase http://people.apache.org/~stack/hbase-0.20.0-candidate-1/

Scaling tests need to be done on Amazon EC2. 

Well GSOC ends today, but mahout-ing continues.


> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Isabel Drost
>             Fix For: 0.2
>
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-August17.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729039#action_12729039 ] 

stack commented on MAHOUT-124:
------------------------------

@Robin

On "tried EHCache", pardon me.  I just read up in issue and see you tried it.  Pardon me.

.bq Will TableOutputFormat be faster interms of Hbase Inserts

The new mapreduce package hbase classes in TRUNK have been all rejiggered.  Should be better but IIUC, you are running all on your laptop, virtual machines too?  If so, you probably won't notice any difference.

.bq ....bloomfilters...

Not implemented.  Currently thought is too little benefit for amount of RAM consumed.  Related, in-memory was just implemented so update and you should catch benefits.

.bq ....about compaction....

You want majorCompact, not compact.  Here is API from 0.19: http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/client/HBaseAdmin.html#majorCompact(byte[])

Good stuff Robin





> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744078#action_12744078 ] 

Grant Ingersoll commented on MAHOUT-124:
----------------------------------------

A few comments after a quick scan:

1. DataStore should likely be an abstract class.  Probably true for Algorithm, too.  It's generally easier to support back compatibility that way, although maybe we don't need to worry about it so much yet.

2. I'm confused as to why HDFS/Filesystem isn't modeled as a DataStore

3. I'm getting build errors trying to find hbase.

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Isabel Drost
>             Fix For: 0.2
>
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742031#action_12742031 ] 

Isabel Drost commented on MAHOUT-124:
-------------------------------------


Alltogether really nice changes. The patch now applies to trunk without problems and builds (except for the missing hbase dependency). As this will be one of the last reviews, I tried to be a little more picky also with minor changes like added System.out.println and missing documentation...

The ant config file (build.xml) contains changes that I see nowhere explained. Are they supposed to remain for the final patch?

In the examples concerning the TestClassifier
  - it has imports for java.io.* and java.util.* - for the final patch could you please revert those to the specific imports?
  - could you please try to avoid reformatting the code as much as possible? It makes reading patches a whole lot easier.
  - in line 129 there is quite a bit of code commented out - better through it out entirely? If needed later the snippet is still in jira.
  - line 224 - have the timing statistics been left in intentionally?

utils/nlp/NGrams
  - The class is missing documentation. I guess your intention was to generate nGrams from a line of text, not the whole document? Otherwise holding document and nGrams both in memory seems a little bit much. There also seems to be no unit test for it?

The classes implementing the caching algorithms are missing documentation. At least some /** {@inheritDoc} */ and a short comment on top that explains the purpose of the implemention would be nice. (Same applies for Pair and Parameters).

CBayesNormalizerReducer still has HBase Dependencies - is it possible to factor them out?

BayesThetaNormalizerDriver - setting the number of map tasks was commented away compared with trunk. Intentional?

BayesClassifierMapper - lines 106, 110 and following: Shouldn't the log message be something like "Using ..." instead of "Testing ..."?

classifier/bayes/interfaces/algorithm/Algorithm - you still give a pointer to the datastore with every method call to the Algorithm. Wouldn't the interface look cleaner if the Algorithm would hold a reference to an initialized datastore and use that for further requests? I don't think it is very likely that users will go to HBase for the first document to classify and to an InMemoryStore for the next document.

bayes/algorithm/CBayesAlgorithm, BayesAlgorithm, bayes/common/ClassifierPriorityQueue - is missing some basic javaDoc.

BayesTfIdfDriver, BayesTfIdfReducer, BayesWeightSummerReducer - I assume the dependency to HBase cannot be factored out?

BayesFeatureMapper - there is a System.out.println in there...

One last question: You reference hbase-0.20.0 which is not released yet. I guess we should include a prebuilt version in our lib directory and ship that until hbase has an official release to use?

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Isabel Drost
>             Fix For: 0.2
>
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-124:
------------------------------

    Fix Version/s: 0.2

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>             Fix For: 0.2
>
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-124) Online Classification using HBase

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Isabel Drost reassigned MAHOUT-124:
-----------------------------------

    Assignee: Isabel Drost

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Isabel Drost
>             Fix For: 0.2
>
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-124:
------------------------------

    Attachment: MAHOUT-124-July-6.patch

* Added command line option dataSource to choose between hdfs or hbase+hdfs model storage
* replaced command line option -p (path) with -m (model location) which either takes sequence file or Hbase table depending on above
* First level refactor. Build the just hbase model or just the sequence file model. Further revisions will streamline the code to remove redundancies.
* Hbase get encapsulated in Hybrid Cache(LFU + LRU)


> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725258#action_12725258 ] 

Robin Anil commented on MAHOUT-124:
-----------------------------------

I played around with Hbase-0.20.0. It was fast but not fast enough

I had trouble running HBase-0.20.0-alpha. The Reducers were crashing because zookeeper connections kept crossing the limit(10 default).  I had to manually kill some net connections to zookeeper to finish the program. I tried overriding Closable.close() method in Reducer to close the table object. But things were still problematic. I have not received a reply to it on IRC. I will post this to Hbase soon. With some luck I finished the map/reduces and loaded the model into Hbase.

Listen to the following Highlights

My model had 196K rows with 20+1 columns of 8 bytes each for the 20-newsgroups data that means 180 byte records with LZO compression enabled. I am creating a blog entry soon which details how to enable lzo for version 0.20.0

* Inserts were around the same range. 4K Cell/s  (feature, label => value)  
* Read speed had definitely improved. I got something like 800 Cells/s up from 150/s i had earlier (translating to 40 rows/s ==  25 ms/row  but way less than 1.42 ms which I believe(Correct me) is for a single 1000 byte record, in the following slides http://wiki.apache.org/hadoop-data/attachments/HBase(2f)HBasePresentations/attachments/HBase_Goes_Realtime.pdf) 
* On suggestion by isabel used a LRU cache using LinkedHashMap. I got nearly 98%+ cache Hit Rate. But ran into trouble when it tried to classify a spam document which had aroudn 30K junk words. 
* So i increased cache to 100K rows.  Entire 20-newsgroups(24 MB) was classified in just above 5 mins as compared to 1 min load time + 1 min classification time of the non Hbase version(In memory hashMap) with a 99.2% hitrate and total Hbase lookup == 200859 which is around 5K above the total lookup actually necessary 196K. Maybe a better caching mechanism can take care of this. 


Looking ahead

* Try to fix the zookeeper bug or close Htable Properly.
* Implement a good caching mechanism.  Currently I have zeroed in on EHCache based on this performance test 
{noformat}
                      http://javalandscape.blogspot.com/2009/01/cachingcaching-algorithms-and-caching.html 
                      http://javalandscape.blogspot.com/2009/05/intro-to-cachingcaching-algorithms-and.html
{noformat}
Currently classifying a single document  with 1000 features from cold start will take around  10-20 seconds. Once the cache is full the lookups would be rather fast
Maybe i could try with LFU instead of LRU, I will post the results once i get a go ahead from the mahout community with EHCache or something equivalent. EHCache is also on maven. 
* I have read that ARC(http://en.wikipedia.org/wiki/Adaptive_Replacement_Cache) improves over LFU and LRU, And if time permits, I may be able to code it up (EH cache currently doesn't support ARC) subject to what Mahout developers suggest
* I need a way to enable the client to use the specified or maximum heap possible for caching. This method should be better than keeping a fixed row entries, Since we have no control over the number of columns(varies according to data)
* Next Step Refactor code and submit patch.

Any thoughts






> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-124:
------------------------------

    Attachment: MAHOUT-124-June-23.patch

Added HBase Support for CBayes Classifier

The HBase data model is

{noformat} 
{
   "feature1":
    {
              "info:label1":  "score1",
              "info:label2":  "score2",
              "info:Sigma_j":  "sum" //sum of weights
    }
    "feature2":
    {
              "info:label1":  "score1",
              "info:Sigma_j":  "sum" //sum of weights
    }
    "feature2":
    {
              "info:label2":  "score2",
              "info:Sigma_j":  "sum" //sum of weights
    }

}
{noformat} 

Here are some links to get you started on Hbase
{noformat} 
  http://wiki.apache.org/hadoop/Hbase/MapReduce
  http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
  http://jimbojw.com/wiki/index.php?title=Understanding_HBase_column-family_performance_options
{noformat} 

* I have disabled get-enwiki maven task which was done by default while compiling examples. It should be kept as an option not as default.  I dont like downloading 2.4 gigs just to run Mahout.

* Put hbase-0.19.3 JAR file in the core/lib directory. the maven build files take it up from there (Thanks Isabel for pointing it to me)

* I have also commented out a like in the ant jar which was causing the mahout-example job file to take twice the size due to multiple copies of dependent jar files getting zipped up
This was causing the map reduce jobs to take a couple of seconds extra to start

* This patch breaks the Bayes code, Hbase is Server is necessary to run this Bayes examples if you apply this patch, it uses HBase to get the weight sparse matrix while loading the label sums from hdfs as it was doing previously.
        More work is needed to refactor Bayes code this so that users can independantly use eihter hdfs / hbase to store the classification model

* Added meaningful jobnames and reporter status to monitor the job.

* removed the map task number setting from the code. It now uses the default map  task as specifed in your hadoop conf

* Hbase inserts takes place at around 4000/s on a 2.4GHz core2duo 1GB VMware running ubuntu karmic koala.

* The Hbase reads are very slow at the moment. at around 150/s. I had enabled inMemory and BloomFilters on both the HBase table and column.  More investigation is needed to improve the speed. It seems more time is spent in searching non-existant columns. When you classify a document, it tries to go through all the given features in the document for all the labels. Suppose a document has 1000 words. then it takes 1000x20 lookups (in the 20 news groups example). A majority of these are empty cells, HBase talks about enabling bloom filters to improve efficiency. But as of 0.19.3 i believe its not part of the code. atleast i cant perceive any benefits


















> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726923#action_12726923 ] 

Robin Anil commented on MAHOUT-124:
-----------------------------------

Added

LRUCache<K,V> 
LFUCache<K,V> 
HybridCache<K,V>

The LFU was implemented using two Maps a TreeMap for eviction of keys and a HashMap for the data
LRU uses LinkedHashMap directly.

PS: the numbers given below differs drastically from those above due to two reasons
* (I am playing an mp3 + a big download is going on)
* I have encapsulated LinkedHashMap LRU inside LRUCache<K,V> template class. additional over head is due to the extra call

So I ran the tests twice one with only LRU and another with both LFU and LRU, so its safe to say the numbers are stable given two steady background process are going on.

Given below are the numbers. 
* The number of HBase row lookups have gone down from 200859 to 194228 when adding an LFUCache without inflicting additional overhead. 
* The Max Classification time and std Deviation have gone down with the inclusion of an LFU Cache. 

{noformat}
LFU Capacity = 0
LRU Capacity  = 100000
09/07/03 16:49:21 INFO cbayes.CBayesModel: 48000000 47799141 200859 99.58154 100000
nCalls = 18828;
sumTime = 431.78778s;
minTime = 0.0ms;
maxTime = 13197.114ms;
meanTime = 22.933277ms;
stdDevTime = 128.23552ms;

LFU Capacity = 20000
LRU Capacity  = 100000
09/07/03 17:01:17 INFO cbayes.CBayesModel: 48000000 47805772 194228 99.59536 120000
nCalls = 18828;
sumTime = 428.96442s;
minTime = 0.0ms;
maxTime = 10064.428ms;
meanTime = 22.783323ms;
stdDevTime = 110.68307ms;
{noformat}

> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-124) Online Classification using HBase

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-124:
------------------------------

    Attachment: MAHOUT-124-July-13.patch

* Changes in CLI switches for  TestClassifier 
** -m or --model for ModelName of BasePath
** -source or --dataSource = (hdfs|hbase) for specifying dataSource
** -d  or --testDir (base directory of test data)

* Deleted BayesModel and CBayesModel(kept the Model abstract class since it has reference implementation codes. Can remove it later)
* Usage :  ClassifierContext classifier = new ClassiferContext(new Algorithm, new Datastore)
* Choice of Bayes/CBayes Algorithm  and InMemoryBayes/HBaseBayes Datastore
* Added documentation to the public interfaces and classes
* Earlier tests modifed to work on the new API. All Inmemory Tests pass. Will have to figure out a way to mimick Hbase Server to test HbaseDatastore class.
* Verified: Same classification results for Bayes Cbayes over Imemory or Hbase Datastore
* Removed HBase LZO Compression/InMemory Option in this Patch. Will have to provide external configuration Methods for Users to choose these Optimizations if their Cluster supports it



> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-13.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-124) Online Classification using HBase

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728325#action_12728325 ] 

stack commented on MAHOUT-124:
------------------------------

Here's a few small comments Robin:

In HybridCache, you import

+import org.apache.hadoop.hbase.client.Result;

Is this intentional?

Similarily in BayesModel and HbaseModelReader, there are hbase imports that do not look as though they are being used.

Do you have to write your own LRUCache?  Can you not rob one from elsewhere?

In BayesTfIdfReducer, if you write hbase, you also write the collector?  You might consider using TableOutputFormat if using hbase?

In the same class, in configure, you make an HTable whether its being used or not.

Regards:

+      hcd.setBloomfilter(true);
+      hcd.setInMemory(true);
+      hcd.setBlockCacheEnabled(true);
+      hcd.setCompressionType(Algorithm.LZO);

FYI, bloom filter has no effect.  You've done the work to put in the LZO support?

Would suggest you not suppress these exceptions:

{code}
+      try {
+        hba.disableTable(output);
+        hba.deleteTable(output);
+      } catch (TableNotFoundException ex) {
+      } catch (TableNotDisabledException ex) {
+      }
+      hba.createTable(ht);

{code}

If table not fully disabled, then its going to give you bother when you try to recreate.  Be good to know why.  Should also add in a compaction on .META. between delete and create as the shell command does to avoid known big table remove issues.









> Online Classification using HBase
> ---------------------------------
>
>                 Key: MAHOUT-124
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-124
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>         Attachments: MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.