You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sam Cunningham <sa...@yahoo.com> on 2011/11/14 04:47:30 UTC

Classifying documents in database

I have a database of documents. In other words, each tuple contains a
document that needs to be classified. Does Mahout API provide such
capability that I connect to DB, get the document, classify and write the
label back to database? 

I am aware I can connect to DB separately, loop through tuples, convert each
tuple to a document, then use Mahout API to classify, and write back to the
database, at the end. Is this the way to go?

To be more specific, does BayesFileFormatter in Mahout API come with
readerToDatabase method? or is there a way to use readerToDocument method
along with a database tuple instead of Files.newReader()?

What is the best practice to connect and read/write from/to DB from Mahout
classifier?



--
View this message in context: http://lucene.472066.n3.nabble.com/Classifying-documents-in-database-tp3505846p3505846.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Classifying documents in database

Posted by Yuval Feinstein <yu...@citypath.com>.
Here's one way - albeit indirect:

a. Index your DB into Solr, using the DataImportHandler:
http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS

b. Now you will have a Lucene index, which you can import into Mahout:
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html#CreatingVectorsfromText-FromLucene

c. Train your classifier inside Mahout.

d. Run the classifier on the needed records, and get an output file in the
format:
<record id> <label>

e. Use a script to insert the results into the database.

I am a Mahout newbie, so there might be more efficient ways.

Cheers,
Yuval


On Mon, Nov 14, 2011 at 5:47 AM, Sam Cunningham <sa...@yahoo.com>wrote:

> I have a database of documents. In other words, each tuple contains a
> document that needs to be classified. Does Mahout API provide such
> capability that I connect to DB, get the document, classify and write the
> label back to database?
>
> I am aware I can connect to DB separately, loop through tuples, convert
> each
> tuple to a document, then use Mahout API to classify, and write back to the
> database, at the end. Is this the way to go?
>
> To be more specific, does BayesFileFormatter in Mahout API come with
> readerToDatabase method? or is there a way to use readerToDocument method
> along with a database tuple instead of Files.newReader()?
>
> What is the best practice to connect and read/write from/to DB from Mahout
> classifier?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Classifying-documents-in-database-tp3505846p3505846.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Re: Classifying documents in database

Posted by Ted Dunning <te...@gmail.com>.
No.  There are no specific examples describing classification against a
database.

On Mon, Nov 14, 2011 at 8:03 AM, Sam Cunningham <sa...@yahoo.com>wrote:

> Thanks for your answer, Ted. Any examples on this? I have the "Mahout In
> Action"
> book and I don't see any sample scenario covering that. By the way, I am
> using
> the Bayes, CBayes algorithms. Greatly appreciated if you can point a sample
> implementation on this.
>
>
>

Re: Classifying documents in database

Posted by Sam Cunningham <sa...@yahoo.com>.
Thanks for your answer, Ted. Any examples on this? I have the "Mahout In Action"
book and I don't see any sample scenario covering that. By the way, I am using
the Bayes, CBayes algorithms. Greatly appreciated if you can point a sample
implementation on this.



Re: Classifying documents in database

Posted by Ted Dunning <te...@gmail.com>.
On Sun, Nov 13, 2011 at 7:47 PM, Sam Cunningham <sa...@yahoo.com>wrote:

> I have a database of documents. In other words, each tuple contains a
> document that needs to be classified. Does Mahout API provide such
> capability that I connect to DB, get the document, classify and write the
> label back to database?
>

Yes.  You can do this, particularly with the SGD classifiers.  The API for
the Naive Bayes classifiers is a bit more complex, but it also supports
this scenario.


> I am aware I can connect to DB separately, loop through tuples, convert
> each
> tuple to a document, then use Mahout API to classify, and write back to the
> database, at the end. Is this the way to go?
>

More or less, yes.  Document encoding is inherently application specific.

You can also parallelize this for higher throughput, but you have to watch
out for the fact that a large parallel number of tasks can slam your
database pretty easily.



> To be more specific, does BayesFileFormatter in Mahout API come with
> readerToDatabase method? or is there a way to use readerToDocument method
> along with a database tuple instead of Files.newReader()?
>

No.


> What is the best practice to connect and read/write from/to DB from Mahout
> classifier?
>

I think you described it right off the bat.

Re: Classifying documents in database

Posted by Lance Norskog <go...@gmail.com>.
It might be less work to turn the SGD classifier into a Pig UDF library.
Then you don't have to write any database code.

On Sun, Nov 13, 2011 at 7:47 PM, Sam Cunningham <sa...@yahoo.com>wrote:

>
> I am aware I can connect to DB separately, loop through tuples, convert
> each
> tuple to a document, then use Mahout API to classify, and write back to the
> database, at the end. Is this the way to go?
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Classifying-documents-in-database-tp3505846p3505846.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goksron@gmail.com