You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ryan Josal <ry...@josal.com> on 2013/03/21 18:37:03 UTC

Building a Mahout pipeline with Solr

Hi,

So I've read MiA, Taming Text, and lots of blog posts. I've played around with it on a (test) CDH4.2 cluster of 16 nodes. What I'm still trying to figure out is how to put this puzzle together with Solr, and I'm hoping someone with some experience might have some tips and best practices.

I have:
I've indexed text documents into a Lucene index (Solr 3.6) that are pretagged with the industry the doc is related to. There are ~3.3 million docs with k=177 unique industries. Each doc could have multiple industries. In production there will be untagged docs mixed in.

I want to:
Tag new/existing documents that haven't already been tagged with industries. Bonus if it works "online". After this, I'd like to use a similar pipeline to create a topic hierarchy with no known topics.

So to start it off, I should use lucene.vector to prepare the data. One blog post Grant wrote suggested hooking into Solr's event system, say on optimize, to kick off a Mahout job. I've been using both Solr and Lucene for awhile, but I'm new to Mahout and Hadoop. So my first question with this approach is how to end up with the vectors in HDFS for more processing? Does the index need to be in HDFS? Is the right architecture for this pipeline to have the Solr Master oversee everything? Or another option might be to have some other brand new Java component that periodically does this batch process. I feel like having a Java component is a nice thing to do, to encapsulate the whole process and maybe load some configuration. I'm not quite clear on embedding Mahout usage in Java; where it lives, where the libraries live, or how to invoke it.
So let's say I've got my vectors; I've tried on the Hadoop cluster to run fkmeans with k=177, but map task always runs out of heap space. I guess the required heap must be (8bytes*k*numDocs*2), and my map tasks are limited to 2G heap, so I guess my max k is ~40, which works, but seems small. What if I wanted to do fine grained topic clustering, do I have to cluster 40 at a time and cluster each of those 40? If clustering could work for me, I suppose I could label my clusters with the pretags using mahout.utils.vectors.lucene.ClusterLabels. So now I'll look at CVB LDA (with TF weighted vectors this time); if I can reduce dimentions enough, this should work with any number of docs. So I was following the wiki on dimensionality reduction, but (as another thread on this list found), the output of lucene.vector has LongWritable keys and LDA takes IntWritable. I was able to run SVD on the vectors though, but the transform job runs into the same IntWritable problem. They suggested modifying the rowid job to convert LongWritable to IntWritable. This makes me lean towards a Java component to manage things. Assuming I have successfully run LDA and have my topic models, and doc/topic distribution output, I should be able to label my topics with something like ClusterLabels.
Now if I've gone the clustering route, MiA talks about how I could use Canopy clustering for online clustering. If I've gone LDA, I hear I can use the topic models for a classifier. How can topic models be used with a classifier? I can write an UpdateRequestProcessor plugin for Solr to do these things, but I'm not sure how that should be architected... should there be some service I call that returns a topic? Also, after the batch is done on data including some that doesn't have topics, I need to get the discovered topics into the Index. So I can iterate over the results easy enough, but should I store them in a separate SolrCore with an foreign key to the id field of the main core, or is it ok to update?

The grand question of course is, how would you do this?

Ryan