You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Shahid Shaikh <sh...@gmail.com> on 2014/09/23 11:48:40 UTC

Apache Mahout 0.9 LDA CVB Example

Hi,

               I am currently working on a project that
needs categorization of documents (UN-structured data) based on internal
context of document. I am using Apache mahout clustering solution for the
same. So far we have explored Kmeans, Canopy with Kmeans, We have also used
Lucene analyzers to skip the stop words with lower case filters. We have
experimented on the data with writing some custom distance measures and
trying different configurations but un-fortunately the end clusters
produced did not result into meaning full clusters. We realized that use of
Lucene analyzer with stop words filter is actually skipping the words from
document which is not a perfect solution but we need to implement some NLP
with clustering. I have been referring mahout in action book and found that
LDA algorithm is one of the solution and implements Topic modelling
clustering.



               We have tried the same process

                              1. Generate sequence files from text
documents.

                              2. Generate TF vectors only.

                              3. Generate Matrix from vectors.

                              4. Run cvb0_local mahout job with matrix as
input. This job returns docOutputFile and topicOutputFile.



I need help in interpreting CVB output, but the reference material
available is confusing.

I was expecting that CVB would generate some clusters that can be
interpreted and thus shown as clusters. In some of the references i have
found that vectordump is used to dump these outputs as vectors. But how
they are mapped into clusters is not shown anywhere.

I also saw the example script "cluster-reuters.sh" that is also using
vector dump on "reuters-lda-topics" output of cvb job.



How do we actually use LDA/CVB clustering for real time clustering
solutions? Also any example with interpretation of CVB output will be help
full for us to proceed.





Thanks a lot for help and support.


Regards,
Shaikh Shahid G .
+91 9503954781