You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Shivani Rao <sg...@purdue.edu> on 2010/11/17 23:25:07 UTC

R and Mahout integration

Hello
I am an R user and now using Mahout for ML algorithms on big datasets that are
out of reach of R.
R has hadoopstreaming package and I was wondering if Mahout and R have an
interface that has been developed.

My question arises from the fact that the lucence vectors/sparse matrices
created by Mahout are unintelligible if there is no way to access them in R

I have just tested using Apache Mahout for building an Latent dirichlet
allocation model on a corpus of 30 documents. I did not have Hadoop installed on
the system thats why a local execution of the Mahout yielded the resulting
model. I would like to access the model parameters, as in the estimated \alpha,
\beta, \Phi, \Theta

How can I access these?

<Mahout bin location>/mahout lda -i <tf-vectors location>/tf-vectors -o
<lda-out-dir> -k 4-v 27

I can see that <lda-out-dir> has folder <state-i> for each iteration(i presume)
of the learning algorithm. Each <state-i> has a single file part-r-0000 which I
do not know how to access.

Do I need to use HBASE to be able to acesss the data generated by Mahout?

If my naive questions annoy you, I apologize, I am new to Mahout.

Regards,
Shivani


Re: R and Mahout integration

Posted by Ted Dunning <te...@gmail.com>.
There are two questions here.

a) how to read Mahout vectors from R.  My typical answer here is to emit
row, column, value triples in CSV form.  R can read these and populate a
sparse matrix of its own with those.

Here is a hint about how you convert triples into a sparse matrix in R:

> library(Matrix)
Loading required package: lattice

Attaching package: 'Matrix'

The following object(s) are masked from 'package:base':

    det

> sparseMatrix(x=c(1,1,1,1), i=c(1,2,3,3), j=c(1,1,2,1))
3 x 2 sparse Matrix of class "dgCMatrix"

[1,] 1 .
[2,] 1 .
[3,] 1 1
>


b) how to get model parameters from LDA.  That I can't help you with without
digging in and I am a bit swamped for that right off the bat.

On Wed, Nov 17, 2010 at 2:25 PM, Shivani Rao <sg...@purdue.edu> wrote:

> Hello
> I am an R user and now using Mahout for ML algorithms on big datasets that
> are
> out of reach of R.
> R has hadoopstreaming package and I was wondering if Mahout and R have an
> interface that has been developed.
>
> My question arises from the fact that the lucence vectors/sparse matrices
> created by Mahout are unintelligible if there is no way to access them in R
>
> I have just tested using Apache Mahout for building an Latent dirichlet
> allocation model on a corpus of 30 documents. I did not have Hadoop
> installed on
> the system thats why a local execution of the Mahout yielded the resulting
> model. I would like to access the model parameters, as in the estimated
> \alpha,
> \beta, \Phi, \Theta
>
> How can I access these?
>
> <Mahout bin location>/mahout lda -i <tf-vectors location>/tf-vectors -o
> <lda-out-dir> -k 4-v 27
>
> I can see that <lda-out-dir> has folder <state-i> for each iteration(i
> presume)
> of the learning algorithm. Each <state-i> has a single file part-r-0000
> which I
> do not know how to access.
>
> Do I need to use HBASE to be able to acesss the data generated by Mahout?
>
> If my naive questions annoy you, I apologize, I am new to Mahout.
>
> Regards,
> Shivani
>
>