You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by 戴睿 <ge...@gmail.com> on 2012/03/23 15:11:32 UTC

Mahout Clustering and HBase

Hello,
I'm new for Mahout,and I've read Support of
HBase<http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/ajax/%3CCACpbbiJHP3JVmwU1GURL2nW9obw236Ju33Xt9%2BDtnLtzDyCVQg%40mail.gmail.com%3E>
before.
still don't get it.
Input and output of Mahout are stored in HDFS
and I'm wondering is there any way to cluster input data from
HBase directly and write the output to Htable  instead of HDFS?
that might save much time in transformation between Hadoop and mahout

really look forward your answer, thank you

Re: Mahout Clustering and HBase

Posted by Ioan Eugen Stan <st...@gmail.com>.

2012/3/23 戴睿 <ge...@gmail.com>:
> Hello,
> I'm new for Mahout,and I've read Support of
> HBase<http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/ajax/%3CCACpbbiJHP3JVmwU1GURL2nW9obw236Ju33Xt9%2BDtnLtzDyCVQg%40mail.gmail.com%3E>
> before.
> still don't get it.
> Input and output of Mahout are stored in HDFS
> and I'm wondering is there any way to cluster input data from
> HBase directly and write the output to Htable  instead of HDFS?
> that might save much time in transformation between Hadoop and mahout
>
> really look forward your answer, thank you

Hi genius33232,

The best approach is to use a custom vectorization step that transform
your data into Vectors the way mahout wants them. Take a look at
DictionaryVectorizer source code
(https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/DictionaryVectorizer.html)
and SparseVectorsFromSequanceFiles
(http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java?view=markup).

You can write you MR job that reads data from HBase, creates vectors
and then saves them to HDFS so they can be provided as input for the
clustering job. You will need to provide the dictionary sequence file
and the vectors sequence files for the next step.

You can import the clustered data again in HBase and delete them after
or hack the clustering job to spit data into HBase instead of HDFS.

I suggest you take the first approach until you are getting what you
need and then move to the next step. You can also index the cluster
data with Solr, but this depends on your use case and data size.

>From my experience with Mahout, it's not very easy to modify those
jobs but devs know this and it's on the wishlist (make mahout more
like a library).

Happy hacking,

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/

Re: Mahout Clustering and HBase

Posted by 戴清灏 <ro...@gmail.com>.

Hi,
I think they are for different use cases.
HBase is for high concurrency, while mahout is for data mining & machine
learning.
There may not be so many people running mahout at the same time.

Regards,
Q



在 2012年3月23日 下午10:11，戴睿 <ge...@gmail.com>写道：

> Hello,
> I'm new for Mahout,and I've read Support of
> HBase<
> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/ajax/%3CCACpbbiJHP3JVmwU1GURL2nW9obw236Ju33Xt9%2BDtnLtzDyCVQg%40mail.gmail.com%3E
> >
> before.
> still don't get it.
> Input and output of Mahout are stored in HDFS
> and I'm wondering is there any way to cluster input data from
> HBase directly and write the output to Htable  instead of HDFS?
> that might save much time in transformation between Hadoop and mahout
>
> really look forward your answer, thank you
>