You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2012/01/01 02:15:19 UTC

Re: how to prepare data efficiently for mahout

Hector is a more industrial-strength client for Cassandra. I have not used it.

https://github.com/rantav/hector

On Sat, Dec 31, 2011 at 10:50 AM, Sean Owen <sr...@gmail.com> wrote:
> You might get some mileage out of this article I wrote about using
> Cassandra as input for Hadoop/Mahout, though it's not specific to LDA:
>
> http://www.acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/
>
> On Sat, Dec 31, 2011 at 10:36 AM, Allen <an...@gmail.com> wrote:
>
>> Hello there,
>>
>> I am new to Mahout and trying to get Mahout running on our data
>> storage -- Cassandra. After poking around the LDA example on reuters
>> data, I have several questions.
>>
>> 1) Where is the source code for seqdirectory and seq2sparse?
>>
>> 2) Before the algorithm can run, it looks like the raw text must be
>> converted and materialized into a sequece file which represents some
>> vectors. Is that true? If so, is there an more efficient way to handle
>> the conversion like streaming the data? In my project, all the data is
>> in Cassandra. If I need to run some Mahout algorithm, it seems I need
>> to get the data out, put them into a temporal directory in HDFS,
>> convert them into sequence file and finally turn them into tf-vectors
>> format in HDFS. Then I can run the algorithm. 2 temporal data are
>> stored in the above procedure which will make the run slow.
>>
>> Many thanks.
>>
>> --
>> Allen
>>



-- 
Lance Norskog
goksron@gmail.com