You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "Scott C. Cote" <sc...@gmail.com> on 2015/01/23 06:49:55 UTC
Re: streaming kmeans vs incremental canopy/solr/kmeans

Mahout Gurus,

I’m back at the clustering text game (after a hiatus of a year).   Not for
recommendation purposes - thanks for the book and the idea of solr for
recommendation ….  that’s cool (Found Ted at Data Days in Austin - nice to
see you again).

My question:
How do I apply streaming cluster technology to text when I don’t have
accurate vectors?  

Let me explain exactly what I mean.
I have a series of sentences coming at me over time.  I may or may not
have the word in a “dictionary” when I receive it.  I need to group the
similar sentences together.  So I want to cluster the sentences.
Streaming cluster lib listed in mahout assumes that the text has already
been vectorized.  So how do I vectorize a sentence that has words that are
not in the dictionary?

Do I save the elements of the TF-IDF prior calculations and incrementally
update? 

...

Ugh - I think I just figured out my source of confusion.

Please confirm my understanding

Streaming does NOT imply an unbounded set of data ….

I will have a set of sentences that arrives in some period of time T.
Those that arrive in time T will be treated as a “batch” and vectorized in
the usual fashion (TF-IDF).
Then I feed the batched vector sets into the shiney new streaming methods
(instead of using the tired old canopy combined with straight k-means) to
arrive at my groupings.

- No time or cpu burned up discovering canopies.
- No intermediate disk consumed pushing canopy output into k-means.

Nice groups.

So all I have to do is keep updating the tfidf as new sentences arrive and
re “ball” the sentences with the fast shiney streaming cluster technology.

My big hurdle is coming up with an efficient way to update tfidf (ideas
are welcome).


On a separate note - over the last year - I have been using markdown and
developing my documentation skills.  Held off on writing docs on canopy as
I saw that it is going to be deprecated (Suneel)  Does my use case sound
like a good example for streaming?  If yes - I’ll cook up my specifics
into a postable example.   Also - just checking - streaming isn’t going to
be deprecated is it?


I know that I crammed a whole bunch of questions into this letter - so I
will truly appreciate ya’ll being patient and wading through.

Regards,

SCott


On 2/14/14, 12:55 PM, "Ted Dunning" <te...@gmail.com> wrote:

>In-memory ball k-means should solve your problem pretty well right now.
> In-memory streaming k-means followed by ball k-means will take you to
>well
>beyond your scaled case.
>
>At 1 million documents, you should be able to do your clustering in a few
>minutes, depending on whether some of the sparse matrix performance issues
>got fixed in the clustering code (I think they did).
>
>
>
>
>On Fri, Feb 14, 2014 at 10:50 AM, Scott C. Cote
><sc...@gmail.com>wrote:
>
>> Right now - I'm dealing with only 40,000 documents, but we will
>>eventually
>> grow more than 10x (put on the manager hat and say 1 mil docs) where a
>>doc
>> is usually no longer than 20 or 30 words.
>>
>> SCott
>>
>> On 2/14/14 12:46 PM, "Ted Dunning" <te...@gmail.com> wrote:
>>
>> >Scott,
>> >
>> >How much data do you have?
>> >
>> >How much do you plan to have?
>> >
>> >
>> >
>> >On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote <sc...@gmail.com>
>> >wrote:
>> >
>> >> Hello All,
>> >>
>> >> I have two questions (Q1, Q2).
>> >>
>> >> Q1: Am digging in to Text Analysis and am wrestling with competing
>> >>analyzed
>> >> data maintenance strategies.
>> >>
>> >> NOTE: my text comes from a very narrowly focused source.
>> >>
>> >> - Am currently crunching the data (batch) using the following scheme:
>> >> 1. Load source text as rows in a mysql database.
>> >> 2. Create named TFIDF vectors using a custom analyzer from source
>>text
>> >> (-stopwords, lowercase, std filter, Š.)
>> >> 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced
>> >>cosine
>> >> metric (derived from a custom metric found in MiA)
>> >> 4. Load references of Clusters into SOLR (core1)  cluster id, top
>>terms
>> >> along with full cluster data into Mongo (a cluster is a doc)
>> >> 5. Then load source text into SOLR(core2) using same custom analyzer
>> >>with
>> >> appropriate boost along with the reference cluster id
>> >> NOTE: in all cases, the id of the source text is preserved throughout
>> >>the
>> >> flow in the vector naming process, etc.
>> >>
>> >> So now I have a mysql table,  two SOLR cores, and a Mongo Document
>> >> Collection (all tied together with text id as the common name)
>> >>
>> >> - Now when  a new document enters the system after "batch" has been
>> >> performed, I use core2 to test the top  SOLR matches (custom analyzer
>> >> normalizes the new doc) to find best cluster within a tolerance.  If
>>a
>> >> cluster is found, then I place the text in that cluster  if not,
>>then I
>> >> start a new group (my word for a cluster not generated via kmeans).
>> >>Either
>> >> way, the doc makes its way into both (core1 and core2). I keep track
>>of
>> >>the
>> >> number of group creations/document placements so that if a threshold
>>is
>> >> crossed, then I can re-batch the data.
>> >>
>> >> In MiA, (I think ch 11), suggests that a user could run the canopy
>> >>cluster
>> >> routine to assign new entries to the clusters (instead of what I am
>> >>doing).
>> >> Does he mean to regenerate a new dictionary, frequencies, etc for the
>> >> corpus
>> >> for every inbound document?  My observations have been that this has
>> >>been a
>> >> very speedy process, but I'm hoping that I'm just too much of a
>>novice
>> >>and
>> >> haven't thought of a way to simply update the dictionary/frequencies.
>> >>  (this
>> >> process also calls for the eventual rebatching of the clusters).
>> >>
>> >> While I was very early in my "implement what I have read" process,
>> >>Suneel
>> >> and Ted recommended that I examine the Streaming Kmeans process.
>>Would
>> >> that
>> >> process sidestep much of what I'm doing?
>> >>
>> >> Q2: I need to really understand the lexicon of my corpus.  How do I
>>see
>> >>the
>> >> list of terms that have been omitted due either to being in too many
>> >> documents or are not in enough documents for consideration?
>> >>
>> >> Please know that I know that I can look at the dictionary to see what
>> >>terms
>> >> are covered.  And since my custom analyzer is using the
>> >> StandardAnalyzer.stop words, those are obvious also.  If there isn't
>>an
>> >> option to emit the  omitted words, where would be the natural place
>>to
>> >> capture that data and save it into yet another data store (Sequence
>> >> file,etc)?
>> >>
>> >> Thanks in Advance for the Guidance,
>> >>
>> >> SCott
>> >>
>> >>
>> >>
>>
>>
>>