You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Bogdan Vatkov <bo...@gmail.com> on 2010/05/01 13:27:42 UTC

Re: Clustering single doc as multiple docs

Thanks Ted! That was what I needed!

On Fri, Apr 30, 2010 at 10:21 PM, Ted Dunning <te...@gmail.com> wrote:

> Yes.  Splitting by paragraph should work fine (been there, done that).
>
> Splitting by sentence works well if you does something like SVD to smooth
> over the fact that you have few words per sentence.
>
> Splitting by paragraph is pretty easy, but corpus specific.  For plain
> text,
> try looking for blank lines.  For HTML make a list of breaking markup and
> insert split points whereever you find those.  For other formats you will
> need to put on your thinking cap.
>
> Sentence splitting is easy to do 90% correctly, hard to do better than 99%
> especially in some domains.  For your purposes, 90% is probably fine.
>  Start
> with the simplest possible case and add a few special cases and you will be
> set.  There may be usable software to be found on the net, but your needs
> are very modest.
>
> Good luck!
>
> Let us know how it goes.
>
> On Fri, Apr 30, 2010 at 10:32 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > Btw, why do you think splitting and clustering won't work? Have anybody
> > tried this?
> > I am not sure it will be successful but I also do not have the arguments
> > that it should not lead to a meaningful result.
> > If I split a doc per sentence it might not get good results but if I use
> > larger pieces, e.g. paragraphs it might give some topics (sets of
> > keywords).
> > Anyone tried something like this?
> >
>



-- 
Best regards,
Bogdan