You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Valerio Schiavoni <va...@gmail.com> on 2006/03/21 12:09:45 UTC

how to cluster documents

Hello,
not sure if the term 'cluster' is the correct one, but here what i would
like to do:
given I have a small set of categories; i manually defined some keywords for
each category.
ie:

-spielberg: ET, munich, indiana jones;
-sport: football, basket, volley, etc etc;

then, i have a quite large archive of documents (html, pdf, doc) (~5000,
still growing) and I want to 'assign' each document
to those categories, using Lucene possibly (if it can help!).

what approach could I adopt ?

thanks,
valerio

--
To Iterate is Human, to Recurse, Divine
James O. Coplien, Bell Labs
(how good is to be human indeed)

Re: how to cluster documents

Posted by jason <gi...@gmail.com>.
I guess you should use some text mining tools. you can use googl find them.
I remember UIUC recently releases one tool. It is very good.

On 3/21/06, Valerio Schiavoni <va...@gmail.com> wrote:
>
> Hello,
> not sure if the term 'cluster' is the correct one, but here what i would
> like to do:
> given I have a small set of categories; i manually defined some keywords
> for
> each category.
> ie:
>
> -spielberg: ET, munich, indiana jones;
> -sport: football, basket, volley, etc etc;
>
> then, i have a quite large archive of documents (html, pdf, doc) (~5000,
> still growing) and I want to 'assign' each document
> to those categories, using Lucene possibly (if it can help!).
>
> what approach could I adopt ?
>
> thanks,
> valerio
>
> --
> To Iterate is Human, to Recurse, Divine
> James O. Coplien, Bell Labs
> (how good is to be human indeed)
>
>

Re: how to cluster documents

Posted by Valerio Schiavoni <va...@gmail.com>.
Hi Grant,
i think what is more relevan is what you wrote here:
http://www.cnlp.org/apachecon2005/

about domain specialization, but it wasn't very (maybe because only 4
slides)


On 3/21/06, Grant Ingersoll <gs...@syr.edu> wrote:
>
> You might want to look at the Carrot2 project
> (http://www.carrot2.org/website/xml/index.xml).
>
> It does clustering and has support for Lucene.
>
> Valerio Schiavoni wrote:
> > Hello,
> > not sure if the term 'cluster' is the correct one, but here what i would
> > like to do:
> > given I have a small set of categories; i manually defined some keywords
> for
> > each category.
> > ie:
> >
> > -spielberg: ET, munich, indiana jones;
> > -sport: football, basket, volley, etc etc;
> >
> > then, i have a quite large archive of documents (html, pdf, doc) (~5000,
> > still growing) and I want to 'assign' each document
> > to those categories, using Lucene possibly (if it can help!).
> >
> > what approach could I adopt ?
> >
> > thanks,
> > valerio
> >
> > --
> > To Iterate is Human, to Recurse, Divine
> > James O. Coplien, Bell Labs
> > (how good is to be human indeed)
> >
> >
>
> --
>
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> School of Information Studies
> 335 Hinds Hall
> Syracuse, NY 13244
>
> http://www.cnlp.org
> Voice:  315-443-5484
> Fax: 315-443-6886
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
To Iterate is Human, to Recurse, Divine
James O. Coplien, Bell Labs
(how good is to be human indeed)

Re: how to cluster documents

Posted by Grant Ingersoll <gs...@syr.edu>.
You might want to look at the Carrot2 project 
(http://www.carrot2.org/website/xml/index.xml).

It does clustering and has support for Lucene.

Valerio Schiavoni wrote:
> Hello,
> not sure if the term 'cluster' is the correct one, but here what i would
> like to do:
> given I have a small set of categories; i manually defined some keywords for
> each category.
> ie:
>
> -spielberg: ET, munich, indiana jones;
> -sport: football, basket, volley, etc etc;
>
> then, i have a quite large archive of documents (html, pdf, doc) (~5000,
> still growing) and I want to 'assign' each document
> to those categories, using Lucene possibly (if it can help!).
>
> what approach could I adopt ?
>
> thanks,
> valerio
>
> --
> To Iterate is Human, to Recurse, Divine
> James O. Coplien, Bell Labs
> (how good is to be human indeed)
>
>   

-- 

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org