You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2004/04/11 21:15:18 UTC
Carrot 2 (was: Re: clustering results)
Carrot (2):
http://www.cs.put.poznan.pl/dweiss/carrot/xml/index.xml?lang=en
Otis
--- kevin@ckhill.com wrote:
> I got all excited reading the subject line "clustering results" but
> this isn't
> really clustering is it? This is more sorting. Does anyone know of
> any work
> within Lucene (or another indexer) to do actual subject clustering
> (i.e. like
> Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)?
> It would
> be pretty awesome if Lucene had such ability, I know there aren't a
> whole lot
> of clustering options, and the commercial products are very
> expensive.
> Anyhow, just curious.
>
> A brief definition of clustering: automatically organizing search or
> database
> query results into meaningful hierarchical folders ... transforming
> long lists
> of search results into categorized information without any clumsy
> pre-
> processing of the source documents.
>
> I'm not sure how it would be done...? Based off of top Term
> Frequencies for a
> document?
>
> -K
>
> Quoting "Michael A. Schoen" <sc...@earthlink.net>:
>
> > So as Venu pointed out, sorting doesn't seem to help the problem.
> If we have
> > to walk the result set, access docs and dedupe using brute force,
> we're
> > better off w/ the standard order by relevance.
> >
> > If you've got an example of this type of clustering done in a more
> efficient
> > way, that'd be great.
> >
> > Any other ideas?
> >
> >
> > ----- Original Message -----
> > From: "Erik Hatcher" <er...@ehatchersolutions.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Saturday, April 10, 2004 12:35 AM
> > Subject: Re: clustering results
> >
> >
> > > On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
> > > > I have an index of urls, and need to display the top 10 results
> for a
> > > > given query, but want to display only 1 result per domain. It
> seems
> > > > that using either Hits or a HitCollector, I'll need to access
> the doc,
> > > > grab the domain field (I'll have it parse ahead of time) and
> only
> > > > take/display documents that are unique.
> > > >
> > > > A significant percentage of the time I expect I may have to
> access
> > > > thousands of results before I find 10 in unique domains. Is
> there a
> > > > faster approach that won't require accessing thousands of
> documents?
> > >
> > > I have examples of this that I can post when I have more time,
> but a
> > > quick pointer... check out the overloaded IndexSearcher.search()
> > > methods which accept a Sort. You can do really really
> interesting
> > > slicing and dicing, I think, using it. Try this one on for size:
> > >
> > > example.displayHits(allBooks,
> > > new Sort(new SortField[]{
> > > new SortField("category"),
> > > SortField.FIELD_SCORE,
> > > new SortField("pubmonth", SortField.INT, true)
> > > }));
> > >
> > > Be clever indexing the piece you want to group on - I think you
> may
> > > find this the solution you're looking for.
> > >
> > > Erik
> > >
> > >
> > >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> > >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org