You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Wilkerson, Cory" <cw...@cars.com> on 2005/09/12 18:09:38 UTC
Sequences of Terms
So...I've had good/great luck finding all terms in my index using the
Lucene API - life is good. Now - I'm trying to take things a step
further and find sequences of key words (maybe two/three/four word
combinations). It's great that I can find "new" and "orleans", but I'm
mostly interested in articles that contain "new orleans". I realize I
can *search* for these terms but I'm more interested in writing an
engine that says "Hey, these sequences seem to be fairly important
because they're occurring quite a bit across this index."
Any suggestions?
Cory Wilkerson
Re: [Nutch-general] Re: Sequences of Terms
Posted by Lars Aronsson <la...@aronsson.se>.
Andy Liu wrote:
> You can try indexing all 2-grams, 3-grams, and 4-grams in your corpus. Then
> you can examine all the terms in your index and see which n-grams are used
> the most.
Another idea can be to download a dump of the Wikipedia database
(from http://download.wikimedia.org/) and use the list of article
titles to see if the named concept has an article there, such as
the 2-gram "Henry Ford" or the 4-gram "weapons of mass
destruction".
That's just a loose idea. I haven't tried it. Are there any good
introductory books on modern, post-Google information retrieval?
--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik - http://aronsson.se
Re: Sequences of Terms
Posted by Andy Liu <an...@gmail.com>.
You can try indexing all 2-grams, 3-grams, and 4-grams in your corpus. Then
you can examine all the terms in your index and see which n-grams are used
the most.
On 9/12/05, Wilkerson, Cory <cw...@cars.com> wrote:
>
> So...I've had good/great luck finding all terms in my index using the
> Lucene API - life is good. Now - I'm trying to take things a step
> further and find sequences of key words (maybe two/three/four word
> combinations). It's great that I can find "new" and "orleans", but I'm
> mostly interested in articles that contain "new orleans". I realize I
> can *search* for these terms but I'm more interested in writing an
> engine that says "Hey, these sequences seem to be fairly important
> because they're occurring quite a bit across this index."
>
> Any suggestions?
> Cory Wilkerson
>
--
Andy Liu
andyliu1227@gmail.com
(301) 873-8458