You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Wilkerson, Cory" <cw...@cars.com> on 2005/09/12 18:09:38 UTC

Sequences of Terms

So...I've had good/great luck finding all terms in my index using the
Lucene API - life is good.  Now - I'm trying to take things a step
further and find sequences of key words (maybe two/three/four word
combinations).  It's great that I can find "new" and "orleans", but I'm
mostly interested in articles that contain "new orleans".  I realize I
can *search* for these terms but I'm more interested in writing an
engine that says "Hey, these sequences seem to be fairly important
because they're occurring quite a bit across this index."  

Any suggestions?
Cory Wilkerson

Re: [Nutch-general] Re: Sequences of Terms

Posted by Lars Aronsson <la...@aronsson.se>.
Andy Liu wrote:

> You can try indexing all 2-grams, 3-grams, and 4-grams in your corpus. Then 
> you can examine all the terms in your index and see which n-grams are used 
> the most.

Another idea can be to download a dump of the Wikipedia database 
(from http://download.wikimedia.org/) and use the list of article 
titles to see if the named concept has an article there, such as 
the 2-gram "Henry Ford" or the 4-gram "weapons of mass 
destruction".

That's just a loose idea.  I haven't tried it.  Are there any good 
introductory books on modern, post-Google information retrieval?


-- 
  Lars Aronsson (lars@aronsson.se)
  Aronsson Datateknik - http://aronsson.se

Re: Sequences of Terms

Posted by Andy Liu <an...@gmail.com>.
You can try indexing all 2-grams, 3-grams, and 4-grams in your corpus. Then 
you can examine all the terms in your index and see which n-grams are used 
the most.

On 9/12/05, Wilkerson, Cory <cw...@cars.com> wrote:
> 
> So...I've had good/great luck finding all terms in my index using the
> Lucene API - life is good. Now - I'm trying to take things a step
> further and find sequences of key words (maybe two/three/four word
> combinations). It's great that I can find "new" and "orleans", but I'm
> mostly interested in articles that contain "new orleans". I realize I
> can *search* for these terms but I'm more interested in writing an
> engine that says "Hey, these sequences seem to be fairly important
> because they're occurring quite a bit across this index."
> 
> Any suggestions?
> Cory Wilkerson
> 



-- 
Andy Liu
andyliu1227@gmail.com
(301) 873-8458