You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Kaspar Fischer <ka...@dreizak.com> on 2009/12/18 15:10:37 UTC

Keywords indexing, "top words", and co-occurrence

Hi everybody,

I need to do some text analysis and am looking for a software library (in Java, preferably) to use for this. Lucene came to my mind first, but I actually hope that there is some library (based on Lucene, for example) that solves the problems directly.

What I want to do is the following:

1. In documents that get added to the system I need to find keywords from a predefined, fixed set of keywords. For example, the user will make a query for all documents containing the word "traffic" (this word need not be a keyword) and I want to show the number of keyword hits in all documents that contain "traffic":

- car, cars, automobile, automobiles (3)
- - Mercedes (2)
- - Ferrari (2)
- train, trains (4) // one doc contains "TGV", 3 contain "train" or "trains"
- - TGV (1)
- - ICE (0)
- plane, planes (5)
- - Boeing (4)
- - Airbus (1)

In short: I want to count keyword hits in the documents returned by some query. Notice that the keywords are hierarchically organized and may have synonyms ("car" = "cars" = "automobile").

2. If the user queries for free-input word A ("hamburger", say) I want to find all keywords (from the above hierarchy) that are close to "hamburger" in some sense (word-distance or some similar measure of distance in text) and order them by number of occurrence.

Can this be done in Lucene? Or do you know of any frameworks that achieve such results?

Regarding to size, I expect the querys (for "traffic" in 1., or "hamburger" in 2.) to return at most 500 documents and each document to contain at most 50 keywords.

Many thanks,
Kaspar

RE: Keywords indexing, "top words", and co-occurrence

Posted by "Rao, Vaijanath" <va...@corp.aol.com>.
 Hi,

You can write your own hitcollector and in the collector you can do the
required thing. 

Here is the pseudo code for doing this.

Create a class KeywordCountCollector extends TopFieldDocCollector 

In the class override the collect function where you would do something
like this
collect(int doc, int score ) {
	Document document = searcher.doc(doc);
	List<String>keywords=	getKeywordsFromDocument(document)
	for (int i=0;i<keywords.size();i++ ) {
		if (keywordCountMap.contains(keywords.get(i)) {
	
keywordCountMap.put(keywords.get(i),keywordCountMap.get(keywords.get(i))
+ 1);
		}else {
			keywordCountMap.put(keywords.get(i),1);
		}
	}
}

Public Map<String,Integer> getKeywordCountMap() {
	return keywordCountMap;
}


Hope this helps.

--Thanks and Regards
Vaijanath N. Rao


-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Saturday, December 19, 2009 12:22 AM
To: general@lucene.apache.org
Subject: Re: Keywords indexing, "top words", and co-occurrence

Yes, Lucene will help you do this.  It won't do exactly what you want
without some effort on your part.

Sounds like what you want to do is

a) get a book on Lucene and SOLR

b) use standard indexers and a synonym lookup to produce multiple fields
based on the original text and the synonymed text

c) use SOLR's support for faceting to get the counts you are after.

On Fri, Dec 18, 2009 at 6:10 AM, Kaspar Fischer
<ka...@dreizak.com>wrote:

> Hi everybody,
>
> I need to do some text analysis and am looking for a software library 
> (in Java, preferably) to use for this. Lucene came to my mind first, 
> but I actually hope that there is some library (based on Lucene, for 
> example) that solves the problems directly.
>
> What I want to do is the following:
>
> 1. In documents that get added to the system I need to find keywords 
> from a predefined, fixed set of keywords. For example, the user will 
> make a query for all documents containing the word "traffic" (this 
> word need not be a
> keyword) and I want to show the number of keyword hits in all 
> documents that contain "traffic":
>
> - car, cars, automobile, automobiles (3)
> - - Mercedes (2)
> - - Ferrari (2)
> - train, trains (4) // one doc contains "TGV", 3 contain "train" or 
> "trains"
> - - TGV (1)
> - - ICE (0)
> - plane, planes (5)
> - - Boeing (4)
> - - Airbus (1)
>
> In short: I want to count keyword hits in the documents returned by 
> some query. Notice that the keywords are hierarchically organized and 
> may have synonyms ("car" = "cars" = "automobile").
>
> 2. If the user queries for free-input word A ("hamburger", say) I want

> to find all keywords (from the above hierarchy) that are close to
"hamburger"
> in some sense (word-distance or some similar measure of distance in 
> text) and order them by number of occurrence.
>
> Can this be done in Lucene? Or do you know of any frameworks that 
> achieve such results?
>
> Regarding to size, I expect the querys (for "traffic" in 1., or
"hamburger"
> in 2.) to return at most 500 documents and each document to contain at

> most 50 keywords.
>
> Many thanks,
> Kaspar




--
Ted Dunning, CTO
DeepDyve

Re: Keywords indexing, "top words", and co-occurrence

Posted by Ted Dunning <te...@gmail.com>.
Yes, Lucene will help you do this.  It won't do exactly what you want
without some effort on your part.

Sounds like what you want to do is

a) get a book on Lucene and SOLR

b) use standard indexers and a synonym lookup to produce multiple fields
based on the original text and the synonymed text

c) use SOLR's support for faceting to get the counts you are after.

On Fri, Dec 18, 2009 at 6:10 AM, Kaspar Fischer
<ka...@dreizak.com>wrote:

> Hi everybody,
>
> I need to do some text analysis and am looking for a software library (in
> Java, preferably) to use for this. Lucene came to my mind first, but I
> actually hope that there is some library (based on Lucene, for example) that
> solves the problems directly.
>
> What I want to do is the following:
>
> 1. In documents that get added to the system I need to find keywords from a
> predefined, fixed set of keywords. For example, the user will make a query
> for all documents containing the word "traffic" (this word need not be a
> keyword) and I want to show the number of keyword hits in all documents that
> contain "traffic":
>
> - car, cars, automobile, automobiles (3)
> - - Mercedes (2)
> - - Ferrari (2)
> - train, trains (4) // one doc contains "TGV", 3 contain "train" or
> "trains"
> - - TGV (1)
> - - ICE (0)
> - plane, planes (5)
> - - Boeing (4)
> - - Airbus (1)
>
> In short: I want to count keyword hits in the documents returned by some
> query. Notice that the keywords are hierarchically organized and may have
> synonyms ("car" = "cars" = "automobile").
>
> 2. If the user queries for free-input word A ("hamburger", say) I want to
> find all keywords (from the above hierarchy) that are close to "hamburger"
> in some sense (word-distance or some similar measure of distance in text)
> and order them by number of occurrence.
>
> Can this be done in Lucene? Or do you know of any frameworks that achieve
> such results?
>
> Regarding to size, I expect the querys (for "traffic" in 1., or "hamburger"
> in 2.) to return at most 500 documents and each document to contain at most
> 50 keywords.
>
> Many thanks,
> Kaspar




-- 
Ted Dunning, CTO
DeepDyve