You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dan Katz <dk...@cymfony.com> on 2006/01/17 20:52:33 UTC

Lucene Query Writing question

I apologize if sending this to the wrong place, but I need some help
writing some lucene queries.  I am not the Lucene manager here at our
company.  Just a lowly unsophisticated user who would be appreciative of
any help that can be provided.
 
Question 1)   Is there a way in Lucene to have some sort of limit based
on term count.  For example,  "atleast5 Apple" to find items with the
word apple only when it has at least 5 mentions.
 
Question 2) We use Lucene to index articles from Web sites. When I have
these documents I want to find when a Web site is mentioned, but not the
email addresses of a Web site.   I write something like "website.com NOT
\@website.com".  This works to a point.  However, it also excludes the
documents when the website.com AND the @website.com is mentioned.  I
want to eliminate the content that only has @website.com but keep it
whenever the @ is not present.  Does anyone know how I would write this
query?
 
Again, I apologize if sending this to the wrong place and would be
thankful for any help I can get.
 
Dan Katz
Cymfony
 

 


Re: Lucene Query Writing question

Posted by Paul Elschot <pa...@xs4all.nl>.
On Tuesday 17 January 2006 20:52, Dan Katz wrote:
...
> Question 1)   Is there a way in Lucene to have some sort of limit based
> on term count.  For example,  "atleast5 Apple" to find items with the
> word apple only when it has at least 5 mentions.

This can be done, but you'll need to write your own TermQuery and
TermScorer for this. Just add the requirement of the minimum number
of term occurrences in your own TermScorer.
Have a look at the Java code of TermScorer, it should be straightforward
to do this.

>  
> Question 2) We use Lucene to index articles from Web sites. When I have
> these documents I want to find when a Web site is mentioned, but not the
> email addresses of a Web site.   I write something like "website.com NOT
> \@website.com".  This works to a point.  However, it also excludes the
> documents when the website.com AND the @website.com is mentioned.  I
> want to eliminate the content that only has @website.com but keep it
> whenever the @ is not present.  Does anyone know how I would write this
> query?

You'll need to make sure the the query term website.com does not match
@website.com so you can simply query for website.com.
I don't know how the StandardAnalyzer deals with this case.
You may need to use your own Analyzer to make sure that @website.com
is only indexed as @website.com and never as website.com .
If you need to know how some text is indexed try Luke:
http://www.getopt.org/luke/

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org