You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mailing Lists Account <ml...@imorph.com> on 2003/02/12 10:16:52 UTC

Phrase query and porter stemmer

Hi,

I use PorterStemmer with my analyzer for indexing the documents.
And I have been using the same analyzer for searching too.

When I search for a phrase like "security" AND database, I would like to
avoid matches for
terms like "secure" or "securities" .  I observed that Google and couple of
search engines do
not return such matches.

1) In otherwords, in a single query, is it possible not to choose porter
stemmer for phrase queries and
    use for other queries (such as Term query etc)

2) As an alternative, is it advisable to manually construct a PhraseQuery by
adding terms without appling porter
   stemmer ?

regards
Ramesh



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

HTMLParser.jj

Posted by Pinky Iyer <pi...@yahoo.com>.

 Whats the file called .jj. I cannot get it to work in netbeans. Is this some sort of pre compiler code or so...
Any help appreciated.
Thanks!
P Iyer



---------------------------------
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day

JSP Parsing

Posted by Pinky Iyer <pi...@yahoo.com>.

 Hi!
    How do i do jsp file parsing using the lucene. ANy examples or ideas appreciated? I am new to Lucene/Java.
Thanks!
Pinky Iyer



---------------------------------
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day

Re: multi-station netspread indexing

Posted by Doug Cutting <cu...@lucene.com>.

The RemoteSearchable class (in the latest CVS) will let you do this.  It 
uses Java RMI to let you search indexes on other machines.  With a 
MultiSearcher you can then search a number of independently maintained 
indexes on different machines.  MultiSearcher searches indexes serially, 
but it would be fairly simple to extend it to be able to search in 
parallel, with each search in a separate thread.  That should, in 
theory, scale fairly well, but I don't know if anyone has tried this. 
If you do, please send a message to describing your experience.

Doug

Piotr Martyniak wrote:
> Hi,
> 
> It's kind of newbie question - i'm working on a small system allowing to
> allocate independent tasks on diffrent computers in local network
> 
> I've been wondering if it is possible to use Lucene with system of this
> kind - I mean if 2 client-computers get 2 diffrent URLs, is it possible for
> them to do indexing job independently and then just send the results to the
> server which will connect those with other results.
> If it's possible at all, do you think it makes sense to do it on 10, 100,
> 1000 client-computers? What kind of advantages and disadvantages you see in
> solution like this?
> 
> Thanks in advance for all your opinions.
> Peter
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

multi-station netspread indexing

Posted by Piotr Martyniak <lu...@o2.pl>.

Hi,

It's kind of newbie question - i'm working on a small system allowing to
allocate independent tasks on diffrent computers in local network

I've been wondering if it is possible to use Lucene with system of this
kind - I mean if 2 client-computers get 2 diffrent URLs, is it possible for
them to do indexing job independently and then just send the results to the
server which will connect those with other results.
If it's possible at all, do you think it makes sense to do it on 10, 100,
1000 client-computers? What kind of advantages and disadvantages you see in
solution like this?

Thanks in advance for all your opinions.
Peter



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Phrase query and porter stemmer

Posted by Doug Cutting <cu...@lucene.com>.

Mailing Lists Account wrote:
> Doug Cutting wrote:
>>That's because Google and most internet search engines never do any
>>stemming.
> 
> Generally speaking, are there any advantages not to apply the stemmer ?
> Except for certain keywords,I found use of stemmers helpful.

Generally speaking, stemmers increase recall but decrease precision. 
Different stems of a word frequently have slightly different meanings, 
and are used in different contexts: hence the loss of precision. 
Internet search engines are not interested in increasing recall (there 
are usually plenty of matches) rather their problem is increasing 
precision (finding the best matches).  For example, there are enought 
sites explicitly about "cars" that there's no need to conflate these 
with sites that are also about "car".

However, with smaller collections, recall can be a problem and stemming 
can be useful.  A higher percentage of false positives is returned with 
stemming (decreased precision) but in a small collection that's 
acceptable if it also finds a few more relevant documents (increased 
recall).

I suspect the reason that internet search engines do not permit 
wildcards is simply a performance issue: wildcarded terms can be *much* 
more expensive to process, and internet search engines cannot afford them.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Phrase query and porter stemmer

Posted by Tatu Saloranta <ta...@hypermall.net>.

On Thursday 13 February 2003 05:06, Mailing Lists Account wrote:
> Doug Cutting wrote:
> > Mailing Lists Account wrote:
..
> > That's because Google and most internet search engines never do any
> > stemming.
> >
> > Doug
>
> I didn't know that. Thanks.
>
> Generally speaking, are there any advantages not to apply the stemmer ?

Yes, I suspect there are.

There are 2 ways to think about this. First is that Google, arguably the best 
current general purpose search engine in the world does not use it. This 
indicates in itself that perhaps stemming is not very useful for general 
indexing/searching. Especially when doing phrase searches.

Second is that in case of internet search engines (or other search engines 
with massive amount of non-domain-specific data), stemming reduces accuracy 
of matching; and in case of huge data sets that's actually not a good thing. 
Instead of, say, 100 matches, you get 10000 matches, because stemming makes 
terms more general, matching more often.
Trying to find a needle from haystack if you will.

Stemming is probably more useful in reducing size of the index and improving 
performance that way. This used to be more important, when memory and 
performance limitations were stricter than nowadays.
Also, if you want to do semantic mapping and correlation, stemming is very 
useful (esp. combined with extensive list of stop words), as minimizing data 
sets used for correlation is essential for acceptable performance.

I think usefulness of stop words is closely related to usefulness of stemming 
(ie. more useful in some cases than others)

> Except for certain keywords,I found use of stemmers helpful.

I suspect this depends a lot on keywords in question. Unifying plurals and 
singulars is often helpful, but unifying words like "useful" and "useless" 
is, well, not very helpful (do they get stemmed to "use" like I would guess? 
or not?). Similarly, dropping stop words like "with", "without", "no"/"not" 
may result in dramatic loss in accuracy (ie. you get matches with pretty much 
"opposite" phrases when "not" is dropped by analyzer)

What do others think?

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Phrase query and porter stemmer

Posted by Mailing Lists Account <ml...@imorph.com>.

Doug Cutting wrote:
> Mailing Lists Account wrote:
>> I use PorterStemmer with my analyzer for indexing the documents.
>> And I have been using the same analyzer for searching too.
>> 
>> When I search for a phrase like "security" AND database, I would
>> like to avoid matches for
>> terms like "secure" or "securities" .  I observed that Google and
>> couple of search engines do
>> not return such matches.
> 
> That's because Google and most internet search engines never do any
> stemming.
> 
> Doug
> 
> 

I didn't know that. Thanks.

Generally speaking, are there any advantages not to apply the stemmer ?
Except for certain keywords,I found use of stemmers helpful.

regards
Ramesh



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Phrase query and porter stemmer

Posted by Doug Cutting <cu...@lucene.com>.

Mailing Lists Account wrote:
> I use PorterStemmer with my analyzer for indexing the documents.
> And I have been using the same analyzer for searching too.
> 
> When I search for a phrase like "security" AND database, I would like to
> avoid matches for
> terms like "secure" or "securities" .  I observed that Google and couple of
> search engines do
> not return such matches.

That's because Google and most internet search engines never do any 
stemming.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org