You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mariella Di Giacomo <ma...@lanl.gov> on 2005/01/18 23:01:48 UTC

Question about proximity searching and wildcards

Hello,

We are using Lucene to index scientific articles.
We are also using Luke to verify the fields and values we index.

One of the fields we index is the author field that consists of the authors 
that have written the scientific article (an example of such data is shown 
at the bottom of the email).

The most common search on the author field is the following:

"find all the authors whose last name starts with Cole and the first name 
starts with S"

We thought of a proximity search (we want to make sure we take the first 
name and not the middle name/initial) similar like that

"Author:cole* S*"~1
"Author:cole* AND Author:S*"~1

What we were expecting was: all the documents that contain authors whose 
last name starts with Cole and the first name start with S and those words 
are near (next to each other)

Unfortunately when we type that search through the Luke "search interface" 
we do not get the expansion of the
words when using the similarity at the same time

So my question:

1) Does Luke cannot deal with that ?
2) Is the query not properly structured to get what expected ? Which would 
be the correct one ?

If Luke cannot deal with that, when writing the query through the Java 
application, which would be the
query to be provided to get what expected ?
Do we need to use a query filter ?


Thanks a lot in advance for your help,


Mariella






_____________________________________________________________________________________________
E.g.

Below there are three examples of data we index and to be precise the 
information related to the Authors field.

The following is the information related to scientific articles that we index.

1)
The Authors field consists of two authors

Title: Using Document Dimensions for Enhanced Information Retrieval
Authors: Jayasooriya, Thimala(thimal@cs.york.ac.uk); Manandhar, 
Suresha(suresh@cs.york.ac.uk)
Affiliations: a. Department of Computer Science, University of York
Abstract (English): Conventional document search techniques are constrained 
by attempting to match individual keywords or phrases to source documents. 
Thus, these techniques miss out documents that contain semantically similar 
terms, thereby achieving a relatively low degree of recall. At the same 
time, processing capabilities and tools for syntactic and semantic analysis 
of language have advanced to the point where an index-time linguistic 
analysis of source documents is both feasible and realistic. In this paper, 
we introduce document dimensions, a means of classifying or grouping terms 
discovered in documents. Using an enhanced version of Jakarta Lucene[1], we 
demonstrate that supplementing keyword analysis with some syntactic and 
semantic information can indeed enhance the quality of information 
retrieval results.
Publisher: Springer-Verlag
Publication Type: Original Paper
ISSN: 0302-9743
ISBN: 3-540-23659-7
Book DOI: 10.1007/b101591

2)
The Authors field consists of six authors

Title: Multilingual Retrieval Experiments with MIMOR at the University of 
Hildesheim
Authors: Hackl, Renéa; Kölle, Ralpha; Mandl, 
Thomasa(mandl@uni-hildesheim.de); Ploedt, Alexandraa; Scheufen, 
Jan-Hendrika; Womser-Hacker, Christaa
Affiliations: a. University of Hildesheim, Information Science, 
Marienburger Platz 22, D-31141 Hildesheim
Abstract (English): Fusion and optimization based relevance judgements have 
proven to be successful strategies in information retrieval. In this years 
CLEF campaign we applied these strategies to multilingual retrieval with 
four languages. Our fusion experiments were carried out using freely 
available software. We used the snowball stemmers, internet translation 
services and the text retrieval tools in Lucene and the new MySQL.
Publisher: Springer-Verlag
Publication Type: Original Paper
ISSN: 0302-9743
ISBN: 3-540-24017-9
Book DOI: 10.1007/b102261

3)

The Authors field consists of one author and only middle and first initial 
are provided

Title: Letter to the editor
Author: Coleman, S.S.a
Affiliations: a. Department of Orthopaedics, The University of Utah School 
of Medicine, 50 North Medical Drive, Salt Lake City, UT 84132, USA US
Abstract: No Abstract
Publisher: Springer-Verlag
Item Identifier: 10.1007/s002640000113
Publication Type: Article
ISSN: 0341-2695

Re: Question about proximity searching and wildcards

Posted by Daniel Naber <da...@t-online.de>.

On Tuesday 18 January 2005 23:01, Mariella Di Giacomo wrote:

> "Author:cole* S*"~1

This is not supported by the query parser. You can use PhrasePrefixQuery to 
build such queries programmatically. BTW, this kind of question belongs to 
the user list, not the developer list.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Question about proximity searching and wildcards

Posted by Morus Walter <mo...@tanto.de>.

Mariella Di Giacomo writes:
> Hello,
> 
> We are using Lucene to index scientific articles.
> We are also using Luke to verify the fields and values we index.
> 
> One of the fields we index is the author field that consists of the authors 
> that have written the scientific article (an example of such data is shown 
> at the bottom of the email).
> 
> The most common search on the author field is the following:
> 
> "find all the authors whose last name starts with Cole and the first name 
> starts with S"
> 
> We thought of a proximity search (we want to make sure we take the first 
> name and not the middle name/initial) similar like that
> 
Query parser cannot do that.

> "Author:cole* S*"~1

In that case you cannot expand the wildcard terms.

> "Author:cole* AND Author:S*"~1

You cannot mix boolean queries and proximity queries.

What comes next to your query is phrase prefix query, but that's designed
for something like 'Cole S*' not 'Cole* S*'.

Searching for 'Cole* S*' means to search for all combinations of possible 
expansions of Cole and S. You can do that by expanding the terms yourself
but I'd expect that a) to be slow and b) to create trouble with the maximum
number of boolean terms (or memory usage).
Given that there are 10 expansions of Cole and 500 of S (that's not just
first names, that all names) you have to do 5000 proximity searches.

> If Luke cannot deal with that, when writing the query through the Java 
> application, which would be the
> query to be provided to get what expected ?
> Do we need to use a query filter ?
> 
I would use different fields for first and last name in this case.
And if it's relevant to search for the first character of the first name,
I'd index that additionally.

HTH
	Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org