You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Paul J. Lucas" <pa...@lucasmail.org> on 2010/01/14 03:48:51 UTC

Hints on implementing XQuery full-text search

Hi -

I've used Lucene on a previous project, so I am somewhat familiar with the API. However, I've never had to do anything "fancy" (where "fancy" means things like using filters, different analyzers, boosting, payloads, etc).

I'm about to embark on implementing the full-text search feature of XQuery:

http://www.w3.org/TR/xpath-full-text-10/

Its query abilities are the most complicated I've seen. From my experience with Lucene, I know it can be used to implement some of the features; however, I'd like to walk through them all. For each feature, I only need a simple answer like, "Feature X is best implemented using a query filter," just so I start off in the right direction for each feature.

Note: I will be implementing my own query parser and construct queries "by hand" using various instantiations of Lucene classes.

-------------------------
3.3 Cardinality Selection

This allows you to say things like:

title ftcontains "usability" occurs at least 2 times

which means that the title field must contain the "usability" at least twice. How can this be done?

---------------------
3.4.4 Stemming Option

This allows you to match words that have been indexed against words in the query that have been stemmed like:

title ftcontains "improve" with stemming

which would match even if title contained "improving". Note that PorterStemFilter can not be used because the decision whether to use stemming or not is specified at query-time and not index-time.

In this case, would I have to add each word to the index twice? Once for the original word and once for the stemmed word (assuming the stemmed word is different from the original word)? Or is there a better way?

-----------------
3.4.5 Case Option

This allows you to specify -- at query-time -- one of "case insensitive", "case sensitive", "lowercase", "uppercase".

The last two I think can be implemented using a query filter since, for "lowercase", it matches only if the document text is all in lower-case (and same for "uppercase").

But how would you handle the case insensitive/sensitive specifications? One thought is to add every word twice: once in its original case and once in a normalized case (arbitrarily chosen to be, say, lowercase). Any better ideas?

-----------------------
3.4.6 Diacritics Option

This is similar to the Cast Option except its "diacritics insensitive" or "diacritics sensitive. How about implementing this?

----------------------
3.4.7 Stop Word Option

This allows you to specify -- qt query time -- "with stop words", e.g.:

abstract ftcontains "propagating of errors"
with stop words ("a", "the", "of")

would match a document with an abstract that contains "propagating few errors". It seems odd, I know. It's as if the stop words become wildcards, i.e.:

"propagating of errors" -> "propagating * errors"

where * will match any word in the document. How can this be implemented in Lucene?

------------------------
3.5.3 Mild-Not Selection

XQuery has two flavors of "not": (regular) not and mild-not. This allows you to have a query like:

body ftcontains "Mexico" not in "New Mexico"

which would only match documents that contain "Mexico" when it's not part of the phrase "New Mexico". How can this be implemented using Lucene?

-----------------------
3.6.1 Ordered Selection

This allows you to require that the order of the words in a query match the order of the words in a document, e.g.:

title ftcontains ("web site" ftand "usability") ordered

which would match only if the phrase "web site" and the word "usability" both occurred in the document and "usability" comes after "web site" in word order. My guess for implementing this would be to keep track of word positions and store this in a payload. Yes?

---------------------
3.6.4 Scope Selection

This allows you to require that words appear in the same "scope", e.g.:

abstract ftcontains "usability" ftand "web site" same sentence

You can also do any combination of {same|different} {sentence|paragraph}. My guess for this would also be to keep track of sentence/paragraph data in a payload. Yes?

-----------------
3.7 Ignore Option

Given the partial XQuery:

let $x := <book>
<title>Web Usability and Practice</title>
<author>Montana <annotation> this author is
an expert in Web Usability</annotation> Marigold
</author>
<editor>Vera Tudor-Medina on Web <annotation> best
editor on Web Usability</annotation> Usability
</editor>
</book>

if I were to have a query:

book ftcontains "Web Usability" without content $x//annotation

then it would not consider any text inside of <annotation></annotation> elements at all. "Web Usability" would be found twice: once in the title element and once in the editor element. Note that the latter <annotation> element comes smack in the middle of the phrase "Web Usability". My guess for this would also be to use payload data to store the element each word is inside of then use a filter based on that. Yes?

-----------------
I realize this is a lot, but any pointers appreciated. Thanks!

- Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hints on implementing XQuery full-text search

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm about to embark on implementing the full-text search feature of XQuery:

Good luck with that.

Here's some quick suggestions on how i'd try to tackle the things you 
asked about, w/o putting much thought into...

: 	title ftcontains "usability" occurs at least 2 times

assuming this is just term based (and not complex subclauses) i would 
write a custom subclass of TermQuery that enforces a minimum term frequency.

: 	title ftcontains "improve" with stemming

index two versions of every field - one with stemming and one w/o

: This allows you to specify -- at query-time -- one of "case 
: insensitive", "case sensitive", "lowercase", "uppercase".

I have no idea what it would mean to match something "uppercase" or 
"lowercase" -- unless that's just syntactic suger for "uppercase by input, 
and then look for a case sensitve match) but again: two fields for case 
sensitive/insensitive

: This is similar to the Cast Option except its "diacritics insensitive" 
: or "diacritics sensitive.  How about implementing this?

two fields, again.

...at this point, if you need to support all permutations of these options 
you are looking at 2*2*2 index fields per source field ... so you start 
getting into hte realm where i might consider keeping them all in one 
field, using Payloads to note the various attributes that each Term has.

: 	abstract ftcontains "propagating of errors"
: 	with stop words ("a", "the", "of")
: 
: would match a document with an abstract that contains "propagating few 
: errors". It seems odd, I know.  It's as if the stop words become 
: wildcards, i.e.:

are you serious? ... so if i query for "A of the B" with stop words ("of", 
"the") then that has to match "A totally ridiculous B" ? ...  that makes 
no sense what so ever.  why require so much verbosity just to get a "gap" 
that matches anything?

that seems like a straight query parsing problem ... if you see one of the 
terms in teh stop work list, strip it out, and increase the phrase slop on 
the PhraseQuery you are building.

: 	body ftcontains "Mexico" not in "New Mexico"

SpanNotQuery


: 	title ftcontains ("web site" ftand "usability") ordered

SpanNearQuery


: 	abstract ftcontains "usability" ftand "web site" same sentence
: 
: You can also do any combination of {same|different} 
: {sentence|paragraph}.  My guess for this would also be to keep track of 
: sentence/paragraph data in a payload.  Yes?

sounds right.


: 	book ftcontains "Web Usability" without content $x//annotation

depends on how you plan on indexing all of hte context stuff ... if the 
tags are Terms then a SpanNOtQuery would work ... if they are Payloads you 
just need some sort of SpanTermNotMatchingPayload query.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org