You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Jason Guild <ja...@alaska.gov> on 2011/06/23 02:25:40 UTC

how to approach phrase queries and term grouping

Hi All:

I am new to Lucene and my project is to provide specialized search for a 
set
of booklets. I am using Lucene Java 3.1.

The basic idea is to run queries to find out what booklet and page 
numbers are
match in order to help people know where to look for information in the 
(rather
large and dry) booklets. Therefore each Document in my index represents a
particular page in one of the booklets.

So far I have been able to successfully scrape the raw text from the 
booklets,
insert it into an index, and query it just fine using StandardAnalyzer 
on both
ends.

So here's my general question:
Many queries on the index will involve searching for place names 
mentioned in the
booklets. Some place names use notational variants. For instance, in the 
body text
it will be called "Ship Creek" but in a diagram it might be listed as 
"Ship Cr." or
elsewhere as "Ship Ck.".

If I search for (Ship AND (Cr Ck Creek)) this does not give me what I 
want because
other words may appear between [ship] and [cr]/[ck]/[creek] leading to 
false positives.

What I need to know is how to approach treating the two consecutive 
words as a single
term and add the notational variants as synonyms. So, in a nutshell I 
need the basic
stuff provided by StandardAnalyzer, but with term grouping to emit place 
names
as complete terms and insert synonymous terms to cover the variants.

For instance, the text "...allowed from the mouth of Ship Creek upstream 
to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via 
a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship 
ck][ship cr].

As a bonus it would be nice to treat the trickier text "..except in 
Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird 
creek],
[campbell creek],[where],[limit].

Should the detection and merging be done in a TokenFilter?
Some of the term grouping can probably be done heuristically [*],[creek] 
is [* creek]
but I also have an exhaustive list of places mentioned in the text if 
that helps.

Thanks for any help you can provide.
Jason

Re: how to approach phrase queries and term grouping

Posted by Ian Lea <ia...@gmail.com>.

Have you read Lucene In Action 2nd edition?  Highly recommended for
anyone new to lucene and includes info and code on synonyms and
position increments.  The code is available somewhere as a free
download. You may also want to read up on slop and span queries.  See
for example http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/


--
Ian.


On Thu, Jun 23, 2011 at 1:25 AM, Jason Guild <ja...@alaska.gov> wrote:
> Hi All:
>
> I am new to Lucene and my project is to provide specialized search for a set
> of booklets. I am using Lucene Java 3.1.
>
> The basic idea is to run queries to find out what booklet and page numbers
> are
> match in order to help people know where to look for information in the
> (rather
> large and dry) booklets. Therefore each Document in my index represents a
> particular page in one of the booklets.
>
> So far I have been able to successfully scrape the raw text from the
> booklets,
> insert it into an index, and query it just fine using StandardAnalyzer on
> both
> ends.
>
> So here's my general question:
> Many queries on the index will involve searching for place names mentioned
> in the
> booklets. Some place names use notational variants. For instance, in the
> body text
> it will be called "Ship Creek" but in a diagram it might be listed as "Ship
> Cr." or
> elsewhere as "Ship Ck.".
>
> If I search for (Ship AND (Cr Ck Creek)) this does not give me what I want
> because
> other words may appear between [ship] and [cr]/[ck]/[creek] leading to false
> positives.
>
> What I need to know is how to approach treating the two consecutive words as
> a single
> term and add the notational variants as synonyms. So, in a nutshell I need
> the basic
> stuff provided by StandardAnalyzer, but with term grouping to emit place
> names
> as complete terms and insert synonymous terms to cover the variants.
>
> For instance, the text "...allowed from the mouth of Ship Creek upstream to
> ..." would
> result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via a
> TokenFilter along
> the way, the [ship creek] term would expand into [ship creek][ship ck][ship
> cr].
>
> As a bonus it would be nice to treat the trickier text "..except in Ship,
> Bird, and
> Campbell creeks where the limit is..." as [except],[ship creek],[bird
> creek],
> [campbell creek],[where],[limit].
>
> Should the detection and merging be done in a TokenFilter?
> Some of the term grouping can probably be done heuristically [*],[creek] is
> [* creek]
> but I also have an exhaustive list of places mentioned in the text if that
> helps.
>
> Thanks for any help you can provide.
> Jason
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: how to approach phrase queries and term grouping

Posted by "Granroth, Neal V." <ne...@thermofisher.com>.

That's an intriguing problem which you are trying to solve.

If you have the ability to programmatically alter the search,
one possibility would be to revise it to be
("Ship Creek" OR "Ship Ck" OR "Ship Cr")
that would avoid the problem of intervening words.

Another thought,
Using a custom TokenFiler in order to tokenize [Ship Creek] and [Bird Creek] might cause you trouble because it would then no longer be possible to search by the words "ship", "bird", or "creek" individually.
An alternative might be to enhance the index entries for each booklet with a
"geographical location" field containing standardized phrases in place of the various synonyms of the locations from your list of place names.
Then when running a search you'd then submit something like: (trout AND placename:(Ship_Creek OR Bird_Creek))
Doing it this way the user would still have the flexibility to use the individual words "ship", "bird", or "creek" in content searches if needed.


- Neal

-----Original Message-----
From: Jason Guild [mailto:jason.guild@alaska.gov] 
Sent: Wednesday, June 22, 2011 7:26 PM
To: general@lucene.apache.org
Cc: java-user@lucene.apache.org
Subject: how to approach phrase queries and term grouping

Hi All:

I am new to Lucene and my project is to provide specialized search for a 
set
of booklets. I am using Lucene Java 3.1.

The basic idea is to run queries to find out what booklet and page 
numbers are
match in order to help people know where to look for information in the 
(rather
large and dry) booklets. Therefore each Document in my index represents a
particular page in one of the booklets.

So far I have been able to successfully scrape the raw text from the 
booklets,
insert it into an index, and query it just fine using StandardAnalyzer 
on both
ends.

So here's my general question:
Many queries on the index will involve searching for place names 
mentioned in the
booklets. Some place names use notational variants. For instance, in the 
body text
it will be called "Ship Creek" but in a diagram it might be listed as 
"Ship Cr." or
elsewhere as "Ship Ck.".

If I search for (Ship AND (Cr Ck Creek)) this does not give me what I 
want because
other words may appear between [ship] and [cr]/[ck]/[creek] leading to 
false positives.

What I need to know is how to approach treating the two consecutive 
words as a single
term and add the notational variants as synonyms. So, in a nutshell I 
need the basic
stuff provided by StandardAnalyzer, but with term grouping to emit place 
names
as complete terms and insert synonymous terms to cover the variants.

For instance, the text "...allowed from the mouth of Ship Creek upstream 
to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via 
a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship 
ck][ship cr].

As a bonus it would be nice to treat the trickier text "..except in 
Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird 
creek],
[campbell creek],[where],[limit].

Should the detection and merging be done in a TokenFilter?
Some of the term grouping can probably be done heuristically [*],[creek] 
is [* creek]
but I also have an exhaustive list of places mentioned in the text if 
that helps.

Thanks for any help you can provide.
Jason