You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jochen Hebbrecht <jo...@gmail.com> on 2012/08/20 13:59:05 UTC

TermRangeQuery with multiple words

Hi,

I have 5 documents. Each document has a field TEST. Total structure is
looking like this:

Doc 01: TEST: "test 1 string"
Doc 02: TEST: "test 2 string"
Doc 03: TEST: "test 3 string"
Doc 04: TEST: "test 4 string"
Doc 05: TEST: "test 5 string"

These fields are indexed as Index.Analyzed with the StandardAnalyzer.
With Luke, I can see for example:

Document: Doc 01
Field: TEST
Terms: test, 1, string

But now I want to make rangesearch as:

<<
new TermRangeQuery("TEST", "test 1", "test 3", true, true);
>>

... to pickup the first 3 documents. Unfortunately, this doesn't seem to
work for multiple words.

Can somebody help me correcting my TermRangeQuery?

Thanks!
Jochen

Re: TermRangeQuery with multiple words

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hehe Ian, our mails just crossed. I was thinking in the same way! :-).
Thanks for your reply!

2012/8/20 Ian Lea <ia...@gmail.com>

> Jochen
>
>
> No, I don't think that Lucene can make a String range query on
> multiple terms.  For your Microsoft example you could build a query
> with Microsoft as required TermQuery and a required TermRangeQuery
> from Belgium to Spain but that would fall apart with multiword company
> or region names.
>
> It feels like there should be a moderately simple answer, but I can't
> spot it.  Maybe someone else can.
>
>
> --
> Ian.
>
>
> On Mon, Aug 20, 2012 at 2:13 PM, Jochen Hebbrecht
> <jo...@gmail.com> wrote:
> > Hi Ian,
> >
> > Thanks for your answer!
> > Well, my example might have been not so clear. Here's a better example:
> >
> > Doc 01: TEST: "Microsoft Belgium"
> > Doc 02: TEST: "Apple"
> > Doc 03: TEST: "Microsoft France"
> > Doc 04: TEST: "Evian"
> > Doc 05: TEST: "Nokia"
> > Doc 06: TEST: "Novotel"
> > Doc 07: TEST: "Microsoft Germany"
> > Doc 08: TEST: "Microsoft Spain"
> >
> >
> > Now, I want to search for all documents which have the field TEST going
> > from "Microsoft Belgium" to "Microsoft Spain".
> > The problem is, I cannot search on multiple terms in a range :-( ...
> >
> > What I can do, is to search from "Microsoft" to "Microsoft", this one
> > works. But not the one stated above ...
> > So the question is: can Lucene make a String range query on multiple
> terms?
> >
> > Kind regards,
> > Jochen
> >
> >
> > 2012/8/20 Ian Lea <ia...@gmail.com>
> >
> >> This won't work with TermRangeQuery because neither "test 1" not "test
> >> 3" are terms.  "test" will be a term, output by the analyzer.  You'll
> >> be able to see the indexed terms in Luke.
> >>
> >> Sounds very flaky anyway - you'd get "term 10 xxx" and "term 100 xxx"
> >> as well as "term 1" and "term 2".  If your TEST values are that
> >> predictable you could split them up and index the number separately,
> >> maybe using NumericField and build a query using NumericRangeQuery.
> >>
> >> RegexQuery in contrib-queries might also be worth a look.
> >>
> >>
> >> --
> >> Ian.
> >>
> >> On Mon, Aug 20, 2012 at 12:59 PM, Jochen Hebbrecht
> >> <jo...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I have 5 documents. Each document has a field TEST. Total structure is
> >> > looking like this:
> >> >
> >> > Doc 01: TEST: "test 1 string"
> >> > Doc 02: TEST: "test 2 string"
> >> > Doc 03: TEST: "test 3 string"
> >> > Doc 04: TEST: "test 4 string"
> >> > Doc 05: TEST: "test 5 string"
> >> >
> >> > These fields are indexed as Index.Analyzed with the StandardAnalyzer.
> >> > With Luke, I can see for example:
> >> >
> >> > Document: Doc 01
> >> > Field: TEST
> >> > Terms: test, 1, string
> >> >
> >> > But now I want to make rangesearch as:
> >> >
> >> > <<
> >> > new TermRangeQuery("TEST", "test 1", "test 3", true, true);
> >> >>>
> >> >
> >> > ... to pickup the first 3 documents. Unfortunately, this doesn't seem
> to
> >> > work for multiple words.
> >> >
> >> > Can somebody help me correcting my TermRangeQuery?
> >> >
> >> > Thanks!
> >> > Jochen
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: TermRangeQuery with multiple words

Posted by Ian Lea <ia...@gmail.com>.
Jochen


No, I don't think that Lucene can make a String range query on
multiple terms.  For your Microsoft example you could build a query
with Microsoft as required TermQuery and a required TermRangeQuery
from Belgium to Spain but that would fall apart with multiword company
or region names.

It feels like there should be a moderately simple answer, but I can't
spot it.  Maybe someone else can.


--
Ian.


On Mon, Aug 20, 2012 at 2:13 PM, Jochen Hebbrecht
<jo...@gmail.com> wrote:
> Hi Ian,
>
> Thanks for your answer!
> Well, my example might have been not so clear. Here's a better example:
>
> Doc 01: TEST: "Microsoft Belgium"
> Doc 02: TEST: "Apple"
> Doc 03: TEST: "Microsoft France"
> Doc 04: TEST: "Evian"
> Doc 05: TEST: "Nokia"
> Doc 06: TEST: "Novotel"
> Doc 07: TEST: "Microsoft Germany"
> Doc 08: TEST: "Microsoft Spain"
>
>
> Now, I want to search for all documents which have the field TEST going
> from "Microsoft Belgium" to "Microsoft Spain".
> The problem is, I cannot search on multiple terms in a range :-( ...
>
> What I can do, is to search from "Microsoft" to "Microsoft", this one
> works. But not the one stated above ...
> So the question is: can Lucene make a String range query on multiple terms?
>
> Kind regards,
> Jochen
>
>
> 2012/8/20 Ian Lea <ia...@gmail.com>
>
>> This won't work with TermRangeQuery because neither "test 1" not "test
>> 3" are terms.  "test" will be a term, output by the analyzer.  You'll
>> be able to see the indexed terms in Luke.
>>
>> Sounds very flaky anyway - you'd get "term 10 xxx" and "term 100 xxx"
>> as well as "term 1" and "term 2".  If your TEST values are that
>> predictable you could split them up and index the number separately,
>> maybe using NumericField and build a query using NumericRangeQuery.
>>
>> RegexQuery in contrib-queries might also be worth a look.
>>
>>
>> --
>> Ian.
>>
>> On Mon, Aug 20, 2012 at 12:59 PM, Jochen Hebbrecht
>> <jo...@gmail.com> wrote:
>> > Hi,
>> >
>> > I have 5 documents. Each document has a field TEST. Total structure is
>> > looking like this:
>> >
>> > Doc 01: TEST: "test 1 string"
>> > Doc 02: TEST: "test 2 string"
>> > Doc 03: TEST: "test 3 string"
>> > Doc 04: TEST: "test 4 string"
>> > Doc 05: TEST: "test 5 string"
>> >
>> > These fields are indexed as Index.Analyzed with the StandardAnalyzer.
>> > With Luke, I can see for example:
>> >
>> > Document: Doc 01
>> > Field: TEST
>> > Terms: test, 1, string
>> >
>> > But now I want to make rangesearch as:
>> >
>> > <<
>> > new TermRangeQuery("TEST", "test 1", "test 3", true, true);
>> >>>
>> >
>> > ... to pickup the first 3 documents. Unfortunately, this doesn't seem to
>> > work for multiple words.
>> >
>> > Can somebody help me correcting my TermRangeQuery?
>> >
>> > Thanks!
>> > Jochen
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: TermRangeQuery with multiple words

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hi Jack,

Thanks, your solution could work too, but I prefer Ian his solution. I
only have to index the data once, and the solution takes place in an easy
place in the business logic.
But perhaps, somebody else with this problem could use your solution.
Happy that you've posted this solution as well!

Kind regards,
Jochen



Op 20/08/12 18:57 schreef Jack Krupansky <ja...@basetechnology.com>:

>You could index the values in both a "text" and a separate "string"
>field. 
>Then you can query the text field by keyword as well as the string field
>by 
>the full literal value, or as a wildcard or prefix query (e.g.,
>"Microsoft*"), or as a range query with the full literal string values.
>
>-- Jack Krupansky
>
>-----Original Message-----
>From: Jochen Hebbrecht
>Sent: Monday, August 20, 2012 9:13 AM
>To: java-user@lucene.apache.org
>Subject: Re: TermRangeQuery with multiple words
>
>Hi Ian,
>
>Thanks for your answer!
>Well, my example might have been not so clear. Here's a better example:
>
>Doc 01: TEST: "Microsoft Belgium"
>Doc 02: TEST: "Apple"
>Doc 03: TEST: "Microsoft France"
>Doc 04: TEST: "Evian"
>Doc 05: TEST: "Nokia"
>Doc 06: TEST: "Novotel"
>Doc 07: TEST: "Microsoft Germany"
>Doc 08: TEST: "Microsoft Spain"
>
>
>Now, I want to search for all documents which have the field TEST going
>from "Microsoft Belgium" to "Microsoft Spain".
>The problem is, I cannot search on multiple terms in a range :-( ...
>
>What I can do, is to search from "Microsoft" to "Microsoft", this one
>works. But not the one stated above ...
>So the question is: can Lucene make a String range query on multiple
>terms?
>
>Kind regards,
>Jochen
>
>
>2012/8/20 Ian Lea <ia...@gmail.com>
>
>> This won't work with TermRangeQuery because neither "test 1" not "test
>> 3" are terms.  "test" will be a term, output by the analyzer.  You'll
>> be able to see the indexed terms in Luke.
>>
>> Sounds very flaky anyway - you'd get "term 10 xxx" and "term 100 xxx"
>> as well as "term 1" and "term 2".  If your TEST values are that
>> predictable you could split them up and index the number separately,
>> maybe using NumericField and build a query using NumericRangeQuery.
>>
>> RegexQuery in contrib-queries might also be worth a look.
>>
>>
>> --
>> Ian.
>>
>> On Mon, Aug 20, 2012 at 12:59 PM, Jochen Hebbrecht
>> <jo...@gmail.com> wrote:
>> > Hi,
>> >
>> > I have 5 documents. Each document has a field TEST. Total structure is
>> > looking like this:
>> >
>> > Doc 01: TEST: "test 1 string"
>> > Doc 02: TEST: "test 2 string"
>> > Doc 03: TEST: "test 3 string"
>> > Doc 04: TEST: "test 4 string"
>> > Doc 05: TEST: "test 5 string"
>> >
>> > These fields are indexed as Index.Analyzed with the StandardAnalyzer.
>> > With Luke, I can see for example:
>> >
>> > Document: Doc 01
>> > Field: TEST
>> > Terms: test, 1, string
>> >
>> > But now I want to make rangesearch as:
>> >
>> > <<
>> > new TermRangeQuery("TEST", "test 1", "test 3", true, true);
>> >>>
>> >
>> > ... to pickup the first 3 documents. Unfortunately, this doesn't seem
>>to
>> > work for multiple words.
>> >
>> > Can somebody help me correcting my TermRangeQuery?
>> >
>> > Thanks!
>> > Jochen
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> 
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: TermRangeQuery with multiple words

Posted by Jack Krupansky <ja...@basetechnology.com>.
You could index the values in both a "text" and a separate "string" field. 
Then you can query the text field by keyword as well as the string field by 
the full literal value, or as a wildcard or prefix query (e.g., 
"Microsoft*"), or as a range query with the full literal string values.

-- Jack Krupansky

-----Original Message----- 
From: Jochen Hebbrecht
Sent: Monday, August 20, 2012 9:13 AM
To: java-user@lucene.apache.org
Subject: Re: TermRangeQuery with multiple words

Hi Ian,

Thanks for your answer!
Well, my example might have been not so clear. Here's a better example:

Doc 01: TEST: "Microsoft Belgium"
Doc 02: TEST: "Apple"
Doc 03: TEST: "Microsoft France"
Doc 04: TEST: "Evian"
Doc 05: TEST: "Nokia"
Doc 06: TEST: "Novotel"
Doc 07: TEST: "Microsoft Germany"
Doc 08: TEST: "Microsoft Spain"


Now, I want to search for all documents which have the field TEST going
from "Microsoft Belgium" to "Microsoft Spain".
The problem is, I cannot search on multiple terms in a range :-( ...

What I can do, is to search from "Microsoft" to "Microsoft", this one
works. But not the one stated above ...
So the question is: can Lucene make a String range query on multiple terms?

Kind regards,
Jochen


2012/8/20 Ian Lea <ia...@gmail.com>

> This won't work with TermRangeQuery because neither "test 1" not "test
> 3" are terms.  "test" will be a term, output by the analyzer.  You'll
> be able to see the indexed terms in Luke.
>
> Sounds very flaky anyway - you'd get "term 10 xxx" and "term 100 xxx"
> as well as "term 1" and "term 2".  If your TEST values are that
> predictable you could split them up and index the number separately,
> maybe using NumericField and build a query using NumericRangeQuery.
>
> RegexQuery in contrib-queries might also be worth a look.
>
>
> --
> Ian.
>
> On Mon, Aug 20, 2012 at 12:59 PM, Jochen Hebbrecht
> <jo...@gmail.com> wrote:
> > Hi,
> >
> > I have 5 documents. Each document has a field TEST. Total structure is
> > looking like this:
> >
> > Doc 01: TEST: "test 1 string"
> > Doc 02: TEST: "test 2 string"
> > Doc 03: TEST: "test 3 string"
> > Doc 04: TEST: "test 4 string"
> > Doc 05: TEST: "test 5 string"
> >
> > These fields are indexed as Index.Analyzed with the StandardAnalyzer.
> > With Luke, I can see for example:
> >
> > Document: Doc 01
> > Field: TEST
> > Terms: test, 1, string
> >
> > But now I want to make rangesearch as:
> >
> > <<
> > new TermRangeQuery("TEST", "test 1", "test 3", true, true);
> >>>
> >
> > ... to pickup the first 3 documents. Unfortunately, this doesn't seem to
> > work for multiple words.
> >
> > Can somebody help me correcting my TermRangeQuery?
> >
> > Thanks!
> > Jochen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: TermRangeQuery with multiple words

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hmm, just thinking. I could split the value on spaces.
Then I can say:

+TEST:Microsoft +TEST:[Belgium TO Spain]

I just tested it, and it seems to work :-) ...


2012/8/20 Jochen Hebbrecht <jo...@gmail.com>

> Hi Ian,
>
> Thanks for your answer!
> Well, my example might have been not so clear. Here's a better example:
>
> Doc 01: TEST: "Microsoft Belgium"
> Doc 02: TEST: "Apple"
> Doc 03: TEST: "Microsoft France"
> Doc 04: TEST: "Evian"
> Doc 05: TEST: "Nokia"
> Doc 06: TEST: "Novotel"
> Doc 07: TEST: "Microsoft Germany"
> Doc 08: TEST: "Microsoft Spain"
>
>
> Now, I want to search for all documents which have the field TEST going
> from "Microsoft Belgium" to "Microsoft Spain".
> The problem is, I cannot search on multiple terms in a range :-( ...
>
> What I can do, is to search from "Microsoft" to "Microsoft", this one
> works. But not the one stated above ...
> So the question is: can Lucene make a String range query on multiple terms?
>
> Kind regards,
> Jochen
>
>
> 2012/8/20 Ian Lea <ia...@gmail.com>
>
>> This won't work with TermRangeQuery because neither "test 1" not "test
>> 3" are terms.  "test" will be a term, output by the analyzer.  You'll
>> be able to see the indexed terms in Luke.
>>
>> Sounds very flaky anyway - you'd get "term 10 xxx" and "term 100 xxx"
>> as well as "term 1" and "term 2".  If your TEST values are that
>> predictable you could split them up and index the number separately,
>> maybe using NumericField and build a query using NumericRangeQuery.
>>
>> RegexQuery in contrib-queries might also be worth a look.
>>
>>
>> --
>> Ian.
>>
>> On Mon, Aug 20, 2012 at 12:59 PM, Jochen Hebbrecht
>> <jo...@gmail.com> wrote:
>> > Hi,
>> >
>> > I have 5 documents. Each document has a field TEST. Total structure is
>> > looking like this:
>> >
>> > Doc 01: TEST: "test 1 string"
>> > Doc 02: TEST: "test 2 string"
>> > Doc 03: TEST: "test 3 string"
>> > Doc 04: TEST: "test 4 string"
>> > Doc 05: TEST: "test 5 string"
>> >
>> > These fields are indexed as Index.Analyzed with the StandardAnalyzer.
>> > With Luke, I can see for example:
>> >
>> > Document: Doc 01
>> > Field: TEST
>> > Terms: test, 1, string
>> >
>> > But now I want to make rangesearch as:
>> >
>> > <<
>> > new TermRangeQuery("TEST", "test 1", "test 3", true, true);
>> >>>
>> >
>> > ... to pickup the first 3 documents. Unfortunately, this doesn't seem to
>> > work for multiple words.
>> >
>> > Can somebody help me correcting my TermRangeQuery?
>> >
>> > Thanks!
>> > Jochen
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: TermRangeQuery with multiple words

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hi Ian,

Thanks for your answer!
Well, my example might have been not so clear. Here's a better example:

Doc 01: TEST: "Microsoft Belgium"
Doc 02: TEST: "Apple"
Doc 03: TEST: "Microsoft France"
Doc 04: TEST: "Evian"
Doc 05: TEST: "Nokia"
Doc 06: TEST: "Novotel"
Doc 07: TEST: "Microsoft Germany"
Doc 08: TEST: "Microsoft Spain"


Now, I want to search for all documents which have the field TEST going
from "Microsoft Belgium" to "Microsoft Spain".
The problem is, I cannot search on multiple terms in a range :-( ...

What I can do, is to search from "Microsoft" to "Microsoft", this one
works. But not the one stated above ...
So the question is: can Lucene make a String range query on multiple terms?

Kind regards,
Jochen


2012/8/20 Ian Lea <ia...@gmail.com>

> This won't work with TermRangeQuery because neither "test 1" not "test
> 3" are terms.  "test" will be a term, output by the analyzer.  You'll
> be able to see the indexed terms in Luke.
>
> Sounds very flaky anyway - you'd get "term 10 xxx" and "term 100 xxx"
> as well as "term 1" and "term 2".  If your TEST values are that
> predictable you could split them up and index the number separately,
> maybe using NumericField and build a query using NumericRangeQuery.
>
> RegexQuery in contrib-queries might also be worth a look.
>
>
> --
> Ian.
>
> On Mon, Aug 20, 2012 at 12:59 PM, Jochen Hebbrecht
> <jo...@gmail.com> wrote:
> > Hi,
> >
> > I have 5 documents. Each document has a field TEST. Total structure is
> > looking like this:
> >
> > Doc 01: TEST: "test 1 string"
> > Doc 02: TEST: "test 2 string"
> > Doc 03: TEST: "test 3 string"
> > Doc 04: TEST: "test 4 string"
> > Doc 05: TEST: "test 5 string"
> >
> > These fields are indexed as Index.Analyzed with the StandardAnalyzer.
> > With Luke, I can see for example:
> >
> > Document: Doc 01
> > Field: TEST
> > Terms: test, 1, string
> >
> > But now I want to make rangesearch as:
> >
> > <<
> > new TermRangeQuery("TEST", "test 1", "test 3", true, true);
> >>>
> >
> > ... to pickup the first 3 documents. Unfortunately, this doesn't seem to
> > work for multiple words.
> >
> > Can somebody help me correcting my TermRangeQuery?
> >
> > Thanks!
> > Jochen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: TermRangeQuery with multiple words

Posted by Ian Lea <ia...@gmail.com>.
This won't work with TermRangeQuery because neither "test 1" not "test
3" are terms.  "test" will be a term, output by the analyzer.  You'll
be able to see the indexed terms in Luke.

Sounds very flaky anyway - you'd get "term 10 xxx" and "term 100 xxx"
as well as "term 1" and "term 2".  If your TEST values are that
predictable you could split them up and index the number separately,
maybe using NumericField and build a query using NumericRangeQuery.

RegexQuery in contrib-queries might also be worth a look.


--
Ian.

On Mon, Aug 20, 2012 at 12:59 PM, Jochen Hebbrecht
<jo...@gmail.com> wrote:
> Hi,
>
> I have 5 documents. Each document has a field TEST. Total structure is
> looking like this:
>
> Doc 01: TEST: "test 1 string"
> Doc 02: TEST: "test 2 string"
> Doc 03: TEST: "test 3 string"
> Doc 04: TEST: "test 4 string"
> Doc 05: TEST: "test 5 string"
>
> These fields are indexed as Index.Analyzed with the StandardAnalyzer.
> With Luke, I can see for example:
>
> Document: Doc 01
> Field: TEST
> Terms: test, 1, string
>
> But now I want to make rangesearch as:
>
> <<
> new TermRangeQuery("TEST", "test 1", "test 3", true, true);
>>>
>
> ... to pickup the first 3 documents. Unfortunately, this doesn't seem to
> work for multiple words.
>
> Can somebody help me correcting my TermRangeQuery?
>
> Thanks!
> Jochen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org