You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Marc Hadfield <ma...@animarc.com> on 2005/10/10 22:35:42 UTC

query across fields?

hello -

i am looking to perform queries efficiently across multiple fields that 
have their token order synchronized, ie:

Field_A[100] has some relationship to Field_B[100]

for example, consider two fields, one the full text of an article and 
the other the "type" of the token where type could be  from { person, 
company, date, ... }

So that for a Document:
Field_A : "Fred Johnson worked for Johnson and Johnson in 2001"
Field_B : "name name other other company company company other date"

and we wish to perform a query:
Field_A:"Johnson" AND Field_B:"name"
which would be true for token number 2 but not for 5 and 7

I think Span Queries could be adapted to this purpose, but I wanted to 
get any thoughts from the list.

I would prefer not to mix the full text and "types" in the same field as 
it would make the term positions inconsistent which i depend on for 
other queries.

In principle I could store the full text in two fields with the second 
field containing the types without incrementing the token index.   Then, 
do a SpanQuery for "Johnson" and "name" with a distance of 0.  The 
resulting match would have a token position which would refer back to 
the matching position in the first field.  I don't know if this is a 
really good idea.

Any thoughts?

---Marc Hadfield


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: query across fields?

Posted by Doug Cutting <cu...@apache.org>.

Marc Hadfield wrote:
> I'll give Span Query's a try as they can handle the 0 increment issue.

Note that PhraseQuery can now handle this too.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: query across fields?

Posted by Marc Hadfield <ma...@animarc.com>.

Thanks Doug -

I'll give Span Query's a try as they can handle the 0 increment issue.

My original desire to have more than one field comes from my document 
represention which includes multiple fields containing (the same) 
document text using different stemmers, as, depending on the type of 
query, i may need to use results from a different stemmer. this is 
necessary as we index text containing complex biological names and some 
otherwise useful stemming chokes on the names.

So, although my immediate need is met, I would still be interested in 
considering how cross-field queries might work.

----Marc


Doug Cutting wrote:

> Marc Hadfield wrote:
>
>> I actually mention your option in my email:
>>
>>> In principle I could store the full text in two fields with the 
>>> second field containing the types without incrementing the token 
>>> index.   Then, do a SpanQuery for "Johnson" and "name" with a 
>>> distance of 0.  The resulting match would have a token position 
>>> which would refer back to the matching position in the first field.  
>>> I don't know if this is a really good idea.
>>
>>
>> ie Field_B = full text interlaced with "types" following each full 
>> text token with positionIncrement=0
>
>
> Sorry, you confused me when you spoke of this as "two fields" when 
> only one field is required.
>
>> However, as far as I understand, the standard TermQuery won't let me 
>> check if "Johnson" and "__name__" occur at the **same** position.  
>> Perhaps, as I ask above, a SpanQuery will allow multiple terms with a 
>> distance of zero (0) , that is they were indexed with 
>> positionIncrement=0 and SpanQuery can handle 0 distance terms?
>
>
> TermQuery certainly won't, since it only concerns a single term.  But 
> PhraseQuery now has an add(Term, position) method that should do the 
> trick.  And SpanNearQuery should work.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: query across fields?

Posted by Marc Hadfield <ma...@animarc.com>.

thanks again!




Doug Cutting wrote:

> Marc Hadfield wrote:
>
>> In the SpanNear (or for that matter PhraseQuery), one can set a slop 
>> value where 0 (zero) means one following after the other.
>>
>> How can one differentiate between Terms at the **same** position vs. 
>> one after the other?
>
>
> The following queries only match "x" and "y" at the same position:
>
> Query pq = new PhraseQuery();
> pq.add(new Term("f", "x"), 0);
> pq.add(new Term("f", "y"), 0);
>
> Query sq =
>   new SpanNearQuery(new SpanQuery[]
>                       { new SpanTermQuery(new Term("f", "x")),
>                         new SpanTermQuery(new Term("f", "y")) },
>                     0, false);
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: query across fields?

Posted by Doug Cutting <cu...@apache.org>.

Marc Hadfield wrote:
> In the SpanNear (or for that matter PhraseQuery), one can set a slop 
> value where 0 (zero) means one following after the other.
> 
> How can one differentiate between Terms at the **same** position vs. one 
> after the other?

The following queries only match "x" and "y" at the same position:

Query pq = new PhraseQuery();
pq.add(new Term("f", "x"), 0);
pq.add(new Term("f", "y"), 0);

Query sq =
   new SpanNearQuery(new SpanQuery[]
                       { new SpanTermQuery(new Term("f", "x")),
                         new SpanTermQuery(new Term("f", "y")) },
                     0, false);

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: query across fields?

Posted by Marc Hadfield <ma...@animarc.com>.

Hello -

a quick follow-up to my previous post.

In the SpanNear (or for that matter PhraseQuery), one can set a slop 
value where 0 (zero) means one following after the other.

How can one differentiate between Terms at the **same** position vs. one 
after the other?

ie:
(Token)/Position

(A)/0 (B)/1 (C)/2 ....
vs
( A B )/0 (C)/1 (D)/2 ...

How can a SpanNear (or anything) query for A,B tell these two cases apart?

---Marc



Doug Cutting wrote:

> Marc Hadfield wrote:
>
>> I actually mention your option in my email:
>>
>>> In principle I could store the full text in two fields with the 
>>> second field containing the types without incrementing the token 
>>> index.   Then, do a SpanQuery for "Johnson" and "name" with a 
>>> distance of 0.  The resulting match would have a token position 
>>> which would refer back to the matching position in the first field.  
>>> I don't know if this is a really good idea.
>>
>>
>> ie Field_B = full text interlaced with "types" following each full 
>> text token with positionIncrement=0
>
>
> Sorry, you confused me when you spoke of this as "two fields" when 
> only one field is required.
>
>> However, as far as I understand, the standard TermQuery won't let me 
>> check if "Johnson" and "__name__" occur at the **same** position.  
>> Perhaps, as I ask above, a SpanQuery will allow multiple terms with a 
>> distance of zero (0) , that is they were indexed with 
>> positionIncrement=0 and SpanQuery can handle 0 distance terms?
>
>
> TermQuery certainly won't, since it only concerns a single term.  But 
> PhraseQuery now has an add(Term, position) method that should do the 
> trick.  And SpanNearQuery should work.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: query across fields?

Posted by Doug Cutting <cu...@apache.org>.

Marc Hadfield wrote:
> I actually mention your option in my email:
> 
>> In principle I could store the full text in two fields with the second 
>> field containing the types without incrementing the token index.   
>> Then, do a SpanQuery for "Johnson" and "name" with a distance of 0.  
>> The resulting match would have a token position which would refer back 
>> to the matching position in the first field.  I don't know if this is 
>> a really good idea.
> 
> ie Field_B = full text interlaced with "types" following each full text 
> token with positionIncrement=0

Sorry, you confused me when you spoke of this as "two fields" when only 
one field is required.

> However, as far as I understand, the standard TermQuery won't let me 
> check if "Johnson" and "__name__" occur at the **same** position.  
> Perhaps, as I ask above, a SpanQuery will allow multiple terms with a 
> distance of zero (0) , that is they were indexed with 
> positionIncrement=0 and SpanQuery can handle 0 distance terms?

TermQuery certainly won't, since it only concerns a single term.  But 
PhraseQuery now has an add(Term, position) method that should do the 
trick.  And SpanNearQuery should work.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: query across fields?

Posted by Marc Hadfield <ma...@animarc.com>.

Doug Cutting wrote:

> Why not store them in the same field using positionIncrement=0 for the 
> types?  Then they won't change positions of non-type tokens.  You 
> should distinguish the types syntactically, e.g., prefix them with a 
> space or other character that does not occur within words.  That way 
> queries on this field for the term "name" won't match a type token.
>
> Doug


Hi Doug, thanks for your reply,

I actually mention your option in my email:

> I would prefer not to mix the full text and "types" in the same field 
> as it would make the term positions inconsistent which i depend on for 
> other queries.
>
> In principle I could store the full text in two fields with the second 
> field containing the types without incrementing the token index.   
> Then, do a SpanQuery for "Johnson" and "name" with a distance of 0.  
> The resulting match would have a token position which would refer back 
> to the matching position in the first field.  I don't know if this is 
> a really good idea.

ie Field_B = full text interlaced with "types" following each full text 
token with positionIncrement=0

However, as far as I understand, the standard TermQuery won't let me 
check if "Johnson" and "__name__" occur at the **same** position.  
Perhaps, as I ask above, a SpanQuery will allow multiple terms with a 
distance of zero (0) , that is they were indexed with 
positionIncrement=0 and SpanQuery can handle 0 distance terms?





---Marc





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: query across fields?

Posted by Doug Cutting <cu...@apache.org>.

Marc Hadfield wrote:
> I would prefer not to mix the full text and "types" in the same field as 
> it would make the term positions inconsistent which i depend on for 
> other queries.

Why not store them in the same field using positionIncrement=0 for the 
types?  Then they won't change positions of non-type tokens.  You should 
distinguish the types syntactically, e.g., prefix them with a space or 
other character that does not occur within words.  That way queries on 
this field for the term "name" won't match a type token.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org