You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/06/18 14:02:59 UTC

Why are not query keywords treated as a set?

q=past past

1.0 = (MATCH) sum of:
*  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
   1.0 = tf(termFreq(content:past)=1)
   1.0 = idf(docFreq=1, maxDocs=2)
   0.5 = fieldNorm(field=content, doc=0)
*  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
   1.0 = tf(termFreq(content:past)=1)
   1.0 = idf(docFreq=1, maxDocs=2)
   0.5 = fieldNorm(field=content, doc=0)

Is there how I can treat the query keywords as a set?

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Why are not query keywords treated as a set?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
 Part of the query is 'injected' by my application while unaware of the user
query. Would I know that 'paste past' end up together as query 'past past' I
would not inject anything as it distorts the score calculation. I could
inject after it, but it is not easy.


So, trying to solve it right into the RequestHandler I've difficulties with
queries that contain phrases ("") or the 'must be present' + operator. For
example I'd not want to touch a user query: +"zusammen essen" +"alein essen"
where 'essen' is the duplicate term.

My 'good enough solution' is thus to not remove the duplicate in clauses
prefixed by + or ".

C := set of clauses in which duplicated term t occurs.
for each clause c in C:
do
if(!c.toString().startsWith(") &&
  !c.toString().startsWith(+) &&
  |C| > 1){
C.remove(c);
}
end

What do you think? Better solutions or algorithms to make sure the same term
occurs only once in a query, or at least it's weighted once only in the
score calculation?


On Mon, Jun 20, 2011 at 11:15 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> That only removed tokens on the same position, as the wiki explains.
>
> Gabrielle, why would you expect that? You input two tokens so you query for
> two tokens, why would it be a `set` ?
>
> > this might help in your analysis chain
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDupl
> > icatesTokenFilterFactory
> >
> > On 20 June 2011 04:21, Gabriele Kahlout <ga...@mysimpatico.com>
> wrote:
> > > <str name="rawquerystring">past past</str>
> > > <str name="querystring">*past past*</str>
> > > <str name="parsedquery">*content:past content:past*</str>
> > >
> > > I was expecting the query to get parsed into content:past only and not
> > > content:past content:past.
> > >
> > > On Mon, Jun 20, 2011 at 12:12 AM, lee carroll
> > >
> > > <le...@googlemail.com>wrote:
> > >> do you mean a phrase query? "past past"
> > >> can you give some more detail?
> > >>
> > >> On 18 June 2011 13:02, Gabriele Kahlout <ga...@mysimpatico.com>
> wrote:
> > >> > q=past past
> > >> >
> > >> > 1.0 = (MATCH) sum of:
> > >> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> > >> >   1.0 = tf(termFreq(content:past)=1)
> > >> >   1.0 = idf(docFreq=1, maxDocs=2)
> > >> >   0.5 = fieldNorm(field=content, doc=0)
> > >> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> > >> >   1.0 = tf(termFreq(content:past)=1)
> > >> >   1.0 = idf(docFreq=1, maxDocs=2)
> > >> >   0.5 = fieldNorm(field=content, doc=0)
> > >> >
> > >> > Is there how I can treat the query keywords as a set?
> > >> >
> > >> > --
> > >> > Regards,
> > >> > K. Gabriele
> > >> >
> > >> > --- unchanged since 20/9/10 ---
> > >> > P.S. If the subject contains "[LON]" or the addressee acknowledges
> the
> > >> > receipt within 48 hours then I don't resend the email.
> > >> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > >>
> > >> time(x)
> > >>
> > >> > < Now + 48h) ⇒ ¬resend(I, this).
> > >> >
> > >> > If an email is sent by a sender that is not a trusted contact or the
> > >>
> > >> email
> > >>
> > >> > does not contain a valid code then the email is not received. A
> valid
> > >>
> > >> code
> > >>
> > >> > starts with a hyphen and ends with "X".
> > >> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧
> y
> > >> > ∈ L(-[a-z]+[0-9]X)).
> > >
> > > --
> > > Regards,
> > > K. Gabriele
> > >
> > > --- unchanged since 20/9/10 ---
> > > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > > receipt within 48 hours then I don't resend the email.
> > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > > time(x) < Now + 48h) ⇒ ¬resend(I, this).
> > >
> > > If an email is sent by a sender that is not a trusted contact or the
> > > email does not contain a valid code then the email is not received. A
> > > valid code starts with a hyphen and ends with "X".
> > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> ∈
> > > L(-[a-z]+[0-9]X)).
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Why are not query keywords treated as a set?

Posted by Markus Jelsma <ma...@openindex.io>.
That only removed tokens on the same position, as the wiki explains.

Gabrielle, why would you expect that? You input two tokens so you query for 
two tokens, why would it be a `set` ?

> this might help in your analysis chain
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDupl
> icatesTokenFilterFactory
> 
> On 20 June 2011 04:21, Gabriele Kahlout <ga...@mysimpatico.com> wrote:
> > <str name="rawquerystring">past past</str>
> > <str name="querystring">*past past*</str>
> > <str name="parsedquery">*content:past content:past*</str>
> > 
> > I was expecting the query to get parsed into content:past only and not
> > content:past content:past.
> > 
> > On Mon, Jun 20, 2011 at 12:12 AM, lee carroll
> > 
> > <le...@googlemail.com>wrote:
> >> do you mean a phrase query? "past past"
> >> can you give some more detail?
> >> 
> >> On 18 June 2011 13:02, Gabriele Kahlout <ga...@mysimpatico.com> wrote:
> >> > q=past past
> >> > 
> >> > 1.0 = (MATCH) sum of:
> >> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> >> >   1.0 = tf(termFreq(content:past)=1)
> >> >   1.0 = idf(docFreq=1, maxDocs=2)
> >> >   0.5 = fieldNorm(field=content, doc=0)
> >> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> >> >   1.0 = tf(termFreq(content:past)=1)
> >> >   1.0 = idf(docFreq=1, maxDocs=2)
> >> >   0.5 = fieldNorm(field=content, doc=0)
> >> > 
> >> > Is there how I can treat the query keywords as a set?
> >> > 
> >> > --
> >> > Regards,
> >> > K. Gabriele
> >> > 
> >> > --- unchanged since 20/9/10 ---
> >> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> >> > receipt within 48 hours then I don't resend the email.
> >> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> >> 
> >> time(x)
> >> 
> >> > < Now + 48h) ⇒ ¬resend(I, this).
> >> > 
> >> > If an email is sent by a sender that is not a trusted contact or the
> >> 
> >> email
> >> 
> >> > does not contain a valid code then the email is not received. A valid
> >> 
> >> code
> >> 
> >> > starts with a hyphen and ends with "X".
> >> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> >> > ∈ L(-[a-z]+[0-9]X)).
> > 
> > --
> > Regards,
> > K. Gabriele
> > 
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > time(x) < Now + 48h) ⇒ ¬resend(I, this).
> > 
> > If an email is sent by a sender that is not a trusted contact or the
> > email does not contain a valid code then the email is not received. A
> > valid code starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).

Re: Why are not query keywords treated as a set?

Posted by lee carroll <le...@googlemail.com>.
this might help in your analysis chain

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory



On 20 June 2011 04:21, Gabriele Kahlout <ga...@mysimpatico.com> wrote:
> <str name="rawquerystring">past past</str>
> <str name="querystring">*past past*</str>
> <str name="parsedquery">*content:past content:past*</str>
>
> I was expecting the query to get parsed into content:past only and not
> content:past content:past.
>
> On Mon, Jun 20, 2011 at 12:12 AM, lee carroll
> <le...@googlemail.com>wrote:
>
>> do you mean a phrase query? "past past"
>> can you give some more detail?
>>
>> On 18 June 2011 13:02, Gabriele Kahlout <ga...@mysimpatico.com> wrote:
>> > q=past past
>> >
>> > 1.0 = (MATCH) sum of:
>> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
>> >   1.0 = tf(termFreq(content:past)=1)
>> >   1.0 = idf(docFreq=1, maxDocs=2)
>> >   0.5 = fieldNorm(field=content, doc=0)
>> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
>> >   1.0 = tf(termFreq(content:past)=1)
>> >   1.0 = idf(docFreq=1, maxDocs=2)
>> >   0.5 = fieldNorm(field=content, doc=0)
>> >
>> > Is there how I can treat the query keywords as a set?
>> >
>> > --
>> > Regards,
>> > K. Gabriele
>> >
>> > --- unchanged since 20/9/10 ---
>> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> > receipt within 48 hours then I don't resend the email.
>> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x)
>> > < Now + 48h) ⇒ ¬resend(I, this).
>> >
>> > If an email is sent by a sender that is not a trusted contact or the
>> email
>> > does not contain a valid code then the email is not received. A valid
>> code
>> > starts with a hyphen and ends with "X".
>> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> > L(-[a-z]+[0-9]X)).
>> >
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>

Re: Why are not query keywords treated as a set?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
<str name="rawquerystring">past past</str>
<str name="querystring">*past past*</str>
<str name="parsedquery">*content:past content:past*</str>

I was expecting the query to get parsed into content:past only and not
content:past content:past.

On Mon, Jun 20, 2011 at 12:12 AM, lee carroll
<le...@googlemail.com>wrote:

> do you mean a phrase query? "past past"
> can you give some more detail?
>
> On 18 June 2011 13:02, Gabriele Kahlout <ga...@mysimpatico.com> wrote:
> > q=past past
> >
> > 1.0 = (MATCH) sum of:
> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> >   1.0 = tf(termFreq(content:past)=1)
> >   1.0 = idf(docFreq=1, maxDocs=2)
> >   0.5 = fieldNorm(field=content, doc=0)
> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> >   1.0 = tf(termFreq(content:past)=1)
> >   1.0 = idf(docFreq=1, maxDocs=2)
> >   0.5 = fieldNorm(field=content, doc=0)
> >
> > Is there how I can treat the query keywords as a set?
> >
> > --
> > Regards,
> > K. Gabriele
> >
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> > < Now + 48h) ⇒ ¬resend(I, this).
> >
> > If an email is sent by a sender that is not a trusted contact or the
> email
> > does not contain a valid code then the email is not received. A valid
> code
> > starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).
> >
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Why are not query keywords treated as a set?

Posted by lee carroll <le...@googlemail.com>.
do you mean a phrase query? "past past"
can you give some more detail?

On 18 June 2011 13:02, Gabriele Kahlout <ga...@mysimpatico.com> wrote:
> q=past past
>
> 1.0 = (MATCH) sum of:
> *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
>   1.0 = tf(termFreq(content:past)=1)
>   1.0 = idf(docFreq=1, maxDocs=2)
>   0.5 = fieldNorm(field=content, doc=0)
> *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
>   1.0 = tf(termFreq(content:past)=1)
>   1.0 = idf(docFreq=1, maxDocs=2)
>   0.5 = fieldNorm(field=content, doc=0)
>
> Is there how I can treat the query keywords as a set?
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>