You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jan <fa...@informatik.hu-berlin.de> on 2010/11/17 17:47:49 UTC

uncorrect results

Hi,
i have an assignment in my Text Analytics class. I am supposed to create
an index and search it. The corpus is a PubMed-like XML file. it is
possible to query terms (programcall a few terms) and phrases
(programcall "a phrase"). 
When a phrase is queried the program should answer how often the phrase
occured.
The problem is, on certain queries the IndexSearcher returns some
documents that do not have that particular query in its fields.
I'd be delighted if someone could tell me what i am doing wrong.
See the source code at my github repo
https://github.com/jangingnicht/TextAnalytics2/tree/master/src/textanalytics2/

Thanks in advance
jan

PS: I use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6
1.8.2) on an 64 bit Linux machine.

Re: uncorrect results

Posted by Jan <fa...@informatik.hu-berlin.de>.

hmm ok i tried it but to no avail. 
It would have confused me even more to be honest.
actually i would not have used a Document Collector at all, because I
was supposed to give all results even when queried "the". What i mean is
that i would not need the score at all. I just didn't know how ;)
Anyway a scoring system should not "invent" token i think.

but thanks
jan

Am Donnerstag, den 18.11.2010, 13:05 -0500 schrieb Pulkit Singhal:
> Briefly looked at your code and there is no way that I'm right about
> this but I'll say it anyway:
> Every single field you index doesn't have any NORMS so how will the
> scoring happen?
> It probably happens based on the matches at query time but its not
> like you are specifying any boosts in you query.
> Lucene has a complex scoring formula that I don't claim to fully
> understand ... but what if somehow (stay with me, don't shoot the
> messenger) due to the fact that you have no NORMS at all, the results
> being collected somehow give a score to the document that doesn't have
> a match at all and therefore present it in the results?
> 
> Just a theory (a bad one perhaps) ... but one which can be easily
> blown away by using ANALYZED in your indexer and then trying again.
> 
> - Pulkit
> 
> On Thu, Nov 18, 2010 at 12:55 PM, Pulkit Singhal
> <pu...@gmail.com> wrote:
> > Wow, you live in a really great country and attend an awesome
> > university where they have classes like "Text Analytics" I'm gonna
> > send my kid there to study :)
> >
> > In all seriousness I think the problem may be with how you are
> > collecting your results.
> >
> > I find this very amusing:
> >> 80. 896889 phrase occurs 0 times
> >
> > How can it claim there are zero hits and still be returning you a result? Weird.
> >
> > Have you tried removing all other docs and then only leaving the one
> > problem child in there indexing just that and seeing what comes back?
> >
> > On Wed, Nov 17, 2010 at 1:19 PM, Jan <fa...@informatik.hu-berlin.de> wrote:
> >> thats what i figured...i can't find out what i'm doing wrong though ;)
> >>
> >> so the query is "experiment" (i know not really a phrase...but the
> >> assignment requested precisely so). The program constructs the following
> >> query
> >>
> >> +(AbstractText:"experiment" ArticleTitle:"experiment")
> >>
> >> which looks good to me. the results look like this:
> >>
> >> Found 95 hits.
> >> 1. 19810 phrase occurs 3 times
> >> 2. 587340 phrase occurs once
> >> ...
> >> 80. 896889 phrase occurs 0 times
> >> ...
> >> 95. 900325 phrase occurs once
> >>
> >> so here is the document 896889
> >> PMID
> >>        896889
> >> ArticleTitle
> >>        Estrogen-induced sexual receptivity and localization of 3H-estradiol in
> >> brains of female mice: effects of 5 alpha-reduced androgens, progestins
> >> and cyproterone acetate.
> >> AbstractText
> >>        Sexual receptivity induced in ovariectomized CD-1 mice with chronic
> >> daily administration of estradiol benzoate (E2 B) was blocked by
> >> concurrent administration of the 5 alpha-reduced androgen,
> >> dihydrotestosterone (DHT). Receptivity was restored in these females
> >> with progesterone-, but not with dihydroprogesterone-priming 6 hr prior
> >> to testing. Delaying the DHT injections until 12 hr after the E2 B
> >> injections greatly reduced its inhibitory properties. Receptivity in E2
> >> B-primed females was also blocked by concurrent treatment with
> >> cyproterone acetate and 3 alpha-, but not 3 beta-adrostanediol.
> >> Pretreatment with DHT, or 3 alpha- or 3 beta-androstanediol failed to
> >> consistently affects 3H-estradiol accumulation in crude nuclear and
> >> supernatant fractions from brain and pituitary
> >>
> >> so apart from doing something wrong while indexing/analyzing (the text
> >> above is from the xml, but i double checked...it is put in teh index
> >> with these textfragments) or so, the token "experiment" does not even
> >> occur. thats what baffles me.
> >>
> >> thanks for the very quick reaction
> >> jan
> >>
> >> Am Mittwoch, den 17.11.2010, 12:57 -0500 schrieb Donna L Gresh:
> >>> As it is probably more likely that you're doing something incorrect than
> >>> that Lucene is reporting incorrect results :), it might help if you
> >>> reported the exact query that is being submitted to the IndexSearcher, and
> >>> then showing us the document that was incorrectly returned. My guess is
> >>> that either looking at the query itself will immediately reveal the
> >>> problem to you, or that the query in combination with the document and
> >>> knowledge of which analyzers you are using will reveal the problem-
> >>>
> >>> Donna
> >>>
> >>>
> >>> Jan <fa...@informatik.hu-berlin.de> wrote on 11/17/2010 11:47:49 AM:
> >>>
> >>> > [image removed]
> >>> >
> >>> > uncorrect results
> >>> >
> >>> > Jan
> >>> >
> >>> > to:
> >>> >
> >>> > java-user
> >>> >
> >>> > 11/17/2010 11:51 AM
> >>> >
> >>> > Please respond to java-user
> >>> >
> >>> > Hi,
> >>> > i have an assignment in my Text Analytics class. I am supposed to create
> >>> > an index and search it. The corpus is a PubMed-like XML file. it is
> >>> > possible to query terms (programcall a few terms) and phrases
> >>> > (programcall "a phrase").
> >>> > When a phrase is queried the program should answer how often the phrase
> >>> > occured.
> >>> > The problem is, on certain queries the IndexSearcher returns some
> >>> > documents that do not have that particular query in its fields.
> >>> > I'd be delighted if someone could tell me what i am doing wrong.
> >>> > See the source code at my github repo
> >>> >
> >>> https://github.com/jangingnicht/TextAnalytics2/tree/master/src/textanalytics2/
> >>>
> >>> >
> >>> > Thanks in advance
> >>> > jan
> >>> >
> >>> > PS: I use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6
> >>> > 1.8.2) on an 64 bit Linux machine.
> >>> > [attachment "signature.asc" deleted by Donna L Gresh/Watson/IBM]
> >>
> >>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: uncorrect results

Posted by Pulkit Singhal <pu...@gmail.com>.

Briefly looked at your code and there is no way that I'm right about
this but I'll say it anyway:
Every single field you index doesn't have any NORMS so how will the
scoring happen?
It probably happens based on the matches at query time but its not
like you are specifying any boosts in you query.
Lucene has a complex scoring formula that I don't claim to fully
understand ... but what if somehow (stay with me, don't shoot the
messenger) due to the fact that you have no NORMS at all, the results
being collected somehow give a score to the document that doesn't have
a match at all and therefore present it in the results?

Just a theory (a bad one perhaps) ... but one which can be easily
blown away by using ANALYZED in your indexer and then trying again.

- Pulkit

On Thu, Nov 18, 2010 at 12:55 PM, Pulkit Singhal
<pu...@gmail.com> wrote:
> Wow, you live in a really great country and attend an awesome
> university where they have classes like "Text Analytics" I'm gonna
> send my kid there to study :)
>
> In all seriousness I think the problem may be with how you are
> collecting your results.
>
> I find this very amusing:
>> 80. 896889 phrase occurs 0 times
>
> How can it claim there are zero hits and still be returning you a result? Weird.
>
> Have you tried removing all other docs and then only leaving the one
> problem child in there indexing just that and seeing what comes back?
>
> On Wed, Nov 17, 2010 at 1:19 PM, Jan <fa...@informatik.hu-berlin.de> wrote:
>> thats what i figured...i can't find out what i'm doing wrong though ;)
>>
>> so the query is "experiment" (i know not really a phrase...but the
>> assignment requested precisely so). The program constructs the following
>> query
>>
>> +(AbstractText:"experiment" ArticleTitle:"experiment")
>>
>> which looks good to me. the results look like this:
>>
>> Found 95 hits.
>> 1. 19810 phrase occurs 3 times
>> 2. 587340 phrase occurs once
>> ...
>> 80. 896889 phrase occurs 0 times
>> ...
>> 95. 900325 phrase occurs once
>>
>> so here is the document 896889
>> PMID
>>        896889
>> ArticleTitle
>>        Estrogen-induced sexual receptivity and localization of 3H-estradiol in
>> brains of female mice: effects of 5 alpha-reduced androgens, progestins
>> and cyproterone acetate.
>> AbstractText
>>        Sexual receptivity induced in ovariectomized CD-1 mice with chronic
>> daily administration of estradiol benzoate (E2 B) was blocked by
>> concurrent administration of the 5 alpha-reduced androgen,
>> dihydrotestosterone (DHT). Receptivity was restored in these females
>> with progesterone-, but not with dihydroprogesterone-priming 6 hr prior
>> to testing. Delaying the DHT injections until 12 hr after the E2 B
>> injections greatly reduced its inhibitory properties. Receptivity in E2
>> B-primed females was also blocked by concurrent treatment with
>> cyproterone acetate and 3 alpha-, but not 3 beta-adrostanediol.
>> Pretreatment with DHT, or 3 alpha- or 3 beta-androstanediol failed to
>> consistently affects 3H-estradiol accumulation in crude nuclear and
>> supernatant fractions from brain and pituitary
>>
>> so apart from doing something wrong while indexing/analyzing (the text
>> above is from the xml, but i double checked...it is put in teh index
>> with these textfragments) or so, the token "experiment" does not even
>> occur. thats what baffles me.
>>
>> thanks for the very quick reaction
>> jan
>>
>> Am Mittwoch, den 17.11.2010, 12:57 -0500 schrieb Donna L Gresh:
>>> As it is probably more likely that you're doing something incorrect than
>>> that Lucene is reporting incorrect results :), it might help if you
>>> reported the exact query that is being submitted to the IndexSearcher, and
>>> then showing us the document that was incorrectly returned. My guess is
>>> that either looking at the query itself will immediately reveal the
>>> problem to you, or that the query in combination with the document and
>>> knowledge of which analyzers you are using will reveal the problem-
>>>
>>> Donna
>>>
>>>
>>> Jan <fa...@informatik.hu-berlin.de> wrote on 11/17/2010 11:47:49 AM:
>>>
>>> > [image removed]
>>> >
>>> > uncorrect results
>>> >
>>> > Jan
>>> >
>>> > to:
>>> >
>>> > java-user
>>> >
>>> > 11/17/2010 11:51 AM
>>> >
>>> > Please respond to java-user
>>> >
>>> > Hi,
>>> > i have an assignment in my Text Analytics class. I am supposed to create
>>> > an index and search it. The corpus is a PubMed-like XML file. it is
>>> > possible to query terms (programcall a few terms) and phrases
>>> > (programcall "a phrase").
>>> > When a phrase is queried the program should answer how often the phrase
>>> > occured.
>>> > The problem is, on certain queries the IndexSearcher returns some
>>> > documents that do not have that particular query in its fields.
>>> > I'd be delighted if someone could tell me what i am doing wrong.
>>> > See the source code at my github repo
>>> >
>>> https://github.com/jangingnicht/TextAnalytics2/tree/master/src/textanalytics2/
>>>
>>> >
>>> > Thanks in advance
>>> > jan
>>> >
>>> > PS: I use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6
>>> > 1.8.2) on an 64 bit Linux machine.
>>> > [attachment "signature.asc" deleted by Donna L Gresh/Watson/IBM]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: uncorrect results

Posted by Pulkit Singhal <pu...@gmail.com>.

Wow, you live in a really great country and attend an awesome
university where they have classes like "Text Analytics" I'm gonna
send my kid there to study :)

In all seriousness I think the problem may be with how you are
collecting your results.

I find this very amusing:
> 80. 896889 phrase occurs 0 times

How can it claim there are zero hits and still be returning you a result? Weird.

Have you tried removing all other docs and then only leaving the one
problem child in there indexing just that and seeing what comes back?

On Wed, Nov 17, 2010 at 1:19 PM, Jan <fa...@informatik.hu-berlin.de> wrote:
> thats what i figured...i can't find out what i'm doing wrong though ;)
>
> so the query is "experiment" (i know not really a phrase...but the
> assignment requested precisely so). The program constructs the following
> query
>
> +(AbstractText:"experiment" ArticleTitle:"experiment")
>
> which looks good to me. the results look like this:
>
> Found 95 hits.
> 1. 19810 phrase occurs 3 times
> 2. 587340 phrase occurs once
> ...
> 80. 896889 phrase occurs 0 times
> ...
> 95. 900325 phrase occurs once
>
> so here is the document 896889
> PMID
>        896889
> ArticleTitle
>        Estrogen-induced sexual receptivity and localization of 3H-estradiol in
> brains of female mice: effects of 5 alpha-reduced androgens, progestins
> and cyproterone acetate.
> AbstractText
>        Sexual receptivity induced in ovariectomized CD-1 mice with chronic
> daily administration of estradiol benzoate (E2 B) was blocked by
> concurrent administration of the 5 alpha-reduced androgen,
> dihydrotestosterone (DHT). Receptivity was restored in these females
> with progesterone-, but not with dihydroprogesterone-priming 6 hr prior
> to testing. Delaying the DHT injections until 12 hr after the E2 B
> injections greatly reduced its inhibitory properties. Receptivity in E2
> B-primed females was also blocked by concurrent treatment with
> cyproterone acetate and 3 alpha-, but not 3 beta-adrostanediol.
> Pretreatment with DHT, or 3 alpha- or 3 beta-androstanediol failed to
> consistently affects 3H-estradiol accumulation in crude nuclear and
> supernatant fractions from brain and pituitary
>
> so apart from doing something wrong while indexing/analyzing (the text
> above is from the xml, but i double checked...it is put in teh index
> with these textfragments) or so, the token "experiment" does not even
> occur. thats what baffles me.
>
> thanks for the very quick reaction
> jan
>
> Am Mittwoch, den 17.11.2010, 12:57 -0500 schrieb Donna L Gresh:
>> As it is probably more likely that you're doing something incorrect than
>> that Lucene is reporting incorrect results :), it might help if you
>> reported the exact query that is being submitted to the IndexSearcher, and
>> then showing us the document that was incorrectly returned. My guess is
>> that either looking at the query itself will immediately reveal the
>> problem to you, or that the query in combination with the document and
>> knowledge of which analyzers you are using will reveal the problem-
>>
>> Donna
>>
>>
>> Jan <fa...@informatik.hu-berlin.de> wrote on 11/17/2010 11:47:49 AM:
>>
>> > [image removed]
>> >
>> > uncorrect results
>> >
>> > Jan
>> >
>> > to:
>> >
>> > java-user
>> >
>> > 11/17/2010 11:51 AM
>> >
>> > Please respond to java-user
>> >
>> > Hi,
>> > i have an assignment in my Text Analytics class. I am supposed to create
>> > an index and search it. The corpus is a PubMed-like XML file. it is
>> > possible to query terms (programcall a few terms) and phrases
>> > (programcall "a phrase").
>> > When a phrase is queried the program should answer how often the phrase
>> > occured.
>> > The problem is, on certain queries the IndexSearcher returns some
>> > documents that do not have that particular query in its fields.
>> > I'd be delighted if someone could tell me what i am doing wrong.
>> > See the source code at my github repo
>> >
>> https://github.com/jangingnicht/TextAnalytics2/tree/master/src/textanalytics2/
>>
>> >
>> > Thanks in advance
>> > jan
>> >
>> > PS: I use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6
>> > 1.8.2) on an 64 bit Linux machine.
>> > [attachment "signature.asc" deleted by Donna L Gresh/Watson/IBM]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: uncorrect results

Posted by Jan <fa...@informatik.hu-berlin.de>.

thats what i figured...i can't find out what i'm doing wrong though ;)

so the query is "experiment" (i know not really a phrase...but the
assignment requested precisely so). The program constructs the following
query

+(AbstractText:"experiment" ArticleTitle:"experiment")

which looks good to me. the results look like this:

Found 95 hits.
1. 19810 phrase occurs 3 times 
2. 587340 phrase occurs once 
...
80. 896889 phrase occurs 0 times 
...
95. 900325 phrase occurs once

so here is the document 896889
PMID
	896889
ArticleTitle
	Estrogen-induced sexual receptivity and localization of 3H-estradiol in
brains of female mice: effects of 5 alpha-reduced androgens, progestins
and cyproterone acetate.
AbstractText
	Sexual receptivity induced in ovariectomized CD-1 mice with chronic
daily administration of estradiol benzoate (E2 B) was blocked by
concurrent administration of the 5 alpha-reduced androgen,
dihydrotestosterone (DHT). Receptivity was restored in these females
with progesterone-, but not with dihydroprogesterone-priming 6 hr prior
to testing. Delaying the DHT injections until 12 hr after the E2 B
injections greatly reduced its inhibitory properties. Receptivity in E2
B-primed females was also blocked by concurrent treatment with
cyproterone acetate and 3 alpha-, but not 3 beta-adrostanediol.
Pretreatment with DHT, or 3 alpha- or 3 beta-androstanediol failed to
consistently affects 3H-estradiol accumulation in crude nuclear and
supernatant fractions from brain and pituitary

so apart from doing something wrong while indexing/analyzing (the text
above is from the xml, but i double checked...it is put in teh index
with these textfragments) or so, the token "experiment" does not even
occur. thats what baffles me.

thanks for the very quick reaction
jan

Am Mittwoch, den 17.11.2010, 12:57 -0500 schrieb Donna L Gresh:
> As it is probably more likely that you're doing something incorrect than 
> that Lucene is reporting incorrect results :), it might help if you 
> reported the exact query that is being submitted to the IndexSearcher, and 
> then showing us the document that was incorrectly returned. My guess is 
> that either looking at the query itself will immediately reveal the 
> problem to you, or that the query in combination with the document and 
> knowledge of which analyzers you are using will reveal the problem-
> 
> Donna 
> 
> 
> Jan <fa...@informatik.hu-berlin.de> wrote on 11/17/2010 11:47:49 AM:
> 
> > [image removed] 
> > 
> > uncorrect results
> > 
> > Jan 
> > 
> > to:
> > 
> > java-user
> > 
> > 11/17/2010 11:51 AM
> > 
> > Please respond to java-user
> > 
> > Hi,
> > i have an assignment in my Text Analytics class. I am supposed to create
> > an index and search it. The corpus is a PubMed-like XML file. it is
> > possible to query terms (programcall a few terms) and phrases
> > (programcall "a phrase"). 
> > When a phrase is queried the program should answer how often the phrase
> > occured.
> > The problem is, on certain queries the IndexSearcher returns some
> > documents that do not have that particular query in its fields.
> > I'd be delighted if someone could tell me what i am doing wrong.
> > See the source code at my github repo
> > 
> https://github.com/jangingnicht/TextAnalytics2/tree/master/src/textanalytics2/
> 
> > 
> > Thanks in advance
> > jan
> > 
> > PS: I use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6
> > 1.8.2) on an 64 bit Linux machine.
> > [attachment "signature.asc" deleted by Donna L Gresh/Watson/IBM]

Re: uncorrect results

Posted by Donna L Gresh <gr...@us.ibm.com>.

As it is probably more likely that you're doing something incorrect than 
that Lucene is reporting incorrect results :), it might help if you 
reported the exact query that is being submitted to the IndexSearcher, and 
then showing us the document that was incorrectly returned. My guess is 
that either looking at the query itself will immediately reveal the 
problem to you, or that the query in combination with the document and 
knowledge of which analyzers you are using will reveal the problem-

Donna 


Jan <fa...@informatik.hu-berlin.de> wrote on 11/17/2010 11:47:49 AM:

> [image removed] 
> 
> uncorrect results
> 
> Jan 
> 
> to:
> 
> java-user
> 
> 11/17/2010 11:51 AM
> 
> Please respond to java-user
> 
> Hi,
> i have an assignment in my Text Analytics class. I am supposed to create
> an index and search it. The corpus is a PubMed-like XML file. it is
> possible to query terms (programcall a few terms) and phrases
> (programcall "a phrase"). 
> When a phrase is queried the program should answer how often the phrase
> occured.
> The problem is, on certain queries the IndexSearcher returns some
> documents that do not have that particular query in its fields.
> I'd be delighted if someone could tell me what i am doing wrong.
> See the source code at my github repo
> 
https://github.com/jangingnicht/TextAnalytics2/tree/master/src/textanalytics2/

> 
> Thanks in advance
> jan
> 
> PS: I use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6
> 1.8.2) on an 64 bit Linux machine.
> [attachment "signature.asc" deleted by Donna L Gresh/Watson/IBM]

Re: uncorrect results

Posted by Simon Willnauer <si...@googlemail.com>.

Jan, can you elaborate the problem a little more. I see you do
indextime analysis with lowercasing (look at LowerCaseTokenizer btw.)
but you don't do lowercaseing at query time.
You could also use the QueryParser to create phrase query automatically though.

Could you give us an idea what the "wrong" documents look like an what
it matched?

One more unrelated question, what course are you doing - just out of curiosity.

simon

On Wed, Nov 17, 2010 at 5:47 PM, Jan <fa...@informatik.hu-berlin.de> wrote:
> Hi,
> i have an assignment in my Text Analytics class. I am supposed to create
> an index and search it. The corpus is a PubMed-like XML file. it is
> possible to query terms (programcall a few terms) and phrases
> (programcall "a phrase").
> When a phrase is queried the program should answer how often the phrase
> occured.
> The problem is, on certain queries the IndexSearcher returns some
> documents that do not have that particular query in its fields.
> I'd be delighted if someone could tell me what i am doing wrong.
> See the source code at my github repo
> https://github.com/jangingnicht/TextAnalytics2/tree/master/src/textanalytics2/
>
> Thanks in advance
> jan
>
> PS: I use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6
> 1.8.2) on an 64 bit Linux machine.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org