You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by A Z <4a...@gmail.com> on 2012/02/14 19:16:31 UTC

Top matched data should be on Top

Hi ,

when i m adding three document i m not getting top mathced text on the top
, but when i have only two document then it displaying properly as shown in
follwoing text

i m using default similarit only and lucene3.1 version
*adding following document *

         * writer.addDocument(createDocument("Doc1", "pt carrefour
indonesia temp price reduct advertising promotion disc reg"));

           writer.addDocument(createDocument("Doc2", "pt carrefour
indonesia temp price reduct advertising promotion reg disc april"));
*
if i un comment Doc3 and search same string  i will get Doc1  as top but
when i comment document 3 then i will get Doc2 on top
and what i want is irrespective of number of document top mathced document
sholud be on top so here Doc2 is document which has maximum text is
matching as in doc2 april is word which is extra as compare to Doc1 so Doc2
should always be on TOP
*//         writer.addDocument(createDocument("Doc3","qrst opq april"));
// document 3 *


*searching with follwing text*
*"pt carrefour indonesia temp price reduct advertising promotion anchr reg
disc april"*

*When we adding two document only[Doc1 ,Doc2]*
*output is *
Query (content:pt content:carrefour content:indonesia content:temp
content:price content:reduct content:advertising content:promotion
content:anchr content:reg content:disc content:april)
title  ->Doc2:::
content -> pt carrefour indonesia temp price reduct advertising promotion
reg disc *april*::: *Score ->0.381982
*title  ->Doc1:::
content -> pt carrefour indonesia temp price reduct advertising promotion
disc reg::: *Score ->0.33834878*

*When we adding three document only[Doc1 ,Doc2,Doc3]*
*output is *
when adding third document
Query (content:pt content:carrefour content:indonesia content:temp
content:price content:reduct content:advertising content:promotion
content:anchr content:reg content:disc content:april)
title  ->Doc1:::
content -> pt carrefour indonesia temp price reduct advertising promotion
disc reg::: *Score ->0.6635133
*title  ->Doc2:::
content -> pt carrefour indonesia temp price reduct advertising promotion
reg disc *april*::: *Score ->0.6422809*
title  ->Doc3:::
content -> qrst opq april::: Score ->0.010616212



Thanks

Re: Top matched data should be on Top

Posted by Ian Lea <ia...@gmail.com>.

Your example is hard to follow - too many words in the query and the
docs.  Have you looked at the output from IndexSearcher.explain()?  If
you don't like how lucene is scoring things you can write your own
implementation of Similarity.


--
Ian.


On Sun, Feb 19, 2012 at 5:08 AM, A Z <4a...@gmail.com> wrote:
> hi
>
> thanks for your reply,
>
> but, if  i add one extra word *[abc]* in all three document and then i try
> to search string i  m getting top matched document on top which is not case
> when i removed abc from all the document and search string.
>
> So here i m getting doc2 which has maximum word matched when *abc* is added
>
> *Query (content:pt content:carrefour content:indonesia content:temp
> content:price content:reduct content:advertising content:promotion
> content:anchr content:reg content:disc content:april content:abc)*
>
>
> *title  ->Doc2:::*
> content -> pt carrefour indonesia temp price reduct advertising promotion
> reg disc april  abc::: Score ->*0.6657306*
> *title  ->Doc1:::*
> content -> pt carrefour indonesia temp price reduct advertising promotion
> disc reg abc::: Score ->*0.55722165*
> *title  ->Doc3:::*
> content -> qrst opq april  abc::: Score ->*0.029068843*
>
> so my concern is that maximum matched word in document should be on top,
> when there is two document which has same number of word matched  then it
> should go for minimum length document on top other wise it should give top
> matched word in document on top.
>
>
> On Tue, Feb 14, 2012 at 11:54 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> You cannot simply count words like this and expect the docs to be ordered
>> as you imply. The problem is that the lengths of the fields are encoded
>> an a byte (or perhaps an int, I forget). Thus, some loss of precision
>> is inherent in the process. You have to encode values from 1 to 2^31
>> or so in something that's not a long.
>>
>> So try attaching &debugQuery=on and examining the output, you'll probably
>> see that the scores are identical, in which case Solr breaks the ties by
>> document insertion order (roughly). And looking closely at the debug
>> information, I suspect you'll see that the length normalization is
>> the same.
>>
>> Best
>> Erick
>>
>> On Tue, Feb 14, 2012 at 1:16 PM, A Z <4a...@gmail.com> wrote:
>> > Hi ,
>> >
>> > when i m adding three document i m not getting top mathced text on the
>> top
>> > , but when i have only two document then it displaying properly as shown
>> in
>> > follwoing text
>> >
>> > i m using default similarit only and lucene3.1 version
>> > *adding following document *
>> >
>> >         * writer.addDocument(createDocument("Doc1", "pt carrefour
>> > indonesia temp price reduct advertising promotion disc reg"));
>> >
>> >           writer.addDocument(createDocument("Doc2", "pt carrefour
>> > indonesia temp price reduct advertising promotion reg disc april"));
>> > *
>> > if i un comment Doc3 and search same string  i will get Doc1  as top but
>> > when i comment document 3 then i will get Doc2 on top
>> > and what i want is irrespective of number of document top mathced
>> document
>> > sholud be on top so here Doc2 is document which has maximum text is
>> > matching as in doc2 april is word which is extra as compare to Doc1 so
>> Doc2
>> > should always be on TOP
>> > *//         writer.addDocument(createDocument("Doc3","qrst opq april"));
>> > // document 3 *
>> >
>> >
>> > *searching with follwing text*
>> > *"pt carrefour indonesia temp price reduct advertising promotion anchr
>> reg
>> > disc april"*
>> >
>> > *When we adding two document only[Doc1 ,Doc2]*
>> > *output is *
>> > Query (content:pt content:carrefour content:indonesia content:temp
>> > content:price content:reduct content:advertising content:promotion
>> > content:anchr content:reg content:disc content:april)
>> > title  ->Doc2:::
>> > content -> pt carrefour indonesia temp price reduct advertising promotion
>> > reg disc *april*::: *Score ->0.381982
>> > *title  ->Doc1:::
>> > content -> pt carrefour indonesia temp price reduct advertising promotion
>> > disc reg::: *Score ->0.33834878*
>> >
>> > *When we adding three document only[Doc1 ,Doc2,Doc3]*
>> > *output is *
>> > when adding third document
>> > Query (content:pt content:carrefour content:indonesia content:temp
>> > content:price content:reduct content:advertising content:promotion
>> > content:anchr content:reg content:disc content:april)
>> * > title  ->Doc1:::
>> > content -> pt carrefour indonesia temp price reduct advertising promotion
>> *
>> *> disc reg::: *Score ->0.6635133
>> > *title  ->Doc2:::
>> *
>> *> content -> pt carrefour indonesia temp price reduct advertising
>> promotion
>> *
>> *> reg disc *april*::: *Score ->0.6422809*
>> *
>> *> title  ->Doc3:::
>> > content -> qrst opq april::: Score ->0.010616212*
>> >
>> >
>> >
>> > Thanks
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Top matched data should be on Top

Posted by A Z <4a...@gmail.com>.

hi

thanks for your reply,

but, if  i add one extra word *[abc]* in all three document and then i try
to search string i  m getting top matched document on top which is not case
when i removed abc from all the document and search string.

So here i m getting doc2 which has maximum word matched when *abc* is added

*Query (content:pt content:carrefour content:indonesia content:temp
content:price content:reduct content:advertising content:promotion
content:anchr content:reg content:disc content:april content:abc)*


*title  ->Doc2:::*
content -> pt carrefour indonesia temp price reduct advertising promotion
reg disc april  abc::: Score ->*0.6657306*
*title  ->Doc1:::*
content -> pt carrefour indonesia temp price reduct advertising promotion
disc reg abc::: Score ->*0.55722165*
*title  ->Doc3:::*
content -> qrst opq april  abc::: Score ->*0.029068843*

so my concern is that maximum matched word in document should be on top,
when there is two document which has same number of word matched  then it
should go for minimum length document on top other wise it should give top
matched word in document on top.


On Tue, Feb 14, 2012 at 11:54 PM, Erick Erickson <er...@gmail.com>wrote:

> You cannot simply count words like this and expect the docs to be ordered
> as you imply. The problem is that the lengths of the fields are encoded
> an a byte (or perhaps an int, I forget). Thus, some loss of precision
> is inherent in the process. You have to encode values from 1 to 2^31
> or so in something that's not a long.
>
> So try attaching &debugQuery=on and examining the output, you'll probably
> see that the scores are identical, in which case Solr breaks the ties by
> document insertion order (roughly). And looking closely at the debug
> information, I suspect you'll see that the length normalization is
> the same.
>
> Best
> Erick
>
> On Tue, Feb 14, 2012 at 1:16 PM, A Z <4a...@gmail.com> wrote:
> > Hi ,
> >
> > when i m adding three document i m not getting top mathced text on the
> top
> > , but when i have only two document then it displaying properly as shown
> in
> > follwoing text
> >
> > i m using default similarit only and lucene3.1 version
> > *adding following document *
> >
> >         * writer.addDocument(createDocument("Doc1", "pt carrefour
> > indonesia temp price reduct advertising promotion disc reg"));
> >
> >           writer.addDocument(createDocument("Doc2", "pt carrefour
> > indonesia temp price reduct advertising promotion reg disc april"));
> > *
> > if i un comment Doc3 and search same string  i will get Doc1  as top but
> > when i comment document 3 then i will get Doc2 on top
> > and what i want is irrespective of number of document top mathced
> document
> > sholud be on top so here Doc2 is document which has maximum text is
> > matching as in doc2 april is word which is extra as compare to Doc1 so
> Doc2
> > should always be on TOP
> > *//         writer.addDocument(createDocument("Doc3","qrst opq april"));
> > // document 3 *
> >
> >
> > *searching with follwing text*
> > *"pt carrefour indonesia temp price reduct advertising promotion anchr
> reg
> > disc april"*
> >
> > *When we adding two document only[Doc1 ,Doc2]*
> > *output is *
> > Query (content:pt content:carrefour content:indonesia content:temp
> > content:price content:reduct content:advertising content:promotion
> > content:anchr content:reg content:disc content:april)
> > title  ->Doc2:::
> > content -> pt carrefour indonesia temp price reduct advertising promotion
> > reg disc *april*::: *Score ->0.381982
> > *title  ->Doc1:::
> > content -> pt carrefour indonesia temp price reduct advertising promotion
> > disc reg::: *Score ->0.33834878*
> >
> > *When we adding three document only[Doc1 ,Doc2,Doc3]*
> > *output is *
> > when adding third document
> > Query (content:pt content:carrefour content:indonesia content:temp
> > content:price content:reduct content:advertising content:promotion
> > content:anchr content:reg content:disc content:april)
> * > title  ->Doc1:::
> > content -> pt carrefour indonesia temp price reduct advertising promotion
> *
> *> disc reg::: *Score ->0.6635133
> > *title  ->Doc2:::
> *
> *> content -> pt carrefour indonesia temp price reduct advertising
> promotion
> *
> *> reg disc *april*::: *Score ->0.6422809*
> *
> *> title  ->Doc3:::
> > content -> qrst opq april::: Score ->0.010616212*
> >
> >
> >
> > Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Top matched data should be on Top

Posted by Erick Erickson <er...@gmail.com>.

You cannot simply count words like this and expect the docs to be ordered
as you imply. The problem is that the lengths of the fields are encoded
an a byte (or perhaps an int, I forget). Thus, some loss of precision
is inherent in the process. You have to encode values from 1 to 2^31
or so in something that's not a long.

So try attaching &debugQuery=on and examining the output, you'll probably
see that the scores are identical, in which case Solr breaks the ties by
document insertion order (roughly). And looking closely at the debug
information, I suspect you'll see that the length normalization is
the same.

Best
Erick

On Tue, Feb 14, 2012 at 1:16 PM, A Z <4a...@gmail.com> wrote:
> Hi ,
>
> when i m adding three document i m not getting top mathced text on the top
> , but when i have only two document then it displaying properly as shown in
> follwoing text
>
> i m using default similarit only and lucene3.1 version
> *adding following document *
>
>         * writer.addDocument(createDocument("Doc1", "pt carrefour
> indonesia temp price reduct advertising promotion disc reg"));
>
>           writer.addDocument(createDocument("Doc2", "pt carrefour
> indonesia temp price reduct advertising promotion reg disc april"));
> *
> if i un comment Doc3 and search same string  i will get Doc1  as top but
> when i comment document 3 then i will get Doc2 on top
> and what i want is irrespective of number of document top mathced document
> sholud be on top so here Doc2 is document which has maximum text is
> matching as in doc2 april is word which is extra as compare to Doc1 so Doc2
> should always be on TOP
> *//         writer.addDocument(createDocument("Doc3","qrst opq april"));
> // document 3 *
>
>
> *searching with follwing text*
> *"pt carrefour indonesia temp price reduct advertising promotion anchr reg
> disc april"*
>
> *When we adding two document only[Doc1 ,Doc2]*
> *output is *
> Query (content:pt content:carrefour content:indonesia content:temp
> content:price content:reduct content:advertising content:promotion
> content:anchr content:reg content:disc content:april)
> title  ->Doc2:::
> content -> pt carrefour indonesia temp price reduct advertising promotion
> reg disc *april*::: *Score ->0.381982
> *title  ->Doc1:::
> content -> pt carrefour indonesia temp price reduct advertising promotion
> disc reg::: *Score ->0.33834878*
>
> *When we adding three document only[Doc1 ,Doc2,Doc3]*
> *output is *
> when adding third document
> Query (content:pt content:carrefour content:indonesia content:temp
> content:price content:reduct content:advertising content:promotion
> content:anchr content:reg content:disc content:april)
> title  ->Doc1:::
> content -> pt carrefour indonesia temp price reduct advertising promotion
> disc reg::: *Score ->0.6635133
> *title  ->Doc2:::
> content -> pt carrefour indonesia temp price reduct advertising promotion
> reg disc *april*::: *Score ->0.6422809*
> title  ->Doc3:::
> content -> qrst opq april::: Score ->0.010616212
>
>
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org