You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by andy <yh...@sohu.com> on 2014/01/15 09:39:01 UTC

Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Hi guys,

As the topic,it seems that the length of filed does not affect the doc score
accurately for chinese analyzer in my source code

index source code

 private static Directory DIRECTORY;


    @BeforeClass
    public static void before() throws IOException {
          DIRECTORY = new RAMDirectory();
          Analyzer chineseanalyzer = new
SmartChineseAnalyzer(Version.LUCENE_40);
          IndexWriterConfig indexWriterConfig = new
IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
          FieldType nameType = new FieldType();
          nameType.setIndexed(true);
          nameType.setStored(true);
          nameType.setOmitNorms(false);
          try {
              IndexWriter indexWriter = new IndexWriter(DIRECTORY,
indexWriterConfig);

              List<String> nameList = new ArrayList<String>();
             
nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司");
              for (int i = 0; i < nameList.size(); i++) {
                  Document document = new Document();
                  document.add(new Field("name", nameList.get(i),
nameType));
                  document.add(new
Field("id",String.valueOf(i+1),nameType));
                  indexWriter.addDocument(document);
            }
              indexWriter.commit();
          } catch (IOException e) {
              // TODO Auto-generated catch block
              e.printStackTrace();
          }
    }

search snippet:
 @Test
    public void testChinese() throws IOException, ParseException {
        String keyword = "咨询公司";
        System.out.println("Searching for:" + keyword);
        System.out.println();
        IndexReader indexReader = DirectoryReader.open(DIRECTORY);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        Query query = null;
        query = new QueryParser(Version.LUCENE_40,"name",new
SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
        TopDocs topDocs = indexSearcher.search(query,15);
        System.out.println("Search Result:");
        if (null !=topDocs && 0 < topDocs.totalHits) {
            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                System.out.println("doc id:" +
indexSearcher.doc(scoreDoc.doc).get("id"));
                String name = indexSearcher.doc(scoreDoc.doc).get("name");
                System.out.println("content of Field:" + name);
                dumpCNTokens(name);
                System.out.println("score:" + scoreDoc.score);
               
System.out.println("-------------------------------------------");
            }
        } else {
            System.out.println("no results");
        }

    }


And search result as follows:
Searching for:咨询公司

Search Result:
doc id:1
content of Field:咨询公司
Terms:咨询	公司	
score:0.74763227
-------------------------------------------
doc id:2
content of Field:飞鹰咨询管理咨询公司
Terms:飞鹰	咨询	管理	咨询	公司	
score:0.6317303
-------------------------------------------
doc id:3
content of Field:北京中标咨询公司
Terms:北京	中标	咨询	公司	
score:0.5981058
-------------------------------------------
doc id:4
content of Field:重庆咨询公司
Terms:重庆	咨询	公司	
score:0.5981058
-------------------------------------------
doc id:5
content of Field:商务咨询服务公司
Terms:商务	咨询	服务	公司	
score:0.5981058
-------------------------------------------
doc id:6
content of Field:法律咨询公司
Terms:法律	咨询	公司	
score:0.5981058
-------------------------------------------

docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6 should
have a higner score than the doc 3,5, becase the doc 4 and doc 6 have three
terms ,doc 3,5 have four terms. 
Am I right? who can give me a explanation? And how to get the expected
result?



--
View this message in context: http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Posted by andy <yh...@sohu.com>.

Hi Uwe, 

thanks a lot, I will try with that. 


Uwe Schindler wrote
> Hi andy,
> 
> unfortunately, that is not easy to show with one simple code. You have to
> change the Similarity used.
> 
> Before starting to do this, you should be sure, that this affects you
> users. The example you gave is showing very short documents. Lucene is
> optimized to handle larger documents, for short documents, the document
> statistics are not behaving in an ideal way - that’s the main issue here.
> Instead of trying to change the very basic Lucene statictics, you should
> first verify that this affects a large part of your user queries and
> documents, not just this example which looks like special case. Otherwise
> it is not an option.
> 
> Please read the documentation of Lucene how to change the similarity,
> specifically the length norm, while indexing/searching:
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/package-summary.html#changingScoring
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: 

> uwe@

> 
> 
>> -----Original Message-----
>> From: andy [mailto:

> yhlweb@

> ]
>> Sent: Wednesday, February 12, 2014 10:53 AM
>> To: 

> java-user@.apache

>> Subject: RE: Length of the filed does not affect the doc score accurately
>> for
>> chinese analyzer(SmartChineseAnalyzer)
>> 
>> Thanks Uwe,could you please give me a more detail example about how to
>> change the lucene behavior
>> 
>> 
>> Uwe Schindler wrote
>> > Hi Erick,
>> >
>> > a statement like " Adding &debug=all to the query will show you if
>> > this is the case" will not help a Lucene user, as it is only available
>> > in the Solr server. But Andy uses Lucene directly. In his case he
>> > should use IndexSearcher's explain functionalities to retrieve a
>> > structured output of how the documents are scored for this query for
>> debugging:
>> >
>> >
>> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/Inde
>> > xSearcher.html#explain(org.apache.lucene.search.Query,
>> > int)
>> >
>> > But yes, the length norm is encoded with loss of precsision in Lucene
>> > (it is a float values encoded to 1 byte only). With Lucene 4 there are
>> > ways to change that behavior, but that included changing the
>> > similarity implementation and use a different DocValues type for
>> encoding
>> the norms.
>> > In most cases this is not needed, because user won't notice.
>> >
>> > Uwe
>> >
>> > -----
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail:
>> 
>> > uwe@
>> 
>> >
>> >
>> >> -----Original Message-----
>> >> From: Erick Erickson [mailto:
>> 
>> > erickerickson@
>> 
>> > ]
>> >> Sent: Wednesday, January 15, 2014 1:30 PM
>> >> To: java-user
>> >> Subject: Re: Length of the filed does not affect the doc score
>> >> accurately for chinese analyzer(SmartChineseAnalyzer)
>> >>
>> >> the lengths of fields are encoded and lose some precision. So I
>> >> suspect the length of the field calculated for the two documents are
>> >> the same after encoding.
>> >>
>> >> Adding &debug=all to the query will show you if this is the case.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Wed, Jan 15, 2014 at 3:39 AM, andy &lt;
>> 
>> > yhlweb@
>> 
>> > &gt; wrote:
>> >> > Hi guys,
>> >> >
>> >> > As the topic,it seems that the length of filed does not affect the
>> >> > doc score accurately for chinese analyzer in my source code
>> >> >
>> >> > index source code
>> >> >
>> >> >  private static Directory DIRECTORY;
>> >> >
>> >> >
>> >> >     @BeforeClass
>> >> >     public static void before() throws IOException {
>> >> >           DIRECTORY = new RAMDirectory();
>> >> >           Analyzer chineseanalyzer = new
>> >> > SmartChineseAnalyzer(Version.LUCENE_40);
>> >> >           IndexWriterConfig indexWriterConfig = new
>> >> > IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
>> >> >           FieldType nameType = new FieldType();
>> >> >           nameType.setIndexed(true);
>> >> >           nameType.setStored(true);
>> >> >           nameType.setOmitNorms(false);
>> >> >           try {
>> >> >               IndexWriter indexWriter = new IndexWriter(DIRECTORY,
>> >> > indexWriterConfig);
>> >> >
>> >> >               List
>> > 
> <String>
>> >  nameList = new ArrayList
>> > 
> <String>
>> > ();
>> >> >
>> >> > nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司
>> >> ");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司
>> >> ");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司
>> ");
>> >> >               for (int i = 0; i < nameList.size(); i++) {
>> >> >                   Document document = new Document();
>> >> >                   document.add(new Field("name", nameList.get(i),
>> >> > nameType));
>> >> >                   document.add(new
>> >> > Field("id",String.valueOf(i+1),nameType));
>> >> >                   indexWriter.addDocument(document);
>> >> >             }
>> >> >               indexWriter.commit();
>> >> >           } catch (IOException e) {
>> >> >               // TODO Auto-generated catch block
>> >> >               e.printStackTrace();
>> >> >           }
>> >> >     }
>> >> >
>> >> > search snippet:
>> >> >  @Test
>> >> >     public void testChinese() throws IOException, ParseException {
>> >> >         String keyword = "咨询公司";
>> >> >         System.out.println("Searching for:" + keyword);
>> >> >         System.out.println();
>> >> >         IndexReader indexReader = DirectoryReader.open(DIRECTORY);
>> >> >         IndexSearcher indexSearcher = new
>> IndexSearcher(indexReader);
>> >> >         Query query = null;
>> >> >         query = new QueryParser(Version.LUCENE_40,"name",new
>> >> > SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
>> >> >         TopDocs topDocs = indexSearcher.search(query,15);
>> >> >         System.out.println("Search Result:");
>> >> >         if (null !=topDocs && 0 < topDocs.totalHits) {
>> >> >             for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>> >> >                 System.out.println("doc id:" +
>> >> > indexSearcher.doc(scoreDoc.doc).get("id"));
>> >> >                 String name =
>> >> indexSearcher.doc(scoreDoc.doc).get("name");
>> >> >                 System.out.println("content of Field:" + name);
>> >> >                 dumpCNTokens(name);
>> >> >                 System.out.println("score:" + scoreDoc.score);
>> >> >
>> >> > System.out.println("-------------------------------------------");
>> >> >             }
>> >> >         } else {
>> >> >             System.out.println("no results");
>> >> >         }
>> >> >
>> >> >     }
>> >> >
>> >> >
>> >> > And search result as follows:
>> >> > Searching for:咨询公司
>> >> >
>> >> > Search Result:
>> >> > doc id:1
>> >> > content of Field:咨询公司
>> >> > Terms:咨询        公司
>> >> > score:0.74763227
>> >> > -------------------------------------------
>> >> > doc id:2
>> >> > content of Field:飞鹰咨询管理咨询公司
>> >> > Terms:飞鹰        咨询      管理      咨询      公司
>> >> > score:0.6317303
>> >> > -------------------------------------------
>> >> > doc id:3
>> >> > content of Field:北京中标咨询公司
>> >> > Terms:北京        中标      咨询      公司
>> >> > score:0.5981058
>> >> > -------------------------------------------
>> >> > doc id:4
>> >> > content of Field:重庆咨询公司
>> >> > Terms:重庆        咨询      公司
>> >> > score:0.5981058
>> >> > -------------------------------------------
>> >> > doc id:5
>> >> > content of Field:商务咨询服务公司
>> >> > Terms:商务        咨询      服务      公司
>> >> > score:0.5981058
>> >> > -------------------------------------------
>> >> > doc id:6
>> >> > content of Field:法律咨询公司
>> >> > Terms:法律        咨询      公司
>> >> > score:0.5981058
>> >> > -------------------------------------------
>> >> >
>> >> > docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6
>> >> > should have a higner score than the doc 3,5, becase the doc 4 and
>> >> > doc
>> >> > 6 have three terms ,doc 3,5 have four terms.
>> >> > Am I right? who can give me a explanation? And how to get the
>> >> > expected result?
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> > http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-aff
>> >> > ect
>> >> > -the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-t
>> >> > p41 11390.html Sent from the Lucene - Java Users mailing list
>> >> > archive at Nabble.com.
>> >> >
>> >> > -------------------------------------------------------------------
>> >> > --
>> >> > To unsubscribe, e-mail:
>> 
>> > java-user-unsubscribe@.apache
>> 
>> >> > For additional commands, e-mail:
>> 
>> > java-user-help@.apache
>> 
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail:
>> 
>> > java-user-unsubscribe@.apache
>> 
>> >> For additional commands, e-mail:
>> 
>> > java-user-help@.apache
>> 
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail:
>> 
>> > java-user-unsubscribe@.apache
>> 
>> > For additional commands, e-mail:
>> 
>> > java-user-help@.apache
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Length-
>> of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-
>> SmartChineseAnalyz-tp4111390p4116850.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> java-user-unsubscribe@.apache

>> For additional commands, e-mail: 

> java-user-help@.apache

> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 

> java-user-unsubscribe@.apache

> For additional commands, e-mail: 

> java-user-help@.apache





--
View this message in context: http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390p4117051.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi andy,

unfortunately, that is not easy to show with one simple code. You have to change the Similarity used.

Before starting to do this, you should be sure, that this affects you users. The example you gave is showing very short documents. Lucene is optimized to handle larger documents, for short documents, the document statistics are not behaving in an ideal way - that’s the main issue here. Instead of trying to change the very basic Lucene statictics, you should first verify that this affects a large part of your user queries and documents, not just this example which looks like special case. Otherwise it is not an option.

Please read the documentation of Lucene how to change the similarity, specifically the length norm, while indexing/searching: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/package-summary.html#changingScoring

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: andy [mailto:yhlweb@sohu.com]
> Sent: Wednesday, February 12, 2014 10:53 AM
> To: java-user@lucene.apache.org
> Subject: RE: Length of the filed does not affect the doc score accurately for
> chinese analyzer(SmartChineseAnalyzer)
> 
> Thanks Uwe,could you please give me a more detail example about how to
> change the lucene behavior
> 
> 
> Uwe Schindler wrote
> > Hi Erick,
> >
> > a statement like " Adding &debug=all to the query will show you if
> > this is the case" will not help a Lucene user, as it is only available
> > in the Solr server. But Andy uses Lucene directly. In his case he
> > should use IndexSearcher's explain functionalities to retrieve a
> > structured output of how the documents are scored for this query for
> debugging:
> >
> >
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/Inde
> > xSearcher.html#explain(org.apache.lucene.search.Query,
> > int)
> >
> > But yes, the length norm is encoded with loss of precsision in Lucene
> > (it is a float values encoded to 1 byte only). With Lucene 4 there are
> > ways to change that behavior, but that included changing the
> > similarity implementation and use a different DocValues type for encoding
> the norms.
> > In most cases this is not needed, because user won't notice.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail:
> 
> > uwe@
> 
> >
> >
> >> -----Original Message-----
> >> From: Erick Erickson [mailto:
> 
> > erickerickson@
> 
> > ]
> >> Sent: Wednesday, January 15, 2014 1:30 PM
> >> To: java-user
> >> Subject: Re: Length of the filed does not affect the doc score
> >> accurately for chinese analyzer(SmartChineseAnalyzer)
> >>
> >> the lengths of fields are encoded and lose some precision. So I
> >> suspect the length of the field calculated for the two documents are
> >> the same after encoding.
> >>
> >> Adding &debug=all to the query will show you if this is the case.
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Jan 15, 2014 at 3:39 AM, andy &lt;
> 
> > yhlweb@
> 
> > &gt; wrote:
> >> > Hi guys,
> >> >
> >> > As the topic,it seems that the length of filed does not affect the
> >> > doc score accurately for chinese analyzer in my source code
> >> >
> >> > index source code
> >> >
> >> >  private static Directory DIRECTORY;
> >> >
> >> >
> >> >     @BeforeClass
> >> >     public static void before() throws IOException {
> >> >           DIRECTORY = new RAMDirectory();
> >> >           Analyzer chineseanalyzer = new
> >> > SmartChineseAnalyzer(Version.LUCENE_40);
> >> >           IndexWriterConfig indexWriterConfig = new
> >> > IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
> >> >           FieldType nameType = new FieldType();
> >> >           nameType.setIndexed(true);
> >> >           nameType.setStored(true);
> >> >           nameType.setOmitNorms(false);
> >> >           try {
> >> >               IndexWriter indexWriter = new IndexWriter(DIRECTORY,
> >> > indexWriterConfig);
> >> >
> >> >               List
> > <String>
> >  nameList = new ArrayList
> > <String>
> > ();
> >> >
> >> > nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司
> >> ");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司
> >> ");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司
> ");
> >> >               for (int i = 0; i < nameList.size(); i++) {
> >> >                   Document document = new Document();
> >> >                   document.add(new Field("name", nameList.get(i),
> >> > nameType));
> >> >                   document.add(new
> >> > Field("id",String.valueOf(i+1),nameType));
> >> >                   indexWriter.addDocument(document);
> >> >             }
> >> >               indexWriter.commit();
> >> >           } catch (IOException e) {
> >> >               // TODO Auto-generated catch block
> >> >               e.printStackTrace();
> >> >           }
> >> >     }
> >> >
> >> > search snippet:
> >> >  @Test
> >> >     public void testChinese() throws IOException, ParseException {
> >> >         String keyword = "咨询公司";
> >> >         System.out.println("Searching for:" + keyword);
> >> >         System.out.println();
> >> >         IndexReader indexReader = DirectoryReader.open(DIRECTORY);
> >> >         IndexSearcher indexSearcher = new IndexSearcher(indexReader);
> >> >         Query query = null;
> >> >         query = new QueryParser(Version.LUCENE_40,"name",new
> >> > SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
> >> >         TopDocs topDocs = indexSearcher.search(query,15);
> >> >         System.out.println("Search Result:");
> >> >         if (null !=topDocs && 0 < topDocs.totalHits) {
> >> >             for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
> >> >                 System.out.println("doc id:" +
> >> > indexSearcher.doc(scoreDoc.doc).get("id"));
> >> >                 String name =
> >> indexSearcher.doc(scoreDoc.doc).get("name");
> >> >                 System.out.println("content of Field:" + name);
> >> >                 dumpCNTokens(name);
> >> >                 System.out.println("score:" + scoreDoc.score);
> >> >
> >> > System.out.println("-------------------------------------------");
> >> >             }
> >> >         } else {
> >> >             System.out.println("no results");
> >> >         }
> >> >
> >> >     }
> >> >
> >> >
> >> > And search result as follows:
> >> > Searching for:咨询公司
> >> >
> >> > Search Result:
> >> > doc id:1
> >> > content of Field:咨询公司
> >> > Terms:咨询        公司
> >> > score:0.74763227
> >> > -------------------------------------------
> >> > doc id:2
> >> > content of Field:飞鹰咨询管理咨询公司
> >> > Terms:飞鹰        咨询      管理      咨询      公司
> >> > score:0.6317303
> >> > -------------------------------------------
> >> > doc id:3
> >> > content of Field:北京中标咨询公司
> >> > Terms:北京        中标      咨询      公司
> >> > score:0.5981058
> >> > -------------------------------------------
> >> > doc id:4
> >> > content of Field:重庆咨询公司
> >> > Terms:重庆        咨询      公司
> >> > score:0.5981058
> >> > -------------------------------------------
> >> > doc id:5
> >> > content of Field:商务咨询服务公司
> >> > Terms:商务        咨询      服务      公司
> >> > score:0.5981058
> >> > -------------------------------------------
> >> > doc id:6
> >> > content of Field:法律咨询公司
> >> > Terms:法律        咨询      公司
> >> > score:0.5981058
> >> > -------------------------------------------
> >> >
> >> > docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6
> >> > should have a higner score than the doc 3,5, becase the doc 4 and
> >> > doc
> >> > 6 have three terms ,doc 3,5 have four terms.
> >> > Am I right? who can give me a explanation? And how to get the
> >> > expected result?
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> > http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-aff
> >> > ect
> >> > -the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-t
> >> > p41 11390.html Sent from the Lucene - Java Users mailing list
> >> > archive at Nabble.com.
> >> >
> >> > -------------------------------------------------------------------
> >> > --
> >> > To unsubscribe, e-mail:
> 
> > java-user-unsubscribe@.apache
> 
> >> > For additional commands, e-mail:
> 
> > java-user-help@.apache
> 
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
> 
> > java-user-unsubscribe@.apache
> 
> >> For additional commands, e-mail:
> 
> > java-user-help@.apache
> 
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> 
> > java-user-unsubscribe@.apache
> 
> > For additional commands, e-mail:
> 
> > java-user-help@.apache
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Length-
> of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-
> SmartChineseAnalyz-tp4111390p4116850.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Posted by andy <yh...@sohu.com>.

Thanks Uwe,could you please give me a more detail example about how to change
the lucene behavior


Uwe Schindler wrote
> Hi Erick,
> 
> a statement like " Adding &debug=all to the query will show you if this is
> the case" will not help a Lucene user, as it is only available in the Solr
> server. But Andy uses Lucene directly. In his case he should use
> IndexSearcher's explain functionalities to retrieve a structured output of
> how the documents are scored for this query for debugging:
> 
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query,
> int)
> 
> But yes, the length norm is encoded with loss of precsision in Lucene (it
> is a float values encoded to 1 byte only). With Lucene 4 there are ways to
> change that behavior, but that included changing the similarity
> implementation and use a different DocValues type for encoding the norms.
> In most cases this is not needed, because user won't notice.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: 

> uwe@

> 
> 
>> -----Original Message-----
>> From: Erick Erickson [mailto:

> erickerickson@

> ]
>> Sent: Wednesday, January 15, 2014 1:30 PM
>> To: java-user
>> Subject: Re: Length of the filed does not affect the doc score accurately
>> for
>> chinese analyzer(SmartChineseAnalyzer)
>> 
>> the lengths of fields are encoded and lose some precision. So I suspect
>> the
>> length of the field calculated for the two documents are the same after
>> encoding.
>> 
>> Adding &debug=all to the query will show you if this is the case.
>> 
>> Best
>> Erick
>> 
>> On Wed, Jan 15, 2014 at 3:39 AM, andy &lt;

> yhlweb@

> &gt; wrote:
>> > Hi guys,
>> >
>> > As the topic,it seems that the length of filed does not affect the doc
>> > score accurately for chinese analyzer in my source code
>> >
>> > index source code
>> >
>> >  private static Directory DIRECTORY;
>> >
>> >
>> >     @BeforeClass
>> >     public static void before() throws IOException {
>> >           DIRECTORY = new RAMDirectory();
>> >           Analyzer chineseanalyzer = new
>> > SmartChineseAnalyzer(Version.LUCENE_40);
>> >           IndexWriterConfig indexWriterConfig = new
>> > IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
>> >           FieldType nameType = new FieldType();
>> >           nameType.setIndexed(true);
>> >           nameType.setStored(true);
>> >           nameType.setOmitNorms(false);
>> >           try {
>> >               IndexWriter indexWriter = new IndexWriter(DIRECTORY,
>> > indexWriterConfig);
>> >
>> >               List
> <String>
>  nameList = new ArrayList
> <String>
> ();
>> >
>> > nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司
>> ");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司
>> ");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司");
>> >               for (int i = 0; i < nameList.size(); i++) {
>> >                   Document document = new Document();
>> >                   document.add(new Field("name", nameList.get(i),
>> > nameType));
>> >                   document.add(new
>> > Field("id",String.valueOf(i+1),nameType));
>> >                   indexWriter.addDocument(document);
>> >             }
>> >               indexWriter.commit();
>> >           } catch (IOException e) {
>> >               // TODO Auto-generated catch block
>> >               e.printStackTrace();
>> >           }
>> >     }
>> >
>> > search snippet:
>> >  @Test
>> >     public void testChinese() throws IOException, ParseException {
>> >         String keyword = "咨询公司";
>> >         System.out.println("Searching for:" + keyword);
>> >         System.out.println();
>> >         IndexReader indexReader = DirectoryReader.open(DIRECTORY);
>> >         IndexSearcher indexSearcher = new IndexSearcher(indexReader);
>> >         Query query = null;
>> >         query = new QueryParser(Version.LUCENE_40,"name",new
>> > SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
>> >         TopDocs topDocs = indexSearcher.search(query,15);
>> >         System.out.println("Search Result:");
>> >         if (null !=topDocs && 0 < topDocs.totalHits) {
>> >             for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>> >                 System.out.println("doc id:" +
>> > indexSearcher.doc(scoreDoc.doc).get("id"));
>> >                 String name =
>> indexSearcher.doc(scoreDoc.doc).get("name");
>> >                 System.out.println("content of Field:" + name);
>> >                 dumpCNTokens(name);
>> >                 System.out.println("score:" + scoreDoc.score);
>> >
>> > System.out.println("-------------------------------------------");
>> >             }
>> >         } else {
>> >             System.out.println("no results");
>> >         }
>> >
>> >     }
>> >
>> >
>> > And search result as follows:
>> > Searching for:咨询公司
>> >
>> > Search Result:
>> > doc id:1
>> > content of Field:咨询公司
>> > Terms:咨询        公司
>> > score:0.74763227
>> > -------------------------------------------
>> > doc id:2
>> > content of Field:飞鹰咨询管理咨询公司
>> > Terms:飞鹰        咨询      管理      咨询      公司
>> > score:0.6317303
>> > -------------------------------------------
>> > doc id:3
>> > content of Field:北京中标咨询公司
>> > Terms:北京        中标      咨询      公司
>> > score:0.5981058
>> > -------------------------------------------
>> > doc id:4
>> > content of Field:重庆咨询公司
>> > Terms:重庆        咨询      公司
>> > score:0.5981058
>> > -------------------------------------------
>> > doc id:5
>> > content of Field:商务咨询服务公司
>> > Terms:商务        咨询      服务      公司
>> > score:0.5981058
>> > -------------------------------------------
>> > doc id:6
>> > content of Field:法律咨询公司
>> > Terms:法律        咨询      公司
>> > score:0.5981058
>> > -------------------------------------------
>> >
>> > docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6
>> > should have a higner score than the doc 3,5, becase the doc 4 and doc
>> > 6 have three terms ,doc 3,5 have four terms.
>> > Am I right? who can give me a explanation? And how to get the expected
>> > result?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect
>> > -the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp41
>> > 11390.html Sent from the Lucene - Java Users mailing list archive at
>> > Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: 

> java-user-unsubscribe@.apache

>> > For additional commands, e-mail: 

> java-user-help@.apache

>> >
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> java-user-unsubscribe@.apache

>> For additional commands, e-mail: 

> java-user-help@.apache

> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 

> java-user-unsubscribe@.apache

> For additional commands, e-mail: 

> java-user-help@.apache





--
View this message in context: http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390p4116850.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Erick,

a statement like " Adding &debug=all to the query will show you if this is the case" will not help a Lucene user, as it is only available in the Solr server. But Andy uses Lucene directly. In his case he should use IndexSearcher's explain functionalities to retrieve a structured output of how the documents are scored for this query for debugging:

http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query, int)

But yes, the length norm is encoded with loss of precsision in Lucene (it is a float values encoded to 1 byte only). With Lucene 4 there are ways to change that behavior, but that included changing the similarity implementation and use a different DocValues type for encoding the norms. In most cases this is not needed, because user won't notice.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, January 15, 2014 1:30 PM
> To: java-user
> Subject: Re: Length of the filed does not affect the doc score accurately for
> chinese analyzer(SmartChineseAnalyzer)
> 
> the lengths of fields are encoded and lose some precision. So I suspect the
> length of the field calculated for the two documents are the same after
> encoding.
> 
> Adding &debug=all to the query will show you if this is the case.
> 
> Best
> Erick
> 
> On Wed, Jan 15, 2014 at 3:39 AM, andy <yh...@sohu.com> wrote:
> > Hi guys,
> >
> > As the topic,it seems that the length of filed does not affect the doc
> > score accurately for chinese analyzer in my source code
> >
> > index source code
> >
> >  private static Directory DIRECTORY;
> >
> >
> >     @BeforeClass
> >     public static void before() throws IOException {
> >           DIRECTORY = new RAMDirectory();
> >           Analyzer chineseanalyzer = new
> > SmartChineseAnalyzer(Version.LUCENE_40);
> >           IndexWriterConfig indexWriterConfig = new
> > IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
> >           FieldType nameType = new FieldType();
> >           nameType.setIndexed(true);
> >           nameType.setStored(true);
> >           nameType.setOmitNorms(false);
> >           try {
> >               IndexWriter indexWriter = new IndexWriter(DIRECTORY,
> > indexWriterConfig);
> >
> >               List<String> nameList = new ArrayList<String>();
> >
> > nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司
> ");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司
> ");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司");
> >               for (int i = 0; i < nameList.size(); i++) {
> >                   Document document = new Document();
> >                   document.add(new Field("name", nameList.get(i),
> > nameType));
> >                   document.add(new
> > Field("id",String.valueOf(i+1),nameType));
> >                   indexWriter.addDocument(document);
> >             }
> >               indexWriter.commit();
> >           } catch (IOException e) {
> >               // TODO Auto-generated catch block
> >               e.printStackTrace();
> >           }
> >     }
> >
> > search snippet:
> >  @Test
> >     public void testChinese() throws IOException, ParseException {
> >         String keyword = "咨询公司";
> >         System.out.println("Searching for:" + keyword);
> >         System.out.println();
> >         IndexReader indexReader = DirectoryReader.open(DIRECTORY);
> >         IndexSearcher indexSearcher = new IndexSearcher(indexReader);
> >         Query query = null;
> >         query = new QueryParser(Version.LUCENE_40,"name",new
> > SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
> >         TopDocs topDocs = indexSearcher.search(query,15);
> >         System.out.println("Search Result:");
> >         if (null !=topDocs && 0 < topDocs.totalHits) {
> >             for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
> >                 System.out.println("doc id:" +
> > indexSearcher.doc(scoreDoc.doc).get("id"));
> >                 String name = indexSearcher.doc(scoreDoc.doc).get("name");
> >                 System.out.println("content of Field:" + name);
> >                 dumpCNTokens(name);
> >                 System.out.println("score:" + scoreDoc.score);
> >
> > System.out.println("-------------------------------------------");
> >             }
> >         } else {
> >             System.out.println("no results");
> >         }
> >
> >     }
> >
> >
> > And search result as follows:
> > Searching for:咨询公司
> >
> > Search Result:
> > doc id:1
> > content of Field:咨询公司
> > Terms:咨询        公司
> > score:0.74763227
> > -------------------------------------------
> > doc id:2
> > content of Field:飞鹰咨询管理咨询公司
> > Terms:飞鹰        咨询      管理      咨询      公司
> > score:0.6317303
> > -------------------------------------------
> > doc id:3
> > content of Field:北京中标咨询公司
> > Terms:北京        中标      咨询      公司
> > score:0.5981058
> > -------------------------------------------
> > doc id:4
> > content of Field:重庆咨询公司
> > Terms:重庆        咨询      公司
> > score:0.5981058
> > -------------------------------------------
> > doc id:5
> > content of Field:商务咨询服务公司
> > Terms:商务        咨询      服务      公司
> > score:0.5981058
> > -------------------------------------------
> > doc id:6
> > content of Field:法律咨询公司
> > Terms:法律        咨询      公司
> > score:0.5981058
> > -------------------------------------------
> >
> > docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6
> > should have a higner score than the doc 3,5, becase the doc 4 and doc
> > 6 have three terms ,doc 3,5 have four terms.
> > Am I right? who can give me a explanation? And how to get the expected
> > result?
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect
> > -the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp41
> > 11390.html Sent from the Lucene - Java Users mailing list archive at
> > Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Posted by andy <yh...@sohu.com>.

thanks for your reply Erick, this is the case ,But how can I keep the
precision of the fields' length?



--
View this message in context: http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390p4116832.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Posted by Erick Erickson <er...@gmail.com>.

the lengths of fields are encoded and lose some precision. So
I suspect the length of the field calculated for the two documents
are the same after encoding.

Adding &debug=all to the query will show you if this is the case.

Best
Erick

On Wed, Jan 15, 2014 at 3:39 AM, andy <yh...@sohu.com> wrote:
> Hi guys,
>
> As the topic,it seems that the length of filed does not affect the doc score
> accurately for chinese analyzer in my source code
>
> index source code
>
>  private static Directory DIRECTORY;
>
>
>     @BeforeClass
>     public static void before() throws IOException {
>           DIRECTORY = new RAMDirectory();
>           Analyzer chineseanalyzer = new
> SmartChineseAnalyzer(Version.LUCENE_40);
>           IndexWriterConfig indexWriterConfig = new
> IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
>           FieldType nameType = new FieldType();
>           nameType.setIndexed(true);
>           nameType.setStored(true);
>           nameType.setOmitNorms(false);
>           try {
>               IndexWriter indexWriter = new IndexWriter(DIRECTORY,
> indexWriterConfig);
>
>               List<String> nameList = new ArrayList<String>();
>
> nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司");
>               for (int i = 0; i < nameList.size(); i++) {
>                   Document document = new Document();
>                   document.add(new Field("name", nameList.get(i),
> nameType));
>                   document.add(new
> Field("id",String.valueOf(i+1),nameType));
>                   indexWriter.addDocument(document);
>             }
>               indexWriter.commit();
>           } catch (IOException e) {
>               // TODO Auto-generated catch block
>               e.printStackTrace();
>           }
>     }
>
> search snippet:
>  @Test
>     public void testChinese() throws IOException, ParseException {
>         String keyword = "咨询公司";
>         System.out.println("Searching for:" + keyword);
>         System.out.println();
>         IndexReader indexReader = DirectoryReader.open(DIRECTORY);
>         IndexSearcher indexSearcher = new IndexSearcher(indexReader);
>         Query query = null;
>         query = new QueryParser(Version.LUCENE_40,"name",new
> SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
>         TopDocs topDocs = indexSearcher.search(query,15);
>         System.out.println("Search Result:");
>         if (null !=topDocs && 0 < topDocs.totalHits) {
>             for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>                 System.out.println("doc id:" +
> indexSearcher.doc(scoreDoc.doc).get("id"));
>                 String name = indexSearcher.doc(scoreDoc.doc).get("name");
>                 System.out.println("content of Field:" + name);
>                 dumpCNTokens(name);
>                 System.out.println("score:" + scoreDoc.score);
>
> System.out.println("-------------------------------------------");
>             }
>         } else {
>             System.out.println("no results");
>         }
>
>     }
>
>
> And search result as follows:
> Searching for:咨询公司
>
> Search Result:
> doc id:1
> content of Field:咨询公司
> Terms:咨询        公司
> score:0.74763227
> -------------------------------------------
> doc id:2
> content of Field:飞鹰咨询管理咨询公司
> Terms:飞鹰        咨询      管理      咨询      公司
> score:0.6317303
> -------------------------------------------
> doc id:3
> content of Field:北京中标咨询公司
> Terms:北京        中标      咨询      公司
> score:0.5981058
> -------------------------------------------
> doc id:4
> content of Field:重庆咨询公司
> Terms:重庆        咨询      公司
> score:0.5981058
> -------------------------------------------
> doc id:5
> content of Field:商务咨询服务公司
> Terms:商务        咨询      服务      公司
> score:0.5981058
> -------------------------------------------
> doc id:6
> content of Field:法律咨询公司
> Terms:法律        咨询      公司
> score:0.5981058
> -------------------------------------------
>
> docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6 should
> have a higner score than the doc 3,5, becase the doc 4 and doc 6 have three
> terms ,doc 3,5 have four terms.
> Am I right? who can give me a explanation? And how to get the expected
> result?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org