You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by manjula wijewickrema <ma...@gmail.com> on 2010/07/09 09:20:50 UTC

scoring and index size

Hi,

I run a single programme to see the way of scoring by Lucene for single
indexed document. The explain() method gave me the following results.
*******************

Searching for 'metaphysics'

Number of hits: 1

0.030706111

0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of:

10.246951 = tf(termFreq(contents:metaphys)=105)

0.30685282 = idf(docFreq=1, maxDocs=1)

0.009765625 = fieldNorm(field=contents, doc=0)

*****************

But I encountered the following problems;

1) In this case, I did not change or done anything to Boost values. So that
should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in Lucene
email archive,  default boost values=1)

2) But, even if I manually calculate the value for fieldNorm (as
=1/sqrt(terms in field)), it doesn't match (approximately it matches) with
the value with given by the system for fieldNorm. Can this be due to
encode/decode precision loss of norm?

3) In my indexed document, my indexed document was consisted with total
number of 19078 words including 125 times of word 'metaphysics' (i.e my
query. I input single term query) . But as you can see in the above output,
system gives only 105 counts for word 'metaphysics'. But once I reduce some
part of my index document and count the number of 'metaphysics' words and
checked with the system results. I noticed that with reduction of text from
index document, system counts it correctly. Why this kind of behaviour? Is
there any limitation for the indexed documents?

If somebody can pls. help me to solve these problems.

Thanks!

Manjula.

Re: scoring and index size

Posted by manjula wijewickrema <ma...@gmail.com>.

Hi Koji,

Thanks for your information

Manjula



On Fri, Jul 9, 2010 at 5:04 PM, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:

> (10/07/09 19:30), manjula wijewickrema wrote:
>
>> Uwe, thanx for your comments. Following is the code I used in this case.
>> Could you pls. let me know where I have to insert UNLIMITED field length?
>> and how?
>> Tanx again!
>> Manjula
>>
>>
>>
> Manjula,
>
> You can set UNLIMITED field length to IW constructor:
>
>
> http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#IndexWriter%28org.apache.lucene.store.Directory,%20org.apache.lucene.analysis.Analyzer,%20boolean,%20org.apache.lucene.index.IndexWriter.MaxFieldLength%29
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: scoring and index size

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

(10/07/09 19:30), manjula wijewickrema wrote:
> Uwe, thanx for your comments. Following is the code I used in this case.
> Could you pls. let me know where I have to insert UNLIMITED field length?
> and how?
> Tanx again!
> Manjula
>
>    
Manjula,

You can set UNLIMITED field length to IW constructor:

http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/index/IndexWriter.html#IndexWriter%28org.apache.lucene.store.Directory,%20org.apache.lucene.analysis.Analyzer,%20boolean,%20org.apache.lucene.index.IndexWriter.MaxFieldLength%29

Koji

-- 
http://www.rondhuit.com/en/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: scoring and index size

Posted by manjula wijewickrema <ma...@gmail.com>.

Uwe, thanx for your comments. Following is the code I used in this case.
Could you pls. let me know where I have to insert UNLIMITED field length?
and how?
Tanx again!
Manjula

code--

*

public* *class* LuceneDemo {

*public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = "filesToIndex"
;

*public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory";

*public* *static* *final* String *FIELD_PATH* = "path";

*public* *static* *final* String *FIELD_CONTENTS* = "contents";

*public* *static* *void* main(String[] args) *throws* Exception {

*createIndex*();

//searchIndex("rice AND milk");

*searchIndex*("metaphysics");

//searchIndex("banana");

//searchIndex("foo");

 }

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException, IOException {

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);

*boolean* recreateIndexIfExists = *true*;

IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
recreateIndexIfExists);

File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

File[] files = dir.listFiles();

*for* (File file : files) {

Document document = *new* Document();

//contents#setOmitNorms(true);

String path = file.getCanonicalPath();

document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index.
UN_TOKENIZED,Field.TermVector.*YES*));

Reader reader = *new* FileReader(file);

document.add(*new* Field(*FIELD_CONTENTS*, reader));

indexWriter.addDocument(document);

 }

indexWriter.optimize();

indexWriter.close();

}

*public* *static* *void* searchIndex(String searchString)
*throws*IOException, ParseException {

System.*out*.println("Searching for '" + searchString + "'");

Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);

IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);

 SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);

QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);

Query query = queryParser.parse(searchString);

Hits hits = indexSearcher.search(query);

System.*out*.println("Number of hits: " + hits.length());

TopDocs results = indexSearcher.search(query,10);

ScoreDoc[] hits1 = results.scoreDocs;

*for* (ScoreDoc hit : hits1) {

Document doc = indexSearcher.doc(hit.doc);

//System.out.printf("%5.3f %s\n",hit.score,doc.get(FIELD_CONTENTS));

System.*out*.println(hit.score);

//Searcher.explain("rice",0);

//System.out.println(indexSearcher.explain(query, 0));

}

System.*out*.println(indexSearcher.explain(query, 0));

//System.out.println(indexSearcher.explain(query, 1));

//System.out.println(indexSearcher.explain(query, 2));

//System.out.println(indexSearcher.explain(query, 3));

Iterator<Hit> it = hits.iterator();

*while* (it.hasNext()) {

Hit hit = it.next();

Document document = hit.getDocument();

String path = document.get(*FIELD_PATH*);

System.*out*.println("Hit: " + path);

}

}

}






On Fri, Jul 9, 2010 at 1:06 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Maybe you have MaxFieldLength.LIMITED instead of UNLIMITED? Then the number
> of terms per document is limited.
>
> The calculation precision is limited by the float norm encoding, but also
> if
> your analyzer removed stop words, so the norm is not what you exspect?
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: manjula wijewickrema [mailto:manjula53@gmail.com]
> > Sent: Friday, July 09, 2010 9:21 AM
> > To: java-user@lucene.apache.org
> > Subject: scoring and index size
> >
> > Hi,
> >
> > I run a single programme to see the way of scoring by Lucene for single
> > indexed document. The explain() method gave me the following results.
> > *******************
> >
> > Searching for 'metaphysics'
> >
> > Number of hits: 1
> >
> > 0.030706111
> >
> > 0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of:
> >
> > 10.246951 = tf(termFreq(contents:metaphys)=105)
> >
> > 0.30685282 = idf(docFreq=1, maxDocs=1)
> >
> > 0.009765625 = fieldNorm(field=contents, doc=0)
> >
> > *****************
> >
> > But I encountered the following problems;
> >
> > 1) In this case, I did not change or done anything to Boost values. So
> that
> > should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in
> Lucene
> > email archive,  default boost values=1)
> >
> > 2) But, even if I manually calculate the value for fieldNorm (as
> =1/sqrt(terms
> > in field)), it doesn't match (approximately it matches) with the value
> with
> > given by the system for fieldNorm. Can this be due to encode/decode
> > precision loss of norm?
> >
> > 3) In my indexed document, my indexed document was consisted with total
> > number of 19078 words including 125 times of word 'metaphysics' (i.e my
> > query. I input single term query) . But as you can see in the above
> output,
> > system gives only 105 counts for word 'metaphysics'. But once I reduce
> some
> > part of my index document and count the number of 'metaphysics' words
> > and checked with the system results. I noticed that with reduction of
> text
> > from index document, system counts it correctly. Why this kind of
> > behaviour? Is there any limitation for the indexed documents?
> >
> > If somebody can pls. help me to solve these problems.
> >
> > Thanks!
> >
> > Manjula.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: scoring and index size

Posted by Uwe Schindler <uw...@thetaphi.de>.

Maybe you have MaxFieldLength.LIMITED instead of UNLIMITED? Then the number
of terms per document is limited.

The calculation precision is limited by the float norm encoding, but also if
your analyzer removed stop words, so the norm is not what you exspect?

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: manjula wijewickrema [mailto:manjula53@gmail.com]
> Sent: Friday, July 09, 2010 9:21 AM
> To: java-user@lucene.apache.org
> Subject: scoring and index size
> 
> Hi,
> 
> I run a single programme to see the way of scoring by Lucene for single
> indexed document. The explain() method gave me the following results.
> *******************
> 
> Searching for 'metaphysics'
> 
> Number of hits: 1
> 
> 0.030706111
> 
> 0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of:
> 
> 10.246951 = tf(termFreq(contents:metaphys)=105)
> 
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 
> 0.009765625 = fieldNorm(field=contents, doc=0)
> 
> *****************
> 
> But I encountered the following problems;
> 
> 1) In this case, I did not change or done anything to Boost values. So
that
> should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in
Lucene
> email archive,  default boost values=1)
> 
> 2) But, even if I manually calculate the value for fieldNorm (as
=1/sqrt(terms
> in field)), it doesn't match (approximately it matches) with the value
with
> given by the system for fieldNorm. Can this be due to encode/decode
> precision loss of norm?
> 
> 3) In my indexed document, my indexed document was consisted with total
> number of 19078 words including 125 times of word 'metaphysics' (i.e my
> query. I input single term query) . But as you can see in the above
output,
> system gives only 105 counts for word 'metaphysics'. But once I reduce
some
> part of my index document and count the number of 'metaphysics' words
> and checked with the system results. I noticed that with reduction of text
> from index document, system counts it correctly. Why this kind of
> behaviour? Is there any limitation for the indexed documents?
> 
> If somebody can pls. help me to solve these problems.
> 
> Thanks!
> 
> Manjula.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org