You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by dziadgba <dz...@googlemail.com> on 2007/03/10 14:33:03 UTC

HITS and termDoc give different results

hye,
I want to extract documents which contain a specific term.
I tried to do it in two different ways:

1 Using the 'iterator' termdocs = reader.termDocs(term);
2 Using search and examing Hits

turns out that the result are sometimes equal, sometimes the first is a  
subset of the
second and sometimes there is no connection between the two results.

can somebody give me a hint?

bye


    public void addDocumentsToTerm(int debug, String myterm) throws  
Exception
     { TermEnum terms=reader.terms();

       MyDocument doc;
       Document docLucene;
       int count=0;
       boolean b=false;
       if(debug==1) System.out.print("Checking (MyTerm) "+myterm+" ... ");

       while (terms.next())
       { Term t = terms.term();
         if(t.text().compareTo(myterm)==0)
         { TermDocs termDocs = reader.termDocs(t);
           while(termDocs.next())
           {
             if(debug==1 && count==0) System.out.println("equal to (Term)  
"+t.text());
             count++;
             docLucene = reader.document(termDocs.doc());
               if(debug==1) System.out.println("  docLucene:  
["+termDocs.doc()+"-
                             "+docLucene.getField("Code").stringValue()+"]  
");
             b=true;
           }
             System.out.println();

             QueryParser pars = new QueryParser("Text",new  
StandardAnalyzer());
             Query q= pars.parse(t.text());
             Hits hits = searcher.search(q);

             System.out.println("Found "+hits.length()+" matches for query  
"+q);
             for(int i=0;i<hits.length();i++)
             {  Document d = hits.doc(i);
                System.out.println("        doc:  
["+d.+"-"+d.getField("Code").stringValue()+"]");

             }

           System.out.println();
           if(b==false)
             System.out.println("No Document found for term:  
"+myterm.getTerm());
           if(debug==1)if(count < myterm.getDocFreq()&&count>0)  
System.out.println("        Error term: "+myterm.getTerm()+", documents  
found: "
                                                            +count+",  
docFreq: "+myterm.getDocFreq());

           return;
         }
       }
     }

OUPUT

Checking (Myterm) zucca ... equal to (Term) zucca
   docLucene: [9963-356 U.S. 256]

Found 8 matches for query Text:zucca
         doc: [0-356 U.S. 256]
         doc: [1-365 U.S. 290]
         doc: [2-351 U.S. 91]
         doc: [3-356 U.S. 660]
         doc: [4-365 U.S. 265]
         doc: [5-377 U.S. 235]
         doc: [6-435 U.S. 519]
         doc: [7-441 U.S. 281]

         Error term: zucca, documents found: 1, docFreq: 8

Checking (Myterm) zimroth ... equal to (Term) zimroth
         doc: [16478-476 U.S. 467]
         doc: [17142-492 U.S. 257]
         doc: [17208-488 U.S. 235]
         doc: [17911-484 U.S. 1]
         doc: [17920-487 U.S. 1]
         doc: [18010-484 U.S. 301]

Found 8 matches for query Text:zimroth
         doc: [0-389 U.S. 143]
         doc: [1-468 U.S. 981]
         doc: [2-489 U.S. 688]
         doc: [3-491 U.S. 781]
         doc: [4-436 U.S. 412]
         doc: [5-445 U.S. 573]
         doc: [6-462 U.S. 213]
         doc: [7-468 U.S. 897]

         Error term: zimroth, documents found: 6, docFreq: 8




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: HITS and termDoc give different results

Posted by dziadgba dziadgba <dz...@googlemail.com>.
you were right
thanks for help
dziadgba

2007/3/11, Doron Cohen <DO...@il.ibm.com>:
>
> Is "Text" the only field in the index?
>
> Note that the search only looks at field "Text", while the terms()
> iteration as appears in that code might bump into a term with same text
> but
> in another field. A better comparison would be to create a Term
> ("Text",<your-word>), and compare TermQuery(thatTerm) to
> termDocs(thatTerm). Btw, if iterating all terms is ever a must, note
> TermEnum.skipTo(Term).
>
> Hope this helps,
> Doron
>
> dziadgba <dz...@googlemail.com> wrote on 10/03/2007 05:33:03:
>
> > hye,
> > I want to extract documents which contain a specific term.
> > I tried to do it in two different ways:
> >
> > 1 Using the 'iterator' termdocs = reader.termDocs(term);
> > 2 Using search and examing Hits
> >
> > turns out that the result are sometimes equal, sometimes the first is a
> > subset of the
> > second and sometimes there is no connection between the two results.
> >
> > can somebody give me a hint?
> >
> > bye
> >
> >
> >     public void addDocumentsToTerm(int debug, String myterm) throws
> > Exception
> >      { TermEnum terms=reader.terms();
> >
> >        MyDocument doc;
> >        Document docLucene;
> >        int count=0;
> >        boolean b=false;
> >        if(debug==1) System.out.print("Checking (MyTerm) "+myterm+" ...
> ");
> >
> >        while (terms.next())
> >        { Term t = terms.term();
> >          if(t.text().compareTo(myterm)==0)
> >          { TermDocs termDocs = reader.termDocs(t);
> >            while(termDocs.next())
> >            {
> >              if(debug==1 && count==0) System.out.println("equal to
> (Term)
>
> > "+t.text());
> >              count++;
> >              docLucene = reader.document(termDocs.doc());
> >                if(debug==1) System.out.println("  docLucene:
> > ["+termDocs.doc()+"-
> >
> "+docLucene.getField("Code").stringValue()+"]
> > ");
> >              b=true;
> >            }
> >              System.out.println();
> >
> >              QueryParser pars = new QueryParser("Text",new
> > StandardAnalyzer());
> >              Query q= pars.parse(t.text());
> >              Hits hits = searcher.search(q);
> >
> >              System.out.println("Found "+hits.length()+" matches for
> query
> > "+q);
> >              for(int i=0;i<hits.length();i++)
> >              {  Document d = hits.doc(i);
> >                 System.out.println("        doc:
> > ["+d.+"-"+d.getField("Code").stringValue()+"]");
> >
> >              }
> >
> >            System.out.println();
> >            if(b==false)
> >              System.out.println("No Document found for term:
> > "+myterm.getTerm());
> >            if(debug==1)if(count < myterm.getDocFreq()&&count>0)
> > System.out.println("        Error term: "+myterm.getTerm()+", documents
> > found: "
> >                                                             +count+",
> > docFreq: "+myterm.getDocFreq());
> >
> >            return;
> >          }
> >        }
> >      }
> >
> > OUPUT
> >
> > Checking (Myterm) zucca ... equal to (Term) zucca
> >    docLucene: [9963-356 U.S. 256]
> >
> > Found 8 matches for query Text:zucca
> >          doc: [0-356 U.S. 256]
> >          doc: [1-365 U.S. 290]
> >          doc: [2-351 U.S. 91]
> >          doc: [3-356 U.S. 660]
> >          doc: [4-365 U.S. 265]
> >          doc: [5-377 U.S. 235]
> >          doc: [6-435 U.S. 519]
> >          doc: [7-441 U.S. 281]
> >
> >          Error term: zucca, documents found: 1, docFreq: 8
> >
> > Checking (Myterm) zimroth ... equal to (Term) zimroth
> >          doc: [16478-476 U.S. 467]
> >          doc: [17142-492 U.S. 257]
> >          doc: [17208-488 U.S. 235]
> >          doc: [17911-484 U.S. 1]
> >          doc: [17920-487 U.S. 1]
> >          doc: [18010-484 U.S. 301]
> >
> > Found 8 matches for query Text:zimroth
> >          doc: [0-389 U.S. 143]
> >          doc: [1-468 U.S. 981]
> >          doc: [2-489 U.S. 688]
> >          doc: [3-491 U.S. 781]
> >          doc: [4-436 U.S. 412]
> >          doc: [5-445 U.S. 573]
> >          doc: [6-462 U.S. 213]
> >          doc: [7-468 U.S. 897]
> >
> >          Error term: zimroth, documents found: 6, docFreq: 8
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: HITS and termDoc give different results

Posted by Doron Cohen <DO...@il.ibm.com>.
Is "Text" the only field in the index?

Note that the search only looks at field "Text", while the terms()
iteration as appears in that code might bump into a term with same text but
in another field. A better comparison would be to create a Term
("Text",<your-word>), and compare TermQuery(thatTerm) to
termDocs(thatTerm). Btw, if iterating all terms is ever a must, note
TermEnum.skipTo(Term).

Hope this helps,
Doron

dziadgba <dz...@googlemail.com> wrote on 10/03/2007 05:33:03:

> hye,
> I want to extract documents which contain a specific term.
> I tried to do it in two different ways:
>
> 1 Using the 'iterator' termdocs = reader.termDocs(term);
> 2 Using search and examing Hits
>
> turns out that the result are sometimes equal, sometimes the first is a
> subset of the
> second and sometimes there is no connection between the two results.
>
> can somebody give me a hint?
>
> bye
>
>
>     public void addDocumentsToTerm(int debug, String myterm) throws
> Exception
>      { TermEnum terms=reader.terms();
>
>        MyDocument doc;
>        Document docLucene;
>        int count=0;
>        boolean b=false;
>        if(debug==1) System.out.print("Checking (MyTerm) "+myterm+" ...
");
>
>        while (terms.next())
>        { Term t = terms.term();
>          if(t.text().compareTo(myterm)==0)
>          { TermDocs termDocs = reader.termDocs(t);
>            while(termDocs.next())
>            {
>              if(debug==1 && count==0) System.out.println("equal to (Term)

> "+t.text());
>              count++;
>              docLucene = reader.document(termDocs.doc());
>                if(debug==1) System.out.println("  docLucene:
> ["+termDocs.doc()+"-
>
"+docLucene.getField("Code").stringValue()+"]
> ");
>              b=true;
>            }
>              System.out.println();
>
>              QueryParser pars = new QueryParser("Text",new
> StandardAnalyzer());
>              Query q= pars.parse(t.text());
>              Hits hits = searcher.search(q);
>
>              System.out.println("Found "+hits.length()+" matches for
query
> "+q);
>              for(int i=0;i<hits.length();i++)
>              {  Document d = hits.doc(i);
>                 System.out.println("        doc:
> ["+d.+"-"+d.getField("Code").stringValue()+"]");
>
>              }
>
>            System.out.println();
>            if(b==false)
>              System.out.println("No Document found for term:
> "+myterm.getTerm());
>            if(debug==1)if(count < myterm.getDocFreq()&&count>0)
> System.out.println("        Error term: "+myterm.getTerm()+", documents
> found: "
>                                                             +count+",
> docFreq: "+myterm.getDocFreq());
>
>            return;
>          }
>        }
>      }
>
> OUPUT
>
> Checking (Myterm) zucca ... equal to (Term) zucca
>    docLucene: [9963-356 U.S. 256]
>
> Found 8 matches for query Text:zucca
>          doc: [0-356 U.S. 256]
>          doc: [1-365 U.S. 290]
>          doc: [2-351 U.S. 91]
>          doc: [3-356 U.S. 660]
>          doc: [4-365 U.S. 265]
>          doc: [5-377 U.S. 235]
>          doc: [6-435 U.S. 519]
>          doc: [7-441 U.S. 281]
>
>          Error term: zucca, documents found: 1, docFreq: 8
>
> Checking (Myterm) zimroth ... equal to (Term) zimroth
>          doc: [16478-476 U.S. 467]
>          doc: [17142-492 U.S. 257]
>          doc: [17208-488 U.S. 235]
>          doc: [17911-484 U.S. 1]
>          doc: [17920-487 U.S. 1]
>          doc: [18010-484 U.S. 301]
>
> Found 8 matches for query Text:zimroth
>          doc: [0-389 U.S. 143]
>          doc: [1-468 U.S. 981]
>          doc: [2-489 U.S. 688]
>          doc: [3-491 U.S. 781]
>          doc: [4-436 U.S. 412]
>          doc: [5-445 U.S. 573]
>          doc: [6-462 U.S. 213]
>          doc: [7-468 U.S. 897]
>
>          Error term: zimroth, documents found: 6, docFreq: 8
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org