You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Maisnam Ns <ma...@gmail.com> on 2015/02/22 12:47:15 UTC

Getting most occurring words in lucene

Hi,

I am trying to get the top occurring words by building a memory index using
lucene using the code below but I am not getting the desired results. The
text contains 'freedom' three times but it gives only 1. Where am I
committing a mistake. Is there a way out. Please help.

RAMDirectory idx = new RAMDirectory(); //create ram directory
IndexWriter writer =
                     new IndexWriter(idx, new
StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED);
// create the index

 writer.addDocument(createDocument("key1",
    "It behooves every man to freedom freedom freedom remember
that                    the work of the "));  // add text to document



             try {
                computeTopTermQuery(idx);  //compute the top term
            } catch (Exception e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

The computeTopTermQuery is from this link
http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html  by
Suujit Pal's blog.

  private static Query computeTopTermQuery(Directory ramdir) throws
Exception {
        final Map<String,Integer> frequencyMap =
          new HashMap<String,Integer>();
        List<String> termlist = new ArrayList<String>();
        IndexReader reader = IndexReader.open(ramdir);
        TermEnum terms = reader.terms();
        while (terms.next()) {
          Term term = terms.term();
          String termText = term.text();
          int frequency = reader.docFreq(term);
          frequencyMap.put(termText, frequency);
          termlist.add(termText);
        }
        reader.close();
        // sort the term map by frequency descending
        Collections.sort(termlist, new ReverseComparator<String>(
          new ByValueComparator<String,Integer>(frequencyMap)));
        // retrieve the top terms based on topTermCutoff
        List<String> topTerms = new ArrayList<String>();
        float topFreq = -1.0F;
        for (String term : termlist) {
          if (topFreq < 0.0F) {
            // first term, capture the value
            topFreq = (float) frequencyMap.get(term);
            topTerms.add(term);
          } else {
            // not the first term, compute the ratio and discard if below
            // topTermCutoff score
            float ratio = (float) ((float) frequencyMap.get(term) /
topFreq);
            if (ratio >= topTermCutoff) {
              topTerms.add(term);
            } else {
              break;
            }
          }
        }
        StringBuilder termBuf = new StringBuilder();
        BooleanQuery q = new BooleanQuery();
        for (String topTerm : topTerms) {
          termBuf.append(topTerm).
            append("(").
            append(frequencyMap.get(topTerm)).
            append(");");
          q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD);
        }
        System.out.println(">>> top terms: " + termBuf.toString());
        System.out.println(">>> query: " + q.toString());
        return q;
      }


But surprisingly I am getting freedom as (1) and not (3), where 3 is the
occurrences of freedom.

top terms:
accomplished(1);altogether(1);behooves(1);critic(1);does(1);end(1);
every(1);freedom(1);importance(1);man(1);progress(1);remember(1);
secondary(1);things(1);who(1);work(1);

Thanks

Re: Getting most occurring words in lucene

Posted by Michael McCandless <lu...@mikemccandless.com>.

Use TermsEnum.totalTermFreq(), which is the total number of
occurrences of the term, not TermsEnum.docFreq(), which is the number
of documents that contain at least one occurrence of the term.

Mike McCandless

http://blog.mikemccandless.com


On Sun, Feb 22, 2015 at 6:47 AM, Maisnam Ns <ma...@gmail.com> wrote:
> Hi,
>
> I am trying to get the top occurring words by building a memory index using
> lucene using the code below but I am not getting the desired results. The
> text contains 'freedom' three times but it gives only 1. Where am I
> committing a mistake. Is there a way out. Please help.
>
> RAMDirectory idx = new RAMDirectory(); //create ram directory
> IndexWriter writer =
>                      new IndexWriter(idx, new
> StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED);
> // create the index
>
>  writer.addDocument(createDocument("key1",
>     "It behooves every man to freedom freedom freedom remember
> that                    the work of the "));  // add text to document
>
>
>
>              try {
>                 computeTopTermQuery(idx);  //compute the top term
>             } catch (Exception e) {
>                 // TODO Auto-generated catch block
>                 e.printStackTrace();
>             }
>
> The computeTopTermQuery is from this link
> http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html  by
> Suujit Pal's blog.
>
>   private static Query computeTopTermQuery(Directory ramdir) throws
> Exception {
>         final Map<String,Integer> frequencyMap =
>           new HashMap<String,Integer>();
>         List<String> termlist = new ArrayList<String>();
>         IndexReader reader = IndexReader.open(ramdir);
>         TermEnum terms = reader.terms();
>         while (terms.next()) {
>           Term term = terms.term();
>           String termText = term.text();
>           int frequency = reader.docFreq(term);
>           frequencyMap.put(termText, frequency);
>           termlist.add(termText);
>         }
>         reader.close();
>         // sort the term map by frequency descending
>         Collections.sort(termlist, new ReverseComparator<String>(
>           new ByValueComparator<String,Integer>(frequencyMap)));
>         // retrieve the top terms based on topTermCutoff
>         List<String> topTerms = new ArrayList<String>();
>         float topFreq = -1.0F;
>         for (String term : termlist) {
>           if (topFreq < 0.0F) {
>             // first term, capture the value
>             topFreq = (float) frequencyMap.get(term);
>             topTerms.add(term);
>           } else {
>             // not the first term, compute the ratio and discard if below
>             // topTermCutoff score
>             float ratio = (float) ((float) frequencyMap.get(term) /
> topFreq);
>             if (ratio >= topTermCutoff) {
>               topTerms.add(term);
>             } else {
>               break;
>             }
>           }
>         }
>         StringBuilder termBuf = new StringBuilder();
>         BooleanQuery q = new BooleanQuery();
>         for (String topTerm : topTerms) {
>           termBuf.append(topTerm).
>             append("(").
>             append(frequencyMap.get(topTerm)).
>             append(");");
>           q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD);
>         }
>         System.out.println(">>> top terms: " + termBuf.toString());
>         System.out.println(">>> query: " + q.toString());
>         return q;
>       }
>
>
> But surprisingly I am getting freedom as (1) and not (3), where 3 is the
> occurrences of freedom.
>
> top terms:
> accomplished(1);altogether(1);behooves(1);critic(1);does(1);end(1);
> every(1);freedom(1);importance(1);man(1);progress(1);remember(1);
> secondary(1);things(1);who(1);work(1);
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org