You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Bianca Pereira <ai...@gmail.com> on 2014/08/07 13:28:38 UTC

EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

Hi,

  I am new in the list and I have been working on a problem for some time
already. I would like to know if someone has any idea of how I can solve it.

 Given a term, I want to get the term frequency in a lucene document. When
I use the WhiteSpaceAnalyzer my code works properly but when I use the
EnglishAnalyzer it returns 0 as frequency for any term.

  In order to get the term appearing both as "term" or "term," in the text
the EnglishAnalyzer is the best one to be used (I suppose).

  Any help is more than welcome.

  Best Regards,
  Bianca

----------------------------
  Here is my code:

TO INDEX

public class LuceneDescriptionIndexer implements Closeable {

private IndexWriter descWriter;


 public LuceneDescriptionIndexer(Directory luceneDirectory, Analyzer
analyzer)

 throws IOException {

  openIndex(luceneDirectory, analyzer);

}

 private void openIndex(Directory directory, Analyzer analyzer) throws
IOException {

  IndexWriterConfig descIwc = new IndexWriterConfig(LuceneConfig.
INDEX_VERSION, analyzer);

  descWriter = new IndexWriter(directory, descIwc);

}

 public void indexDocument(String id, String text) throws IOException {

    IndexableField idField = new StringField("id",id,Field.Store.YES);

     FieldType fieldType = new FieldType();

    fieldType.setStoreTermVectors(true);

    fieldType.setStoreTermVectorPositions(true);

    fieldType.setIndexed(true);

    fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);

    fieldType.setStored(true);



    Document doc = new Document();

    doc.add(idField);

    doc.add(new Field("description", text, fieldType));



    descWriter.addDocument(doc);

}

 @Override

public void close() throws IOException {

  descWriter.commit();

  descWriter.close();

}

}


TO QUERY

public class LuceneTermStatistics implements TermKBStatistics {


private IndexReader luceneIndexReader;

private Analyzer analyzer;

private IndexSearcher searcher;


public LuceneTermStatistics(IndexReader reader, Analyzer analyzer) {

  this.luceneIndexReader = reader;

  this.analyzer = analyzer;

  this.searcher = new IndexSearcher(reader);

}

 /**

 * Create an instance of LuceneTermStatistics from the Config options.

 */

public static LuceneTermStatistics configureInstance(String indexPath,
Analyzer analyzer)

  throws IOException {

  FSDirectory index = FSDirectory.open(new File(indexPath));

  DirectoryReader indexReader = DirectoryReader.open(index);

  return new LuceneTermStatistics(indexReader, analyzer);

}

 @Override

public int getTermFrequency(String term, String id)

 throws Exception {

   int docId = getDocId(id);

   // Get the vector with the frequency for the term in all documents

  DocsEnum de = MultiFields.getTermDocsEnum(

       luceneIndexReader, MultiFields.getLiveDocs(luceneIndexReader),
"description",

       new BytesRef(term));

   // Get the frequency for the document of interest

  if (de != null) {

      int docNo;

      while((docNo = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) {

         if(docNo == docId)

           return de.freq();

       }

  }

  return 0;

}


private int getDocId(String id) throws IOException {

  BooleanQuery idQuery = new BooleanQuery();

  idQuery.add(new TermQuery(new Term("id", id)), Occur.MUST);


  TopScoreDocCollector collector = TopScoreDocCollector.create(1, false);

  searcher.search(idQuery, collector);

   TopDocs topDocs = collector.topDocs();

  if (topDocs.totalHits == 0)

    return -1;

   return topDocs.scoreDocs[0].doc;

 }

}

Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

Posted by Bianca Pereira <ai...@gmail.com>.

Hi Uwe, Jack,

  Thank you very much for your answers. I will work on it.

  Regards,
  Bianca


2014-08-07 14:04 GMT+01:00 Jack Krupansky <ja...@basetechnology.com>:

> Also, usually  query-time analysis is done by a "query parser", so if you
> aren't going through a quwery parser, you have to call the aalyzer
> yourself. The stemming is very likely the culprit here.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Uwe Schindler
> Sent: Thursday, August 7, 2014 9:00 AM
> To: java-user@lucene.apache.org
> Subject: RE: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term
> Frequency
>
>
> Hi,
>
> if you create the term yourself, it is not going through the analyzer:
> public int getTermFrequency(String term, String id)
> (you create a BytesRef out of it). So you have to also let the term go
> through the analyzer. The stemming analyzers change the terms, so you won't
> find them without also stemming the term before you
>
> StandardAnalyzer does not do stemming, so terms (mostly) stay as they are.
> But also for this analyzer, you theoretically has to pass the term through
> the analyzer before you can do a term frequency lookup. Just think about
> that the term was not lowercased, in that case you would also not find it
> in the termdictionary. If it goes through the analyzer, you would find it
> because the analyzer lowercases it.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>  -----Original Message-----
>> From: Bianca Pereira [mailto:aivykarter@gmail.com]
>> Sent: Thursday, August 07, 2014 2:47 PM
>> To: java-user
>> Subject: Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term
>> Frequency
>>
>> Hi Jack,
>>
>>   Thank you very much. I just changed for the StandardAnalyzer and it is
>> working as I would like. But there is something I still cannot understand.
>> If I use the same analyzer for indexing and for searching, the same term
>> should be parsed in the same way in both moments, shouldn't it? It is why
>> I
>> still don't understand why the EnglishAnalyzer was not working. Any idea
>> on
>> that?
>>
>>   Best Regards,
>>   Bianca
>>
>>
>> 2014-08-07 12:40 GMT+01:00 Jack Krupansky <ja...@basetechnology.com>:
>>
>> > Generally, the standard analyzer will be a better choice, unless you
>> > have some special need.
>> >
>> > A language-specific analyzer will include stemming. The English
>> > analyzer includes the Porter stemmer.
>> >
>> > Generally, you need to apply a compatible analyzer to query terms to
>> > match the index, or you need to manually filter your query terms.
>> > Sounds like maybe a term got stemmed.
>> >
>> > -- Jack Krupansky
>> >
>> > -----Original Message----- From: Bianca Pereira
>> > Sent: Thursday, August 7, 2014 7:28 AM
>> > To: java-user@lucene.apache.org
>> > Subject: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term
>> > Frequency
>> >
>> >
>> > Hi,
>> >
>> >  I am new in the list and I have been working on a problem for some
>> > time already. I would like to know if someone has any idea of how I
>> > can solve it.
>> >
>> > Given a term, I want to get the term frequency in a lucene document.
>> > When I use the WhiteSpaceAnalyzer my code works properly but when I
>> > use the EnglishAnalyzer it returns 0 as frequency for any term.
>> >
>> >  In order to get the term appearing both as "term" or "term," in the
>> > text the EnglishAnalyzer is the best one to be used (I suppose).
>> >
>> >  Any help is more than welcome.
>> >
>> >  Best Regards,
>> >  Bianca
>> >
>> > ----------------------------
>> >  Here is my code:
>> >
>> > TO INDEX
>> >
>> > public class LuceneDescriptionIndexer implements Closeable {
>> >
>> > private IndexWriter descWriter;
>> >
>> >
>> > public LuceneDescriptionIndexer(Directory luceneDirectory, Analyzer
>> > analyzer)
>> >
>> > throws IOException {
>> >
>> >  openIndex(luceneDirectory, analyzer);
>> >
>> > }
>> >
>> > private void openIndex(Directory directory, Analyzer analyzer) throws
>> > IOException {
>> >
>> >  IndexWriterConfig descIwc = new IndexWriterConfig(LuceneConfig.
>> > INDEX_VERSION, analyzer);
>> >
>> >  descWriter = new IndexWriter(directory, descIwc);
>> >
>> > }
>> >
>> > public void indexDocument(String id, String text) throws IOException {
>> >
>> >    IndexableField idField = new StringField("id",id,Field.Store.YES);
>> >
>> >     FieldType fieldType = new FieldType();
>> >
>> >    fieldType.setStoreTermVectors(true);
>> >
>> >    fieldType.setStoreTermVectorPositions(true);
>> >
>> >    fieldType.setIndexed(true);
>> >
>> >    fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
>> >
>> >    fieldType.setStored(true);
>> >
>> >
>> >
>> >    Document doc = new Document();
>> >
>> >    doc.add(idField);
>> >
>> >    doc.add(new Field("description", text, fieldType));
>> >
>> >
>> >
>> >    descWriter.addDocument(doc);
>> >
>> > }
>> >
>> > @Override
>> >
>> > public void close() throws IOException {
>> >
>> >  descWriter.commit();
>> >
>> >  descWriter.close();
>> >
>> > }
>> >
>> > }
>> >
>> >
>> > TO QUERY
>> >
>> > public class LuceneTermStatistics implements TermKBStatistics {
>> >
>> >
>> > private IndexReader luceneIndexReader;
>> >
>> > private Analyzer analyzer;
>> >
>> > private IndexSearcher searcher;
>> >
>> >
>> > public LuceneTermStatistics(IndexReader reader, Analyzer analyzer) {
>> >
>> >  this.luceneIndexReader = reader;
>> >
>> >  this.analyzer = analyzer;
>> >
>> >  this.searcher = new IndexSearcher(reader);
>> >
>> > }
>> >
>> > /**
>> >
>> > * Create an instance of LuceneTermStatistics from the Config options.
>> >
>> > */
>> >
>> > public static LuceneTermStatistics configureInstance(String indexPath,
>> > Analyzer analyzer)
>> >
>> >  throws IOException {
>> >
>> >  FSDirectory index = FSDirectory.open(new File(indexPath));
>> >
>> >  DirectoryReader indexReader = DirectoryReader.open(index);
>> >
>> >  return new LuceneTermStatistics(indexReader, analyzer);
>> >
>> > }
>> >
>> > @Override
>> >
>> > public int getTermFrequency(String term, String id)
>> >
>> > throws Exception {
>> >
>> >   int docId = getDocId(id);
>> >
>> >   // Get the vector with the frequency for the term in all documents
>> >
>> >  DocsEnum de = MultiFields.getTermDocsEnum(
>> >
>> >       luceneIndexReader, MultiFields.getLiveDocs(luceneIndexReader),
>> > "description",
>> >
>> >       new BytesRef(term));
>> >
>> >   // Get the frequency for the document of interest
>> >
>> >  if (de != null) {
>> >
>> >      int docNo;
>> >
>> >      while((docNo = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
>> >
>> >         if(docNo == docId)
>> >
>> >           return de.freq();
>> >
>> >       }
>> >
>> >  }
>> >
>> >  return 0;
>> >
>> > }
>> >
>> >
>> > private int getDocId(String id) throws IOException {
>> >
>> >  BooleanQuery idQuery = new BooleanQuery();
>> >
>> >  idQuery.add(new TermQuery(new Term("id", id)), Occur.MUST);
>> >
>> >
>> >  TopScoreDocCollector collector = TopScoreDocCollector.create(1,
>> > false);
>> >
>> >  searcher.search(idQuery, collector);
>> >
>> >   TopDocs topDocs = collector.topDocs();
>> >
>> >  if (topDocs.totalHits == 0)
>> >
>> >    return -1;
>> >
>> >   return topDocs.scoreDocs[0].doc;
>> >
>> > }
>> >
>> > }
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

Posted by Jack Krupansky <ja...@basetechnology.com>.

Also, usually  query-time analysis is done by a "query parser", so if you 
aren't going through a quwery parser, you have to call the aalyzer yourself. 
The stemming is very likely the culprit here.

-- Jack Krupansky

-----Original Message----- 
From: Uwe Schindler
Sent: Thursday, August 7, 2014 9:00 AM
To: java-user@lucene.apache.org
Subject: RE: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

Hi,

if you create the term yourself, it is not going through the analyzer: 
public int getTermFrequency(String term, String id)
(you create a BytesRef out of it). So you have to also let the term go 
through the analyzer. The stemming analyzers change the terms, so you won't 
find them without also stemming the term before you

StandardAnalyzer does not do stemming, so terms (mostly) stay as they are. 
But also for this analyzer, you theoretically has to pass the term through 
the analyzer before you can do a term frequency lookup. Just think about 
that the term was not lowercased, in that case you would also not find it in 
the termdictionary. If it goes through the analyzer, you would find it 
because the analyzer lowercases it.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Bianca Pereira [mailto:aivykarter@gmail.com]
> Sent: Thursday, August 07, 2014 2:47 PM
> To: java-user
> Subject: Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term
> Frequency
>
> Hi Jack,
>
>   Thank you very much. I just changed for the StandardAnalyzer and it is
> working as I would like. But there is something I still cannot understand.
> If I use the same analyzer for indexing and for searching, the same term
> should be parsed in the same way in both moments, shouldn't it? It is why 
> I
> still don't understand why the EnglishAnalyzer was not working. Any idea 
> on
> that?
>
>   Best Regards,
>   Bianca
>
>
> 2014-08-07 12:40 GMT+01:00 Jack Krupansky <ja...@basetechnology.com>:
>
> > Generally, the standard analyzer will be a better choice, unless you
> > have some special need.
> >
> > A language-specific analyzer will include stemming. The English
> > analyzer includes the Porter stemmer.
> >
> > Generally, you need to apply a compatible analyzer to query terms to
> > match the index, or you need to manually filter your query terms.
> > Sounds like maybe a term got stemmed.
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Bianca Pereira
> > Sent: Thursday, August 7, 2014 7:28 AM
> > To: java-user@lucene.apache.org
> > Subject: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term
> > Frequency
> >
> >
> > Hi,
> >
> >  I am new in the list and I have been working on a problem for some
> > time already. I would like to know if someone has any idea of how I
> > can solve it.
> >
> > Given a term, I want to get the term frequency in a lucene document.
> > When I use the WhiteSpaceAnalyzer my code works properly but when I
> > use the EnglishAnalyzer it returns 0 as frequency for any term.
> >
> >  In order to get the term appearing both as "term" or "term," in the
> > text the EnglishAnalyzer is the best one to be used (I suppose).
> >
> >  Any help is more than welcome.
> >
> >  Best Regards,
> >  Bianca
> >
> > ----------------------------
> >  Here is my code:
> >
> > TO INDEX
> >
> > public class LuceneDescriptionIndexer implements Closeable {
> >
> > private IndexWriter descWriter;
> >
> >
> > public LuceneDescriptionIndexer(Directory luceneDirectory, Analyzer
> > analyzer)
> >
> > throws IOException {
> >
> >  openIndex(luceneDirectory, analyzer);
> >
> > }
> >
> > private void openIndex(Directory directory, Analyzer analyzer) throws
> > IOException {
> >
> >  IndexWriterConfig descIwc = new IndexWriterConfig(LuceneConfig.
> > INDEX_VERSION, analyzer);
> >
> >  descWriter = new IndexWriter(directory, descIwc);
> >
> > }
> >
> > public void indexDocument(String id, String text) throws IOException {
> >
> >    IndexableField idField = new StringField("id",id,Field.Store.YES);
> >
> >     FieldType fieldType = new FieldType();
> >
> >    fieldType.setStoreTermVectors(true);
> >
> >    fieldType.setStoreTermVectorPositions(true);
> >
> >    fieldType.setIndexed(true);
> >
> >    fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
> >
> >    fieldType.setStored(true);
> >
> >
> >
> >    Document doc = new Document();
> >
> >    doc.add(idField);
> >
> >    doc.add(new Field("description", text, fieldType));
> >
> >
> >
> >    descWriter.addDocument(doc);
> >
> > }
> >
> > @Override
> >
> > public void close() throws IOException {
> >
> >  descWriter.commit();
> >
> >  descWriter.close();
> >
> > }
> >
> > }
> >
> >
> > TO QUERY
> >
> > public class LuceneTermStatistics implements TermKBStatistics {
> >
> >
> > private IndexReader luceneIndexReader;
> >
> > private Analyzer analyzer;
> >
> > private IndexSearcher searcher;
> >
> >
> > public LuceneTermStatistics(IndexReader reader, Analyzer analyzer) {
> >
> >  this.luceneIndexReader = reader;
> >
> >  this.analyzer = analyzer;
> >
> >  this.searcher = new IndexSearcher(reader);
> >
> > }
> >
> > /**
> >
> > * Create an instance of LuceneTermStatistics from the Config options.
> >
> > */
> >
> > public static LuceneTermStatistics configureInstance(String indexPath,
> > Analyzer analyzer)
> >
> >  throws IOException {
> >
> >  FSDirectory index = FSDirectory.open(new File(indexPath));
> >
> >  DirectoryReader indexReader = DirectoryReader.open(index);
> >
> >  return new LuceneTermStatistics(indexReader, analyzer);
> >
> > }
> >
> > @Override
> >
> > public int getTermFrequency(String term, String id)
> >
> > throws Exception {
> >
> >   int docId = getDocId(id);
> >
> >   // Get the vector with the frequency for the term in all documents
> >
> >  DocsEnum de = MultiFields.getTermDocsEnum(
> >
> >       luceneIndexReader, MultiFields.getLiveDocs(luceneIndexReader),
> > "description",
> >
> >       new BytesRef(term));
> >
> >   // Get the frequency for the document of interest
> >
> >  if (de != null) {
> >
> >      int docNo;
> >
> >      while((docNo = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
> >
> >         if(docNo == docId)
> >
> >           return de.freq();
> >
> >       }
> >
> >  }
> >
> >  return 0;
> >
> > }
> >
> >
> > private int getDocId(String id) throws IOException {
> >
> >  BooleanQuery idQuery = new BooleanQuery();
> >
> >  idQuery.add(new TermQuery(new Term("id", id)), Occur.MUST);
> >
> >
> >  TopScoreDocCollector collector = TopScoreDocCollector.create(1,
> > false);
> >
> >  searcher.search(idQuery, collector);
> >
> >   TopDocs topDocs = collector.topDocs();
> >
> >  if (topDocs.totalHits == 0)
> >
> >    return -1;
> >
> >   return topDocs.scoreDocs[0].doc;
> >
> > }
> >
> > }
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

if you create the term yourself, it is not going through the analyzer: public int getTermFrequency(String term, String id)
(you create a BytesRef out of it). So you have to also let the term go through the analyzer. The stemming analyzers change the terms, so you won't find them without also stemming the term before you

StandardAnalyzer does not do stemming, so terms (mostly) stay as they are. But also for this analyzer, you theoretically has to pass the term through the analyzer before you can do a term frequency lookup. Just think about that the term was not lowercased, in that case you would also not find it in the termdictionary. If it goes through the analyzer, you would find it because the analyzer lowercases it.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Bianca Pereira [mailto:aivykarter@gmail.com]
> Sent: Thursday, August 07, 2014 2:47 PM
> To: java-user
> Subject: Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term
> Frequency
> 
> Hi Jack,
> 
>   Thank you very much. I just changed for the StandardAnalyzer and it is
> working as I would like. But there is something I still cannot understand.
> If I use the same analyzer for indexing and for searching, the same term
> should be parsed in the same way in both moments, shouldn't it? It is why I
> still don't understand why the EnglishAnalyzer was not working. Any idea on
> that?
> 
>   Best Regards,
>   Bianca
> 
> 
> 2014-08-07 12:40 GMT+01:00 Jack Krupansky <ja...@basetechnology.com>:
> 
> > Generally, the standard analyzer will be a better choice, unless you
> > have some special need.
> >
> > A language-specific analyzer will include stemming. The English
> > analyzer includes the Porter stemmer.
> >
> > Generally, you need to apply a compatible analyzer to query terms to
> > match the index, or you need to manually filter your query terms.
> > Sounds like maybe a term got stemmed.
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Bianca Pereira
> > Sent: Thursday, August 7, 2014 7:28 AM
> > To: java-user@lucene.apache.org
> > Subject: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term
> > Frequency
> >
> >
> > Hi,
> >
> >  I am new in the list and I have been working on a problem for some
> > time already. I would like to know if someone has any idea of how I
> > can solve it.
> >
> > Given a term, I want to get the term frequency in a lucene document.
> > When I use the WhiteSpaceAnalyzer my code works properly but when I
> > use the EnglishAnalyzer it returns 0 as frequency for any term.
> >
> >  In order to get the term appearing both as "term" or "term," in the
> > text the EnglishAnalyzer is the best one to be used (I suppose).
> >
> >  Any help is more than welcome.
> >
> >  Best Regards,
> >  Bianca
> >
> > ----------------------------
> >  Here is my code:
> >
> > TO INDEX
> >
> > public class LuceneDescriptionIndexer implements Closeable {
> >
> > private IndexWriter descWriter;
> >
> >
> > public LuceneDescriptionIndexer(Directory luceneDirectory, Analyzer
> > analyzer)
> >
> > throws IOException {
> >
> >  openIndex(luceneDirectory, analyzer);
> >
> > }
> >
> > private void openIndex(Directory directory, Analyzer analyzer) throws
> > IOException {
> >
> >  IndexWriterConfig descIwc = new IndexWriterConfig(LuceneConfig.
> > INDEX_VERSION, analyzer);
> >
> >  descWriter = new IndexWriter(directory, descIwc);
> >
> > }
> >
> > public void indexDocument(String id, String text) throws IOException {
> >
> >    IndexableField idField = new StringField("id",id,Field.Store.YES);
> >
> >     FieldType fieldType = new FieldType();
> >
> >    fieldType.setStoreTermVectors(true);
> >
> >    fieldType.setStoreTermVectorPositions(true);
> >
> >    fieldType.setIndexed(true);
> >
> >    fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
> >
> >    fieldType.setStored(true);
> >
> >
> >
> >    Document doc = new Document();
> >
> >    doc.add(idField);
> >
> >    doc.add(new Field("description", text, fieldType));
> >
> >
> >
> >    descWriter.addDocument(doc);
> >
> > }
> >
> > @Override
> >
> > public void close() throws IOException {
> >
> >  descWriter.commit();
> >
> >  descWriter.close();
> >
> > }
> >
> > }
> >
> >
> > TO QUERY
> >
> > public class LuceneTermStatistics implements TermKBStatistics {
> >
> >
> > private IndexReader luceneIndexReader;
> >
> > private Analyzer analyzer;
> >
> > private IndexSearcher searcher;
> >
> >
> > public LuceneTermStatistics(IndexReader reader, Analyzer analyzer) {
> >
> >  this.luceneIndexReader = reader;
> >
> >  this.analyzer = analyzer;
> >
> >  this.searcher = new IndexSearcher(reader);
> >
> > }
> >
> > /**
> >
> > * Create an instance of LuceneTermStatistics from the Config options.
> >
> > */
> >
> > public static LuceneTermStatistics configureInstance(String indexPath,
> > Analyzer analyzer)
> >
> >  throws IOException {
> >
> >  FSDirectory index = FSDirectory.open(new File(indexPath));
> >
> >  DirectoryReader indexReader = DirectoryReader.open(index);
> >
> >  return new LuceneTermStatistics(indexReader, analyzer);
> >
> > }
> >
> > @Override
> >
> > public int getTermFrequency(String term, String id)
> >
> > throws Exception {
> >
> >   int docId = getDocId(id);
> >
> >   // Get the vector with the frequency for the term in all documents
> >
> >  DocsEnum de = MultiFields.getTermDocsEnum(
> >
> >       luceneIndexReader, MultiFields.getLiveDocs(luceneIndexReader),
> > "description",
> >
> >       new BytesRef(term));
> >
> >   // Get the frequency for the document of interest
> >
> >  if (de != null) {
> >
> >      int docNo;
> >
> >      while((docNo = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
> >
> >         if(docNo == docId)
> >
> >           return de.freq();
> >
> >       }
> >
> >  }
> >
> >  return 0;
> >
> > }
> >
> >
> > private int getDocId(String id) throws IOException {
> >
> >  BooleanQuery idQuery = new BooleanQuery();
> >
> >  idQuery.add(new TermQuery(new Term("id", id)), Occur.MUST);
> >
> >
> >  TopScoreDocCollector collector = TopScoreDocCollector.create(1,
> > false);
> >
> >  searcher.search(idQuery, collector);
> >
> >   TopDocs topDocs = collector.topDocs();
> >
> >  if (topDocs.totalHits == 0)
> >
> >    return -1;
> >
> >   return topDocs.scoreDocs[0].doc;
> >
> > }
> >
> > }
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

Posted by Bianca Pereira <ai...@gmail.com>.

Hi Jack,

  Thank you very much. I just changed for the StandardAnalyzer and it is
working as I would like. But there is something I still cannot understand.
If I use the same analyzer for indexing and for searching, the same term
should be parsed in the same way in both moments, shouldn't it? It is why I
still don't understand why the EnglishAnalyzer was not working. Any idea on
that?

  Best Regards,
  Bianca


2014-08-07 12:40 GMT+01:00 Jack Krupansky <ja...@basetechnology.com>:

> Generally, the standard analyzer will be a better choice, unless you have
> some special need.
>
> A language-specific analyzer will include stemming. The English analyzer
> includes the Porter stemmer.
>
> Generally, you need to apply a compatible analyzer to query terms to match
> the index, or you need to manually filter your query terms. Sounds like
> maybe a term got stemmed.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Bianca Pereira
> Sent: Thursday, August 7, 2014 7:28 AM
> To: java-user@lucene.apache.org
> Subject: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency
>
>
> Hi,
>
>  I am new in the list and I have been working on a problem for some time
> already. I would like to know if someone has any idea of how I can solve
> it.
>
> Given a term, I want to get the term frequency in a lucene document. When
> I use the WhiteSpaceAnalyzer my code works properly but when I use the
> EnglishAnalyzer it returns 0 as frequency for any term.
>
>  In order to get the term appearing both as "term" or "term," in the text
> the EnglishAnalyzer is the best one to be used (I suppose).
>
>  Any help is more than welcome.
>
>  Best Regards,
>  Bianca
>
> ----------------------------
>  Here is my code:
>
> TO INDEX
>
> public class LuceneDescriptionIndexer implements Closeable {
>
> private IndexWriter descWriter;
>
>
> public LuceneDescriptionIndexer(Directory luceneDirectory, Analyzer
> analyzer)
>
> throws IOException {
>
>  openIndex(luceneDirectory, analyzer);
>
> }
>
> private void openIndex(Directory directory, Analyzer analyzer) throws
> IOException {
>
>  IndexWriterConfig descIwc = new IndexWriterConfig(LuceneConfig.
> INDEX_VERSION, analyzer);
>
>  descWriter = new IndexWriter(directory, descIwc);
>
> }
>
> public void indexDocument(String id, String text) throws IOException {
>
>    IndexableField idField = new StringField("id",id,Field.Store.YES);
>
>     FieldType fieldType = new FieldType();
>
>    fieldType.setStoreTermVectors(true);
>
>    fieldType.setStoreTermVectorPositions(true);
>
>    fieldType.setIndexed(true);
>
>    fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
>
>    fieldType.setStored(true);
>
>
>
>    Document doc = new Document();
>
>    doc.add(idField);
>
>    doc.add(new Field("description", text, fieldType));
>
>
>
>    descWriter.addDocument(doc);
>
> }
>
> @Override
>
> public void close() throws IOException {
>
>  descWriter.commit();
>
>  descWriter.close();
>
> }
>
> }
>
>
> TO QUERY
>
> public class LuceneTermStatistics implements TermKBStatistics {
>
>
> private IndexReader luceneIndexReader;
>
> private Analyzer analyzer;
>
> private IndexSearcher searcher;
>
>
> public LuceneTermStatistics(IndexReader reader, Analyzer analyzer) {
>
>  this.luceneIndexReader = reader;
>
>  this.analyzer = analyzer;
>
>  this.searcher = new IndexSearcher(reader);
>
> }
>
> /**
>
> * Create an instance of LuceneTermStatistics from the Config options.
>
> */
>
> public static LuceneTermStatistics configureInstance(String indexPath,
> Analyzer analyzer)
>
>  throws IOException {
>
>  FSDirectory index = FSDirectory.open(new File(indexPath));
>
>  DirectoryReader indexReader = DirectoryReader.open(index);
>
>  return new LuceneTermStatistics(indexReader, analyzer);
>
> }
>
> @Override
>
> public int getTermFrequency(String term, String id)
>
> throws Exception {
>
>   int docId = getDocId(id);
>
>   // Get the vector with the frequency for the term in all documents
>
>  DocsEnum de = MultiFields.getTermDocsEnum(
>
>       luceneIndexReader, MultiFields.getLiveDocs(luceneIndexReader),
> "description",
>
>       new BytesRef(term));
>
>   // Get the frequency for the document of interest
>
>  if (de != null) {
>
>      int docNo;
>
>      while((docNo = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
>
>         if(docNo == docId)
>
>           return de.freq();
>
>       }
>
>  }
>
>  return 0;
>
> }
>
>
> private int getDocId(String id) throws IOException {
>
>  BooleanQuery idQuery = new BooleanQuery();
>
>  idQuery.add(new TermQuery(new Term("id", id)), Occur.MUST);
>
>
>  TopScoreDocCollector collector = TopScoreDocCollector.create(1, false);
>
>  searcher.search(idQuery, collector);
>
>   TopDocs topDocs = collector.topDocs();
>
>  if (topDocs.totalHits == 0)
>
>    return -1;
>
>   return topDocs.scoreDocs[0].doc;
>
> }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

Posted by Jack Krupansky <ja...@basetechnology.com>.

Generally, the standard analyzer will be a better choice, unless you have 
some special need.

A language-specific analyzer will include stemming. The English analyzer 
includes the Porter stemmer.

Generally, you need to apply a compatible analyzer to query terms to match 
the index, or you need to manually filter your query terms. Sounds like 
maybe a term got stemmed.

-- Jack Krupansky

-----Original Message----- 
From: Bianca Pereira
Sent: Thursday, August 7, 2014 7:28 AM
To: java-user@lucene.apache.org
Subject: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

Hi,

  I am new in the list and I have been working on a problem for some time
already. I would like to know if someone has any idea of how I can solve it.

Given a term, I want to get the term frequency in a lucene document. When
I use the WhiteSpaceAnalyzer my code works properly but when I use the
EnglishAnalyzer it returns 0 as frequency for any term.

  In order to get the term appearing both as "term" or "term," in the text
the EnglishAnalyzer is the best one to be used (I suppose).

  Any help is more than welcome.

  Best Regards,
  Bianca

----------------------------
  Here is my code:

TO INDEX

public class LuceneDescriptionIndexer implements Closeable {

private IndexWriter descWriter;


public LuceneDescriptionIndexer(Directory luceneDirectory, Analyzer
analyzer)

throws IOException {

  openIndex(luceneDirectory, analyzer);

}

private void openIndex(Directory directory, Analyzer analyzer) throws
IOException {

  IndexWriterConfig descIwc = new IndexWriterConfig(LuceneConfig.
INDEX_VERSION, analyzer);

  descWriter = new IndexWriter(directory, descIwc);

}

public void indexDocument(String id, String text) throws IOException {

    IndexableField idField = new StringField("id",id,Field.Store.YES);

     FieldType fieldType = new FieldType();

    fieldType.setStoreTermVectors(true);

    fieldType.setStoreTermVectorPositions(true);

    fieldType.setIndexed(true);

    fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);

    fieldType.setStored(true);



    Document doc = new Document();

    doc.add(idField);

    doc.add(new Field("description", text, fieldType));



    descWriter.addDocument(doc);

}

@Override

public void close() throws IOException {

  descWriter.commit();

  descWriter.close();

}

}


TO QUERY

public class LuceneTermStatistics implements TermKBStatistics {


private IndexReader luceneIndexReader;

private Analyzer analyzer;

private IndexSearcher searcher;


public LuceneTermStatistics(IndexReader reader, Analyzer analyzer) {

  this.luceneIndexReader = reader;

  this.analyzer = analyzer;

  this.searcher = new IndexSearcher(reader);

}

/**

* Create an instance of LuceneTermStatistics from the Config options.

*/

public static LuceneTermStatistics configureInstance(String indexPath,
Analyzer analyzer)

  throws IOException {

  FSDirectory index = FSDirectory.open(new File(indexPath));

  DirectoryReader indexReader = DirectoryReader.open(index);

  return new LuceneTermStatistics(indexReader, analyzer);

}

@Override

public int getTermFrequency(String term, String id)

throws Exception {

   int docId = getDocId(id);

   // Get the vector with the frequency for the term in all documents

  DocsEnum de = MultiFields.getTermDocsEnum(

       luceneIndexReader, MultiFields.getLiveDocs(luceneIndexReader),
"description",

       new BytesRef(term));

   // Get the frequency for the document of interest

  if (de != null) {

      int docNo;

      while((docNo = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) {

         if(docNo == docId)

           return de.freq();

       }

  }

  return 0;

}


private int getDocId(String id) throws IOException {

  BooleanQuery idQuery = new BooleanQuery();

  idQuery.add(new TermQuery(new Term("id", id)), Occur.MUST);


  TopScoreDocCollector collector = TopScoreDocCollector.create(1, false);

  searcher.search(idQuery, collector);

   TopDocs topDocs = collector.topDocs();

  if (topDocs.totalHits == 0)

    return -1;

   return topDocs.scoreDocs[0].doc;

}

} 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org