You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Yewint Ko <ye...@gmail.com> on 2015/10/10 21:27:52 UTC

Using SimpleNaiveBayesClassifier in solr

Hi

I am trying to use NaiveBayesClassifier in my solr project. Currently
looking at its test case ClassificationTestBase.java.

Below codes seems like that classifier read the whole index db to train the
model everytime when classification happened for inputDocument. or am I
misunderstanding something here? If i had a large index db, will it impact
performance?

protected void checkCorrectClassification(Classifier<T> classifier, String
inputDoc, T expectedResult, Analyzer analyzer, String textFieldName, String
classFieldName, Query query) throws Exception {

    AtomicReader atomicReader = null;

    try {

      populateSampleIndex(analyzer);

      atomicReader = SlowCompositeReaderWrapper.wrap(indexWriter
.getReader());

      classifier.train(atomicReader, textFieldName, classFieldName, analyzer,
query);

      ClassificationResult<T> classificationResult = classifier.assignClass(
inputDoc);

      assertNotNull(classificationResult.getAssignedClass());

      assertEquals("got an assigned class of " +
classificationResult.getAssignedClass(),
expectedResult, classificationResult.getAssignedClass());

      assertTrue("got a not positive score " + classificationResult.getScore(),
classificationResult.getScore() > 0);

    } finally {

      if (atomicReader != null)

        atomicReader.close();

    }

  }

Re: Using SimpleNaiveBayesClassifier in solr

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Yewint,

the SNB classifier is not an online one, so you should retrain it every
time you want to update it.
What you pass to the Classifier is a Reader therefore you should grant that
this keeps being accessible (not close it) for classification to work.
Regarding performance SNB becomes slower as the no. of classes (labels)
increases as per the naive bayes algorithm scans through all the classes
and chooses the one with highest probability.
Depending on how big your index is you might want to make the classifier
use an index that's not accessed by other Lucene / Solr threads to avoid
impacting such other processes (e.g. indexing / search).

Hope this helps, if you have any further questions just ask.

Regards,
Tommaso



2015-10-10 21:27 GMT+02:00 Yewint Ko <ye...@gmail.com>:

> Hi
>
> I am trying to use NaiveBayesClassifier in my solr project. Currently
> looking at its test case ClassificationTestBase.java.
>
> Below codes seems like that classifier read the whole index db to train the
> model everytime when classification happened for inputDocument. or am I
> misunderstanding something here? If i had a large index db, will it impact
> performance?
>
> protected void checkCorrectClassification(Classifier<T> classifier, String
> inputDoc, T expectedResult, Analyzer analyzer, String textFieldName, String
> classFieldName, Query query) throws Exception {
>
>     AtomicReader atomicReader = null;
>
>     try {
>
>       populateSampleIndex(analyzer);
>
>       atomicReader = SlowCompositeReaderWrapper.wrap(indexWriter
> .getReader());
>
>       classifier.train(atomicReader, textFieldName, classFieldName,
> analyzer,
> query);
>
>       ClassificationResult<T> classificationResult =
> classifier.assignClass(
> inputDoc);
>
>       assertNotNull(classificationResult.getAssignedClass());
>
>       assertEquals("got an assigned class of " +
> classificationResult.getAssignedClass(),
> expectedResult, classificationResult.getAssignedClass());
>
>       assertTrue("got a not positive score " +
> classificationResult.getScore(),
> classificationResult.getScore() > 0);
>
>     } finally {
>
>       if (atomicReader != null)
>
>         atomicReader.close();
>
>     }
>
>   }
>