You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Davis, Daniel (NIH/NLM) [C]" <da...@nih.gov> on 2016/11/15 15:31:23 UTC

Measuring the entropy of a field

Does Lucene/Solr include any tools for measuring the entropy/information of a field?   My intuition is that this would only work if the field were a single-value field and the analysis identified characters rather than tokens.    Also, Unicode does through a wrench in it - I suppose such a thing would also need to have a set of expected symbols as by entropy I mean against ASCII or Latin-1.

Just curious here - I have no problem to solve, and you guys are expert in this sort of thing solved in Java, so if there are other libraries or corners of OpenNLP that address this, let me know.  I know more at this point about tackling this stuff from Python.

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH

Re: Measuring the entropy of a field

Posted by Joel Bernstein <jo...@gmail.com>.

You may be interested in:
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/search/IGainTermsQParserPlugin.java

This iterates through a training set and scores terms in a text field using
Information Gain. You'll see entropy calculations in the implementation.

It was developed as part of this ticket:

https://issues.apache.org/jira/browse/SOLR-9252

This may not be your use case, but it can provide an example of how to plug
an algorithm into Solr.

Also if you can provide details about your use case perhaps we can add the
feature. We are looking to add more useful algorithms.



Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Nov 15, 2016 at 10:31 AM, Davis, Daniel (NIH/NLM) [C] <
daniel.davis@nih.gov> wrote:

> Does Lucene/Solr include any tools for measuring the entropy/information
> of a field?   My intuition is that this would only work if the field were a
> single-value field and the analysis identified characters rather than
> tokens.    Also, Unicode does through a wrench in it - I suppose such a
> thing would also need to have a set of expected symbols as by entropy I
> mean against ASCII or Latin-1.
>
> Just curious here - I have no problem to solve, and you guys are expert in
> this sort of thing solved in Java, so if there are other libraries or
> corners of OpenNLP that address this, let me know.  I know more at this
> point about tackling this stuff from Python.
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>
>