You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2016/07/09 23:46:11 UTC

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

     [ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Rowe updated LUCENE-2899:
-------------------------------
    Attachment: LUCENE-2899.patch

Patch.  I took [~lancenorskog]'s latest patch and did the following (among other modernization/cleanup stuff):

* Upgraded to the latest OpenNLP release version (1.6.0)
* Moved the analysis factories, along with their tests and test data, into the Lucene module at {{lucene/analysis/opennlp/}}
* Removed the Solr contrib, since it only contained the analysis factories
* Added SPI files for the analysis factories
* Extended {{BaseTokenStreamTestCase.assertTokenStreamContents()}} to test payload contents.
* Converted analysis tests to use the above method instead of the home-grown ones in tests in the previous patch. 
* Converted the test model creation shell script into an Ant target named {{train-test-models}}.  I've run that target and included the binary models it produced in this patch.
* Added IntelliJ and Maven config
* Added license and checksum files for the two OpenNLP dependencies
* Included a dependency on the Lucene {{opennlp}} module in the Solr {{analysis-extras}} contrib, so that the module's jar and its dependencies be shipped with the distribution.

All the module's tests pass for me.

I manually tested the Solr integration:

* built the distribution and unpacked it 
* cloned the data driven configs and modified {{solrconfig.xml}} to add {{<lib>}} elements for the two directories containing the necessary jars
* downloaded English binary models from OpenNLP's sourceforge site
* via the schema api, added a field type that performs sentence splitting, tokenization and POS tagging, and a field using it
* added a simple doc with the opennlp-invoking field via curl and the {{update/json/docs}} handler
* searched using the admin UI
* random text pasted into the Admin UI's analysis pane shows payloads with POS tags (as hex bytes...)

Left to do prior to committability, IMHO, in no particular order:

* I think {{OpenNLPFilter}} should be broken up into a separate component for each of the things it can do.
* I agree with [~rcmuir ] that the tagging functionality here should be converted to use token type instead of payloads.  Then the included {{FilterPayloadsFilter}} won't be necessary (since the {{TypeTokenFilter}} has the same functionality for token types), and probably the included {{StripPayloadsFilter}} won't be necessary either, since populating payloads would likely only be done as a final step in the analysis chain (e.g. via {{TypeAsPayloadTokenFilter}}).
* Convert the NER support to be a Solr update processor.

Not sure it should prevent the current state from being committed, but: incorporating {{SegmentingTokenizerBase}} (extended by {{ThaiTokenizer}} and {{HMMChineseTokenizer}}) might be a useful improvement to the sentence-breaking/tokenization strategy currently used by {{OpenNLPTokenizer}}.

> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 4.9, 6.0
>
>         Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org