You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by xristy <gi...@git.apache.org> on 2017/04/23 16:28:21 UTC

[GitHub] jena pull request #246: Generic text analyzers

GitHub user xristy opened a pull request:

    https://github.com/apache/jena/pull/246

    Generic text analyzers

    This PR implements JENA-1326 and includes some tests.
    
    Updates to [text-query.mdtext.zip](https://github.com/apache/jena/files/949831/text-query.mdtext.zip) are attached.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BuddhistDigitalResourceCenter/jena generic-text-analyzers

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #246
    
----
commit 1440e81d75ee01baf874c407d5f0017bc59c6787
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-17T19:53:41Z

    initial commit for generic analyzers

commit 8b3757bae52d08d4b308bd0f996ff452c60cc7c9
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-19T19:43:04Z

    initial documentation

commit 27ea30b73855d7a3cf0cd9561d2089295ec03353
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-20T20:37:00Z

    implement GenericAnalyzerAssembler. TO DO: Tests

commit c429c1d8e6fa7a7473001289985a9265fcc4ff37
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-21T13:56:48Z

    Merge remote-tracking branch 'apache/master' into generic-text-analyzers
    grabbing updates to jena-text

commit 8f1fa7ccbf2cb05f2eed121831c39e07260ec18b
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-21T21:02:20Z

    adding GenericAnalyzer tests

commit d2f0561b99c957658261b3693e4a89892369a65a
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-22T17:29:04Z

    added parameters of type org.apache.lucene.analysis.Analyzer

commit 94b41be7553a4f955c0e41c868d94662bdd7236e
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-22T17:29:47Z

    added more tests

commit 57ded6a9c1f7d275de4f8e6294611a869407534d
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-22T21:15:58Z

    ignore: organize imports

commit a3bb8e41aeaf9be3540cf0a6be84cd9dc9b43b28
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-22T21:31:54Z

    added analyzer definitions: 1) DefinedAnalyzers for use in text:map; 2)
    add analyzers to Multilingual support based on BCP47 codes

commit 311efab2fd26a58406b29b64d74b41039292d080
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-23T14:18:35Z

    represent parameter types as resources like text:TypeSet instead of
    literal string

commit 5edb6c8758124fe8dd5a96d7b92949fc3ac1f61f
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-23T15:13:09Z

    factor DefinedAnalyzerAssembler and DefineAnalyzersAssembler into
    separate classes; move defined analyzer cache to Utils along side the
    language tagged analyzers since both caches have the same lifetime and
    similar uses.

commit cc579790631d92499ef44717e540ceaefcfe1d89
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-23T16:17:20Z

    Merge remote-tracking branch 'apache/master' into generic-text-analyzers
    merging jena-tex-es changes

commit fef4d22faeda09159cc2523e477571d1d23a85e7
Author: Chris Tomlinson <ct...@moonvine.org>
Date:   2017-04-23T16:20:53Z

    ignore extras

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #246: Generic text analyzers

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/246#discussion_r124493779
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/assembler/TestGenericAnalyzerAssembler.java ---
    @@ -0,0 +1,198 @@
    +package org.apache.jena.query.text.assembler;
    --- End diff --
    
    Missing an license header.  Please add it - it is better if the contributor does this than someone else, although the act of making a contribution is enough especially as other files have the header.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #246: Generic text analyzers

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/246#discussion_r124493150
  
    --- Diff: jena-text/.gitignore ---
    @@ -0,0 +1 @@
    +/text-query.mdtext
    --- End diff --
    
    Not needed. Add .gitignore to .gitignore.
    
    Could you attach text-query.mdtext to the JIRA please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/246
  
    I second what @osma said and I would add that there is a possible further learning challenge with this new functionality in that the two selections for test indexing, Lucene and ElasticSearch, would (after this) have even more disparate abilities. That having been said, thanks so much for the contribution! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #246: Generic text analyzers

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/246#discussion_r124560328
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/assembler/TestGenericAnalyzerAssembler.java ---
    @@ -0,0 +1,198 @@
    +package org.apache.jena.query.text.assembler;
    --- End diff --
    
    No problem - It is not an issue specific to this class. I noticed it and found it was common in the text tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/246
  
    Hopefully, putting in 3.4.0 will not be too long - one reason the project wants to have a more frequent release cycles is so that new features can come in with less wait.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/246
  
    @xristy Thanks for the excellent explanation about the background for this PR! As you said, this came a bit out of nowhere and it would perhaps had been easier to get oriented if you had proposed it a bit beforehand on the dev list, or filed the JIRA issue a little bit earlier, not just before the PR. But clearly this is a well prepared PR and you have done an excellent job of following the contribution guidelines, no fault with that.
    
    As I see it, if there is a need (and you clearly have one!) for functionality like this, the code looks good, and it doesn't make ongoing maintenance more difficult, I don't see why we couldn't take it in. 
    
    Your suggestion of revising the jena-text documentation would be extremely welcome. I've been concerned that it has been growing organically over time when new features have been added, but the structure has not really kept up. Also some Lucene features, such as wildcard queries, are not mentioned at all, which may have caused misunderstandings about what the text index really can do. At the minimum, linking to Lucene documentation about query syntax would help, but revising the documentation to be more tutorial-like and perhaps orienting it more around potential uses instead of individual features would help a lot. If you can give a helping hand in that, I'd appreciate it a lot!
    
    With a rather complex extension of existing functionality like this, we would perhaps like to have some time for the code to get settled and work out emerging issues. An additional complication currently is that we are just trying to put together a release that already includes major changes to jena-text (Lucene upgrade by two major versions, partial reimplementation of the multilingual analyzer due to Lucene API changes, dropping Solr support and introducing the Elasticsearch backend), it would perhaps be safer to wait until after the release before merging this. On the other hand, I can see from the commit stats that this is basically not touching existing code at all (zero lines deleted), just adding to it, so it could be argued that this would be unlikely to break existing functionality.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/246
  
    Documentation updated: 
    
    http://jena.staging.apache.org/documentation/query/text-query.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/246
  
    Thanks for the contribution!
    
    I think this will take a while to digest. I'm currently on a business trip to California and don't have time to look at this very closely, but I will do so when I have the chance.
    
    I've wanted to have something like this in jena-text some time ago, when there was no support for any other analyzers than StandardAnalyzer. However, instead of implementing a generic configuration system like this, I ended up adding support for just the easiest analyzers (e.g. SimpleAnalyzer) plus a couple I had a special need for. Others have so far done the same (e.g. MultilingualAnalyzer by Alexis Miara).
    
    It's very good to have tests and documentation included in the PR. I'm still a bit worried about the usability aspect of this whole new configuration language. The jena-text documentation is already difficult to understand, and I've had to explain how to configure it many times to people who are new to it. Adding a new layer that allows specifying any combination of analyzers with specific parameters is very powerful, but it also adds a whole new meta-layer over an already complex configuration system.
    
    @xristy Could you elaborate a bit on why you ended up doing this (including what is the original problem you were solving), and whether you considered any alternative approaches?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/246
  
    I read through the text-query.mdtext @xristy attached on 8 June and it looks really excellent! All the features are still there, but the overall story is now much clearer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by xristy <gi...@git.apache.org>.
Github user xristy commented on the issue:

    https://github.com/apache/jena/pull/246
  
    I've resolved the conflict in TextVocab.java and add further updates to the text-querty.mdtext a couple of weeks ago. What are the next steps that I need to take to get this PR accepted?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by xristy <gi...@git.apache.org>.
Github user xristy commented on the issue:

    https://github.com/apache/jena/pull/246
  
    I've made considerable updates to text-query.mdtext which are attached. I hope these updates are moving in a useful direction.
    
    [text-query.mdtext.zip](https://github.com/apache/jena/files/1062372/text-query.mdtext.zip)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by xristy <gi...@git.apache.org>.
Github user xristy commented on the issue:

    https://github.com/apache/jena/pull/246
  
    @osma I'm happy to give some background. We are developing a multilingual cultural heritage system to handle in particular Tibetan, Pali, and Sanskrit which are not handled currently among the various Lucene analyzers. There will also be instances of Chinese and other languages that are well covered among the analyzers bundled with Lucene but that are not completely exposed via the current jena-text facility.
    
    We currently have a [Tibetan centered system](https://www.tbrc.org) (the largest repository of scanned and input Tibetan texts in the world with accompanying cultural contextual metadata) built on [eXist-db](http://exist-db.org) and are migrating to a combination of couchdb and jena. The eXist-db has a [Lucene integration](http://exist-db.org/exist/apps/doc/lucene.xml) and implements an analyzer extension feature similar to that which this PR implements for jena-text. Our current development seeks to extend the initial Tibetan analyzers to deal with a variety of issues that are peculiar to Tibetan, and to include appropriate analyzers for the additional languages that will be covered. It was thus natural for us to consider recapitulating an extension feature in the context of jena-text.
    
    One point of interest is that much of the Tibetan data is in a variety of transliteration schemes as well as native Unicode, which is true as well of Pali for which there is really no native script but rather a historical succession of transliterations in scripts such as Thai, Burmese and so on. We currently tag various bits of content via BCP-47 conforming tags that reflect transliteration and such. This affects how analyzers are used. Further, there are differences in how we expect to treat title and name fields versus running text such as from colophons and so on.
    
    So we are wanting to minimize the intrusion on jena (and couchdb-lucene for which we have contributed a [similar extension feature](https://github.com/rnewson/couchdb-lucene/blob/master/CONFIGURING_ANALYZERS.md)) as we develop and test a variety of analyzers for various needs.
    
    We naturally want to be able to benefit from future developments of jena without having to incur an ongoing maintenance cost of retrofitting our extension feature or, without such feature, having to update and redeploy jena as we develop analyzers and such.
    
    The extension feature offered in this PR seeks to decouple the use of jena from the development and configuration of analyzers. Further, having such a facility should make it easier for others to incorporate full text in their jena applications without needing to modify jena.
    
    If PR #245 is adopted that leaves jena-text/Lucene as the default full text support and should make the case for effective configuration-based extension more compelling.
    
    I agree that the [current documentation](https://jena.apache.org/documentation/query/text-query.html) for jena-text is somewhat difficult in places to appreciate and it seems to me that some improvements there can be most helpful. I included updates to the current documentation to cover the extension and if there is interest I certainly could lend a hand to revise and extend the overall documentation to clarify the configuration and use of jena-text.
    
    You mention the MultilingualAnalyzer by Alexis Miara and that is in fact one of the attractive features of jena that we are most eager to leverage. RDF and jena in particular via the MultilingualAnalyzer thoroughly embrace multiple languages in metadata. This PR provides a natural extension of the  MultilingualAnalyzer to include new languages as well as override the default support provided. To be clear, the eXist-db+Lucene integration provides no equivalent to the RDF convention of carrying language tagged strings throughout the framework and consequently it is difficult to handle multilingual data since it implies per language indexing of a field.
    
    I don't know whether I've dug a deeper hole or made the case plainer. I hope the latter. I realize that this PR and the associated [JENA-1326](https://issues.apache.org/jira/browse/JENA-1326) popped up out of no where and I've tried to follow the [guidelines for contributing](https://github.com/apache/jena/blob/master/CONTRIBUTING.md). I understand that there may be issues surrounding 3.3.0 and ES and such so I hoped to at least get this work a timely hearing in that regard. I think this PR is not in conflict with or competing with the java-text-es, but provides a useful extension to Lucene-based jena-text.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by xristy <gi...@git.apache.org>.
Github user xristy commented on the issue:

    https://github.com/apache/jena/pull/246
  
    I understand that there's a lot on the plate for 3.3.0 now and it makes sense to wait until after the code and docs have settled regarding [PR # 245](https://github.com/apache/jena/pull/245). I will add some more tests and work on the jena-text docs in general starting from 3.3.0 and anticipate that this PR will make it into 3.4.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/246
  
    @xristy  - Thank you very much.
    
    Now merged into Jena master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/246
  
    The new features introduced by this PR seem useful and worthwhile, and the implementation looks solid. This clearly doesn't conflict with existing code. 
    
    However, there is some overlap in functionality to earlier attempts at making the analyzer configuration more generic. For example, I think that the equivalent of LowerCaseKeywordAnalyzer can now be specified in three different ways: using LowerCaseKeywordAnalyzer directly; using ConfigurableAnalyzer to build it from KeywordTokenizer and two filters; or with the new GenericAnalyzer. If we had had functionality like this earlier, then there wouldn't have been any need to create ConfigurableAnalyzer or indeed the pre-packaged LowerCaseKeywordAnalyzer. 
    
    But now I think we must simply support all the different configuration styles. If we want to drop one or more of those options, it should at least be clearly deprecated well in advance. It is not the fault of this PR, just something that has happened as the feature set has grown organically over time.
    
    +1 for merging after the minor issues identified by @afs have been sorted out.
    
    I must say that without the excellent documentation updates that come along with this PR, I would be a lot more hesitant to merge this because of the added complexity in index configuration. But with the very clear examples and thorough explanations (also for older jena-text features!), I think this adds a lot of value, even for jena-text users who don't need the full power of this meta-configuration language.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #246: Generic text analyzers

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/jena/pull/246


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #246: Generic text analyzers

Posted by xristy <gi...@git.apache.org>.
Github user xristy commented on a diff in the pull request:

    https://github.com/apache/jena/pull/246#discussion_r124540115
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/assembler/TestGenericAnalyzerAssembler.java ---
    @@ -0,0 +1,198 @@
    +package org.apache.jena.query.text.assembler;
    --- End diff --
    
    I've made these changes. I only added the `JenaSystem.init()` to `TestGenericAnalyzerAssembler.java` but am happy to add to the other tests in jena-text.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by xristy <gi...@git.apache.org>.
Github user xristy commented on the issue:

    https://github.com/apache/jena/pull/246
  
    I've added `JenaSystem.init()` to the jena-text tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #246: Generic text analyzers

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/246#discussion_r124495200
  
    --- Diff: jena-text/src/test/java/org/apache/jena/query/text/assembler/TestGenericAnalyzerAssembler.java ---
    @@ -0,0 +1,198 @@
    +package org.apache.jena.query.text.assembler;
    --- End diff --
    
    More FYI than comment:
    
    If you want to run these test standalone, it needs to cause initialization with `static { JenaSystem.init() ; }` These test don't touch anything soon enough to cause that to happen implicitly.  These are not the only text tests that tis applies to.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Posted by xristy <gi...@git.apache.org>.
Github user xristy commented on the issue:

    https://github.com/apache/jena/pull/246
  
    @afs Thank you for your patience and help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---