You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "tballison (via GitHub)" <gi...@apache.org> on 2023/02/24 19:56:51 UTC

[GitHub] [nutch] tballison opened a new pull request, #761: NUTCH-2920 -- first working attempt at an OpenSearchIndexWriter

tballison opened a new pull request, #761:
URL: https://github.com/apache/nutch/pull/761

   …iter to OpenSearch
   
   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-2920`)
     - is referenced in the title of the pull request
     - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-2920] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean runtime test`
   * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
     - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)?
     - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] tballison commented on pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "tballison (via GitHub)" <gi...@apache.org>.
tballison commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1445065255

   Needs more work on tls vs basic auth etc. Converting to draft.  Will update on Monday.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] sebastian-nagel commented on a diff in pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "sebastian-nagel (via GitHub)" <gi...@apache.org>.
sebastian-nagel commented on code in PR #761:
URL: https://github.com/apache/nutch/pull/761#discussion_r1124754751


##########
src/plugin/build.xml:
##########
@@ -54,6 +54,7 @@
     <ant dir="indexer-dummy" target="deploy"/>
     <ant dir="indexer-elastic" target="deploy"/>
     <ant dir="indexer-kafka" target="deploy"/>
+    <ant dir="indexer-opensearch-1x" target="deploy"/>

Review Comment:
   (if `ant clean` is called in the folder `src/plugin`)
   
   In addition, the plugin should be added to the build.xml (targets javadoc, eclipse and release), same as other plugins.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] tballison commented on pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "tballison (via GitHub)" <gi...@apache.org>.
tballison commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1454110886

   Sorry, what I intended by my question was: are you all ok with version numbers in the module name.  The current code is deprecated for 3x so I think we'll need to have a 3x at some point.  Or, do we want to target, say 2.x as 'indexer-opensearch' and hope it supports 1.x, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] tballison commented on pull request #761: NUTCH-2920 -- first working attempt at an OpenSearchIndexWriter

Posted by "tballison (via GitHub)" <gi...@apache.org>.
tballison commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1444377846

   The fiddly part (for me) was setting up the rest client to deal with a trust store.
   
   I followed https://opensearch.org/blog/connecting-java-high-level-rest-client-with-opensearch-over-https/ 
   
   I had to update the `keystore` command in that blog post like so: `keytool -importcert -file root_ca.der -keystore my-store.jks -storepass mystorepass -alias opensearch`
   
   If there's a better way of handling this, let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] tballison commented on a diff in pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "tballison (via GitHub)" <gi...@apache.org>.
tballison commented on code in PR #761:
URL: https://github.com/apache/nutch/pull/761#discussion_r1124942828


##########
NOTICE-binary:
##########
@@ -1021,6 +1021,10 @@ mapdb (http://www.mapdb.org)
 webarchive-commons (https://github.com/iipc/webarchive-commons)
 - license: The Apache Software License, Version 2.0
 
+# org.opensearch.client:opensearch-rest-high-level-client

Review Comment:
   +1, nothing to do, tho, on this PR, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] tballison commented on a diff in pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "tballison (via GitHub)" <gi...@apache.org>.
tballison commented on code in PR #761:
URL: https://github.com/apache/nutch/pull/761#discussion_r1124942563


##########
src/plugin/build.xml:
##########
@@ -54,6 +54,7 @@
     <ant dir="indexer-dummy" target="deploy"/>
     <ant dir="indexer-elastic" target="deploy"/>
     <ant dir="indexer-kafka" target="deploy"/>
+    <ant dir="indexer-opensearch-1x" target="deploy"/>

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] tballison commented on pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "tballison (via GitHub)" <gi...@apache.org>.
tballison commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1454107912

   I've run this with both the docker "getting started" OpenSearch example  (e.g. `docker run -d -p 127.0.0.1:9200:9200 -p 127.0.0.1:9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.3.8` with username=admin and pw=admin) and with the example for running OpenSearch locally with the trust store in the [post above](https://opensearch.org/blog/connecting-java-high-level-rest-client-with-opensearch-over-https/)
   
   The more tests the merrier! 
   
   We're using testcontainers over on Tika to test integration with OpenSearch and Solr.  Resource and time intensive, but really, really valuable to have those.  I can try to draft some testcontainers tests in a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] sebastian-nagel commented on pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "sebastian-nagel (via GitHub)" <gi...@apache.org>.
sebastian-nagel commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1455893629

   Merged. After reading about the [OpenSearch releases](https://opensearch.org/releases.html), I agree to include the version number in the plugin's name. It would also allow us to start working on a 2x plugin now. But maybe we keep naming the plugin supporting the latest version without using a specific version number?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] tballison commented on pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "tballison (via GitHub)" <gi...@apache.org>.
tballison commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1454059036

   To confirm, you're all ok with the *-1x design?  Less than ideal, but I think it is useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] sebastian-nagel merged pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "sebastian-nagel (via GitHub)" <gi...@apache.org>.
sebastian-nagel merged PR #761:
URL: https://github.com/apache/nutch/pull/761


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] sebastian-nagel commented on pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "sebastian-nagel (via GitHub)" <gi...@apache.org>.
sebastian-nagel commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1454100003

   > ok with the *-1x design?
   
   Yes. As said, I haven't tested it with a running OpenSearch instance. If you can confirm that docs/fields appear in the index as expected, that's fine. Otherwise, I can try to run an OpenSearch instance (would be my first time), index a bulk of docs using various index filter plugins (add to the property plugin.includes `index-(basic|more|anchor|geoip)`). This would be a more realistic test scenario.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] tballison commented on pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "tballison (via GitHub)" <gi...@apache.org>.
tballison commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1450693830

   K, I think this is ready for review.  I'm happy for any and all input!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] sebastian-nagel commented on a diff in pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "sebastian-nagel (via GitHub)" <gi...@apache.org>.
sebastian-nagel commented on code in PR #761:
URL: https://github.com/apache/nutch/pull/761#discussion_r1124705690


##########
src/plugin/build.xml:
##########
@@ -54,6 +54,7 @@
     <ant dir="indexer-dummy" target="deploy"/>
     <ant dir="indexer-elastic" target="deploy"/>
     <ant dir="indexer-kafka" target="deploy"/>
+    <ant dir="indexer-opensearch-1x" target="deploy"/>

Review Comment:
   Should also add the new plugin to the target "clean".



##########
src/plugin/build.xml:
##########
@@ -54,6 +54,7 @@
     <ant dir="indexer-dummy" target="deploy"/>
     <ant dir="indexer-elastic" target="deploy"/>
     <ant dir="indexer-kafka" target="deploy"/>
+    <ant dir="indexer-opensearch-1x" target="deploy"/>

Review Comment:
   (if `ant clean` is called in the folder `src/plugin`)
   
   In addition, the plugin should be added to the build.xml (targets javadoc, elipse and release), same as other plugins.



##########
NOTICE-binary:
##########
@@ -1021,6 +1021,10 @@ mapdb (http://www.mapdb.org)
 webarchive-commons (https://github.com/iipc/webarchive-commons)
 - license: The Apache Software License, Version 2.0
 
+# org.opensearch.client:opensearch-rest-high-level-client

Review Comment:
   Excellent! - but we should seriously automatize keeping the licenses up-to-date, see NUTCH-2981.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] sebastian-nagel commented on pull request #761: NUTCH-2920 -- add an OpenSearchIndexWriter

Posted by "sebastian-nagel (via GitHub)" <gi...@apache.org>.
sebastian-nagel commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1454129142

   Well, don't really know. Depends on how long we want to maintain it - upgrading and testing (manually, as of now). And also, how long users may want to keep it running.
   
   For the Solr and Elastic indexers we always supported only one version (a recent one, but rarely the latest).
   
   > using testcontainers over on Tika to test integration with OpenSearch and Solr
   
   Same for [StormCrawler](https://github.com/Digitalpebble/storm-crawler/). Might be worth to have a look at the OpenSearch (2.5) module over there.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nutch] tballison commented on pull request #761: NUTCH-2920 -- first working attempt at an OpenSearchIndexWriter

Posted by "tballison (via GitHub)" <gi...@apache.org>.
tballison commented on PR #761:
URL: https://github.com/apache/nutch/pull/761#issuecomment-1444379112

   I'm less than entirely thrilled with using stored strings for credentials, but that's where we were with Elasticsearch.  Again, if there's a better way, please let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org