You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "prodriguezdefino (via GitHub)" <gi...@apache.org> on 2024/04/02 06:55:04 UTC

[PR] Elasticsearch enable Point In Time based searches [beam]

prodriguezdefino opened a new pull request, #30824:
URL: https://github.com/apache/beam/pull/30824

   This change adds a BoundedReader implementation based on the newer Point In Time search API (see [doc](https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html)). This API is the recommended mode when reading data from large indexes that imply retrieving documents using deep iteration (more than 10000 hits per slice). 
   
   This mode should be available for Elasticsearch clusters with version 8 and up. 
   
   [ADD PERFORMANCE IMPROVEMENT COMPARISON WITH SCROLL BASED SEARCHES]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #30824:
URL: https://github.com/apache/beam/pull/30824#issuecomment-2046632425

   Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @Abacn for label java.
   R: @johnjcasey for label io.
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
   
   The PR bot will only process comments in the main thread (not review comments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "prodriguezdefino (via GitHub)" <gi...@apache.org>.
prodriguezdefino commented on PR #30824:
URL: https://github.com/apache/beam/pull/30824#issuecomment-2033191290

   Run Java_ElasticSearch_IO_Direct PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "prodriguezdefino (via GitHub)" <gi...@apache.org>.
prodriguezdefino commented on PR #30824:
URL: https://github.com/apache/beam/pull/30824#issuecomment-2044211304

   Run Java_ElasticSearch_IO_Direct PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "prodriguezdefino (via GitHub)" <gi...@apache.org>.
prodriguezdefino commented on PR #30824:
URL: https://github.com/apache/beam/pull/30824#issuecomment-2044220747

   Run Java_ElasticSearch_IO_Direct PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "prodriguezdefino (via GitHub)" <gi...@apache.org>.
prodriguezdefino commented on code in PR #30824:
URL: https://github.com/apache/beam/pull/30824#discussion_r1580452732


##########
sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java:
##########
@@ -834,13 +859,27 @@ public Read withMetadata() {
      * href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/search-request-scroll.html">scroll
      * API</a> Default is "5m". Change this only if you get "No search context found" errors.
      *
+     * @deprecated use {@link Read#withIteratorKeepalive} instead

Review Comment:
   removed the new iterator configuration and reverted the changes on the scroll keep alive. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn merged PR #30824:
URL: https://github.com/apache/beam/pull/30824


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "prodriguezdefino (via GitHub)" <gi...@apache.org>.
prodriguezdefino commented on code in PR #30824:
URL: https://github.com/apache/beam/pull/30824#discussion_r1585325343


##########
sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java:
##########
@@ -832,15 +857,17 @@ public Read withMetadata() {
     /**
      * Provide a scroll keepalive. See <a
      * href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/search-request-scroll.html">scroll
-     * API</a> Default is "5m". Change this only if you get "No search context found" errors.
+     * API</a> Default is "5m". Change this only if you get "No search context found" errors. When
+     * configuring the read to use Point In Time (PIT) search this configuration is used to set the
+     * PIT keep alive.
      *
      * @param scrollKeepalive keepalive duration of the scroll
      * @return a {@link PTransform} reading data from Elasticsearch.
      */
     public Read withScrollKeepalive(String scrollKeepalive) {
-      checkArgument(scrollKeepalive != null, "scrollKeepalive can not be null");
-      checkArgument(!"0m".equals(scrollKeepalive), "scrollKeepalive can not be 0m");
-      return builder().setScrollKeepalive(scrollKeepalive).build();
+      checkArgument(scrollKeepalive != null, "iteratorKeepalive can not be null");

Review Comment:
   good catch, reverted those.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #30824:
URL: https://github.com/apache/beam/pull/30824#issuecomment-2061119274

   Reminder, please take a look at this pr: @Abacn @johnjcasey 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on code in PR #30824:
URL: https://github.com/apache/beam/pull/30824#discussion_r1571119296


##########
sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java:
##########
@@ -834,13 +859,27 @@ public Read withMetadata() {
      * href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/search-request-scroll.html">scroll
      * API</a> Default is "5m". Change this only if you get "No search context found" errors.
      *
+     * @deprecated use {@link Read#withIteratorKeepalive} instead

Review Comment:
   I see when setUsePITSearch is not enabled, iteratorKeepalive has the same effect as scrollKeepalive before: https://github.com/apache/beam/pull/30824/files#diff-e119923661d41dfc1911da8f6e29cc0e2e9a0a50129718c1c20c7d1aa32119bfR1269
   
   it still sets `scroll` parameter taken the scrollKeepalive value.
   
   My understanding is that `scrollKeepalive` isn't really deprecated. It is now the config `iteratorKeepalive` goes to different param depending on whether PIT is used or not.
   
   Can we keep the old naming `withScrollKeepalive` and document that this parameter now goes to set iterator alive time when PIT is enabled? So no Beam API change or deprecation needed



##########
sdks/java/io/elasticsearch-tests/elasticsearch-tests-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java:
##########
@@ -228,24 +228,42 @@ void testRead() throws Exception {
             ElasticsearchIO.read()
                 .withConnectionConfiguration(connectionConfiguration)
                 // set to default value, useful just to test parameter passing.
-                .withScrollKeepalive("5m")
+                .withIteratorKeepalive("5m")
                 // set to default value, useful just to test parameter passing.
                 .withBatchSize(100L));
     PAssert.thatSingleton(output.apply("Count", Count.globally())).isEqualTo(numDocs);
     pipeline.run();
   }
 
+  void testReadPIT() throws Exception {

Review Comment:
   Can we add a comment that PIT is available since elasticsearch version 8



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on PR #30824:
URL: https://github.com/apache/beam/pull/30824#issuecomment-2061283049

   sorry for delay, taking a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Elasticsearch enable Point In Time based searches [beam]

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on code in PR #30824:
URL: https://github.com/apache/beam/pull/30824#discussion_r1585079959


##########
sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java:
##########
@@ -832,15 +857,17 @@ public Read withMetadata() {
     /**
      * Provide a scroll keepalive. See <a
      * href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/search-request-scroll.html">scroll
-     * API</a> Default is "5m". Change this only if you get "No search context found" errors.
+     * API</a> Default is "5m". Change this only if you get "No search context found" errors. When
+     * configuring the read to use Point In Time (PIT) search this configuration is used to set the
+     * PIT keep alive.
      *
      * @param scrollKeepalive keepalive duration of the scroll
      * @return a {@link PTransform} reading data from Elasticsearch.
      */
     public Read withScrollKeepalive(String scrollKeepalive) {
-      checkArgument(scrollKeepalive != null, "scrollKeepalive can not be null");
-      checkArgument(!"0m".equals(scrollKeepalive), "scrollKeepalive can not be 0m");
-      return builder().setScrollKeepalive(scrollKeepalive).build();
+      checkArgument(scrollKeepalive != null, "iteratorKeepalive can not be null");

Review Comment:
   please fix these error messages also (revert change)



##########
sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java:
##########
@@ -773,10 +792,16 @@ abstract static class Builder {
 
       abstract Builder setWithMetadata(boolean withMetadata);
 
-      abstract Builder setScrollKeepalive(String scrollKeepalive);
+      abstract Builder setIteratorKeepalive(String iteratorKeepalive);

Review Comment:
   and here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org