You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/11/11 13:59:56 UTC

[GitHub] [beam] JeffBolle opened a new issue, #24117: [Bug]: ElasticsearchIO connector does not properly estimate index size

JeffBolle opened a new issue, #24117:
URL: https://github.com/apache/beam/issues/24117

   ### What happened?
   
   In the process of investigating the issue reported here:
   https://stackoverflow.com/questions/74390325/how-to-enable-elasticsearchio-parallel-reads-in-apache-beam
   
   it appears that the method used by the ElasticsearchIO connector to get the estimated size of the data in the response is not accounting for the case where the configured index is an alias or a datastream or an index pattern which can point to multiple indexes.
   
   The original issue was a query that returns over 100 million documents for processing in the pipeline was unable to scale and was only processing at a rate of 40 / second.
   
   As discussed in the stackoverflow thread, the code here: https://github.com/apache/beam/blob/c7f2cab6ea30a63e04847dc45047a8193abc9552/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L871
   
   is not properly accounting for a number of scenarios where the index name returned by ElasticSearch is different than `connectionConfiguration.getIndex()`. 
   
   ElasticSearch should be relied upon to return the proper indexes for a given stats query, and as such the `_all` object should be used instead of the `indicies` top level object.  If there are other cases where the `_all` object isn't appropriate, then the code should iterate through all of the indicies returned under the `indices` field and sum the total store size, and not simply try to match the configured index.
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: io-java-elasticsearch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] egalpin commented on issue #24117: [Bug]: ElasticsearchIO connector does not properly estimate index size

Posted by GitBox <gi...@apache.org>.
egalpin commented on issue #24117:
URL: https://github.com/apache/beam/issues/24117#issuecomment-1315889559

   Thanks @JeffBolle for raising this and for proposing the patch.  I can have a look at the patch, but definitely would not want to rob you of the chance to contribute your change directly if you're interested in doing so!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb commented on issue #24117: [Bug]: ElasticsearchIO connector does not properly estimate index size

Posted by GitBox <gi...@apache.org>.
robertwb commented on issue #24117:
URL: https://github.com/apache/beam/issues/24117#issuecomment-1314114472

   At the very least we should provide a way for the end user to specify the size or degree of splitting until this is resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] aromanenko-dev closed issue #24117: [Bug]: ElasticsearchIO connector does not properly estimate index size

Posted by GitBox <gi...@apache.org>.
aromanenko-dev closed issue #24117: [Bug]: ElasticsearchIO connector does not properly estimate index size
URL: https://github.com/apache/beam/issues/24117


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] aromanenko-dev commented on issue #24117: [Bug]: ElasticsearchIO connector does not properly estimate index size

Posted by GitBox <gi...@apache.org>.
aromanenko-dev commented on issue #24117:
URL: https://github.com/apache/beam/issues/24117#issuecomment-1315187751

   CC: @egalpin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] egalpin commented on issue #24117: [Bug]: ElasticsearchIO connector does not properly estimate index size

Posted by GitBox <gi...@apache.org>.
egalpin commented on issue #24117:
URL: https://github.com/apache/beam/issues/24117#issuecomment-1315898115

   Ok I think the patch is close but not quite right. using `_all` would take the size of all indices in the cluster, but that may not be applicable to the size of data that one will be reading.  Instead, if we use the stats API at the index level (`GET /my-index-name/_stats`) and then use `_all`, it contains data for all aliases of `my-index-name`.  
   
   I believe this would fulfill the original goal.  Thoughts @JeffBolle ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] JeffBolle commented on issue #24117: [Bug]: ElasticsearchIO connector does not properly estimate index size

Posted by GitBox <gi...@apache.org>.
JeffBolle commented on issue #24117:
URL: https://github.com/apache/beam/issues/24117#issuecomment-1315503075

   I'm working on a PR for this, but its a trivial change. I've attached the patch if anyone wants to have a look.
   [0001-Read-from-_all-instead-of-_indices-for-index-stats.patch.txt](https://github.com/apache/beam/files/10014103/0001-Read-from-_all-instead-of-_indices-for-index-stats.patch.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] JeffBolle commented on issue #24117: [Bug]: ElasticsearchIO connector does not properly estimate index size

Posted by GitBox <gi...@apache.org>.
JeffBolle commented on issue #24117:
URL: https://github.com/apache/beam/issues/24117#issuecomment-1316225170

   > Thanks @JeffBolle for raising this and for proposing the patch. I can have a look at the patch, but definitely would not want to rob you of the chance to contribute your change directly if you're interested in doing so!
   
   I appreciate that, but for now I think this is the best path. Better to get the issue resolved quickly. If the next one involves more work to address properly I might want credit for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org