You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/08/24 09:29:36 UTC

[GitHub] [beam] sheepdreamofandroids opened a new issue, #22840: [Feature Request]: ElasticSearchIO.Write to support custom input type

sheepdreamofandroids opened a new issue, #22840:
URL: https://github.com/apache/beam/issues/22840

### What would you like to happen?

ElasticSearchIO.Write accepts a PCollection<String>. This is the exact JSON that will be written to ElasticSearch. Consequently functions like IndexFn have only (the parsed version of) this JSON as input.

I have a problem where, among others, the exact same JSON get's written to 2 different indexes in different ways. This means I have to create multiple instances of ElasticSearchIO.Write, one for each permutation of settings. It's not a major problem because there are only 7 permutations in my case but it could be many more. It does lead to a big process graph in dataflow and also it leads many small batches because there is a latency constraint,

The reason is that there is no way of adding some context or meta data to that JSON string. I could add an extra field to that json but I have no way of removing it before it gets written to the index.

A possible solution would be to have a type parameter for the input and a settable jsonFn that extracts the json string to be written.

### Issue Priority

Priority: 3

### Issue Component

Component: io-java-elasticsearch

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] sheepdreamofandroids commented on issue #22840: [Feature Request]: ElasticSearchIO.Write to allow adding context to input json

Posted by GitBox <gi...@apache.org>.

sheepdreamofandroids commented on issue #22840:
URL: https://github.com/apache/beam/issues/22840#issuecomment-1225571146

   Actually I would also need a getUpsertFn().


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] sheepdreamofandroids commented on issue #22840: [Feature Request]: ElasticSearchIO.Write to allow adding context to input json

Posted by GitBox <gi...@apache.org>.

sheepdreamofandroids commented on issue #22840:
URL: https://github.com/apache/beam/issues/22840#issuecomment-1236635253

We're processing JSON events for an online marketplace like creation, modification, publishing, (de-)activating features etc. Lots of private info so I can't share actual info. Depending on type these events go to different indexes and most go to 2 different indexes: one for full history and one where we update the latest version using a scripted upsert. This is to facilitate different styles of searching.
While it is possible to determine what the type is from the json itself, it's easier to do it from metadata sent in pubsub. When sending the same json to different indexes though, that is obviously not possible.
It is solved now by using the type from the metadata and dispatching each json to one or more subsequent pipelines, one for each permutation of index and upsert.
Much nicer would be to simply attach the index and upsert to the json and keep everything in one simple pipeline. The most flexible way would be to be able to send my own type into Write and define a jsonFn and upsertFn to extract from that type.
Maybe a GenericWrite (with type parameter) could be defined as superclass of Write, which now uses String for that parameter and defines a default jsonFn.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] sheepdreamofandroids commented on issue #22840: [Feature Request]: ElasticSearchIO.Write to allow adding context to input json

Posted by GitBox <gi...@apache.org>.

sheepdreamofandroids commented on issue #22840:
URL: https://github.com/apache/beam/issues/22840#issuecomment-1236924189

   Now that I'm thinking about it, it would be very nice to have the same type for the output success/failure pipeline stages. That way we could measure the time taken to write each item. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] egalpin commented on issue #22840: [Feature Request]: ElasticSearchIO.Write to allow adding context to input json

Posted by GitBox <gi...@apache.org>.

egalpin commented on issue #22840:
URL: https://github.com/apache/beam/issues/22840#issuecomment-1233191174

   @sheepdreamofandroids would you be able to share some concrete examples of input data and desired capabilities? I definitely want the ES connector to be flexible to serve many use cases, and also want to guard separation of concern at the same time 😊 
   
   It seems like there may be an opportunity to create an interface within DocToBulk, with a default implementation providing the existing functionality, to allow user code to alter the behaviour.  But I'd like to understand more by way of concrete example if you're able to share something 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] egalpin commented on issue #22840: [Feature Request]: ElasticSearchIO.Write to allow adding context to input json

Posted by GitBox <gi...@apache.org>.

egalpin commented on issue #22840:
URL: https://github.com/apache/beam/issues/22840#issuecomment-1225707383

   @sheepdreamofandroids Something that exists today which might work for you would be to split apart your usage of the building blocks of `ElasticsearchIO#Write`:  `DocToBulk`[1] and `BulkIO`[2].  DocToBulk is responsible for taking JSON-serialized inputs and converting them to a representation that the ES Bulk API can work with. BulkIO strictly deals with batching and sending data to an ES cluster.
   
   In your use case, it sounds like you have a singular input PCollection of inputs which then need to fanout in order to be processed in multiple ways for inclusion in different indices in ES.  You could fanout to each DocToBulk to process as needed, then flatten and use a single BulkIO operation.  This would allow for larger Bulk API payloads/larger batches because outputs from all DocToBulk could be combined in a single BulkIO output to ES (depending on buffering time, of course).
   
   ```
   
   
   
   
   
                                                                         ┌──────────────────────┐
                                                                         │                      │
                                                                         │ Input PCollection    │
                     ┌─────────────────────────────┬─────────────────────┴──────────────────────┴─────────────────────────────┐
                     │                             │                                                                          │
                     │                             │                                                                          │
                     │                             │                                                                          │
                     │                             │                                                                          │
                     │                             │                                                                          │
          ┌──────────▼──────────┐         ┌────────▼──────┐                                                          ┌────────▼────────┐
          │  DocToBulk1         │         │  DocToBulk2   │                                                          │ DocToBulk_n     │
          └────────┬────────────┘         └───────────────┴───────────────────┐                                      └────────┬────────┘
                   │                                                          │                                               │
                   │                                                          │                                               │
                   │                                                          │                                               │
                   │                                                          │                                               │
                   │                                                          │                                               │
                   │                                                          │                                               │
                   │                                            ┌─────────────▼─────────────┐                                 │
                   └───────────────────────────────────────────►│    Flatten                ◄─────────────────────────────────┘
                                                                │                           │
                                                                └─────────────┬─────────────┘
                                                                              │
                                                                              │
                                                                              │
                                                                ┌─────────────▼─────────────┐
                                                                │       BulkIO              │
                                                                └───────────────────────────┘
   
   
   
   ```
   
   [1] https://beam.apache.org/releases/javadoc/2.40.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.DocToBulk.html
   [2] https://beam.apache.org/releases/javadoc/2.40.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.BulkIO.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] sheepdreamofandroids commented on issue #22840: [Feature Request]: ElasticSearchIO.Write to allow adding context to input json

Posted by GitBox <gi...@apache.org>.

sheepdreamofandroids commented on issue #22840:
URL: https://github.com/apache/beam/issues/22840#issuecomment-1226095863

   Hmmm, I didn't think about flattening into a single BulkIO. That would indeed solve the multiple batches problem. It still makes the code more complicated than necessary because what could be a simple switch statement now has to be a dispatch to multiple DocToBulks.
   
   My idea was to write my own DocToBulk with the necessary logic. It just means that I would have to duplicate `createBulkApiEntity` and `getDocumentMetadata` which isn't too complicated.
   
   Would you accept a PR with a type parameter on DocToBulk and additional jsonFn and upsertFn?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org