You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Joshua Fox (JIRA)" <ji...@apache.org> on 2016/11/19 16:07:58 UTC

[jira] [Commented] (BEAM-991) DatastoreIO Write should flush early for large batches

    [ https://issues.apache.org/jira/browse/BEAM-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679482#comment-15679482 ] 

Joshua Fox commented on BEAM-991:
---------------------------------

The maximum request size is 10 Mb; the maximum Item size is 1 Mb. The implementation _must_ support all legal items. Solutions

- Set the maximum batch size to 10. That obviously reduces performance, but allows requests to complete.
- Make the users set a constant  batch size between 10 and the max for Datastore-API, which is 500. This is problematic, since we do not always know how big out Items are, particularly if we are developing generic solutions.
- Start with  batch size 500. If that fails on a "too-large" error, the implementation then recursively cuts the batch size in half and retries, until the _put_ succeeds. This new value is then used for  a while. On the assumptions that Entities are grouped into similar sizes, occasionally ramp up the batch size to see if the Entities are smaller, again reverting to smaller batch size if there is a failure. Perhaps save the batch size, and ramp it up and down, on a per-Kind basis.
- Measure _getSerializedSize()_ of _all_ Items on _every put_, and adjust batch size accordingly. This may be slow.


> DatastoreIO Write should flush early for large batches
> ------------------------------------------------------
>
>                 Key: BEAM-991
>                 URL: https://issues.apache.org/jira/browse/BEAM-991
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-gcp
>            Reporter: Vikas Kedigehalli
>            Assignee: Vikas Kedigehalli
>
> If entities are large (avg size > 20KB) then the a single batched write (500 entities) would exceed the Datastore size limit of a single request (10MB) from https://cloud.google.com/datastore/docs/concepts/limits.
> First reported in: http://stackoverflow.com/questions/40156400/why-does-dataflow-erratically-fail-in-datastore-access



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)