You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@camel.apache.org by "Andrea Cosentino (Jira)" <ji...@apache.org> on 2021/09/23 11:52:00 UTC

[jira] [Resolved] (CAMEL-16594) DynamoDB stream updates are missed when there are more than one active shards

     [ https://issues.apache.org/jira/browse/CAMEL-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrea Cosentino resolved CAMEL-16594.
--------------------------------------
    Resolution: Fixed

> DynamoDB stream updates are missed when there are more than one active shards
> -----------------------------------------------------------------------------
>
>                 Key: CAMEL-16594
>                 URL: https://issues.apache.org/jira/browse/CAMEL-16594
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-aws
>            Reporter: Pierre-Yves Bigourdan
>            Assignee: Andrea Cosentino
>            Priority: Major
>             Fix For: 3.12.0
>
>         Attachments: shards.json
>
>
> The current Camel ddbstream implementation seems to incorrectly apply the concept of {{ShardIteratorType}} to the list of shards forming a DynamoDB stream rather than each shard individually.
> According to the [AWS documentation|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetShardIterator.html#DDB-streams_GetShardIterator-request-ShardIteratorType]:
> {noformat}
> ShardIteratorType determines how the shard iterator is used to start reading stream records from the shard.
> {noformat}
> For example, for a given shard, when {{ShardIteratorType}} equal to {{LATEST}}, the AWS SDK will read the most recent data in that particular shard. However, when {{ShardIteratorType}} equal to {{LATEST}}, Camel will additionally use {{ShardIteratorType}} to determine which shard it considers amongst all the available ones in the stream: https://github.com/apache/camel/blob/6119fdc379db343030bd25b191ab88bbec34d6b6/components/camel-aws/camel-aws2-ddb/src/main/java/org/apache/camel/component/aws2/ddbstream/ShardIteratorHandler.java#L132
> If my understanding is correct, shards in DynamoDB are modelled as a tree, with the child leaf nodes being the shards that are still active, i.e. the ones where new stream data will appear. These child shards will have a {{StartingSequenceNumber}}, but no {{EndingSequenceNumber}}.
> The most common case is to have a single shard, or a single branch of parent and child nodes:
> {noformat}
> Shard0
>    |
> Shard1
> {noformat}
> In the above case, new data will be added to {{Shard1}}, and the Camel implementation which  looks only at the last shard when {{ShardIteratorType}} is equal to {{LATEST}}, will be correct.
> However, the tree can also look like this (see related example in the attached JSON output from the AWS CLI, where the shard number matches the index in the JSON list):
> {noformat}
>              Shard0
>             /      \
>      Shard1          Shard2
>     /      \        /      \ 
> Shard3   Shard4  Shard5   Shard6
> {noformat}
> In this case, Camel will only consider Shard6, even though new data may be added to any of Shard3, Shard4, Shard5 or Shard6. This leads to updates being missed.
> As far as I can tell, DynamoDB will split into multiple shards depending on the number of table partitions, which will either grow for a table with huge amounts of data, or when an exiting table with provisioned capacity is migrated to on-demand provisioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)