You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/06/02 04:13:08 UTC

[GitHub] [druid] maytasm opened a new pull request #9965: API to verify a datasource has the latest ingested data

maytasm opened a new pull request #9965:
URL: https://github.com/apache/druid/pull/9965

API to verify a datasource has the latest ingested data

### Description

This PR address https://github.com/apache/druid/issues/5721

The existing loadstatus API reads segments from SqlSegmentsMetadataManager of the Coordinator which caches segments in memory and periodically updates them. Hence, there can be a race condition as this API implementation compares segments metadata from the mentioned cache with published segments in historicals. Particularly, when there is a new ingestion after the initial load of the datasource, the cache still only contains the metadata of old segments. The API would compares list of old segments with what is published by historical and returns that everything is available when the new segments are not actually available yet.

This new API will fix this problem. The new API will be able to do the following:
- new api takes in datasource. This will returns false if any used segment (of the past 2 weeks) of the given datasource are not available to be query (i.e. not loaded onto historical yet). Return true otherwise. The interval of 2 weeks above is not finalized yet. We can decide later what is a good default number

- (same) new api takes in datasource and a time interval (start + end): This will returns false if any used segment (between the given start and given end time) of the given datasource are not available to be query (i.e. not loaded onto historical yet). Return true otherwise.

Note that the above are both the same API. The time interval is an optional parameter. The time interval referred above is the timestamp of the data in the segment (nothing to do with when the segment is ingested). This can be the same time interval as the time interval the user want to query data from. Basically if the user wants to query from x to y then they can call this new api with the datasource and time interval x to y. This will ensure that all segments of the datasource for the timestamp from x to y is ready to be query (loaded onto historical).

Important differencees between this API from the existing coordinator loadstatus API:
- Takes datasource (required) to be able to check faster (iterate smaller number of segments)
- Takes interval (optional) to be able to check faster (iterate smaller number of segments)
- **IMPORATANT**. Takes boolean firstCheck. If this is true, this will force poll the metadata source to get latest published segment information.

The workflow will be :

1) submit ingestion task

2) poll task api until task succeeded

3) poll the new api with datasource, interval, and firstCheck=true once. If false, go to step 4, otherwise the data is available and user can query.

4) poll the new api with datasource, interval, and firstCheck=false until return true. After true, data is available and user can query.

This PR has:
- [x] been self-reviewed.
- [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
- [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org