You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Samarth Jain <sa...@gmail.com> on 2020/05/16 00:03:03 UTC

PRs awaiting review

Hi Druid Devs,

I wanted to bring the community's attention to a few PRs that are awaiting
review and what I believe are worthwhile features and fixes to have in OSS.

Add new round robin strategy for loading segments:
https://github.com/apache/druid/pull/9603/
<https://github.com/apache/druid/pull/9603/files>

This PR adds a new strategy that Druid coordinator can use when determining
what segment to load next. The current and the only strategy is to prefer
loading the newer segments first. For data being ingested using a streaming
indexing service, it makes sense to prefer loading the newer segments on
the historicals as it alleviates pressure off the middle manager nodes by
expediting the segment handoff process. In case of batch ingestion also, it
makes sense to prefer loading newer segments first since chances are users
want to be able to query newer data first. However, there are certain cases
where such an approach causes pain. For example - if two different
datasources are ingested with one having newer data compared to the other
one, it is possible that the segments of the second datasource one may not
get loaded for a long time. To make things "fair" the approach added in the
PR instead picks segments by selecting datasources in a round robin
fashion. For each datasource though, the strategy does make sure that the
newer segments are loaded first. We have been running clusters with this
strategy in our clusters for a while now and it has helped our large (order
of a few TBs) ingest use cases quite well.

The second PR is for handling unknown complex types:
https://github.com/apache/druid/pull/9422

Recently, while upgrading our cluster, we ran into an issue where the Druid
SQL functionality broke because an incompatible change was made in an
aggregator extension. While we obviously shouldn't be making any
incompatible changes, it doesn't hurt to guard against it (especially for
folks building in-house Druid extensions) and especially preventing it from
a major functionality like Druid SQL in this case.

The third PR I actually raised today. But would be good to bring to
community's attention as I believe it addresses a long standing issue.
https://github.com/apache/druid/pull/9877
Internally, and I would be surprised if it isn't common out there, we have
lots of hive parquet tables that have the timestamp column of type int
storing the time in the format yyyyMMdd. To ingest such a column as Druid
timestamp, one would expect that specifying a date time format like
"yyyyMMdd" would suffice. Unfortunately, the timestamp parser in Druid
ignores the format when it sees that column is numeric and instead
interprets it as timestamp in millis. So 20200521 in yyyyMMdd format ends
up being interpreted as 20200521 milliseconds which corresponds to the
incorrect datetime value of "Thu Jan 01 1970 05:36:40".

Thanks,
Samarth

Re: PRs awaiting review

Posted by Gian Merlino <gi...@apache.org>.
Hey Samarth,

It looks like the last PR has been merged already — great!

I just wrote up a review for your first PR, about round robin data types.

I haven't had a chance to check out the unknown-complex-types PR yet;
apologies.

I'm now subscribed to them all, though.

On Fri, May 15, 2020 at 5:03 PM Samarth Jain <sa...@gmail.com> wrote:

> Hi Druid Devs,
>
> I wanted to bring the community's attention to a few PRs that are awaiting
> review and what I believe are worthwhile features and fixes to have in OSS.
>
> Add new round robin strategy for loading segments:
> https://github.com/apache/druid/pull/9603/
> <https://github.com/apache/druid/pull/9603/files>
>
> This PR adds a new strategy that Druid coordinator can use when determining
> what segment to load next. The current and the only strategy is to prefer
> loading the newer segments first. For data being ingested using a streaming
> indexing service, it makes sense to prefer loading the newer segments on
> the historicals as it alleviates pressure off the middle manager nodes by
> expediting the segment handoff process. In case of batch ingestion also, it
> makes sense to prefer loading newer segments first since chances are users
> want to be able to query newer data first. However, there are certain cases
> where such an approach causes pain. For example - if two different
> datasources are ingested with one having newer data compared to the other
> one, it is possible that the segments of the second datasource one may not
> get loaded for a long time. To make things "fair" the approach added in the
> PR instead picks segments by selecting datasources in a round robin
> fashion. For each datasource though, the strategy does make sure that the
> newer segments are loaded first. We have been running clusters with this
> strategy in our clusters for a while now and it has helped our large (order
> of a few TBs) ingest use cases quite well.
>
> The second PR is for handling unknown complex types:
> https://github.com/apache/druid/pull/9422
>
> Recently, while upgrading our cluster, we ran into an issue where the Druid
> SQL functionality broke because an incompatible change was made in an
> aggregator extension. While we obviously shouldn't be making any
> incompatible changes, it doesn't hurt to guard against it (especially for
> folks building in-house Druid extensions) and especially preventing it from
> a major functionality like Druid SQL in this case.
>
> The third PR I actually raised today. But would be good to bring to
> community's attention as I believe it addresses a long standing issue.
> https://github.com/apache/druid/pull/9877
> Internally, and I would be surprised if it isn't common out there, we have
> lots of hive parquet tables that have the timestamp column of type int
> storing the time in the format yyyyMMdd. To ingest such a column as Druid
> timestamp, one would expect that specifying a date time format like
> "yyyyMMdd" would suffice. Unfortunately, the timestamp parser in Druid
> ignores the format when it sees that column is numeric and instead
> interprets it as timestamp in millis. So 20200521 in yyyyMMdd format ends
> up being interpreted as 20200521 milliseconds which corresponds to the
> incorrect datetime value of "Thu Jan 01 1970 05:36:40".
>
> Thanks,
> Samarth
>