You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Samarth Jain <sa...@gmail.com> on 2018/07/19 19:28:05 UTC

Issue with segments not loading/taking a long time

I am working on upgrading our internal cluster to 0.12.1 release and seeing
that a few data sources fail to load. Looking at coordinator logs, I am
seeing messages like this for the datasource:

@400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO
[Coordinator-Exec--0] io.druid.server.coordinator.CuratorLoadQueuePeon -
Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to drop
segment[*datasource*
_2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018-04-23T21:24:04.910Z]



@400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN
[Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule - No
available [_default_tier] servers or node capacity to assign primary
segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00:00.000Z_2018-04-23T21:24:04.910Z]!
Expected Replicants[1]


The datasource failed to load for a long time and then eventually was
loaded successfully. Has anyone else seen this? I see a few fixes around
segment loading and coordination in 0.12.2 (which I am hoping will be out
soon) but I am not sure if they address this issue.

Re: Issue with segments not loading/taking a long time

Posted by Jihoon Son <ji...@apache.org>.
Hi Samarth,

IIRC, nothing has been changed around loading local segments when
historicals start up.
The above log looks that you have 4816 segments to load.
How long does it take to load all of them? How slow is it compared to
before?

Jihoon

On Thu, Jul 19, 2018 at 1:37 PM Samarth Jain <sa...@gmail.com> wrote:

> Thanks for the reply, Clint. It does look related.
>
> We also noticed that historicals are taking a long time to download the
> segments after a restart. At least in 0.10.1, restart of a historical
> wouldn't be a big deal as the segments it is responsible for serving were
> still available on the local disk and so after a restart, the segment load
> process would finish up fast.
> With 0.12.1, I am seeing that a restarted historical is spending a lot of
> time loading segments.
>
> Sample log line:
>
> *@400000005b50ee941f5ead84 2018-07-19T20:03:22,526 INFO
> [Segment-Load-Startup-5]
> io.druid.server.coordination.SegmentLoadDropHandler - Loading
>
> segment[3681/4816][xyz_druid_raw_2015-10-11T00:00:00.000Z_2015-10-12T00:00:00.000Z_2018-03-14T23:58:34.091Z]*
>
>
> Has something changed around how the segment assignment works since 0.10.1
> which could cause this change in behavior? I could always dig in the code
> and try figure out but considering you have actively worked in this area
> recently, I thought you might have an idea?
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Jul 19, 2018 at 1:26 PM, Clint Wylie <cl...@imply.io> wrote:
>
> > You might be running into something related to these issues
> > https://github.com/apache/incubator-druid/issues/5531 and
> > https://github.com/apache/incubator-druid/issues/5882, the former of
> which
> > should be fixed in 0.12.2. The effects of these issues can be at least
> > partially mitigated by setting and  maxSegmentsInNodeLoadingQueue and
> > maxSegmentsToMove http://druid.io/docs/latest/
> > configuration/coordinator.html
> > to limit how deep load queues get and minimizing the number of bad
> > decisions the coordinator makes when a historical disappears due to zk
> > blip, upgrade, or anything else.
> >
> > On Thu, Jul 19, 2018 at 1:10 PM, Samarth Jain <sa...@gmail.com>
> > wrote:
> >
> > > Hi Jihoon,
> > >
> > > I have a 6 node historical test cluster. 3 nodes are at ~80% and the
> > other
> > > two at ~60 and ~50% disk utilization.
> > >
> > > The interesting thing is that the 6th node ended up getting into zk
> > timeout
> > > (because of large GC pause) and is no longer part of the cluster (which
> > is
> > > a separate issue I am trying to figure out).
> > > On this 6th node, I see that it is busy loading segments. However, once
> > it
> > > is done downloading, I am not sure if it will report back to zk as
> being
> > > available.
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Jul 19, 2018 at 12:58 PM, Jihoon Son <gh...@gmail.com>
> wrote:
> > >
> > > > Hi Samarth,
> > > >
> > > > have you had a change to check the segment balancing status of your
> > > > cluster?
> > > > Do you see any significant imbalance between historicals?
> > > >
> > > > Jihoon
> > > >
> > > > On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain <
> samarth.jain@gmail.com>
> > > > wrote:
> > > >
> > > > > I am working on upgrading our internal cluster to 0.12.1 release
> and
> > > > seeing
> > > > > that a few data sources fail to load. Looking at coordinator logs,
> I
> > am
> > > > > seeing messages like this for the datasource:
> > > > >
> > > > > @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO
> > > > > [Coordinator-Exec--0] io.druid.server.coordinator.
> > CuratorLoadQueuePeon
> > > -
> > > > > Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to
> > drop
> > > > > segment[*datasource*
> > > > >
> > > > > _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018-
> > > > 04-23T21:24:04.910Z]
> > > > >
> > > > >
> > > > >
> > > > > @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN
> > > > > [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule -
> > No
> > > > > available [_default_tier] servers or node capacity to assign
> primary
> > > > >
> > > > > segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00:
> > > > 00.000Z_2018-04-23T21:24:04.910Z]!
> > > > > Expected Replicants[1]
> > > > >
> > > > >
> > > > > The datasource failed to load for a long time and then eventually
> was
> > > > > loaded successfully. Has anyone else seen this? I see a few fixes
> > > around
> > > > > segment loading and coordination in 0.12.2 (which I am hoping will
> be
> > > out
> > > > > soon) but I am not sure if they address this issue.
> > > > >
> > > >
> > >
> >
>

Re: Issue with segments not loading/taking a long time

Posted by Samarth Jain <sa...@gmail.com>.
Thanks for the reply, Clint. It does look related.

We also noticed that historicals are taking a long time to download the
segments after a restart. At least in 0.10.1, restart of a historical
wouldn't be a big deal as the segments it is responsible for serving were
still available on the local disk and so after a restart, the segment load
process would finish up fast.
With 0.12.1, I am seeing that a restarted historical is spending a lot of
time loading segments.

Sample log line:

*@400000005b50ee941f5ead84 2018-07-19T20:03:22,526 INFO
[Segment-Load-Startup-5]
io.druid.server.coordination.SegmentLoadDropHandler - Loading
segment[3681/4816][xyz_druid_raw_2015-10-11T00:00:00.000Z_2015-10-12T00:00:00.000Z_2018-03-14T23:58:34.091Z]*


Has something changed around how the segment assignment works since 0.10.1
which could cause this change in behavior? I could always dig in the code
and try figure out but considering you have actively worked in this area
recently, I thought you might have an idea?











On Thu, Jul 19, 2018 at 1:26 PM, Clint Wylie <cl...@imply.io> wrote:

> You might be running into something related to these issues
> https://github.com/apache/incubator-druid/issues/5531 and
> https://github.com/apache/incubator-druid/issues/5882, the former of which
> should be fixed in 0.12.2. The effects of these issues can be at least
> partially mitigated by setting and  maxSegmentsInNodeLoadingQueue and
> maxSegmentsToMove http://druid.io/docs/latest/
> configuration/coordinator.html
> to limit how deep load queues get and minimizing the number of bad
> decisions the coordinator makes when a historical disappears due to zk
> blip, upgrade, or anything else.
>
> On Thu, Jul 19, 2018 at 1:10 PM, Samarth Jain <sa...@gmail.com>
> wrote:
>
> > Hi Jihoon,
> >
> > I have a 6 node historical test cluster. 3 nodes are at ~80% and the
> other
> > two at ~60 and ~50% disk utilization.
> >
> > The interesting thing is that the 6th node ended up getting into zk
> timeout
> > (because of large GC pause) and is no longer part of the cluster (which
> is
> > a separate issue I am trying to figure out).
> > On this 6th node, I see that it is busy loading segments. However, once
> it
> > is done downloading, I am not sure if it will report back to zk as being
> > available.
> >
> >
> >
> >
> >
> > On Thu, Jul 19, 2018 at 12:58 PM, Jihoon Son <gh...@gmail.com> wrote:
> >
> > > Hi Samarth,
> > >
> > > have you had a change to check the segment balancing status of your
> > > cluster?
> > > Do you see any significant imbalance between historicals?
> > >
> > > Jihoon
> > >
> > > On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain <sa...@gmail.com>
> > > wrote:
> > >
> > > > I am working on upgrading our internal cluster to 0.12.1 release and
> > > seeing
> > > > that a few data sources fail to load. Looking at coordinator logs, I
> am
> > > > seeing messages like this for the datasource:
> > > >
> > > > @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO
> > > > [Coordinator-Exec--0] io.druid.server.coordinator.
> CuratorLoadQueuePeon
> > -
> > > > Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to
> drop
> > > > segment[*datasource*
> > > >
> > > > _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018-
> > > 04-23T21:24:04.910Z]
> > > >
> > > >
> > > >
> > > > @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN
> > > > [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule -
> No
> > > > available [_default_tier] servers or node capacity to assign primary
> > > >
> > > > segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00:
> > > 00.000Z_2018-04-23T21:24:04.910Z]!
> > > > Expected Replicants[1]
> > > >
> > > >
> > > > The datasource failed to load for a long time and then eventually was
> > > > loaded successfully. Has anyone else seen this? I see a few fixes
> > around
> > > > segment loading and coordination in 0.12.2 (which I am hoping will be
> > out
> > > > soon) but I am not sure if they address this issue.
> > > >
> > >
> >
>

Re: Issue with segments not loading/taking a long time

Posted by Clint Wylie <cl...@imply.io>.
You might be running into something related to these issues
https://github.com/apache/incubator-druid/issues/5531 and
https://github.com/apache/incubator-druid/issues/5882, the former of which
should be fixed in 0.12.2. The effects of these issues can be at least
partially mitigated by setting and  maxSegmentsInNodeLoadingQueue and
maxSegmentsToMove http://druid.io/docs/latest/configuration/coordinator.html
to limit how deep load queues get and minimizing the number of bad
decisions the coordinator makes when a historical disappears due to zk
blip, upgrade, or anything else.

On Thu, Jul 19, 2018 at 1:10 PM, Samarth Jain <sa...@gmail.com>
wrote:

> Hi Jihoon,
>
> I have a 6 node historical test cluster. 3 nodes are at ~80% and the other
> two at ~60 and ~50% disk utilization.
>
> The interesting thing is that the 6th node ended up getting into zk timeout
> (because of large GC pause) and is no longer part of the cluster (which is
> a separate issue I am trying to figure out).
> On this 6th node, I see that it is busy loading segments. However, once it
> is done downloading, I am not sure if it will report back to zk as being
> available.
>
>
>
>
>
> On Thu, Jul 19, 2018 at 12:58 PM, Jihoon Son <gh...@gmail.com> wrote:
>
> > Hi Samarth,
> >
> > have you had a change to check the segment balancing status of your
> > cluster?
> > Do you see any significant imbalance between historicals?
> >
> > Jihoon
> >
> > On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain <sa...@gmail.com>
> > wrote:
> >
> > > I am working on upgrading our internal cluster to 0.12.1 release and
> > seeing
> > > that a few data sources fail to load. Looking at coordinator logs, I am
> > > seeing messages like this for the datasource:
> > >
> > > @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO
> > > [Coordinator-Exec--0] io.druid.server.coordinator.CuratorLoadQueuePeon
> -
> > > Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to drop
> > > segment[*datasource*
> > >
> > > _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018-
> > 04-23T21:24:04.910Z]
> > >
> > >
> > >
> > > @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN
> > > [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule - No
> > > available [_default_tier] servers or node capacity to assign primary
> > >
> > > segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00:
> > 00.000Z_2018-04-23T21:24:04.910Z]!
> > > Expected Replicants[1]
> > >
> > >
> > > The datasource failed to load for a long time and then eventually was
> > > loaded successfully. Has anyone else seen this? I see a few fixes
> around
> > > segment loading and coordination in 0.12.2 (which I am hoping will be
> out
> > > soon) but I am not sure if they address this issue.
> > >
> >
>

Re: Issue with segments not loading/taking a long time

Posted by Samarth Jain <sa...@gmail.com>.
Hi Jihoon,

I have a 6 node historical test cluster. 3 nodes are at ~80% and the other
two at ~60 and ~50% disk utilization.

The interesting thing is that the 6th node ended up getting into zk timeout
(because of large GC pause) and is no longer part of the cluster (which is
a separate issue I am trying to figure out).
On this 6th node, I see that it is busy loading segments. However, once it
is done downloading, I am not sure if it will report back to zk as being
available.





On Thu, Jul 19, 2018 at 12:58 PM, Jihoon Son <gh...@gmail.com> wrote:

> Hi Samarth,
>
> have you had a change to check the segment balancing status of your
> cluster?
> Do you see any significant imbalance between historicals?
>
> Jihoon
>
> On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain <sa...@gmail.com>
> wrote:
>
> > I am working on upgrading our internal cluster to 0.12.1 release and
> seeing
> > that a few data sources fail to load. Looking at coordinator logs, I am
> > seeing messages like this for the datasource:
> >
> > @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO
> > [Coordinator-Exec--0] io.druid.server.coordinator.CuratorLoadQueuePeon -
> > Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to drop
> > segment[*datasource*
> >
> > _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018-
> 04-23T21:24:04.910Z]
> >
> >
> >
> > @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN
> > [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule - No
> > available [_default_tier] servers or node capacity to assign primary
> >
> > segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00:
> 00.000Z_2018-04-23T21:24:04.910Z]!
> > Expected Replicants[1]
> >
> >
> > The datasource failed to load for a long time and then eventually was
> > loaded successfully. Has anyone else seen this? I see a few fixes around
> > segment loading and coordination in 0.12.2 (which I am hoping will be out
> > soon) but I am not sure if they address this issue.
> >
>

Re: Issue with segments not loading/taking a long time

Posted by Jihoon Son <gh...@gmail.com>.
Hi Samarth,

have you had a change to check the segment balancing status of your cluster?
Do you see any significant imbalance between historicals?

Jihoon

On Thu, Jul 19, 2018 at 12:28 PM Samarth Jain <sa...@gmail.com>
wrote:

> I am working on upgrading our internal cluster to 0.12.1 release and seeing
> that a few data sources fail to load. Looking at coordinator logs, I am
> seeing messages like this for the datasource:
>
> @400000005b50dbc637061cec 2018-07-19T18:43:08,923 INFO
> [Coordinator-Exec--0] io.druid.server.coordinator.CuratorLoadQueuePeon -
> Asking server peon[/druid-test--001/loadQueue/127.0.0.1:7103] to drop
> segment[*datasource*
>
> _2015-09-03T00:00:00.000Z_2015-09-04T00:00:00.000Z_2018-04-23T21:24:04.910Z]
>
>
>
> @400000005b50dbc637391f84 2018-07-19T18:43:08,926 WARN
> [Coordinator-Exec--0] io.druid.server.coordinator.rules.LoadRule - No
> available [_default_tier] servers or node capacity to assign primary
>
> segment[*datasource*-08-10T00:00:00.000Z_2015-08-11T00:00:00.000Z_2018-04-23T21:24:04.910Z]!
> Expected Replicants[1]
>
>
> The datasource failed to load for a long time and then eventually was
> loaded successfully. Has anyone else seen this? I see a few fixes around
> segment loading and coordination in 0.12.2 (which I am hoping will be out
> soon) but I am not sure if they address this issue.
>