You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by David Glasser <gl...@apollographql.com> on 2019/03/01 08:49:30 UTC

Namespacing segments, or preventing unknown segments from wreaking havoc

(I sent this message to druid-user last week and got no response. Since it
is proposing making improvements to Druid, I thought maybe it would be
appropriate to resend here. Hope that's OK.)

We had a big outage in our Druid cluster last week.  We run our Druid
servers in Kubernetes, and our historicals use machine local SSDs for their
segment caches.  We made the unfortunate choice to have our production and
staging historicals share the same pool of machines, and today got bit by
this for the first time.

A production historical started up on a machine whose segment cache
contained segments from our staging cluster.  Our prod and staging clusters
use the same names for data sources.

This meant that these segments overshadowed production segments which
happened to have lower versions.  Worse, when
DruidCoordinatorCleanupOvershadowed kicked in, all of the production
segments that were overshadowed got used=false set, and quickly got dropped
from historicals. This ended up being the majority of our data.  We
eventually figured out what was going on and did a bunch of manual steps to
clean up (turning off and clearing the cache of the two historicals that
had staging segments on them, manually setting used=true for all entries in
druid_segments, waiting a long long time for data to re-download), but
figuring out what was going on was subtle (I was very lucky I had randomly
decided to read a lot of the code about how the `used` column works and how
versioned timelines are calculated just a few days before!).

(We were also lucky that we had turned off coordinator automatic killing
literally that morning!)

I feel like Druid should have been able to protect me from this to some
degree. (Yes, we are going to address the root cause by making it
impossible for prod and staging to reuse each others' disks.) Some thoughts
on changes that could have helped:

- Is the Druid standard to prepend the "cluster" name to the data source
name, so that conflicts like this are never possible?  We are certainly
tempted to do this now but nobody ever told us to. If that's the standard,
should it be documented?

- Should clusters have an optional name/namespace, and DataSegments have
that namespace recorded in it, and clusters refuse to handle segments they
find that are from a different namespace? This would be like the common
database setup where a single server/cluster has a set of database which
each have a set of tables.

- Should historicals refuse to announce segments that don't exist in the
druid_segments table, or should coordinators/brokers/etc refuse to pay
attention to segments announced *by historicals* that don't exist in the
druid_segments table.  I'm going to guess this is difficult to do in the
historical because the historical probably doesn't actually talk to the sql
DB at all? But maybe it could be done by coordinator and broker?

--dave

Re: Namespacing segments, or preventing unknown segments from wreaking havoc

Posted by David Glasser <gl...@apollographql.com>.
Thanks, I've opened a proposal and taken Gian's suggestion into account.
https://github.com/apache/incubator-druid/issues/7180

On Fri, Mar 1, 2019 at 4:20 PM Gian Merlino <gi...@apache.org> wrote:

> To me this seems like a lot of effort to go through just to detect cases
> where servers from two different clusters are misconfigured to read each
> others' files or talk to each other by accident. I wonder if there's an
> easier way to do it. Maybe keep the cluster name idea, but write it to a
> marker file in any local storage directories that servers read on
> bootstrap, and don't load up from them if the name is wrong?
>
> On Fri, Mar 1, 2019 at 6:17 PM David Glasser <gl...@apollographql.com>
> wrote:
>
> > Makes sense.
> >
> > To elaborate a bit more on my "cluster name" concept, I actually think it
> > would be pretty straightforward:
> >
> > - Add something like `druid.cluster.name=staging`.
> > - To be compatible with existing data, also add something like
> > `druid.cluster.allowSegmentsFromClusters=["", "dev"]`. Note that the
> empty
> > string is explicitly recognized here.
> > - Add a `clusterName` field to DataSegment. When creating a new segment,
> > set its clusterName field to the value of druid.cluster.name.
> > - Make various places that see DataSegments ignore and warn when
> presented
> > with segments whose cluster does not match druid.cluster.name or a value
> > in
> > druid.cluster.allowSegmentsFromClusters. This would include
> > SegmentLoadDropHandler (which is what looks at the local cache in
> > historicals etc), operations that publish new segments, etc.
> >
> > This might actually be simpler and more efficient than going to the
> > database each time, though the database approach could handle other
> related
> > issues I suppose.
> >
> > On Fri, Mar 1, 2019 at 1:58 PM Jihoon Son <gh...@gmail.com> wrote:
> >
> > > The broker learns from historicals and tasks even though recently a PR
> > has
> > > been merged to keep published segments in memory (
> > > https://github.com/apache/incubator-druid/pull/6901) in brokers.
> > > Probably it makes sense to filter out segments in brokers too if they
> are
> > > from historicals and not in the metadata store.
> > >
> > > Jihoon
> > >
> > > On Fri, Mar 1, 2019 at 1:24 PM David Glasser <
> glasser@apollographql.com>
> > > wrote:
> > >
> > > > That makes sense. Does the coordinator's decisions about what
> segments
> > > are
> > > > 'used' affect the broker's choices for routing queries, or does it
> just
> > > > learn about things directly from historicals/ingestion tasks (via...
> > > > zookeeper?)
> > > >
> > > > --dave
> > > >
> > > > On Fri, Mar 1, 2019 at 1:15 PM Jihoon Son <gh...@gmail.com>
> wrote:
> > > >
> > > > > Hi Dave,
> > > > >
> > > > > I think the third option sounds most reasonable to fix this issue.
> > > Though
> > > > > the second option sounds useful in general.
> > > > > And yes, it wouldn't be easy to refuse to announce unknown segments
> > in
> > > > > historicals.
> > > > > I think it makes more sense to check only in the coordinator
> because
> > > it's
> > > > > the only node who would directly access to the metadata store
> (except
> > > > > overlord).
> > > > > So, the coordinator may not update the "used" flag if overshadowing
> > > > > segments are not in the metadata store.
> > > > > In stream ingestion, segments might not be in the metadata store
> > until
> > > > they
> > > > > are published. However, this shouldn't be a problem because
> segments
> > > are
> > > > > always appended in stream ingestion.
> > > > >
> > > > > Jihoon
> > > > >
> > > > > On Fri, Mar 1, 2019 at 12:49 AM David Glasser <
> > > glasser@apollographql.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > (I sent this message to druid-user last week and got no response.
> > > Since
> > > > > it
> > > > > > is proposing making improvements to Druid, I thought maybe it
> would
> > > be
> > > > > > appropriate to resend here. Hope that's OK.)
> > > > > >
> > > > > > We had a big outage in our Druid cluster last week.  We run our
> > Druid
> > > > > > servers in Kubernetes, and our historicals use machine local SSDs
> > for
> > > > > their
> > > > > > segment caches.  We made the unfortunate choice to have our
> > > production
> > > > > and
> > > > > > staging historicals share the same pool of machines, and today
> got
> > > bit
> > > > by
> > > > > > this for the first time.
> > > > > >
> > > > > > A production historical started up on a machine whose segment
> cache
> > > > > > contained segments from our staging cluster.  Our prod and
> staging
> > > > > clusters
> > > > > > use the same names for data sources.
> > > > > >
> > > > > > This meant that these segments overshadowed production segments
> > which
> > > > > > happened to have lower versions.  Worse, when
> > > > > > DruidCoordinatorCleanupOvershadowed kicked in, all of the
> > production
> > > > > > segments that were overshadowed got used=false set, and quickly
> got
> > > > > dropped
> > > > > > from historicals. This ended up being the majority of our data.
> We
> > > > > > eventually figured out what was going on and did a bunch of
> manual
> > > > steps
> > > > > to
> > > > > > clean up (turning off and clearing the cache of the two
> historicals
> > > > that
> > > > > > had staging segments on them, manually setting used=true for all
> > > > entries
> > > > > in
> > > > > > druid_segments, waiting a long long time for data to
> re-download),
> > > but
> > > > > > figuring out what was going on was subtle (I was very lucky I had
> > > > > randomly
> > > > > > decided to read a lot of the code about how the `used` column
> works
> > > and
> > > > > how
> > > > > > versioned timelines are calculated just a few days before!).
> > > > > >
> > > > > > (We were also lucky that we had turned off coordinator automatic
> > > > killing
> > > > > > literally that morning!)
> > > > > >
> > > > > > I feel like Druid should have been able to protect me from this
> to
> > > some
> > > > > > degree. (Yes, we are going to address the root cause by making it
> > > > > > impossible for prod and staging to reuse each others' disks.)
> Some
> > > > > thoughts
> > > > > > on changes that could have helped:
> > > > > >
> > > > > > - Is the Druid standard to prepend the "cluster" name to the data
> > > > source
> > > > > > name, so that conflicts like this are never possible?  We are
> > > certainly
> > > > > > tempted to do this now but nobody ever told us to. If that's the
> > > > > standard,
> > > > > > should it be documented?
> > > > > >
> > > > > > - Should clusters have an optional name/namespace, and
> DataSegments
> > > > have
> > > > > > that namespace recorded in it, and clusters refuse to handle
> > segments
> > > > > they
> > > > > > find that are from a different namespace? This would be like the
> > > common
> > > > > > database setup where a single server/cluster has a set of
> database
> > > > which
> > > > > > each have a set of tables.
> > > > > >
> > > > > > - Should historicals refuse to announce segments that don't exist
> > in
> > > > the
> > > > > > druid_segments table, or should coordinators/brokers/etc refuse
> to
> > > pay
> > > > > > attention to segments announced *by historicals* that don't exist
> > in
> > > > the
> > > > > > druid_segments table.  I'm going to guess this is difficult to do
> > in
> > > > the
> > > > > > historical because the historical probably doesn't actually talk
> to
> > > the
> > > > > sql
> > > > > > DB at all? But maybe it could be done by coordinator and broker?
> > > > > >
> > > > > > --dave
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Namespacing segments, or preventing unknown segments from wreaking havoc

Posted by Gian Merlino <gi...@apache.org>.
To me this seems like a lot of effort to go through just to detect cases
where servers from two different clusters are misconfigured to read each
others' files or talk to each other by accident. I wonder if there's an
easier way to do it. Maybe keep the cluster name idea, but write it to a
marker file in any local storage directories that servers read on
bootstrap, and don't load up from them if the name is wrong?

On Fri, Mar 1, 2019 at 6:17 PM David Glasser <gl...@apollographql.com>
wrote:

> Makes sense.
>
> To elaborate a bit more on my "cluster name" concept, I actually think it
> would be pretty straightforward:
>
> - Add something like `druid.cluster.name=staging`.
> - To be compatible with existing data, also add something like
> `druid.cluster.allowSegmentsFromClusters=["", "dev"]`. Note that the empty
> string is explicitly recognized here.
> - Add a `clusterName` field to DataSegment. When creating a new segment,
> set its clusterName field to the value of druid.cluster.name.
> - Make various places that see DataSegments ignore and warn when presented
> with segments whose cluster does not match druid.cluster.name or a value
> in
> druid.cluster.allowSegmentsFromClusters. This would include
> SegmentLoadDropHandler (which is what looks at the local cache in
> historicals etc), operations that publish new segments, etc.
>
> This might actually be simpler and more efficient than going to the
> database each time, though the database approach could handle other related
> issues I suppose.
>
> On Fri, Mar 1, 2019 at 1:58 PM Jihoon Son <gh...@gmail.com> wrote:
>
> > The broker learns from historicals and tasks even though recently a PR
> has
> > been merged to keep published segments in memory (
> > https://github.com/apache/incubator-druid/pull/6901) in brokers.
> > Probably it makes sense to filter out segments in brokers too if they are
> > from historicals and not in the metadata store.
> >
> > Jihoon
> >
> > On Fri, Mar 1, 2019 at 1:24 PM David Glasser <gl...@apollographql.com>
> > wrote:
> >
> > > That makes sense. Does the coordinator's decisions about what segments
> > are
> > > 'used' affect the broker's choices for routing queries, or does it just
> > > learn about things directly from historicals/ingestion tasks (via...
> > > zookeeper?)
> > >
> > > --dave
> > >
> > > On Fri, Mar 1, 2019 at 1:15 PM Jihoon Son <gh...@gmail.com> wrote:
> > >
> > > > Hi Dave,
> > > >
> > > > I think the third option sounds most reasonable to fix this issue.
> > Though
> > > > the second option sounds useful in general.
> > > > And yes, it wouldn't be easy to refuse to announce unknown segments
> in
> > > > historicals.
> > > > I think it makes more sense to check only in the coordinator because
> > it's
> > > > the only node who would directly access to the metadata store (except
> > > > overlord).
> > > > So, the coordinator may not update the "used" flag if overshadowing
> > > > segments are not in the metadata store.
> > > > In stream ingestion, segments might not be in the metadata store
> until
> > > they
> > > > are published. However, this shouldn't be a problem because segments
> > are
> > > > always appended in stream ingestion.
> > > >
> > > > Jihoon
> > > >
> > > > On Fri, Mar 1, 2019 at 12:49 AM David Glasser <
> > glasser@apollographql.com
> > > >
> > > > wrote:
> > > >
> > > > > (I sent this message to druid-user last week and got no response.
> > Since
> > > > it
> > > > > is proposing making improvements to Druid, I thought maybe it would
> > be
> > > > > appropriate to resend here. Hope that's OK.)
> > > > >
> > > > > We had a big outage in our Druid cluster last week.  We run our
> Druid
> > > > > servers in Kubernetes, and our historicals use machine local SSDs
> for
> > > > their
> > > > > segment caches.  We made the unfortunate choice to have our
> > production
> > > > and
> > > > > staging historicals share the same pool of machines, and today got
> > bit
> > > by
> > > > > this for the first time.
> > > > >
> > > > > A production historical started up on a machine whose segment cache
> > > > > contained segments from our staging cluster.  Our prod and staging
> > > > clusters
> > > > > use the same names for data sources.
> > > > >
> > > > > This meant that these segments overshadowed production segments
> which
> > > > > happened to have lower versions.  Worse, when
> > > > > DruidCoordinatorCleanupOvershadowed kicked in, all of the
> production
> > > > > segments that were overshadowed got used=false set, and quickly got
> > > > dropped
> > > > > from historicals. This ended up being the majority of our data.  We
> > > > > eventually figured out what was going on and did a bunch of manual
> > > steps
> > > > to
> > > > > clean up (turning off and clearing the cache of the two historicals
> > > that
> > > > > had staging segments on them, manually setting used=true for all
> > > entries
> > > > in
> > > > > druid_segments, waiting a long long time for data to re-download),
> > but
> > > > > figuring out what was going on was subtle (I was very lucky I had
> > > > randomly
> > > > > decided to read a lot of the code about how the `used` column works
> > and
> > > > how
> > > > > versioned timelines are calculated just a few days before!).
> > > > >
> > > > > (We were also lucky that we had turned off coordinator automatic
> > > killing
> > > > > literally that morning!)
> > > > >
> > > > > I feel like Druid should have been able to protect me from this to
> > some
> > > > > degree. (Yes, we are going to address the root cause by making it
> > > > > impossible for prod and staging to reuse each others' disks.) Some
> > > > thoughts
> > > > > on changes that could have helped:
> > > > >
> > > > > - Is the Druid standard to prepend the "cluster" name to the data
> > > source
> > > > > name, so that conflicts like this are never possible?  We are
> > certainly
> > > > > tempted to do this now but nobody ever told us to. If that's the
> > > > standard,
> > > > > should it be documented?
> > > > >
> > > > > - Should clusters have an optional name/namespace, and DataSegments
> > > have
> > > > > that namespace recorded in it, and clusters refuse to handle
> segments
> > > > they
> > > > > find that are from a different namespace? This would be like the
> > common
> > > > > database setup where a single server/cluster has a set of database
> > > which
> > > > > each have a set of tables.
> > > > >
> > > > > - Should historicals refuse to announce segments that don't exist
> in
> > > the
> > > > > druid_segments table, or should coordinators/brokers/etc refuse to
> > pay
> > > > > attention to segments announced *by historicals* that don't exist
> in
> > > the
> > > > > druid_segments table.  I'm going to guess this is difficult to do
> in
> > > the
> > > > > historical because the historical probably doesn't actually talk to
> > the
> > > > sql
> > > > > DB at all? But maybe it could be done by coordinator and broker?
> > > > >
> > > > > --dave
> > > > >
> > > >
> > >
> >
>

Re: Namespacing segments, or preventing unknown segments from wreaking havoc

Posted by Jihoon Son <gh...@gmail.com>.
Thanks for additional details.
It sounds pretty straightforward and maybe it's better than poking database
every time. It would be worth to start a discussion on Github by raising a
proposal if you think it's valuable.

Jihoon

On Fri, Mar 1, 2019 at 2:17 PM David Glasser <gl...@apollographql.com>
wrote:

> Makes sense.
>
> To elaborate a bit more on my "cluster name" concept, I actually think it
> would be pretty straightforward:
>
> - Add something like `druid.cluster.name=staging`.
> - To be compatible with existing data, also add something like
> `druid.cluster.allowSegmentsFromClusters=["", "dev"]`. Note that the empty
> string is explicitly recognized here.
> - Add a `clusterName` field to DataSegment. When creating a new segment,
> set its clusterName field to the value of druid.cluster.name.
> - Make various places that see DataSegments ignore and warn when presented
> with segments whose cluster does not match druid.cluster.name or a value
> in
> druid.cluster.allowSegmentsFromClusters. This would include
> SegmentLoadDropHandler (which is what looks at the local cache in
> historicals etc), operations that publish new segments, etc.
>
> This might actually be simpler and more efficient than going to the
> database each time, though the database approach could handle other related
> issues I suppose.
>
> On Fri, Mar 1, 2019 at 1:58 PM Jihoon Son <gh...@gmail.com> wrote:
>
> > The broker learns from historicals and tasks even though recently a PR
> has
> > been merged to keep published segments in memory (
> > https://github.com/apache/incubator-druid/pull/6901) in brokers.
> > Probably it makes sense to filter out segments in brokers too if they are
> > from historicals and not in the metadata store.
> >
> > Jihoon
> >
> > On Fri, Mar 1, 2019 at 1:24 PM David Glasser <gl...@apollographql.com>
> > wrote:
> >
> > > That makes sense. Does the coordinator's decisions about what segments
> > are
> > > 'used' affect the broker's choices for routing queries, or does it just
> > > learn about things directly from historicals/ingestion tasks (via...
> > > zookeeper?)
> > >
> > > --dave
> > >
> > > On Fri, Mar 1, 2019 at 1:15 PM Jihoon Son <gh...@gmail.com> wrote:
> > >
> > > > Hi Dave,
> > > >
> > > > I think the third option sounds most reasonable to fix this issue.
> > Though
> > > > the second option sounds useful in general.
> > > > And yes, it wouldn't be easy to refuse to announce unknown segments
> in
> > > > historicals.
> > > > I think it makes more sense to check only in the coordinator because
> > it's
> > > > the only node who would directly access to the metadata store (except
> > > > overlord).
> > > > So, the coordinator may not update the "used" flag if overshadowing
> > > > segments are not in the metadata store.
> > > > In stream ingestion, segments might not be in the metadata store
> until
> > > they
> > > > are published. However, this shouldn't be a problem because segments
> > are
> > > > always appended in stream ingestion.
> > > >
> > > > Jihoon
> > > >
> > > > On Fri, Mar 1, 2019 at 12:49 AM David Glasser <
> > glasser@apollographql.com
> > > >
> > > > wrote:
> > > >
> > > > > (I sent this message to druid-user last week and got no response.
> > Since
> > > > it
> > > > > is proposing making improvements to Druid, I thought maybe it would
> > be
> > > > > appropriate to resend here. Hope that's OK.)
> > > > >
> > > > > We had a big outage in our Druid cluster last week.  We run our
> Druid
> > > > > servers in Kubernetes, and our historicals use machine local SSDs
> for
> > > > their
> > > > > segment caches.  We made the unfortunate choice to have our
> > production
> > > > and
> > > > > staging historicals share the same pool of machines, and today got
> > bit
> > > by
> > > > > this for the first time.
> > > > >
> > > > > A production historical started up on a machine whose segment cache
> > > > > contained segments from our staging cluster.  Our prod and staging
> > > > clusters
> > > > > use the same names for data sources.
> > > > >
> > > > > This meant that these segments overshadowed production segments
> which
> > > > > happened to have lower versions.  Worse, when
> > > > > DruidCoordinatorCleanupOvershadowed kicked in, all of the
> production
> > > > > segments that were overshadowed got used=false set, and quickly got
> > > > dropped
> > > > > from historicals. This ended up being the majority of our data.  We
> > > > > eventually figured out what was going on and did a bunch of manual
> > > steps
> > > > to
> > > > > clean up (turning off and clearing the cache of the two historicals
> > > that
> > > > > had staging segments on them, manually setting used=true for all
> > > entries
> > > > in
> > > > > druid_segments, waiting a long long time for data to re-download),
> > but
> > > > > figuring out what was going on was subtle (I was very lucky I had
> > > > randomly
> > > > > decided to read a lot of the code about how the `used` column works
> > and
> > > > how
> > > > > versioned timelines are calculated just a few days before!).
> > > > >
> > > > > (We were also lucky that we had turned off coordinator automatic
> > > killing
> > > > > literally that morning!)
> > > > >
> > > > > I feel like Druid should have been able to protect me from this to
> > some
> > > > > degree. (Yes, we are going to address the root cause by making it
> > > > > impossible for prod and staging to reuse each others' disks.) Some
> > > > thoughts
> > > > > on changes that could have helped:
> > > > >
> > > > > - Is the Druid standard to prepend the "cluster" name to the data
> > > source
> > > > > name, so that conflicts like this are never possible?  We are
> > certainly
> > > > > tempted to do this now but nobody ever told us to. If that's the
> > > > standard,
> > > > > should it be documented?
> > > > >
> > > > > - Should clusters have an optional name/namespace, and DataSegments
> > > have
> > > > > that namespace recorded in it, and clusters refuse to handle
> segments
> > > > they
> > > > > find that are from a different namespace? This would be like the
> > common
> > > > > database setup where a single server/cluster has a set of database
> > > which
> > > > > each have a set of tables.
> > > > >
> > > > > - Should historicals refuse to announce segments that don't exist
> in
> > > the
> > > > > druid_segments table, or should coordinators/brokers/etc refuse to
> > pay
> > > > > attention to segments announced *by historicals* that don't exist
> in
> > > the
> > > > > druid_segments table.  I'm going to guess this is difficult to do
> in
> > > the
> > > > > historical because the historical probably doesn't actually talk to
> > the
> > > > sql
> > > > > DB at all? But maybe it could be done by coordinator and broker?
> > > > >
> > > > > --dave
> > > > >
> > > >
> > >
> >
>

Re: Namespacing segments, or preventing unknown segments from wreaking havoc

Posted by David Glasser <gl...@apollographql.com>.
Makes sense.

To elaborate a bit more on my "cluster name" concept, I actually think it
would be pretty straightforward:

- Add something like `druid.cluster.name=staging`.
- To be compatible with existing data, also add something like
`druid.cluster.allowSegmentsFromClusters=["", "dev"]`. Note that the empty
string is explicitly recognized here.
- Add a `clusterName` field to DataSegment. When creating a new segment,
set its clusterName field to the value of druid.cluster.name.
- Make various places that see DataSegments ignore and warn when presented
with segments whose cluster does not match druid.cluster.name or a value in
druid.cluster.allowSegmentsFromClusters. This would include
SegmentLoadDropHandler (which is what looks at the local cache in
historicals etc), operations that publish new segments, etc.

This might actually be simpler and more efficient than going to the
database each time, though the database approach could handle other related
issues I suppose.

On Fri, Mar 1, 2019 at 1:58 PM Jihoon Son <gh...@gmail.com> wrote:

> The broker learns from historicals and tasks even though recently a PR has
> been merged to keep published segments in memory (
> https://github.com/apache/incubator-druid/pull/6901) in brokers.
> Probably it makes sense to filter out segments in brokers too if they are
> from historicals and not in the metadata store.
>
> Jihoon
>
> On Fri, Mar 1, 2019 at 1:24 PM David Glasser <gl...@apollographql.com>
> wrote:
>
> > That makes sense. Does the coordinator's decisions about what segments
> are
> > 'used' affect the broker's choices for routing queries, or does it just
> > learn about things directly from historicals/ingestion tasks (via...
> > zookeeper?)
> >
> > --dave
> >
> > On Fri, Mar 1, 2019 at 1:15 PM Jihoon Son <gh...@gmail.com> wrote:
> >
> > > Hi Dave,
> > >
> > > I think the third option sounds most reasonable to fix this issue.
> Though
> > > the second option sounds useful in general.
> > > And yes, it wouldn't be easy to refuse to announce unknown segments in
> > > historicals.
> > > I think it makes more sense to check only in the coordinator because
> it's
> > > the only node who would directly access to the metadata store (except
> > > overlord).
> > > So, the coordinator may not update the "used" flag if overshadowing
> > > segments are not in the metadata store.
> > > In stream ingestion, segments might not be in the metadata store until
> > they
> > > are published. However, this shouldn't be a problem because segments
> are
> > > always appended in stream ingestion.
> > >
> > > Jihoon
> > >
> > > On Fri, Mar 1, 2019 at 12:49 AM David Glasser <
> glasser@apollographql.com
> > >
> > > wrote:
> > >
> > > > (I sent this message to druid-user last week and got no response.
> Since
> > > it
> > > > is proposing making improvements to Druid, I thought maybe it would
> be
> > > > appropriate to resend here. Hope that's OK.)
> > > >
> > > > We had a big outage in our Druid cluster last week.  We run our Druid
> > > > servers in Kubernetes, and our historicals use machine local SSDs for
> > > their
> > > > segment caches.  We made the unfortunate choice to have our
> production
> > > and
> > > > staging historicals share the same pool of machines, and today got
> bit
> > by
> > > > this for the first time.
> > > >
> > > > A production historical started up on a machine whose segment cache
> > > > contained segments from our staging cluster.  Our prod and staging
> > > clusters
> > > > use the same names for data sources.
> > > >
> > > > This meant that these segments overshadowed production segments which
> > > > happened to have lower versions.  Worse, when
> > > > DruidCoordinatorCleanupOvershadowed kicked in, all of the production
> > > > segments that were overshadowed got used=false set, and quickly got
> > > dropped
> > > > from historicals. This ended up being the majority of our data.  We
> > > > eventually figured out what was going on and did a bunch of manual
> > steps
> > > to
> > > > clean up (turning off and clearing the cache of the two historicals
> > that
> > > > had staging segments on them, manually setting used=true for all
> > entries
> > > in
> > > > druid_segments, waiting a long long time for data to re-download),
> but
> > > > figuring out what was going on was subtle (I was very lucky I had
> > > randomly
> > > > decided to read a lot of the code about how the `used` column works
> and
> > > how
> > > > versioned timelines are calculated just a few days before!).
> > > >
> > > > (We were also lucky that we had turned off coordinator automatic
> > killing
> > > > literally that morning!)
> > > >
> > > > I feel like Druid should have been able to protect me from this to
> some
> > > > degree. (Yes, we are going to address the root cause by making it
> > > > impossible for prod and staging to reuse each others' disks.) Some
> > > thoughts
> > > > on changes that could have helped:
> > > >
> > > > - Is the Druid standard to prepend the "cluster" name to the data
> > source
> > > > name, so that conflicts like this are never possible?  We are
> certainly
> > > > tempted to do this now but nobody ever told us to. If that's the
> > > standard,
> > > > should it be documented?
> > > >
> > > > - Should clusters have an optional name/namespace, and DataSegments
> > have
> > > > that namespace recorded in it, and clusters refuse to handle segments
> > > they
> > > > find that are from a different namespace? This would be like the
> common
> > > > database setup where a single server/cluster has a set of database
> > which
> > > > each have a set of tables.
> > > >
> > > > - Should historicals refuse to announce segments that don't exist in
> > the
> > > > druid_segments table, or should coordinators/brokers/etc refuse to
> pay
> > > > attention to segments announced *by historicals* that don't exist in
> > the
> > > > druid_segments table.  I'm going to guess this is difficult to do in
> > the
> > > > historical because the historical probably doesn't actually talk to
> the
> > > sql
> > > > DB at all? But maybe it could be done by coordinator and broker?
> > > >
> > > > --dave
> > > >
> > >
> >
>

Re: Namespacing segments, or preventing unknown segments from wreaking havoc

Posted by Jihoon Son <gh...@gmail.com>.
The broker learns from historicals and tasks even though recently a PR has
been merged to keep published segments in memory (
https://github.com/apache/incubator-druid/pull/6901) in brokers.
Probably it makes sense to filter out segments in brokers too if they are
from historicals and not in the metadata store.

Jihoon

On Fri, Mar 1, 2019 at 1:24 PM David Glasser <gl...@apollographql.com>
wrote:

> That makes sense. Does the coordinator's decisions about what segments are
> 'used' affect the broker's choices for routing queries, or does it just
> learn about things directly from historicals/ingestion tasks (via...
> zookeeper?)
>
> --dave
>
> On Fri, Mar 1, 2019 at 1:15 PM Jihoon Son <gh...@gmail.com> wrote:
>
> > Hi Dave,
> >
> > I think the third option sounds most reasonable to fix this issue. Though
> > the second option sounds useful in general.
> > And yes, it wouldn't be easy to refuse to announce unknown segments in
> > historicals.
> > I think it makes more sense to check only in the coordinator because it's
> > the only node who would directly access to the metadata store (except
> > overlord).
> > So, the coordinator may not update the "used" flag if overshadowing
> > segments are not in the metadata store.
> > In stream ingestion, segments might not be in the metadata store until
> they
> > are published. However, this shouldn't be a problem because segments are
> > always appended in stream ingestion.
> >
> > Jihoon
> >
> > On Fri, Mar 1, 2019 at 12:49 AM David Glasser <glasser@apollographql.com
> >
> > wrote:
> >
> > > (I sent this message to druid-user last week and got no response. Since
> > it
> > > is proposing making improvements to Druid, I thought maybe it would be
> > > appropriate to resend here. Hope that's OK.)
> > >
> > > We had a big outage in our Druid cluster last week.  We run our Druid
> > > servers in Kubernetes, and our historicals use machine local SSDs for
> > their
> > > segment caches.  We made the unfortunate choice to have our production
> > and
> > > staging historicals share the same pool of machines, and today got bit
> by
> > > this for the first time.
> > >
> > > A production historical started up on a machine whose segment cache
> > > contained segments from our staging cluster.  Our prod and staging
> > clusters
> > > use the same names for data sources.
> > >
> > > This meant that these segments overshadowed production segments which
> > > happened to have lower versions.  Worse, when
> > > DruidCoordinatorCleanupOvershadowed kicked in, all of the production
> > > segments that were overshadowed got used=false set, and quickly got
> > dropped
> > > from historicals. This ended up being the majority of our data.  We
> > > eventually figured out what was going on and did a bunch of manual
> steps
> > to
> > > clean up (turning off and clearing the cache of the two historicals
> that
> > > had staging segments on them, manually setting used=true for all
> entries
> > in
> > > druid_segments, waiting a long long time for data to re-download), but
> > > figuring out what was going on was subtle (I was very lucky I had
> > randomly
> > > decided to read a lot of the code about how the `used` column works and
> > how
> > > versioned timelines are calculated just a few days before!).
> > >
> > > (We were also lucky that we had turned off coordinator automatic
> killing
> > > literally that morning!)
> > >
> > > I feel like Druid should have been able to protect me from this to some
> > > degree. (Yes, we are going to address the root cause by making it
> > > impossible for prod and staging to reuse each others' disks.) Some
> > thoughts
> > > on changes that could have helped:
> > >
> > > - Is the Druid standard to prepend the "cluster" name to the data
> source
> > > name, so that conflicts like this are never possible?  We are certainly
> > > tempted to do this now but nobody ever told us to. If that's the
> > standard,
> > > should it be documented?
> > >
> > > - Should clusters have an optional name/namespace, and DataSegments
> have
> > > that namespace recorded in it, and clusters refuse to handle segments
> > they
> > > find that are from a different namespace? This would be like the common
> > > database setup where a single server/cluster has a set of database
> which
> > > each have a set of tables.
> > >
> > > - Should historicals refuse to announce segments that don't exist in
> the
> > > druid_segments table, or should coordinators/brokers/etc refuse to pay
> > > attention to segments announced *by historicals* that don't exist in
> the
> > > druid_segments table.  I'm going to guess this is difficult to do in
> the
> > > historical because the historical probably doesn't actually talk to the
> > sql
> > > DB at all? But maybe it could be done by coordinator and broker?
> > >
> > > --dave
> > >
> >
>

Re: Namespacing segments, or preventing unknown segments from wreaking havoc

Posted by David Glasser <gl...@apollographql.com>.
That makes sense. Does the coordinator's decisions about what segments are
'used' affect the broker's choices for routing queries, or does it just
learn about things directly from historicals/ingestion tasks (via...
zookeeper?)

--dave

On Fri, Mar 1, 2019 at 1:15 PM Jihoon Son <gh...@gmail.com> wrote:

> Hi Dave,
>
> I think the third option sounds most reasonable to fix this issue. Though
> the second option sounds useful in general.
> And yes, it wouldn't be easy to refuse to announce unknown segments in
> historicals.
> I think it makes more sense to check only in the coordinator because it's
> the only node who would directly access to the metadata store (except
> overlord).
> So, the coordinator may not update the "used" flag if overshadowing
> segments are not in the metadata store.
> In stream ingestion, segments might not be in the metadata store until they
> are published. However, this shouldn't be a problem because segments are
> always appended in stream ingestion.
>
> Jihoon
>
> On Fri, Mar 1, 2019 at 12:49 AM David Glasser <gl...@apollographql.com>
> wrote:
>
> > (I sent this message to druid-user last week and got no response. Since
> it
> > is proposing making improvements to Druid, I thought maybe it would be
> > appropriate to resend here. Hope that's OK.)
> >
> > We had a big outage in our Druid cluster last week.  We run our Druid
> > servers in Kubernetes, and our historicals use machine local SSDs for
> their
> > segment caches.  We made the unfortunate choice to have our production
> and
> > staging historicals share the same pool of machines, and today got bit by
> > this for the first time.
> >
> > A production historical started up on a machine whose segment cache
> > contained segments from our staging cluster.  Our prod and staging
> clusters
> > use the same names for data sources.
> >
> > This meant that these segments overshadowed production segments which
> > happened to have lower versions.  Worse, when
> > DruidCoordinatorCleanupOvershadowed kicked in, all of the production
> > segments that were overshadowed got used=false set, and quickly got
> dropped
> > from historicals. This ended up being the majority of our data.  We
> > eventually figured out what was going on and did a bunch of manual steps
> to
> > clean up (turning off and clearing the cache of the two historicals that
> > had staging segments on them, manually setting used=true for all entries
> in
> > druid_segments, waiting a long long time for data to re-download), but
> > figuring out what was going on was subtle (I was very lucky I had
> randomly
> > decided to read a lot of the code about how the `used` column works and
> how
> > versioned timelines are calculated just a few days before!).
> >
> > (We were also lucky that we had turned off coordinator automatic killing
> > literally that morning!)
> >
> > I feel like Druid should have been able to protect me from this to some
> > degree. (Yes, we are going to address the root cause by making it
> > impossible for prod and staging to reuse each others' disks.) Some
> thoughts
> > on changes that could have helped:
> >
> > - Is the Druid standard to prepend the "cluster" name to the data source
> > name, so that conflicts like this are never possible?  We are certainly
> > tempted to do this now but nobody ever told us to. If that's the
> standard,
> > should it be documented?
> >
> > - Should clusters have an optional name/namespace, and DataSegments have
> > that namespace recorded in it, and clusters refuse to handle segments
> they
> > find that are from a different namespace? This would be like the common
> > database setup where a single server/cluster has a set of database which
> > each have a set of tables.
> >
> > - Should historicals refuse to announce segments that don't exist in the
> > druid_segments table, or should coordinators/brokers/etc refuse to pay
> > attention to segments announced *by historicals* that don't exist in the
> > druid_segments table.  I'm going to guess this is difficult to do in the
> > historical because the historical probably doesn't actually talk to the
> sql
> > DB at all? But maybe it could be done by coordinator and broker?
> >
> > --dave
> >
>

Re: Namespacing segments, or preventing unknown segments from wreaking havoc

Posted by Jihoon Son <gh...@gmail.com>.
Hi Dave,

I think the third option sounds most reasonable to fix this issue. Though
the second option sounds useful in general.
And yes, it wouldn't be easy to refuse to announce unknown segments in
historicals.
I think it makes more sense to check only in the coordinator because it's
the only node who would directly access to the metadata store (except
overlord).
So, the coordinator may not update the "used" flag if overshadowing
segments are not in the metadata store.
In stream ingestion, segments might not be in the metadata store until they
are published. However, this shouldn't be a problem because segments are
always appended in stream ingestion.

Jihoon

On Fri, Mar 1, 2019 at 12:49 AM David Glasser <gl...@apollographql.com>
wrote:

> (I sent this message to druid-user last week and got no response. Since it
> is proposing making improvements to Druid, I thought maybe it would be
> appropriate to resend here. Hope that's OK.)
>
> We had a big outage in our Druid cluster last week.  We run our Druid
> servers in Kubernetes, and our historicals use machine local SSDs for their
> segment caches.  We made the unfortunate choice to have our production and
> staging historicals share the same pool of machines, and today got bit by
> this for the first time.
>
> A production historical started up on a machine whose segment cache
> contained segments from our staging cluster.  Our prod and staging clusters
> use the same names for data sources.
>
> This meant that these segments overshadowed production segments which
> happened to have lower versions.  Worse, when
> DruidCoordinatorCleanupOvershadowed kicked in, all of the production
> segments that were overshadowed got used=false set, and quickly got dropped
> from historicals. This ended up being the majority of our data.  We
> eventually figured out what was going on and did a bunch of manual steps to
> clean up (turning off and clearing the cache of the two historicals that
> had staging segments on them, manually setting used=true for all entries in
> druid_segments, waiting a long long time for data to re-download), but
> figuring out what was going on was subtle (I was very lucky I had randomly
> decided to read a lot of the code about how the `used` column works and how
> versioned timelines are calculated just a few days before!).
>
> (We were also lucky that we had turned off coordinator automatic killing
> literally that morning!)
>
> I feel like Druid should have been able to protect me from this to some
> degree. (Yes, we are going to address the root cause by making it
> impossible for prod and staging to reuse each others' disks.) Some thoughts
> on changes that could have helped:
>
> - Is the Druid standard to prepend the "cluster" name to the data source
> name, so that conflicts like this are never possible?  We are certainly
> tempted to do this now but nobody ever told us to. If that's the standard,
> should it be documented?
>
> - Should clusters have an optional name/namespace, and DataSegments have
> that namespace recorded in it, and clusters refuse to handle segments they
> find that are from a different namespace? This would be like the common
> database setup where a single server/cluster has a set of database which
> each have a set of tables.
>
> - Should historicals refuse to announce segments that don't exist in the
> druid_segments table, or should coordinators/brokers/etc refuse to pay
> attention to segments announced *by historicals* that don't exist in the
> druid_segments table.  I'm going to guess this is difficult to do in the
> historical because the historical probably doesn't actually talk to the sql
> DB at all? But maybe it could be done by coordinator and broker?
>
> --dave
>