You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by Casey Stella <ce...@gmail.com> on 2017/09/25 16:37:33 UTC

[DISCUSS] Splitting up the Indexing Topology

One of the lessons that have bubbled up in doing some performance analysis
is that having the indexing topology share both the ES and the HDFS writer
in the same topology can be problematic from a tuning perspective.
Specifically, it's hard to square that circle and make both perform fast
enough to not cause significant back-pressure in kafka (and often Commit
Exceptions in the kafka spout).

I wanted to get the community's opinion about the possibility of separating
the two current writers into separate topologies which could be tuned
separately.

Pros:

   - Practically speaking, tuning separately is often a lot easier than
   trying to tune together
   - This opens us up with the beginnings of an abstraction that may be
   reusable to expose new indexers to Metron

Cons:

   - It has the potential to mask a problem.  We may want to ensure that
   the writers write at the same rate and don't get far ahead of one another.
   In the current setup, this is inherent in the design.  If we separate them,
   they may be reading at different rates and one index may get ahead of the
   other.
   - The management pack section around indexing would need to be
   reconsidered if we split them up

Personally, I'm strongly in favor of splitting them up, but I want to make
sure that we don't miss an important nuance here.  The first con is
concerning to me, but I'd argue that another lesson from performance tuning
is that we need to monitor the average partition lag over time in the
management UI for the various consumer groups and ensure that writing keeps
up with reading.  If we insist on this assertion being true for all healthy
metron installations, the primary con goes away in my mind.

Anyway, I'm sure I've missed some pros and cons, so it'd be great to hear
community feedback here.  Thoughts?

Re: [DISCUSS] Splitting up the Indexing Topology

Posted by Michael Miklavcic <mi...@gmail.com>.
+1 to the split. I also feel it's much easier to dissect problems when
these actions are separated. It's also easier to fine tune each
independently, which may have additional performance benefits.

M

On Mon, Sep 25, 2017 at 5:31 PM, James Sirota <js...@apache.org> wrote:

> I have experienced issues with ES and HDFS indexing in production and have
> previously split out the topologies into two separate topologies.  As you
> state the benefits of this approach are (a) tuning each topology
> separately, (b) ability to attribute problems to a specific topology (why
> is something slow?) and (c) graceful degradation.  When ES, for example,
> fails partially or catastrophically and your ES topology goes all kinds of
> crazy, HDFS topology keeps humming along unaffected.  Once Metron-1205 is
> in you will be able to re-index into ES (or potentially other sources) from
> HDFS at will.  The major con for this architecture is that there is a
> greater chance that all your data sources will get out of sync because they
> index/store data at different rates.  But even given that I would vote +1
> on splitting out the topologies.
>
> 25.09.2017, 09:37, "Casey Stella" <ce...@gmail.com>:
> > One of the lessons that have bubbled up in doing some performance
> analysis
> > is that having the indexing topology share both the ES and the HDFS
> writer
> > in the same topology can be problematic from a tuning perspective.
> > Specifically, it's hard to square that circle and make both perform fast
> > enough to not cause significant back-pressure in kafka (and often Commit
> > Exceptions in the kafka spout).
> >
> > I wanted to get the community's opinion about the possibility of
> separating
> > the two current writers into separate topologies which could be tuned
> > separately.
> >
> > Pros:
> >
> >    - Practically speaking, tuning separately is often a lot easier than
> >    trying to tune together
> >    - This opens us up with the beginnings of an abstraction that may be
> >    reusable to expose new indexers to Metron
> >
> > Cons:
> >
> >    - It has the potential to mask a problem. We may want to ensure that
> >    the writers write at the same rate and don't get far ahead of one
> another.
> >    In the current setup, this is inherent in the design. If we separate
> them,
> >    they may be reading at different rates and one index may get ahead of
> the
> >    other.
> >    - The management pack section around indexing would need to be
> >    reconsidered if we split them up
> >
> > Personally, I'm strongly in favor of splitting them up, but I want to
> make
> > sure that we don't miss an important nuance here. The first con is
> > concerning to me, but I'd argue that another lesson from performance
> tuning
> > is that we need to monitor the average partition lag over time in the
> > management UI for the various consumer groups and ensure that writing
> keeps
> > up with reading. If we insist on this assertion being true for all
> healthy
> > metron installations, the primary con goes away in my mind.
> >
> > Anyway, I'm sure I've missed some pros and cons, so it'd be great to hear
> > community feedback here. Thoughts?
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>

Re: [DISCUSS] Splitting up the Indexing Topology

Posted by James Sirota <js...@apache.org>.
I have experienced issues with ES and HDFS indexing in production and have previously split out the topologies into two separate topologies.  As you state the benefits of this approach are (a) tuning each topology separately, (b) ability to attribute problems to a specific topology (why is something slow?) and (c) graceful degradation.  When ES, for example, fails partially or catastrophically and your ES topology goes all kinds of crazy, HDFS topology keeps humming along unaffected.  Once Metron-1205 is in you will be able to re-index into ES (or potentially other sources) from HDFS at will.  The major con for this architecture is that there is a greater chance that all your data sources will get out of sync because they index/store data at different rates.  But even given that I would vote +1 on splitting out the topologies. 

25.09.2017, 09:37, "Casey Stella" <ce...@gmail.com>:
> One of the lessons that have bubbled up in doing some performance analysis
> is that having the indexing topology share both the ES and the HDFS writer
> in the same topology can be problematic from a tuning perspective.
> Specifically, it's hard to square that circle and make both perform fast
> enough to not cause significant back-pressure in kafka (and often Commit
> Exceptions in the kafka spout).
>
> I wanted to get the community's opinion about the possibility of separating
> the two current writers into separate topologies which could be tuned
> separately.
>
> Pros:
>
>    - Practically speaking, tuning separately is often a lot easier than
>    trying to tune together
>    - This opens us up with the beginnings of an abstraction that may be
>    reusable to expose new indexers to Metron
>
> Cons:
>
>    - It has the potential to mask a problem. We may want to ensure that
>    the writers write at the same rate and don't get far ahead of one another.
>    In the current setup, this is inherent in the design. If we separate them,
>    they may be reading at different rates and one index may get ahead of the
>    other.
>    - The management pack section around indexing would need to be
>    reconsidered if we split them up
>
> Personally, I'm strongly in favor of splitting them up, but I want to make
> sure that we don't miss an important nuance here. The first con is
> concerning to me, but I'd argue that another lesson from performance tuning
> is that we need to monitor the average partition lag over time in the
> management UI for the various consumer groups and ensure that writing keeps
> up with reading. If we insist on this assertion being true for all healthy
> metron installations, the primary con goes away in my mind.
>
> Anyway, I'm sure I've missed some pros and cons, so it'd be great to hear
> community feedback here. Thoughts?

------------------- 
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org