You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@htrace.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2015/02/26 21:57:16 UTC

HTrace for Nutch 2.x Search Stack

Hi Folks,
Right now we are able to initiate TRACE within Log4j file in Nutch 2.X [0].
Nutch 2.X offer storage abstraction for storage of Webpage and Host data
through use of Apache Gora.
Gora backend support includes

   - Apache Avro 1.7.6
   - Apache Hadoop 1.2.1 and 2.5.2
   - Apache HBase 0.98.8-hadoop2
   - Apache Cassandra 2.0.2
   - Apache Solr 4.10.3
   - MongoDB 2.6.X
   - Apache Accumlo 1.5.1

Parts of the Nutch 2.X search stack include Nutch, Gora, optionally Solr
and/or Elastic Search for indexing context and optionally Hadoop for
running it all on.

As a developer of Nutch and Gora am very keen to see a more verbose tracing
mechanism (potentially with some visualization) for debugging our search
stack deployments. Can I please kick off conversation on what the
difference and benefits of H Trace are over log4j TRACE level logging.

Once I understand the above, I would like to obtain advice on the
integration points for having HTrace as the tracing mechanism for the stack
I describe above.
Thanks
Lewis

[0] https://github.com/apache/nutch/blob/2.x/conf/log4j.properties
-- 
*Lewis*

Re: HTrace for Nutch 2.x Search Stack

Posted by Abraham Elmahrek <ab...@cloudera.com>.
Lewis,

Thought I'd add my two cents as well :).

As Colin stated, we're working on presenting a graph for visualizing the
parent/child relationships. We're going to continue refining the search
page and are still looking for the right look and feel.

A GSoC project would be cool. Adding new visualizations (such as a
histogram or generate charts for spans by description, start time, end
time, etc.) might work.

If you're interested, feel free to share some ideas/thoughts on
visualizations that would be useful to you.

-Abe

On Fri, Feb 27, 2015 at 1:31 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Thanks ll for the info here.
> I am going to press forward with investigating HTrace for the Nutch 2.X
> Search Stack.
> Lewis
>
>
> On Fri, Feb 27, 2015 at 11:58 AM, Colin P. McCabe <cm...@apache.org>
> wrote:
>
> > Hi Lewis,
> >
> > Good questions.  I would say HTrace differs from TRACE logging (or
> > other single-node metrics, JMX, audit logs, etc.) in that it pulls
> > together information from across the cluster.  This is something that
> > is a major pain point when using a distributed system such as HDFS.
> > Just to diagnose a slow write, you might have to match up logs from a
> > client log and the logs of 3 different datanodes.  The big idea behind
> > htrace is two things: integrating those logging sources, and using
> > sampling to instrument performance in production.  The main thing
> > htrace deals with is "spans" which are lengths of time.
> >
> > We're working on a web UI that will allow people to search for spans
> > by time, duration, and name (among other things).  It's not quite
> > finished now (hoping to have something usable in HTrace 3.2.0 or maybe
> > 3.3.0... but abe can comment more on that.)
> >
> > Here's an early screenshot (probably way out of date now):
> >
> >
> https://issues.apache.org/jira/secure/attachment/12689757/Search%20page%20skeleton%20-%200.png
> >
> > There is also a plan to create a visualization of parent/child
> > relationships on the web UI, by using the d3 library (which can draw
> > graphs, and do many other things besides.)
> >
> > In the meantime, there's an option to product a graphviz file from a
> > file containing span JSON.  That way you can draw a graph of
> > parent/child relationships with the "dot" tool, available on Linux.
> > Uh... unfortunately it's broken right now... let me file a JIRA for
> > that :P  This is a very new feature, got added earlier this week.
> >
> > The web UI is a great place to get involved right now... there is a
> > lot of work going on there and we've been adding new contributors.
> >
> > Colin
> >
> > On Thu, Feb 26, 2015 at 1:46 PM, Lewis John Mcgibbney
> > <le...@gmail.com> wrote:
> > > Hi Nick,
> > >
> > > Grand. Thank you
> > >
> > > What is visualization looking like right now? It there currently a
> > > mechanism for visualizing HTrace structures?
> > > Is it worth considering posting something like this as a GSoC project
> is
> > > one does not currently exist?
> > > Thanks
> > > Lewis
> > >
> > >
> > > On Thu, Feb 26, 2015 at 1:31 PM, Nick Dimiduk <nd...@gmail.com>
> > wrote:
> > >
> > >> Hi Lewis,
> > >>
> > >>
> >
>
>
>
> --
> *Lewis*
>

Re: HTrace for Nutch 2.x Search Stack

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Thanks ll for the info here.
I am going to press forward with investigating HTrace for the Nutch 2.X
Search Stack.
Lewis


On Fri, Feb 27, 2015 at 11:58 AM, Colin P. McCabe <cm...@apache.org>
wrote:

> Hi Lewis,
>
> Good questions.  I would say HTrace differs from TRACE logging (or
> other single-node metrics, JMX, audit logs, etc.) in that it pulls
> together information from across the cluster.  This is something that
> is a major pain point when using a distributed system such as HDFS.
> Just to diagnose a slow write, you might have to match up logs from a
> client log and the logs of 3 different datanodes.  The big idea behind
> htrace is two things: integrating those logging sources, and using
> sampling to instrument performance in production.  The main thing
> htrace deals with is "spans" which are lengths of time.
>
> We're working on a web UI that will allow people to search for spans
> by time, duration, and name (among other things).  It's not quite
> finished now (hoping to have something usable in HTrace 3.2.0 or maybe
> 3.3.0... but abe can comment more on that.)
>
> Here's an early screenshot (probably way out of date now):
>
> https://issues.apache.org/jira/secure/attachment/12689757/Search%20page%20skeleton%20-%200.png
>
> There is also a plan to create a visualization of parent/child
> relationships on the web UI, by using the d3 library (which can draw
> graphs, and do many other things besides.)
>
> In the meantime, there's an option to product a graphviz file from a
> file containing span JSON.  That way you can draw a graph of
> parent/child relationships with the "dot" tool, available on Linux.
> Uh... unfortunately it's broken right now... let me file a JIRA for
> that :P  This is a very new feature, got added earlier this week.
>
> The web UI is a great place to get involved right now... there is a
> lot of work going on there and we've been adding new contributors.
>
> Colin
>
> On Thu, Feb 26, 2015 at 1:46 PM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
> > Hi Nick,
> >
> > Grand. Thank you
> >
> > What is visualization looking like right now? It there currently a
> > mechanism for visualizing HTrace structures?
> > Is it worth considering posting something like this as a GSoC project is
> > one does not currently exist?
> > Thanks
> > Lewis
> >
> >
> > On Thu, Feb 26, 2015 at 1:31 PM, Nick Dimiduk <nd...@gmail.com>
> wrote:
> >
> >> Hi Lewis,
> >>
> >>
>



-- 
*Lewis*

Re: HTrace for Nutch 2.x Search Stack

Posted by "Colin P. McCabe" <cm...@apache.org>.
Hi Lewis,

Good questions.  I would say HTrace differs from TRACE logging (or
other single-node metrics, JMX, audit logs, etc.) in that it pulls
together information from across the cluster.  This is something that
is a major pain point when using a distributed system such as HDFS.
Just to diagnose a slow write, you might have to match up logs from a
client log and the logs of 3 different datanodes.  The big idea behind
htrace is two things: integrating those logging sources, and using
sampling to instrument performance in production.  The main thing
htrace deals with is "spans" which are lengths of time.

We're working on a web UI that will allow people to search for spans
by time, duration, and name (among other things).  It's not quite
finished now (hoping to have something usable in HTrace 3.2.0 or maybe
3.3.0... but abe can comment more on that.)

Here's an early screenshot (probably way out of date now):
https://issues.apache.org/jira/secure/attachment/12689757/Search%20page%20skeleton%20-%200.png

There is also a plan to create a visualization of parent/child
relationships on the web UI, by using the d3 library (which can draw
graphs, and do many other things besides.)

In the meantime, there's an option to product a graphviz file from a
file containing span JSON.  That way you can draw a graph of
parent/child relationships with the "dot" tool, available on Linux.
Uh... unfortunately it's broken right now... let me file a JIRA for
that :P  This is a very new feature, got added earlier this week.

The web UI is a great place to get involved right now... there is a
lot of work going on there and we've been adding new contributors.

Colin

On Thu, Feb 26, 2015 at 1:46 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi Nick,
>
> Grand. Thank you
>
> What is visualization looking like right now? It there currently a
> mechanism for visualizing HTrace structures?
> Is it worth considering posting something like this as a GSoC project is
> one does not currently exist?
> Thanks
> Lewis
>
>
> On Thu, Feb 26, 2015 at 1:31 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
>> Hi Lewis,
>>
>>

Re: HTrace for Nutch 2.x Search Stack

Posted by Nick Dimiduk <nd...@gmail.com>.
Visualization is a work in progress, though I'm not the guy to ask. Abe is
the man with the plan on this front.

GSoC applications deadline has passed, hasn't it?

On Thu, Feb 26, 2015 at 1:46 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Nick,
>
> Grand. Thank you
>
> What is visualization looking like right now? It there currently a
> mechanism for visualizing HTrace structures?
> Is it worth considering posting something like this as a GSoC project is
> one does not currently exist?
> Thanks
> Lewis
>
>
> On Thu, Feb 26, 2015 at 1:31 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > Hi Lewis,
> >
> >
>

Re: HTrace for Nutch 2.x Search Stack

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Nick,

Grand. Thank you

What is visualization looking like right now? It there currently a
mechanism for visualizing HTrace structures?
Is it worth considering posting something like this as a GSoC project is
one does not currently exist?
Thanks
Lewis


On Thu, Feb 26, 2015 at 1:31 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Hi Lewis,
>
>

Re: HTrace for Nutch 2.x Search Stack

Posted by Nick Dimiduk <nd...@gmail.com>.
Hi Lewis,

The results of TRACE level logging will give you messages from the various
system components according to whatever those developers thought relevant
for exposing from the application. Assembling the HTrace spans from a
single trace will give you something similar, but structured according to
the major boundaries within the system. The original intention is for
tracing a single request through multiple RPC's across multiple systems. At
the very least, you'd see these boundary points and the amount of time
spent in each component. Each component's developers are free to use the
HTrace API to attach "annotations" to their trace spans. These annotations
are arbitrary messages, accompanied by timestamps, included by the whims of
the projects' developers.

I guess the big difference is that trace spans collected by HTrace are a
bit more structured than simple log messages. There's a parent-child
relationship between spans, and this hierarchy exists across threads within
a process and across processes within a distributed application.

Maybe someone else can do a better job explaining...

-n

On Thu, Feb 26, 2015 at 12:57 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Folks,
> Right now we are able to initiate TRACE within Log4j file in Nutch 2.X [0].
> Nutch 2.X offer storage abstraction for storage of Webpage and Host data
> through use of Apache Gora.
> Gora backend support includes
>
>    - Apache Avro 1.7.6
>    - Apache Hadoop 1.2.1 and 2.5.2
>    - Apache HBase 0.98.8-hadoop2
>    - Apache Cassandra 2.0.2
>    - Apache Solr 4.10.3
>    - MongoDB 2.6.X
>    - Apache Accumlo 1.5.1
>
> Parts of the Nutch 2.X search stack include Nutch, Gora, optionally Solr
> and/or Elastic Search for indexing context and optionally Hadoop for
> running it all on.
>
> As a developer of Nutch and Gora am very keen to see a more verbose tracing
> mechanism (potentially with some visualization) for debugging our search
> stack deployments. Can I please kick off conversation on what the
> difference and benefits of H Trace are over log4j TRACE level logging.
>
> Once I understand the above, I would like to obtain advice on the
> integration points for having HTrace as the tracing mechanism for the stack
> I describe above.
> Thanks
> Lewis
>
> [0] https://github.com/apache/nutch/blob/2.x/conf/log4j.properties
> --
> *Lewis*
>