You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Ahmed Aley <ah...@cs.umu.se> on 2014/05/22 08:15:09 UTC

GSoC work for Aurora and some updates

Hi,

I am Ahmed Ali-Eldin <https://www8.cs.umu.se/~ahmeda/>, a PhD student at
Umeå University, Sweden (It is up
north<http://tools.wmflabs.org/geohack/geohack.php?pagename=Ume%C3%A5&params=63_49_30_N_20_15_50_E_type:city%2879594%29_region:SE>:)
). I am working with @MarkCC on integrating a distributed logging
framework with Aurora and building an analytics framework on top to analyze
the logged data.
We started off by looking into different logging frameworks
(Kafka<http://kafka.apache.org/>,
Scribe <https://github.com/facebook/scribe>,
Chukwa<https://chukwa.apache.org/>,
Suro<http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html>,
Calligraphus<http://www-conf.slac.stanford.edu/xldb2011/talks/xldb2011_tue_0940_facebookrealtimeanalytics.pdf>and
Flume <http://flume.apache.org/>). We chose Suro coupled with Kafka out of
these for different reasons.
i- It has been built to allow scale-up and down (elastic).
ii- It is quite flexible with a Kafka sink giving us access to all Kafka
sinks.
iii- It has an S3 sink making it a suitable solution for more scenarios.
iv- I got a tip from someone I know at Netflix on Suro benchmarking results.
v- it is an active project

Based on the above, I have started some experiments with Suro and will be
looking at its integration with Aurora this weekend. I can not make any
statements on if Suro (coupled with kafka) is "the best" solution for
distributed logging but it looks very promising till now. I will hopefully
send some results/updates late next week.

Best,
--Ahmed

Re: GSoC work for Aurora and some updates

Posted by Ahmed Aley <ah...@cs.umu.se>.

Hi Bill,

Sorry for the slow response. I am in the Swedish High
Coast<http://en.wikipedia.org/wiki/The_High_Coast> this
week with a weak connection. This is the proposal I had for GSoC.
*Proposed idea:*
I would like to propose an idea based on Mark's (@MarkCC) ideas described
in AURORA-256 <https://issues.apache.org/jira/browse/AURORA-256>andAURORA-257<https://issues.apache.org/jira/browse/AURORA-257>with
a few additions.

1- First, we identify the requirements from the logging system for Aurora.
We study already existing solution such as Flume <http://flume.apache.org/>,
Scribe <https://github.com/facebook/scribe/wiki>,
Chukwa<https://chukwa.apache.org/>and
Kafka <http://kafka.apache.org/>. The results shall be put in a report
similar to the Wikimedia foundation's report on their choice of a logging
solution.<https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation>The
report should be ready for submission by the end of the bonding
period.
We will also identify any missing functionalities in the chosen system.

2-We start working on implementing any missing functionalities and
integrating the chosen system with Aurora.  Any added functionalities to
the logging system shall be pushed in the respective open-source project.

3- Once we have the logging system in place, we will design and build the
analytics module.  This system will support both simple queries such as the
example given by Mark, "Show all of the update commands that resulted in a
rollback between 12:00 and 2pm." and more complex ones like "show the
correlation between failures and number of jobs" or "Detect anomalies in
the logged data for the past 10 days" or "What is the distribution of job
execution times". The analytics tool(s) will be written in Python and R
(mainly) binded by RPy2 while leveraging the power pf MapReduce (when
needed). The tool will be built to be modular to allow for future
extensions and updates when needed. The analysis reports will be in both
textual format and visual format, e.g., histograms, box-plots, CDFs and so
on, to aid Aurora users and cluster managers to make informed decisions.

Best,
--Ahmed

On Sat, May 24, 2014 at 7:56 AM, Bill Farner <wf...@apache.org> wrote:

> Welcome, Ahmed!  Cool stuff!
>
> Is there a design doc or mission statement that you and Mark are working
> off?
>
> -=Bill
>
>
> On Wed, May 21, 2014 at 11:15 PM, Ahmed Aley <ah...@cs.umu.se> wrote:
>
> > Hi,
> >
> > I am Ahmed Ali-Eldin <https://www8.cs.umu.se/~ahmeda/>, a PhD student at
> > Umeå University, Sweden (It is up
> > north<
> >
> http://tools.wmflabs.org/geohack/geohack.php?pagename=Ume%C3%A5&params=63_49_30_N_20_15_50_E_type:city%2879594%29_region:SE
> > >:)
> > ). I am working with @MarkCC on integrating a distributed logging
> > framework with Aurora and building an analytics framework on top to
> analyze
> > the logged data.
> > We started off by looking into different logging frameworks
> > (Kafka<http://kafka.apache.org/>,
> > Scribe <https://github.com/facebook/scribe>,
> > Chukwa<https://chukwa.apache.org/>,
> > Suro<
> >
> http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html
> > >,
> > Calligraphus<
> >
> http://www-conf.slac.stanford.edu/xldb2011/talks/xldb2011_tue_0940_facebookrealtimeanalytics.pdf
> > >and
> > Flume <http://flume.apache.org/>). We chose Suro coupled with Kafka out
> of
> > these for different reasons.
> > i- It has been built to allow scale-up and down (elastic).
> > ii- It is quite flexible with a Kafka sink giving us access to all Kafka
> > sinks.
> > iii- It has an S3 sink making it a suitable solution for more scenarios.
> > iv- I got a tip from someone I know at Netflix on Suro benchmarking
> > results.
> > v- it is an active project
> >
> > Based on the above, I have started some experiments with Suro and will be
> > looking at its integration with Aurora this weekend. I can not make any
> > statements on if Suro (coupled with kafka) is "the best" solution for
> > distributed logging but it looks very promising till now. I will
> hopefully
> > send some results/updates late next week.
> >
> > Best,
> > --Ahmed
> >
>

Re: GSoC work for Aurora and some updates

Posted by Dave Lester <da...@gmail.com>.

Welcome, Ahmed! I look forward to following your work this summer.

Dave


On Fri, May 23, 2014 at 10:56 PM, Bill Farner <wf...@apache.org> wrote:

> Welcome, Ahmed!  Cool stuff!
>
> Is there a design doc or mission statement that you and Mark are working
> off?
>
> -=Bill
>
>
> On Wed, May 21, 2014 at 11:15 PM, Ahmed Aley <ah...@cs.umu.se> wrote:
>
> > Hi,
> >
> > I am Ahmed Ali-Eldin <https://www8.cs.umu.se/~ahmeda/>, a PhD student at
> > Umeå University, Sweden (It is up
> > north<
> >
> http://tools.wmflabs.org/geohack/geohack.php?pagename=Ume%C3%A5&params=63_49_30_N_20_15_50_E_type:city%2879594%29_region:SE
> > >:)
> > ). I am working with @MarkCC on integrating a distributed logging
> > framework with Aurora and building an analytics framework on top to
> analyze
> > the logged data.
> > We started off by looking into different logging frameworks
> > (Kafka<http://kafka.apache.org/>,
> > Scribe <https://github.com/facebook/scribe>,
> > Chukwa<https://chukwa.apache.org/>,
> > Suro<
> >
> http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html
> > >,
> > Calligraphus<
> >
> http://www-conf.slac.stanford.edu/xldb2011/talks/xldb2011_tue_0940_facebookrealtimeanalytics.pdf
> > >and
> > Flume <http://flume.apache.org/>). We chose Suro coupled with Kafka out
> of
> > these for different reasons.
> > i- It has been built to allow scale-up and down (elastic).
> > ii- It is quite flexible with a Kafka sink giving us access to all Kafka
> > sinks.
> > iii- It has an S3 sink making it a suitable solution for more scenarios.
> > iv- I got a tip from someone I know at Netflix on Suro benchmarking
> > results.
> > v- it is an active project
> >
> > Based on the above, I have started some experiments with Suro and will be
> > looking at its integration with Aurora this weekend. I can not make any
> > statements on if Suro (coupled with kafka) is "the best" solution for
> > distributed logging but it looks very promising till now. I will
> hopefully
> > send some results/updates late next week.
> >
> > Best,
> > --Ahmed
> >
>

Re: GSoC work for Aurora and some updates

Posted by Bill Farner <wf...@apache.org>.

Welcome, Ahmed!  Cool stuff!

Is there a design doc or mission statement that you and Mark are working
off?

-=Bill


On Wed, May 21, 2014 at 11:15 PM, Ahmed Aley <ah...@cs.umu.se> wrote:

> Hi,
>
> I am Ahmed Ali-Eldin <https://www8.cs.umu.se/~ahmeda/>, a PhD student at
> Umeå University, Sweden (It is up
> north<
> http://tools.wmflabs.org/geohack/geohack.php?pagename=Ume%C3%A5&params=63_49_30_N_20_15_50_E_type:city%2879594%29_region:SE
> >:)
> ). I am working with @MarkCC on integrating a distributed logging
> framework with Aurora and building an analytics framework on top to analyze
> the logged data.
> We started off by looking into different logging frameworks
> (Kafka<http://kafka.apache.org/>,
> Scribe <https://github.com/facebook/scribe>,
> Chukwa<https://chukwa.apache.org/>,
> Suro<
> http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html
> >,
> Calligraphus<
> http://www-conf.slac.stanford.edu/xldb2011/talks/xldb2011_tue_0940_facebookrealtimeanalytics.pdf
> >and
> Flume <http://flume.apache.org/>). We chose Suro coupled with Kafka out of
> these for different reasons.
> i- It has been built to allow scale-up and down (elastic).
> ii- It is quite flexible with a Kafka sink giving us access to all Kafka
> sinks.
> iii- It has an S3 sink making it a suitable solution for more scenarios.
> iv- I got a tip from someone I know at Netflix on Suro benchmarking
> results.
> v- it is an active project
>
> Based on the above, I have started some experiments with Suro and will be
> looking at its integration with Aurora this weekend. I can not make any
> statements on if Suro (coupled with kafka) is "the best" solution for
> distributed logging but it looks very promising till now. I will hopefully
> send some results/updates late next week.
>
> Best,
> --Ahmed
>