You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Peter Coetzee <pe...@coetzee.org> on 2014/10/20 10:00:51 UTC

New research using Accumulo: Unified Secure On-/Off-line Analytics

New open-access research published in the journal of Parallel Computing
demonstrates a novel approach to engineering analytics for deployment in
streaming and batch contexts.

Increasing numbers of users are extracting real value from their data using
tools like IBM InfoSphere Streams for near-real-time analysis and Apache
Spark across their historical data in Accumulo.

Until now, there hasn't been an approach which permits the use of these
tools from a single shared codebase, with deployment considerations
reserved until deployment time. Furthermore, it has been even harder to
permit this unified analysis while maintaining cell-level traces of the
security heritage for each datum an analytic produces.

Some highlights of the paper include:
  - A domain specific language (CRUCIBLE) and runtime models for on- and
off-line data analytics.
  - Detailed analysis of CRUCIBLE’s runtime performance in state-of-the-art
environments.
  - Development and detailed analysis of a set of runtime models for new
environments.
  - Performance comparison with native implementations and discussion of
optimisation steps.
  - Formulation of a primitive in the DSL that permits an analytic to be
run over multiple data sources.

The paper, Towards Unified Secure On- and Off-line Analytics at Scale, is
available free of charge from Elsevier:

http://www.sciencedirect.com/science/article/pii/S0167819114000842


I am one of the lead authors of the work, and would be more than happy to
discuss any aspects which catch your attention!

Peter

--
Peter Coetzee
Performance Computing and Visualisation PhD Candidate
Department of Computer Science
University of Warwick

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by Peter Coetzee <pe...@coetzee.org>.

That's correct, yes (and hopefully the body text agrees with your reading
of the data?). The 10x slowdown is one of the reasons I suggest that
complex networks of Iterators are probably not a sound approach to
implementing CRUCIBLE-style analytics, although it can be made to work.
This motivated the implementation of the Spark runtime for CRUCIBLE, which
gives a couple of orders of magnitude better performance (the gap between
Accumulo v1 and Spark-Accumulo is around 480x, I believe).

There's something to be said for using the right tool for the job :)

All the best,
Peter


On 21 October 2014 15:28, Jeremy Kepner <ke...@ll.mit.edu> wrote:

> Hi Peter,
>   So the Y axis is labeled "Execution Time (s)" which would imply
> "Accumulo v2" using CRUCIBLE is 10 times slower than the "Native Accumulo"
> which doesn't use CRUCIBLE.  Is this correct?
>
> Regards.  -Jeremy
>
> On Tue, Oct 21, 2014 at 03:28:54PM +0100, Peter Coetzee wrote:
> > Accumulo v1, Accumulo v2, Spark-Accumulo, Spark-HDFS were implemented
> with
> > CUCIBLE, each being the same CRUCIBLE code, but executed against a
> > different runtime configuration.
> >
> > Accumulo v1 represents the pre-optimisation Accumulo Iterator based
> runtime
> > Accumulo v2 represents the post-optimisation Accumulo Iterator based
> runtime
> > Spark-Accumulo makes use of a Standalone Spark cluster, backed by
> Accumulo
> > on HDFS (uses Spark's hadoopRDD with AccumuloInputFormat)
> > Spark-HDFS uses the same Standalone Spark cluster, but is operating over
> > files in HDFS directly
> >
> >
> >
> > On 21 October 2014 15:07, Jeremy Kepner <ke...@ll.mit.edu> wrote:
> >
> > > So of the six lines on the graph:  Accumulo v1, Accumulo v2,
> > > Spark-Accumulo, Spark-HDFS, Native Accumulo, Native Spark
> > > which were implemented with  CRUCIBLE
> > >
> > > On Tue, Oct 21, 2014 at 09:23:12AM +0100, Peter Coetzee wrote:
> > > > Hi Jeremy,
> > > >
> > > > If you're viewing the PDF form of the paper (Elsevier's HTML
> rendering
> > > has
> > > > some odd artefacts), there's a short explanation of the figure
> appearing
> > > > just after it:
> > > >
> > > > At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen
> to
> > > > > outperform a native implementation making use of the more
> expressive
> > > Spark
> > > > > builtins. Performing bulk analysis through the use of Accumulo
> > > Iterators
> > > > > with CRUCIBLE was approximately 10x slower than the equivalent
> native
> > > > > implementation; with Spark on HDFS files, this is now almost 1.2x
> > > faster
> > > > > than the native implementation used.
> > > >
> > > >
> > > > The "native" implementations (i.e. hand-written by an engineer using
> the
> > > > tools offered by the standard platform) are shown as dashed series
> on the
> > > > chart, while the other series represent a single CRUCIBLE topology,
> > > > compiled once and executed on a collection of runtimes (each of
> which are
> > > > discussed in more detail earlier in the paper).
> > > >
> > > > By way of clarification; are you curious as to what the figure
> shows, or
> > > > why those results are demonstrated?
> > > >
> > > > Hope this helps somewhat.
> > > >
> > > > Best regards,
> > > > Peter
> > > >
> > > >
> > > >
> > > > On 21 October 2014 00:19, Jeremy Kepner <ke...@ll.mit.edu> wrote:
> > > >
> > > > > Hi Peter,
> > > > >   Thanks.  Can you clarify Figure 12 in the paper.  I think I
> > > understand
> > > > > what it is saying, but I am not 100% sure.
> > > > >
> > > > > Regards.  -Jeremy
> > > > >
> > > > > On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > > > > > New open-access research published in the journal of Parallel
> > > Computing
> > > > > > demonstrates a novel approach to engineering analytics for
> > > deployment in
> > > > > > streaming and batch contexts.
> > > > > >
> > > > > > Increasing numbers of users are extracting real value from their
> data
> > > > > using
> > > > > > tools like IBM InfoSphere Streams for near-real-time analysis and
> > > Apache
> > > > > > Spark across their historical data in Accumulo.
> > > > > >
> > > > > > Until now, there hasn't been an approach which permits the use of
> > > these
> > > > > > tools from a single shared codebase, with deployment
> considerations
> > > > > > reserved until deployment time. Furthermore, it has been even
> harder
> > > to
> > > > > > permit this unified analysis while maintaining cell-level traces
> of
> > > the
> > > > > > security heritage for each datum an analytic produces.
> > > > > >
> > > > > > Some highlights of the paper include:
> > > > > >   - A domain specific language (CRUCIBLE) and runtime models for
> on-
> > > and
> > > > > > off-line data analytics.
> > > > > >   - Detailed analysis of CRUCIBLE’s runtime performance in
> > > > > state-of-the-art
> > > > > > environments.
> > > > > >   - Development and detailed analysis of a set of runtime models
> for
> > > new
> > > > > > environments.
> > > > > >   - Performance comparison with native implementations and
> > > discussion of
> > > > > > optimisation steps.
> > > > > >   - Formulation of a primitive in the DSL that permits an
> analytic
> > > to be
> > > > > > run over multiple data sources.
> > > > > >
> > > > > > The paper, Towards Unified Secure On- and Off-line Analytics at
> > > Scale, is
> > > > > > available free of charge from Elsevier:
> > > > > >
> > > > > >
> http://www.sciencedirect.com/science/article/pii/S0167819114000842
> > > > > >
> > > > > >
> > > > > > I am one of the lead authors of the work, and would be more than
> > > happy to
> > > > > > discuss any aspects which catch your attention!
> > > > > >
> > > > > > Peter
> > > > > >
> > > > > > --
> > > > > > Peter Coetzee
> > > > > > Performance Computing and Visualisation PhD Candidate
> > > > > > Department of Computer Science
> > > > > > University of Warwick
> > > > >
> > >
>

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by Jeremy Kepner <ke...@ll.mit.edu>.

Hi Peter,
  So the Y axis is labeled "Execution Time (s)" which would imply
"Accumulo v2" using CRUCIBLE is 10 times slower than the "Native Accumulo"
which doesn't use CRUCIBLE.  Is this correct?

Regards.  -Jeremy

On Tue, Oct 21, 2014 at 03:28:54PM +0100, Peter Coetzee wrote:
> Accumulo v1, Accumulo v2, Spark-Accumulo, Spark-HDFS were implemented with
> CUCIBLE, each being the same CRUCIBLE code, but executed against a
> different runtime configuration.
> 
> Accumulo v1 represents the pre-optimisation Accumulo Iterator based runtime
> Accumulo v2 represents the post-optimisation Accumulo Iterator based runtime
> Spark-Accumulo makes use of a Standalone Spark cluster, backed by Accumulo
> on HDFS (uses Spark's hadoopRDD with AccumuloInputFormat)
> Spark-HDFS uses the same Standalone Spark cluster, but is operating over
> files in HDFS directly
> 
> 
> 
> On 21 October 2014 15:07, Jeremy Kepner <ke...@ll.mit.edu> wrote:
> 
> > So of the six lines on the graph:  Accumulo v1, Accumulo v2,
> > Spark-Accumulo, Spark-HDFS, Native Accumulo, Native Spark
> > which were implemented with  CRUCIBLE
> >
> > On Tue, Oct 21, 2014 at 09:23:12AM +0100, Peter Coetzee wrote:
> > > Hi Jeremy,
> > >
> > > If you're viewing the PDF form of the paper (Elsevier's HTML rendering
> > has
> > > some odd artefacts), there's a short explanation of the figure appearing
> > > just after it:
> > >
> > > At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen to
> > > > outperform a native implementation making use of the more expressive
> > Spark
> > > > builtins. Performing bulk analysis through the use of Accumulo
> > Iterators
> > > > with CRUCIBLE was approximately 10x slower than the equivalent native
> > > > implementation; with Spark on HDFS files, this is now almost 1.2x
> > faster
> > > > than the native implementation used.
> > >
> > >
> > > The "native" implementations (i.e. hand-written by an engineer using the
> > > tools offered by the standard platform) are shown as dashed series on the
> > > chart, while the other series represent a single CRUCIBLE topology,
> > > compiled once and executed on a collection of runtimes (each of which are
> > > discussed in more detail earlier in the paper).
> > >
> > > By way of clarification; are you curious as to what the figure shows, or
> > > why those results are demonstrated?
> > >
> > > Hope this helps somewhat.
> > >
> > > Best regards,
> > > Peter
> > >
> > >
> > >
> > > On 21 October 2014 00:19, Jeremy Kepner <ke...@ll.mit.edu> wrote:
> > >
> > > > Hi Peter,
> > > >   Thanks.  Can you clarify Figure 12 in the paper.  I think I
> > understand
> > > > what it is saying, but I am not 100% sure.
> > > >
> > > > Regards.  -Jeremy
> > > >
> > > > On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > > > > New open-access research published in the journal of Parallel
> > Computing
> > > > > demonstrates a novel approach to engineering analytics for
> > deployment in
> > > > > streaming and batch contexts.
> > > > >
> > > > > Increasing numbers of users are extracting real value from their data
> > > > using
> > > > > tools like IBM InfoSphere Streams for near-real-time analysis and
> > Apache
> > > > > Spark across their historical data in Accumulo.
> > > > >
> > > > > Until now, there hasn't been an approach which permits the use of
> > these
> > > > > tools from a single shared codebase, with deployment considerations
> > > > > reserved until deployment time. Furthermore, it has been even harder
> > to
> > > > > permit this unified analysis while maintaining cell-level traces of
> > the
> > > > > security heritage for each datum an analytic produces.
> > > > >
> > > > > Some highlights of the paper include:
> > > > >   - A domain specific language (CRUCIBLE) and runtime models for on-
> > and
> > > > > off-line data analytics.
> > > > >   - Detailed analysis of CRUCIBLE’s runtime performance in
> > > > state-of-the-art
> > > > > environments.
> > > > >   - Development and detailed analysis of a set of runtime models for
> > new
> > > > > environments.
> > > > >   - Performance comparison with native implementations and
> > discussion of
> > > > > optimisation steps.
> > > > >   - Formulation of a primitive in the DSL that permits an analytic
> > to be
> > > > > run over multiple data sources.
> > > > >
> > > > > The paper, Towards Unified Secure On- and Off-line Analytics at
> > Scale, is
> > > > > available free of charge from Elsevier:
> > > > >
> > > > > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> > > > >
> > > > >
> > > > > I am one of the lead authors of the work, and would be more than
> > happy to
> > > > > discuss any aspects which catch your attention!
> > > > >
> > > > > Peter
> > > > >
> > > > > --
> > > > > Peter Coetzee
> > > > > Performance Computing and Visualisation PhD Candidate
> > > > > Department of Computer Science
> > > > > University of Warwick
> > > >
> >

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by Peter Coetzee <pe...@coetzee.org>.

Accumulo v1, Accumulo v2, Spark-Accumulo, Spark-HDFS were implemented with
CUCIBLE, each being the same CRUCIBLE code, but executed against a
different runtime configuration.

Accumulo v1 represents the pre-optimisation Accumulo Iterator based runtime
Accumulo v2 represents the post-optimisation Accumulo Iterator based runtime
Spark-Accumulo makes use of a Standalone Spark cluster, backed by Accumulo
on HDFS (uses Spark's hadoopRDD with AccumuloInputFormat)
Spark-HDFS uses the same Standalone Spark cluster, but is operating over
files in HDFS directly



On 21 October 2014 15:07, Jeremy Kepner <ke...@ll.mit.edu> wrote:

> So of the six lines on the graph:  Accumulo v1, Accumulo v2,
> Spark-Accumulo, Spark-HDFS, Native Accumulo, Native Spark
> which were implemented with  CRUCIBLE
>
> On Tue, Oct 21, 2014 at 09:23:12AM +0100, Peter Coetzee wrote:
> > Hi Jeremy,
> >
> > If you're viewing the PDF form of the paper (Elsevier's HTML rendering
> has
> > some odd artefacts), there's a short explanation of the figure appearing
> > just after it:
> >
> > At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen to
> > > outperform a native implementation making use of the more expressive
> Spark
> > > builtins. Performing bulk analysis through the use of Accumulo
> Iterators
> > > with CRUCIBLE was approximately 10x slower than the equivalent native
> > > implementation; with Spark on HDFS files, this is now almost 1.2x
> faster
> > > than the native implementation used.
> >
> >
> > The "native" implementations (i.e. hand-written by an engineer using the
> > tools offered by the standard platform) are shown as dashed series on the
> > chart, while the other series represent a single CRUCIBLE topology,
> > compiled once and executed on a collection of runtimes (each of which are
> > discussed in more detail earlier in the paper).
> >
> > By way of clarification; are you curious as to what the figure shows, or
> > why those results are demonstrated?
> >
> > Hope this helps somewhat.
> >
> > Best regards,
> > Peter
> >
> >
> >
> > On 21 October 2014 00:19, Jeremy Kepner <ke...@ll.mit.edu> wrote:
> >
> > > Hi Peter,
> > >   Thanks.  Can you clarify Figure 12 in the paper.  I think I
> understand
> > > what it is saying, but I am not 100% sure.
> > >
> > > Regards.  -Jeremy
> > >
> > > On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > > > New open-access research published in the journal of Parallel
> Computing
> > > > demonstrates a novel approach to engineering analytics for
> deployment in
> > > > streaming and batch contexts.
> > > >
> > > > Increasing numbers of users are extracting real value from their data
> > > using
> > > > tools like IBM InfoSphere Streams for near-real-time analysis and
> Apache
> > > > Spark across their historical data in Accumulo.
> > > >
> > > > Until now, there hasn't been an approach which permits the use of
> these
> > > > tools from a single shared codebase, with deployment considerations
> > > > reserved until deployment time. Furthermore, it has been even harder
> to
> > > > permit this unified analysis while maintaining cell-level traces of
> the
> > > > security heritage for each datum an analytic produces.
> > > >
> > > > Some highlights of the paper include:
> > > >   - A domain specific language (CRUCIBLE) and runtime models for on-
> and
> > > > off-line data analytics.
> > > >   - Detailed analysis of CRUCIBLE’s runtime performance in
> > > state-of-the-art
> > > > environments.
> > > >   - Development and detailed analysis of a set of runtime models for
> new
> > > > environments.
> > > >   - Performance comparison with native implementations and
> discussion of
> > > > optimisation steps.
> > > >   - Formulation of a primitive in the DSL that permits an analytic
> to be
> > > > run over multiple data sources.
> > > >
> > > > The paper, Towards Unified Secure On- and Off-line Analytics at
> Scale, is
> > > > available free of charge from Elsevier:
> > > >
> > > > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> > > >
> > > >
> > > > I am one of the lead authors of the work, and would be more than
> happy to
> > > > discuss any aspects which catch your attention!
> > > >
> > > > Peter
> > > >
> > > > --
> > > > Peter Coetzee
> > > > Performance Computing and Visualisation PhD Candidate
> > > > Department of Computer Science
> > > > University of Warwick
> > >
>

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by Jeremy Kepner <ke...@ll.mit.edu>.

So of the six lines on the graph:  Accumulo v1, Accumulo v2, Spark-Accumulo, Spark-HDFS, Native Accumulo, Native Spark
which were implemented with  CRUCIBLE

On Tue, Oct 21, 2014 at 09:23:12AM +0100, Peter Coetzee wrote:
> Hi Jeremy,
> 
> If you're viewing the PDF form of the paper (Elsevier's HTML rendering has
> some odd artefacts), there's a short explanation of the figure appearing
> just after it:
> 
> At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen to
> > outperform a native implementation making use of the more expressive Spark
> > builtins. Performing bulk analysis through the use of Accumulo Iterators
> > with CRUCIBLE was approximately 10x slower than the equivalent native
> > implementation; with Spark on HDFS files, this is now almost 1.2x faster
> > than the native implementation used.
> 
> 
> The "native" implementations (i.e. hand-written by an engineer using the
> tools offered by the standard platform) are shown as dashed series on the
> chart, while the other series represent a single CRUCIBLE topology,
> compiled once and executed on a collection of runtimes (each of which are
> discussed in more detail earlier in the paper).
> 
> By way of clarification; are you curious as to what the figure shows, or
> why those results are demonstrated?
> 
> Hope this helps somewhat.
> 
> Best regards,
> Peter
> 
> 
> 
> On 21 October 2014 00:19, Jeremy Kepner <ke...@ll.mit.edu> wrote:
> 
> > Hi Peter,
> >   Thanks.  Can you clarify Figure 12 in the paper.  I think I understand
> > what it is saying, but I am not 100% sure.
> >
> > Regards.  -Jeremy
> >
> > On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > > New open-access research published in the journal of Parallel Computing
> > > demonstrates a novel approach to engineering analytics for deployment in
> > > streaming and batch contexts.
> > >
> > > Increasing numbers of users are extracting real value from their data
> > using
> > > tools like IBM InfoSphere Streams for near-real-time analysis and Apache
> > > Spark across their historical data in Accumulo.
> > >
> > > Until now, there hasn't been an approach which permits the use of these
> > > tools from a single shared codebase, with deployment considerations
> > > reserved until deployment time. Furthermore, it has been even harder to
> > > permit this unified analysis while maintaining cell-level traces of the
> > > security heritage for each datum an analytic produces.
> > >
> > > Some highlights of the paper include:
> > >   - A domain specific language (CRUCIBLE) and runtime models for on- and
> > > off-line data analytics.
> > >   - Detailed analysis of CRUCIBLE’s runtime performance in
> > state-of-the-art
> > > environments.
> > >   - Development and detailed analysis of a set of runtime models for new
> > > environments.
> > >   - Performance comparison with native implementations and discussion of
> > > optimisation steps.
> > >   - Formulation of a primitive in the DSL that permits an analytic to be
> > > run over multiple data sources.
> > >
> > > The paper, Towards Unified Secure On- and Off-line Analytics at Scale, is
> > > available free of charge from Elsevier:
> > >
> > > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> > >
> > >
> > > I am one of the lead authors of the work, and would be more than happy to
> > > discuss any aspects which catch your attention!
> > >
> > > Peter
> > >
> > > --
> > > Peter Coetzee
> > > Performance Computing and Visualisation PhD Candidate
> > > Department of Computer Science
> > > University of Warwick
> >

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by Peter Coetzee <pe...@coetzee.org>.

Hi Jeremy,

If you're viewing the PDF form of the paper (Elsevier's HTML rendering has
some odd artefacts), there's a short explanation of the figure appearing
just after it:

At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen to
> outperform a native implementation making use of the more expressive Spark
> builtins. Performing bulk analysis through the use of Accumulo Iterators
> with CRUCIBLE was approximately 10x slower than the equivalent native
> implementation; with Spark on HDFS files, this is now almost 1.2x faster
> than the native implementation used.


The "native" implementations (i.e. hand-written by an engineer using the
tools offered by the standard platform) are shown as dashed series on the
chart, while the other series represent a single CRUCIBLE topology,
compiled once and executed on a collection of runtimes (each of which are
discussed in more detail earlier in the paper).

By way of clarification; are you curious as to what the figure shows, or
why those results are demonstrated?

Hope this helps somewhat.

Best regards,
Peter



On 21 October 2014 00:19, Jeremy Kepner <ke...@ll.mit.edu> wrote:

> Hi Peter,
>   Thanks.  Can you clarify Figure 12 in the paper.  I think I understand
> what it is saying, but I am not 100% sure.
>
> Regards.  -Jeremy
>
> On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > New open-access research published in the journal of Parallel Computing
> > demonstrates a novel approach to engineering analytics for deployment in
> > streaming and batch contexts.
> >
> > Increasing numbers of users are extracting real value from their data
> using
> > tools like IBM InfoSphere Streams for near-real-time analysis and Apache
> > Spark across their historical data in Accumulo.
> >
> > Until now, there hasn't been an approach which permits the use of these
> > tools from a single shared codebase, with deployment considerations
> > reserved until deployment time. Furthermore, it has been even harder to
> > permit this unified analysis while maintaining cell-level traces of the
> > security heritage for each datum an analytic produces.
> >
> > Some highlights of the paper include:
> >   - A domain specific language (CRUCIBLE) and runtime models for on- and
> > off-line data analytics.
> >   - Detailed analysis of CRUCIBLE’s runtime performance in
> state-of-the-art
> > environments.
> >   - Development and detailed analysis of a set of runtime models for new
> > environments.
> >   - Performance comparison with native implementations and discussion of
> > optimisation steps.
> >   - Formulation of a primitive in the DSL that permits an analytic to be
> > run over multiple data sources.
> >
> > The paper, Towards Unified Secure On- and Off-line Analytics at Scale, is
> > available free of charge from Elsevier:
> >
> > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> >
> >
> > I am one of the lead authors of the work, and would be more than happy to
> > discuss any aspects which catch your attention!
> >
> > Peter
> >
> > --
> > Peter Coetzee
> > Performance Computing and Visualisation PhD Candidate
> > Department of Computer Science
> > University of Warwick
>

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by Jeremy Kepner <ke...@ll.mit.edu>.

Hi Peter,
  Thanks.  Can you clarify Figure 12 in the paper.  I think I understand
what it is saying, but I am not 100% sure.

Regards.  -Jeremy

On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> New open-access research published in the journal of Parallel Computing
> demonstrates a novel approach to engineering analytics for deployment in
> streaming and batch contexts.
> 
> Increasing numbers of users are extracting real value from their data using
> tools like IBM InfoSphere Streams for near-real-time analysis and Apache
> Spark across their historical data in Accumulo.
> 
> Until now, there hasn't been an approach which permits the use of these
> tools from a single shared codebase, with deployment considerations
> reserved until deployment time. Furthermore, it has been even harder to
> permit this unified analysis while maintaining cell-level traces of the
> security heritage for each datum an analytic produces.
> 
> Some highlights of the paper include:
>   - A domain specific language (CRUCIBLE) and runtime models for on- and
> off-line data analytics.
>   - Detailed analysis of CRUCIBLE’s runtime performance in state-of-the-art
> environments.
>   - Development and detailed analysis of a set of runtime models for new
> environments.
>   - Performance comparison with native implementations and discussion of
> optimisation steps.
>   - Formulation of a primitive in the DSL that permits an analytic to be
> run over multiple data sources.
> 
> The paper, Towards Unified Secure On- and Off-line Analytics at Scale, is
> available free of charge from Elsevier:
> 
> http://www.sciencedirect.com/science/article/pii/S0167819114000842
> 
> 
> I am one of the lead authors of the work, and would be more than happy to
> discuss any aspects which catch your attention!
> 
> Peter
> 
> --
> Peter Coetzee
> Performance Computing and Visualisation PhD Candidate
> Department of Computer Science
> University of Warwick

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by Peter Coetzee <pe...@coetzee.org>.

David, All,

I've managed to get together a version of the code I can publish. There's a
basic quickstart guide to getting it installed and running in Eclipse at
http://www2.warwick.ac.uk/fac/sci/dcs/people/research/csrmab/crucible/quickstart,
as well as links to the source and binaries. I've not had much opportunity
to test this outside of my own environment, so all the usual "untested
research code" caveats apply. Hopefully it'll be a useful illustration of
the approach and concepts put forward in the paper, at least!

All the best,
Peter


On 21 October 2014 15:54, David Medinets <da...@gmail.com> wrote:

> The picture of the DSL Eclipse Integration looks nice. Looking forward
> to seeing the Iterator stack and cell-level security handling.
>
> On Tue, Oct 21, 2014 at 10:37 AM, Peter Coetzee <pe...@coetzee.org> wrote:
> > There's the skeleton of a website at http://go.warwick.ac.uk/crucible,
> > although at present it's a thin pointer to the journal paper.
> >
> > I'm currently working on getting the code up to release standard and
> pushed
> > through my employer's release process before I can put it up, and
> hopefully
> > I'll get some more complete documentation and examples together to go
> with
> > that. I'm sure you're familiar with the juggling act of splitting time
> > between getting research to a technical level that's increasingly useful
> vs
> > documenting and thus making it usable for others!
> >
> >
> > Peter.
> >
> > On 21 October 2014 15:27, David Medinets <da...@gmail.com>
> wrote:
> >>
> >> Thanks for letting us know about this research. Is there a website
> >> exploring the DSL?
> >>
> >> On Mon, Oct 20, 2014 at 4:00 AM, Peter Coetzee <pe...@coetzee.org>
> wrote:
> >> > New open-access research published in the journal of Parallel
> Computing
> >> > demonstrates a novel approach to engineering analytics for deployment
> in
> >> > streaming and batch contexts.
> >> >
> >> > Increasing numbers of users are extracting real value from their data
> >> > using
> >> > tools like IBM InfoSphere Streams for near-real-time analysis and
> Apache
> >> > Spark across their historical data in Accumulo.
> >> >
> >> > Until now, there hasn't been an approach which permits the use of
> these
> >> > tools from a single shared codebase, with deployment considerations
> >> > reserved
> >> > until deployment time. Furthermore, it has been even harder to permit
> >> > this
> >> > unified analysis while maintaining cell-level traces of the security
> >> > heritage for each datum an analytic produces.
> >> >
> >> > Some highlights of the paper include:
> >> >   - A domain specific language (CRUCIBLE) and runtime models for on-
> and
> >> > off-line data analytics.
> >> >   - Detailed analysis of CRUCIBLE’s runtime performance in
> >> > state-of-the-art
> >> > environments.
> >> >   - Development and detailed analysis of a set of runtime models for
> new
> >> > environments.
> >> >   - Performance comparison with native implementations and discussion
> of
> >> > optimisation steps.
> >> >   - Formulation of a primitive in the DSL that permits an analytic to
> be
> >> > run
> >> > over multiple data sources.
> >> >
> >> > The paper, Towards Unified Secure On- and Off-line Analytics at Scale,
> >> > is
> >> > available free of charge from Elsevier:
> >> >
> >> > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> >> >
> >> >
> >> > I am one of the lead authors of the work, and would be more than happy
> >> > to
> >> > discuss any aspects which catch your attention!
> >> >
> >> > Peter
> >> >
> >> > --
> >> > Peter Coetzee
> >> > Performance Computing and Visualisation PhD Candidate
> >> > Department of Computer Science
> >> > University of Warwick
> >
> >
>

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by David Medinets <da...@gmail.com>.

The picture of the DSL Eclipse Integration looks nice. Looking forward
to seeing the Iterator stack and cell-level security handling.

On Tue, Oct 21, 2014 at 10:37 AM, Peter Coetzee <pe...@coetzee.org> wrote:
> There's the skeleton of a website at http://go.warwick.ac.uk/crucible,
> although at present it's a thin pointer to the journal paper.
>
> I'm currently working on getting the code up to release standard and pushed
> through my employer's release process before I can put it up, and hopefully
> I'll get some more complete documentation and examples together to go with
> that. I'm sure you're familiar with the juggling act of splitting time
> between getting research to a technical level that's increasingly useful vs
> documenting and thus making it usable for others!
>
>
> Peter.
>
> On 21 October 2014 15:27, David Medinets <da...@gmail.com> wrote:
>>
>> Thanks for letting us know about this research. Is there a website
>> exploring the DSL?
>>
>> On Mon, Oct 20, 2014 at 4:00 AM, Peter Coetzee <pe...@coetzee.org> wrote:
>> > New open-access research published in the journal of Parallel Computing
>> > demonstrates a novel approach to engineering analytics for deployment in
>> > streaming and batch contexts.
>> >
>> > Increasing numbers of users are extracting real value from their data
>> > using
>> > tools like IBM InfoSphere Streams for near-real-time analysis and Apache
>> > Spark across their historical data in Accumulo.
>> >
>> > Until now, there hasn't been an approach which permits the use of these
>> > tools from a single shared codebase, with deployment considerations
>> > reserved
>> > until deployment time. Furthermore, it has been even harder to permit
>> > this
>> > unified analysis while maintaining cell-level traces of the security
>> > heritage for each datum an analytic produces.
>> >
>> > Some highlights of the paper include:
>> >   - A domain specific language (CRUCIBLE) and runtime models for on- and
>> > off-line data analytics.
>> >   - Detailed analysis of CRUCIBLE’s runtime performance in
>> > state-of-the-art
>> > environments.
>> >   - Development and detailed analysis of a set of runtime models for new
>> > environments.
>> >   - Performance comparison with native implementations and discussion of
>> > optimisation steps.
>> >   - Formulation of a primitive in the DSL that permits an analytic to be
>> > run
>> > over multiple data sources.
>> >
>> > The paper, Towards Unified Secure On- and Off-line Analytics at Scale,
>> > is
>> > available free of charge from Elsevier:
>> >
>> > http://www.sciencedirect.com/science/article/pii/S0167819114000842
>> >
>> >
>> > I am one of the lead authors of the work, and would be more than happy
>> > to
>> > discuss any aspects which catch your attention!
>> >
>> > Peter
>> >
>> > --
>> > Peter Coetzee
>> > Performance Computing and Visualisation PhD Candidate
>> > Department of Computer Science
>> > University of Warwick
>
>

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by Peter Coetzee <pe...@coetzee.org>.

There's the skeleton of a website at http://go.warwick.ac.uk/crucible,
although at present it's a thin pointer to the journal paper.

I'm currently working on getting the code up to release standard and pushed
through my employer's release process before I can put it up, and hopefully
I'll get some more complete documentation and examples together to go with
that. I'm sure you're familiar with the juggling act of splitting time
between getting research to a technical level that's increasingly useful vs
documenting and thus making it usable for others!


Peter.

On 21 October 2014 15:27, David Medinets <da...@gmail.com> wrote:

> Thanks for letting us know about this research. Is there a website
> exploring the DSL?
>
> On Mon, Oct 20, 2014 at 4:00 AM, Peter Coetzee <pe...@coetzee.org> wrote:
> > New open-access research published in the journal of Parallel Computing
> > demonstrates a novel approach to engineering analytics for deployment in
> > streaming and batch contexts.
> >
> > Increasing numbers of users are extracting real value from their data
> using
> > tools like IBM InfoSphere Streams for near-real-time analysis and Apache
> > Spark across their historical data in Accumulo.
> >
> > Until now, there hasn't been an approach which permits the use of these
> > tools from a single shared codebase, with deployment considerations
> reserved
> > until deployment time. Furthermore, it has been even harder to permit
> this
> > unified analysis while maintaining cell-level traces of the security
> > heritage for each datum an analytic produces.
> >
> > Some highlights of the paper include:
> >   - A domain specific language (CRUCIBLE) and runtime models for on- and
> > off-line data analytics.
> >   - Detailed analysis of CRUCIBLE’s runtime performance in
> state-of-the-art
> > environments.
> >   - Development and detailed analysis of a set of runtime models for new
> > environments.
> >   - Performance comparison with native implementations and discussion of
> > optimisation steps.
> >   - Formulation of a primitive in the DSL that permits an analytic to be
> run
> > over multiple data sources.
> >
> > The paper, Towards Unified Secure On- and Off-line Analytics at Scale, is
> > available free of charge from Elsevier:
> >
> > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> >
> >
> > I am one of the lead authors of the work, and would be more than happy to
> > discuss any aspects which catch your attention!
> >
> > Peter
> >
> > --
> > Peter Coetzee
> > Performance Computing and Visualisation PhD Candidate
> > Department of Computer Science
> > University of Warwick
>

Re: New research using Accumulo: Unified Secure On-/Off-line Analytics

Posted by David Medinets <da...@gmail.com>.

Thanks for letting us know about this research. Is there a website
exploring the DSL?

On Mon, Oct 20, 2014 at 4:00 AM, Peter Coetzee <pe...@coetzee.org> wrote:
> New open-access research published in the journal of Parallel Computing
> demonstrates a novel approach to engineering analytics for deployment in
> streaming and batch contexts.
>
> Increasing numbers of users are extracting real value from their data using
> tools like IBM InfoSphere Streams for near-real-time analysis and Apache
> Spark across their historical data in Accumulo.
>
> Until now, there hasn't been an approach which permits the use of these
> tools from a single shared codebase, with deployment considerations reserved
> until deployment time. Furthermore, it has been even harder to permit this
> unified analysis while maintaining cell-level traces of the security
> heritage for each datum an analytic produces.
>
> Some highlights of the paper include:
>   - A domain specific language (CRUCIBLE) and runtime models for on- and
> off-line data analytics.
>   - Detailed analysis of CRUCIBLE’s runtime performance in state-of-the-art
> environments.
>   - Development and detailed analysis of a set of runtime models for new
> environments.
>   - Performance comparison with native implementations and discussion of
> optimisation steps.
>   - Formulation of a primitive in the DSL that permits an analytic to be run
> over multiple data sources.
>
> The paper, Towards Unified Secure On- and Off-line Analytics at Scale, is
> available free of charge from Elsevier:
>
> http://www.sciencedirect.com/science/article/pii/S0167819114000842
>
>
> I am one of the lead authors of the work, and would be more than happy to
> discuss any aspects which catch your attention!
>
> Peter
>
> --
> Peter Coetzee
> Performance Computing and Visualisation PhD Candidate
> Department of Computer Science
> University of Warwick