You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "roman.drapeko@baesystems.com" <ro...@baesystems.com> on 2015/06/09 17:28:53 UTC

micro compaction

Hi guys,

While doing pre-analytics we generate hundreds of millions of mutations  that result in 1-100 megabytes of useful data after major compaction. We ingest into Accumulo using MR from Mapper job. We identified that performance really degrades while increasing a number of mutations.

The obvious improvement is to do some calculations in-memory before sending mutations to Accumulo.

Of course, at the same time we are looking for a solution to minimize development effort.

I guess I am asking about micro compaction/ingest-time iterators on the client side (before data is sent to Accumulo).

To my understanding, Accumulo does not support them, is it correct? And if so, are there any plans to support this functionality in the future?

Thanks
Roman


Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Re: micro compaction

Posted by Christopher <ct...@apache.org>.

The starting point would be to look at where mutations are added to the writer.

There'd be a couple of tricky parts. For one, mutations aren't sorted
on the client side, which is a prerequisite for iterators. Another
pitfall is getting the API right, so that it provides users enough
control over how much data is buffered for processing before being
sent along to the writer (which queues it for an RPC thread). Another
problem is that we pre-serialize Mutations as they are created, to
make the RPC faster. A client-side iterator from a bucket of mutations
would have to de-serialize these (with performance considerations).

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com> wrote:
> Having a combiner stack (more generally an iterator stack) run on the
> client-side seems to be the second most popular request on this list. The
> most popular being, "How do I write to Accumulo from inside an iterator?"
>
> Such a thing would be very useful for me, too. I have some cycles to help
> out, if somebody can give me an idea of where to get started and where the
> potential land-mines are.
>
> -Russ
>
> On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com
> <ro...@baesystems.com> wrote:
>>
>> Aggregated output is tiny,  so if I do same calculations in memory
>> (instead of sending mutations to Accumulo) , I can reduce overall number of
>> mutations by 1000x or so
>>
>>
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:josh.elser@gmail.com]
>> Sent: 09 June 2015 16:54
>> To: user@accumulo.apache.org
>> Subject: Re: micro compaction
>>
>> Well, you win the prize for new terminology. I haven't ever heard the term
>> "micro compaction" before.
>>
>> Can you clarify though, you say hundreds of millions of mutations that
>> result in megabytes of data. Is that an increase or decrease in size.
>> Comparing apples to oranges :)
>>
>> roman.drapeko@baesystems.com wrote:
>> > Hi guys,
>> >
>> > While doing pre-analytics we generate hundreds of millions of
>> > mutations that result in 1-100 megabytes of useful data after major
>> > compaction. We ingest into Accumulo using MR from Mapper job. We
>> > identified that performance really degrades while increasing a number of
>> > mutations.
>> >
>> > The obvious improvement is to do some calculations in-memory before
>> > sending mutations to Accumulo.
>> >
>> > Of course, at the same time we are looking for a solution to minimize
>> > development effort.
>> >
>> > I guess I am asking about micro compaction/ingest-time iterators on
>> > the client side (before data is sent to Accumulo).
>> >
>> > To my understanding, Accumulo does not support them, is it correct?
>> > And if so, are there any plans to support this functionality in the
>> > future?
>> >
>> > Thanks
>> >
>> > Roman
>> >
>> > Please consider the environment before printing this email. This
>> > message should be regarded as confidential. If you have received this
>> > email in error please notify the sender and destroy it immediately.
>> > Statements of intent shall only become binding when confirmed in hard
>> > copy by an authorised signatory. The contents of this email may relate
>> > to dealings with other companies under the control of BAE Systems
>> > Applied Intelligence Limited, details of which can be found at
>> > http://www.baesystems.com/Businesses/index.htm.
>> Please consider the environment before printing this email. This message
>> should be regarded as confidential. If you have received this email in error
>> please notify the sender and destroy it immediately. Statements of intent
>> shall only become binding when confirmed in hard copy by an authorised
>> signatory. The contents of this email may relate to dealings with other
>> companies under the control of BAE Systems Applied Intelligence Limited,
>> details of which can be found at
>> http://www.baesystems.com/Businesses/index.htm.

Re: micro compaction

Posted by Chris Bennight <ch...@slowcar.net>.

We have this same use case in geowave for statistics/aggregates (standard
commutative/associative requirement) - we use it to generate/store
statistics (min, max, avg, bounding box, cardinality, distributions, etc.).

We ended up solving with another layer of abstractions - we have a
mergeable interface which actually defines the operations, and then a thin
shim iterator which calls the method (actually a set of iterators - we have
different logic for scan vs. compaction time, to handle keeping statistics
per unique visibility value (compaction), but aggregating those results at
scan time).   It's pretty easy to run the same code client side
(optionally) in local java, storm, etc.  (we do this for ingest operations)

The mergeable interface is about as basic as you can get
https://github.com/ngageoint/geowave/blob/master/core/index/src/main/java/mil/nga/giat/geowave/core/index/Mergeable.java

here's a basic implementation that keeps track of the envelope for a set of
geospatial feature (BBOX basically)
https://github.com/ngageoint/geowave/blob/master/core/geotime/src/main/java/mil/nga/giat/geowave/core/geotime/store/statistics/BoundingBoxDataStatistics.java

Here's the iterator shim for compaction time
https://github.com/ngageoint/geowave/blob/master/extensions/datastores/accumulo/src/main/java/mil/nga/giat/geowave/datastore/accumulo/MergingCombiner.java

Here's the iterator shim for scan time
https://github.com/ngageoint/geowave/blob/master/extensions/datastores/accumulo/src/main/java/mil/nga/giat/geowave/datastore/accumulo/MergingVisibilityCombiner.java


On Tue, Jun 9, 2015 at 3:53 PM, Russ Weeks <rw...@newbrightidea.com> wrote:

> For consistency and ease of implementation. Say I've written a stack of
> combiners that do statistical aggregation, sampling etc. on my table.
> Rather than port that logic to a Storm topology or to the DStream API I'd
> just like to turn that stack on in my BatchWriter.
>
> On Tue, Jun 9, 2015 at 12:47 PM David Medinets <da...@gmail.com>
> wrote:
>
>> Consider using Storm, Pig, Spark, or your own framework to handle the
>> in-memory aggregation before giving the data to the BatchWriter. Why would
>> any part of Accumulo code be responsible for this kind of
>> application-specific data handling?
>>
>> On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com <
>> roman.drapeko@baesystems.com> wrote:
>>
>>>  Just to clarify the origin of my question.
>>>
>>>
>>>
>>> I had to do some performance tests to compare different storage types of
>>> “raw” data against each other.
>>>
>>>
>>>
>>> Hopefully, picture below is visible in the mailing list. If not, I will
>>> put it somewhere else.
>>>
>>>
>>>
>>> 6 million “original” records, 1.3GB data, 233 bytes per record
>>>
>>> Each original record is 40 fields delimited by tab, on average 19 – not
>>> null
>>>
>>> Batchwriter, single java program
>>>
>>>
>>>
>>> First three bars represent single “heavy” mutation to insert the whole
>>> tabular line / serialized object.
>>>
>>> 4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in
>>> one mutation)
>>>
>>> 8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid
>>> in separate mutations) - ~19 mutations per original record
>>>
>>>
>>>
>>> On average, single “heavy” mutations are 7-10 times faster than anything
>>> else, composite are 10%-35% faster than individual.
>>>
>>>
>>>
>>> I am not an expert how Accumulo is implemented internally, however it
>>> looks like composite mutation is treated more or less in the same way as a
>>> set of individual mutations. Probably, largest overhead is added by WAL.
>>>
>>>
>>>
>>>
>>>
>>> Data utilization before and after manual compaction of test table and
>>> all system tables:
>>>
>>>
>>>
>>>
>>>
>>> It’s not clear why “accumulo du” shows twice less data used comparing to
>>> “hdfs du”.
>>>
>>>
>>>
>>> All these tests made us think that we can improve performance by doing
>>> some calculations in-memory (and our use-case fits very well) and reducing
>>> number of mutations. Now I am trying to understand whether there is a
>>> relatively easy way to do this with Accumulo or whether it’s time to look
>>> closer into something like Spark.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Roman
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Adam Fuchs [mailto:afuchs@apache.org]
>>> *Sent:* 09 June 2015 19:08
>>>
>>> *To:* user@accumulo.apache.org
>>> *Subject:* Re: micro compaction
>>>
>>>
>>>
>>> I think this might be the same concept as in-mapper combining, but
>>> applied to data being sent to a BatchWriter rather than an OutputCollector.
>>> See [1], section 3.1.1. A similar performance analysis and probably a lot
>>> of the same code should apply here.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Adam
>>>
>>>
>>>
>>> [1]
>>> http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>>>
>>>
>>>
>>> On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com>
>>> wrote:
>>>
>>> Having a combiner stack (more generally an iterator stack) run on the
>>> client-side seems to be the second most popular request on this list. The
>>> most popular being, "How do I write to Accumulo from inside an iterator?"
>>>
>>>
>>>
>>> Such a thing would be very useful for me, too. I have some cycles to
>>> help out, if somebody can give me an idea of where to get started and where
>>> the potential land-mines are.
>>>
>>>
>>>
>>> -Russ
>>>
>>>
>>>
>>> On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com <
>>> roman.drapeko@baesystems.com> wrote:
>>>
>>> Aggregated output is tiny,  so if I do same calculations in memory
>>> (instead of sending mutations to Accumulo) , I can reduce overall number of
>>> mutations by 1000x or so
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Josh Elser [mailto:josh.elser@gmail.com]
>>> Sent: 09 June 2015 16:54
>>> To: user@accumulo.apache.org
>>> Subject: Re: micro compaction
>>>
>>> Well, you win the prize for new terminology. I haven't ever heard the
>>> term "micro compaction" before.
>>>
>>> Can you clarify though, you say hundreds of millions of mutations that
>>> result in megabytes of data. Is that an increase or decrease in size.
>>> Comparing apples to oranges :)
>>>
>>> roman.drapeko@baesystems.com wrote:
>>> > Hi guys,
>>> >
>>> > While doing pre-analytics we generate hundreds of millions of
>>> > mutations that result in 1-100 megabytes of useful data after major
>>> > compaction. We ingest into Accumulo using MR from Mapper job. We
>>> > identified that performance really degrades while increasing a number
>>> of mutations.
>>> >
>>> > The obvious improvement is to do some calculations in-memory before
>>> > sending mutations to Accumulo.
>>> >
>>> > Of course, at the same time we are looking for a solution to minimize
>>> > development effort.
>>> >
>>> > I guess I am asking about micro compaction/ingest-time iterators on
>>> > the client side (before data is sent to Accumulo).
>>> >
>>> > To my understanding, Accumulo does not support them, is it correct?
>>> > And if so, are there any plans to support this functionality in the
>>> future?
>>> >
>>> > Thanks
>>> >
>>> > Roman
>>> >
>>> > Please consider the environment before printing this email. This
>>> > message should be regarded as confidential. If you have received this
>>> > email in error please notify the sender and destroy it immediately.
>>> > Statements of intent shall only become binding when confirmed in hard
>>> > copy by an authorised signatory. The contents of this email may relate
>>> > to dealings with other companies under the control of BAE Systems
>>> > Applied Intelligence Limited, details of which can be found at
>>> > http://www.baesystems.com/Businesses/index.htm.
>>> Please consider the environment before printing this email. This message
>>> should be regarded as confidential. If you have received this email in
>>> error please notify the sender and destroy it immediately. Statements of
>>> intent shall only become binding when confirmed in hard copy by an
>>> authorised signatory. The contents of this email may relate to dealings
>>> with other companies under the control of BAE Systems Applied Intelligence
>>> Limited, details of which can be found at
>>> http://www.baesystems.com/Businesses/index.htm.
>>>
>>>
>>>  Please consider the environment before printing this email. This
>>> message should be regarded as confidential. If you have received this email
>>> in error please notify the sender and destroy it immediately. Statements of
>>> intent shall only become binding when confirmed in hard copy by an
>>> authorised signatory. The contents of this email may relate to dealings
>>> with other companies under the control of BAE Systems Applied Intelligence
>>> Limited, details of which can be found at
>>> http://www.baesystems.com/Businesses/index.htm.
>>>
>>
>>

Re: micro compaction

Posted by Josh Elser <jo...@gmail.com>.

Good point! Another reason why we need to start writing this stuff down 
now :)

John Vines wrote:
> Don't forget that the client may not have the same iterators in memory
> as the server JVM so that would have to be worked around.
>
> On Wed, Jun 10, 2015 at 12:28 PM Josh Elser <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     I think re-using Iterators in the client-write path makes sense
>     architecturally and is a logical progression for the reasons pointed out
>     by Roman and Russ.
>
>     The big concern that Keith pointed out, it's hard to directly apply
>     iterators on the client-write side because we're not dealing in sorted
>     key-values at this point. I think there could be ways to work around
>     this.
>
>     I'd say if we have people who are interested in pursuing this, let's
>     start a new discussion on dev@ where we can start laying some groundwork
>     for the scope and implementation of what this solution would look like.
>
>     roman.drapeko@baesystems.com <ma...@baesystems.com>
>     wrote:
>      > My view is that introduction of ingest-time iterators would be
>     quite a
>      > useful feature. Anyway. J
>      >
>      > Also, could anyone exactly explain why composite mutation perform
>     pretty
>      > much in the same way as a set of individual mutations?
>      >
>      > One large composite mutation with 19 qualifiers inside is just 10-30%
>      > faster than 19 individual mutations.
>      >
>      > *From:*Russ Weeks [mailto:rweeks@newbrightidea.com
>     <ma...@newbrightidea.com>]
>      > *Sent:* 09 June 2015 20:54
>      > *To:* accumulo-user
>      > *Subject:* Re: micro compaction
>      >
>      > For consistency and ease of implementation. Say I've written a
>     stack of
>      > combiners that do statistical aggregation, sampling etc. on my table.
>      > Rather than port that logic to a Storm topology or to the DStream API
>      > I'd just like to turn that stack on in my BatchWriter.
>      >
>      > On Tue, Jun 9, 2015 at 12:47 PM David Medinets
>     <david.medinets@gmail.com <ma...@gmail.com>
>      > <mailto:david.medinets@gmail.com
>     <ma...@gmail.com>>> wrote:
>      >
>      >     Consider using Storm, Pig, Spark, or your own framework to handle
>      >     the in-memory aggregation before giving the data to the
>     BatchWriter.
>      >     Why would any part of Accumulo code be responsible for this
>     kind of
>      >     application-specific data handling?
>      >
>      >     On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com
>     <ma...@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <ma...@baesystems.com>> <roman.drapeko@baesystems.com
>     <ma...@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <ma...@baesystems.com>>> wrote:
>      >
>      >     Just to clarify the origin of my question.
>      >
>      >     I had to do some performance tests to compare different storage
>      >     types of “raw” data against each other.
>      >
>      >     Hopefully, picture below is visible in the mailing list. If
>     not, I
>      >     will put it somewhere else.
>      >
>      >     6 million “original” records, 1.3GB data, 233 bytes per record
>      >
>      >     Each original record is 40 fields delimited by tab, on
>     average 19 –
>      >     not null
>      >
>      >     Batchwriter, single java program
>      >
>      >     First three bars represent single “heavy” mutation to insert the
>      >     whole tabular line / serialized object.
>      >
>      >     4,5,6,7 bars – composite mutation (all qualifiers for the
>     same rowid
>      >     in one mutation)
>      >
>      >     8, 9, 10, 11 – individual mutations (all qualifiers for the same
>      >     rowid in separate mutations) - ~19 mutations per original record
>      >
>      >     On average, single “heavy” mutations are 7-10 times faster than
>      >     anything else, composite are 10%-35% faster than individual.
>      >
>      >     I am not an expert how Accumulo is implemented internally,
>     however
>      >     it looks like composite mutation is treated more or less in
>     the same
>      >     way as a set of individual mutations. Probably, largest
>     overhead is
>      >     added by WAL.
>      >
>      >     Data utilization before and after manual compaction of test table
>      >     and all system tables:
>      >
>      >     It’s not clear why “accumulo du” shows twice less data used
>      >     comparing to “hdfs du”.
>      >
>      >     All these tests made us think that we can improve performance by
>      >     doing some calculations in-memory (and our use-case fits very
>     well)
>      >     and reducing number of mutations. Now I am trying to understand
>      >     whether there is a relatively easy way to do this with
>     Accumulo or
>      >     whether it’s time to look closer into something like Spark.
>      >
>      >     Thanks
>      >
>      >     Roman
>      >
>      >     *From:*Adam Fuchs [mailto:afuchs@apache.org
>     <ma...@apache.org> <mailto:afuchs@apache.org
>     <ma...@apache.org>>]
>      >     *Sent:* 09 June 2015 19:08
>      >
>      >
>      >     *To:* user@accumulo.apache.org
>     <ma...@accumulo.apache.org> <mailto:user@accumulo.apache.org
>     <ma...@accumulo.apache.org>>
>      >     *Subject:* Re: micro compaction
>      >
>      >     I think this might be the same concept as in-mapper
>     combining, but
>      >     applied to data being sent to a BatchWriter rather than an
>      >     OutputCollector. See [1], section 3.1.1. A similar performance
>      >     analysis and probably a lot of the same code should apply here.
>      >
>      >     Cheers,
>      >
>      >     Adam
>      >
>      >     [1]
>      > http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>      >
>      >     On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks
>     <rweeks@newbrightidea.com <ma...@newbrightidea.com>
>      > <mailto:rweeks@newbrightidea.com
>     <ma...@newbrightidea.com>>> wrote:
>      >
>      >     Having a combiner stack (more generally an iterator stack) run on
>      >     the client-side seems to be the second most popular request
>     on this
>      >     list. The most popular being, "How do I write to Accumulo from
>      >     inside an iterator?"
>      >
>      >     Such a thing would be very useful for me, too. I have some
>     cycles to
>      >     help out, if somebody can give me an idea of where to get started
>      >     and where the potential land-mines are.
>      >
>      >     -Russ
>      >
>      >     On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com
>     <ma...@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <ma...@baesystems.com>> <roman.drapeko@baesystems.com
>     <ma...@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <ma...@baesystems.com>>> wrote:
>      >
>      >         Aggregated output is tiny, so if I do same calculations in
>      >         memory (instead of sending mutations to Accumulo) , I can
>     reduce
>      >         overall number of mutations by 1000x or so
>      >
>      >
>      >
>      >         -----Original Message-----
>      >         From: Josh Elser [mailto:josh.elser@gmail.com
>     <ma...@gmail.com>
>      > <mailto:josh.elser@gmail.com <ma...@gmail.com>>]
>      >         Sent: 09 June 2015 16:54
>      >         To: user@accumulo.apache.org
>     <ma...@accumulo.apache.org> <mailto:user@accumulo.apache.org
>     <ma...@accumulo.apache.org>>
>      >         Subject: Re: micro compaction
>      >
>      >         Well, you win the prize for new terminology. I haven't ever
>      >         heard the term "micro compaction" before.
>      >
>      >         Can you clarify though, you say hundreds of millions of
>      >         mutations that result in megabytes of data. Is that an
>     increase
>      >         or decrease in size.
>      >         Comparing apples to oranges :)
>      >
>      > roman.drapeko@baesystems.com <ma...@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <ma...@baesystems.com>> wrote:
>      > > Hi guys,
>      > >
>      > > While doing pre-analytics we generate hundreds of millions of
>      > > mutations that result in 1-100 megabytes of useful data after
>      >         major
>      > > compaction. We ingest into Accumulo using MR from Mapper job. We
>      > > identified that performance really degrades while increasing
>      >         a number of mutations.
>      > >
>      > > The obvious improvement is to do some calculations in-memory
>      >         before
>      > > sending mutations to Accumulo.
>      > >
>      > > Of course, at the same time we are looking for a solution to
>      >         minimize
>      > > development effort.
>      > >
>      > > I guess I am asking about micro compaction/ingest-time
>      >         iterators on
>      > > the client side (before data is sent to Accumulo).
>      > >
>      > > To my understanding, Accumulo does not support them, is it
>      >         correct?
>      > > And if so, are there any plans to support this functionality
>      >         in the future?
>      > >
>      > > Thanks
>      > >
>      > > Roman
>      > >
>      > > Please consider the environment before printing this email. This
>      > > message should be regarded as confidential. If you have
>      >         received this
>      > > email in error please notify the sender and destroy it
>      >         immediately.
>      > > Statements of intent shall only become binding when confirmed
>      >         in hard
>      > > copy by an authorised signatory. The contents of this email
>      >         may relate
>      > > to dealings with other companies under the control of BAE Systems
>      > > Applied Intelligence Limited, details of which can be found at
>      > > http://www.baesystems.com/Businesses/index.htm.
>      >         Please consider the environment before printing this
>     email. This
>      >         message should be regarded as confidential. If you have
>     received
>      >         this email in error please notify the sender and destroy it
>      >         immediately. Statements of intent shall only become
>     binding when
>      >         confirmed in hard copy by an authorised signatory. The
>     contents
>      >         of this email may relate to dealings with other companies
>     under
>      >         the control of BAE Systems Applied Intelligence Limited,
>     details
>      >         of which can be found at
>      > http://www.baesystems.com/Businesses/index.htm.
>      >
>      >     Please consider the environment before printing this email. This
>      >     message should be regarded as confidential. If you have received
>      >     this email in error please notify the sender and destroy it
>      >     immediately. Statements of intent shall only become binding when
>      >     confirmed in hard copy by an authorised signatory. The
>     contents of
>      >     this email may relate to dealings with other companies under the
>      >     control of BAE Systems Applied Intelligence Limited, details of
>      >     which can be found at
>     http://www.baesystems.com/Businesses/index.htm.
>      >
>      > Please consider the environment before printing this email. This
>     message
>      > should be regarded as confidential. If you have received this
>     email in
>      > error please notify the sender and destroy it immediately.
>     Statements of
>      > intent shall only become binding when confirmed in hard copy by an
>      > authorised signatory. The contents of this email may relate to
>     dealings
>      > with other companies under the control of BAE Systems Applied
>      > Intelligence Limited, details of which can be found at
>      > http://www.baesystems.com/Businesses/index.htm.
>

Re: micro compaction

Posted by John Vines <vi...@apache.org>.

Don't forget that the client may not have the same iterators in memory as
the server JVM so that would have to be worked around.

On Wed, Jun 10, 2015 at 12:28 PM Josh Elser <jo...@gmail.com> wrote:

> I think re-using Iterators in the client-write path makes sense
> architecturally and is a logical progression for the reasons pointed out
> by Roman and Russ.
>
> The big concern that Keith pointed out, it's hard to directly apply
> iterators on the client-write side because we're not dealing in sorted
> key-values at this point. I think there could be ways to work around this.
>
> I'd say if we have people who are interested in pursuing this, let's
> start a new discussion on dev@ where we can start laying some groundwork
> for the scope and implementation of what this solution would look like.
>
> roman.drapeko@baesystems.com wrote:
> > My view is that introduction of ingest-time iterators would be quite a
> > useful feature. Anyway. J
> >
> > Also, could anyone exactly explain why composite mutation perform pretty
> > much in the same way as a set of individual mutations?
> >
> > One large composite mutation with 19 qualifiers inside is just 10-30%
> > faster than 19 individual mutations.
> >
> > *From:*Russ Weeks [mailto:rweeks@newbrightidea.com]
> > *Sent:* 09 June 2015 20:54
> > *To:* accumulo-user
> > *Subject:* Re: micro compaction
> >
> > For consistency and ease of implementation. Say I've written a stack of
> > combiners that do statistical aggregation, sampling etc. on my table.
> > Rather than port that logic to a Storm topology or to the DStream API
> > I'd just like to turn that stack on in my BatchWriter.
> >
> > On Tue, Jun 9, 2015 at 12:47 PM David Medinets <david.medinets@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     Consider using Storm, Pig, Spark, or your own framework to handle
> >     the in-memory aggregation before giving the data to the BatchWriter.
> >     Why would any part of Accumulo code be responsible for this kind of
> >     application-specific data handling?
> >
> >     On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com
> >     <ma...@baesystems.com> <roman.drapeko@baesystems.com
> >     <ma...@baesystems.com>> wrote:
> >
> >     Just to clarify the origin of my question.
> >
> >     I had to do some performance tests to compare different storage
> >     types of “raw” data against each other.
> >
> >     Hopefully, picture below is visible in the mailing list. If not, I
> >     will put it somewhere else.
> >
> >     6 million “original” records, 1.3GB data, 233 bytes per record
> >
> >     Each original record is 40 fields delimited by tab, on average 19 –
> >     not null
> >
> >     Batchwriter, single java program
> >
> >     First three bars represent single “heavy” mutation to insert the
> >     whole tabular line / serialized object.
> >
> >     4,5,6,7 bars – composite mutation (all qualifiers for the same rowid
> >     in one mutation)
> >
> >     8, 9, 10, 11 – individual mutations (all qualifiers for the same
> >     rowid in separate mutations) - ~19 mutations per original record
> >
> >     On average, single “heavy” mutations are 7-10 times faster than
> >     anything else, composite are 10%-35% faster than individual.
> >
> >     I am not an expert how Accumulo is implemented internally, however
> >     it looks like composite mutation is treated more or less in the same
> >     way as a set of individual mutations. Probably, largest overhead is
> >     added by WAL.
> >
> >     Data utilization before and after manual compaction of test table
> >     and all system tables:
> >
> >     It’s not clear why “accumulo du” shows twice less data used
> >     comparing to “hdfs du”.
> >
> >     All these tests made us think that we can improve performance by
> >     doing some calculations in-memory (and our use-case fits very well)
> >     and reducing number of mutations. Now I am trying to understand
> >     whether there is a relatively easy way to do this with Accumulo or
> >     whether it’s time to look closer into something like Spark.
> >
> >     Thanks
> >
> >     Roman
> >
> >     *From:*Adam Fuchs [mailto:afuchs@apache.org <mailto:
> afuchs@apache.org>]
> >     *Sent:* 09 June 2015 19:08
> >
> >
> >     *To:* user@accumulo.apache.org <ma...@accumulo.apache.org>
> >     *Subject:* Re: micro compaction
> >
> >     I think this might be the same concept as in-mapper combining, but
> >     applied to data being sent to a BatchWriter rather than an
> >     OutputCollector. See [1], section 3.1.1. A similar performance
> >     analysis and probably a lot of the same code should apply here.
> >
> >     Cheers,
> >
> >     Adam
> >
> >     [1]
> >
> http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
> >
> >     On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rweeks@newbrightidea.com
> >     <ma...@newbrightidea.com>> wrote:
> >
> >     Having a combiner stack (more generally an iterator stack) run on
> >     the client-side seems to be the second most popular request on this
> >     list. The most popular being, "How do I write to Accumulo from
> >     inside an iterator?"
> >
> >     Such a thing would be very useful for me, too. I have some cycles to
> >     help out, if somebody can give me an idea of where to get started
> >     and where the potential land-mines are.
> >
> >     -Russ
> >
> >     On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com
> >     <ma...@baesystems.com> <roman.drapeko@baesystems.com
> >     <ma...@baesystems.com>> wrote:
> >
> >         Aggregated output is tiny, so if I do same calculations in
> >         memory (instead of sending mutations to Accumulo) , I can reduce
> >         overall number of mutations by 1000x or so
> >
> >
> >
> >         -----Original Message-----
> >         From: Josh Elser [mailto:josh.elser@gmail.com
> >         <ma...@gmail.com>]
> >         Sent: 09 June 2015 16:54
> >         To: user@accumulo.apache.org <ma...@accumulo.apache.org>
> >         Subject: Re: micro compaction
> >
> >         Well, you win the prize for new terminology. I haven't ever
> >         heard the term "micro compaction" before.
> >
> >         Can you clarify though, you say hundreds of millions of
> >         mutations that result in megabytes of data. Is that an increase
> >         or decrease in size.
> >         Comparing apples to oranges :)
> >
> >         roman.drapeko@baesystems.com
> >         <ma...@baesystems.com> wrote:
> >          > Hi guys,
> >          >
> >          > While doing pre-analytics we generate hundreds of millions of
> >          > mutations that result in 1-100 megabytes of useful data after
> >         major
> >          > compaction. We ingest into Accumulo using MR from Mapper job.
> We
> >          > identified that performance really degrades while increasing
> >         a number of mutations.
> >          >
> >          > The obvious improvement is to do some calculations in-memory
> >         before
> >          > sending mutations to Accumulo.
> >          >
> >          > Of course, at the same time we are looking for a solution to
> >         minimize
> >          > development effort.
> >          >
> >          > I guess I am asking about micro compaction/ingest-time
> >         iterators on
> >          > the client side (before data is sent to Accumulo).
> >          >
> >          > To my understanding, Accumulo does not support them, is it
> >         correct?
> >          > And if so, are there any plans to support this functionality
> >         in the future?
> >          >
> >          > Thanks
> >          >
> >          > Roman
> >          >
> >          > Please consider the environment before printing this email.
> This
> >          > message should be regarded as confidential. If you have
> >         received this
> >          > email in error please notify the sender and destroy it
> >         immediately.
> >          > Statements of intent shall only become binding when confirmed
> >         in hard
> >          > copy by an authorised signatory. The contents of this email
> >         may relate
> >          > to dealings with other companies under the control of BAE
> Systems
> >          > Applied Intelligence Limited, details of which can be found at
> >          > http://www.baesystems.com/Businesses/index.htm.
> >         Please consider the environment before printing this email. This
> >         message should be regarded as confidential. If you have received
> >         this email in error please notify the sender and destroy it
> >         immediately. Statements of intent shall only become binding when
> >         confirmed in hard copy by an authorised signatory. The contents
> >         of this email may relate to dealings with other companies under
> >         the control of BAE Systems Applied Intelligence Limited, details
> >         of which can be found at
> >         http://www.baesystems.com/Businesses/index.htm.
> >
> >     Please consider the environment before printing this email. This
> >     message should be regarded as confidential. If you have received
> >     this email in error please notify the sender and destroy it
> >     immediately. Statements of intent shall only become binding when
> >     confirmed in hard copy by an authorised signatory. The contents of
> >     this email may relate to dealings with other companies under the
> >     control of BAE Systems Applied Intelligence Limited, details of
> >     which can be found at http://www.baesystems.com/Businesses/index.htm
> .
> >
> > Please consider the environment before printing this email. This message
> > should be regarded as confidential. If you have received this email in
> > error please notify the sender and destroy it immediately. Statements of
> > intent shall only become binding when confirmed in hard copy by an
> > authorised signatory. The contents of this email may relate to dealings
> > with other companies under the control of BAE Systems Applied
> > Intelligence Limited, details of which can be found at
> > http://www.baesystems.com/Businesses/index.htm.
>

Re: micro compaction

Posted by Josh Elser <jo...@gmail.com>.

I think re-using Iterators in the client-write path makes sense 
architecturally and is a logical progression for the reasons pointed out 
by Roman and Russ.

The big concern that Keith pointed out, it's hard to directly apply 
iterators on the client-write side because we're not dealing in sorted 
key-values at this point. I think there could be ways to work around this.

I'd say if we have people who are interested in pursuing this, let's 
start a new discussion on dev@ where we can start laying some groundwork 
for the scope and implementation of what this solution would look like.

roman.drapeko@baesystems.com wrote:
> My view is that introduction of ingest-time iterators would be quite a
> useful feature. Anyway. J
>
> Also, could anyone exactly explain why composite mutation perform pretty
> much in the same way as a set of individual mutations?
>
> One large composite mutation with 19 qualifiers inside is just 10-30%
> faster than 19 individual mutations.
>
> *From:*Russ Weeks [mailto:rweeks@newbrightidea.com]
> *Sent:* 09 June 2015 20:54
> *To:* accumulo-user
> *Subject:* Re: micro compaction
>
> For consistency and ease of implementation. Say I've written a stack of
> combiners that do statistical aggregation, sampling etc. on my table.
> Rather than port that logic to a Storm topology or to the DStream API
> I'd just like to turn that stack on in my BatchWriter.
>
> On Tue, Jun 9, 2015 at 12:47 PM David Medinets <david.medinets@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Consider using Storm, Pig, Spark, or your own framework to handle
>     the in-memory aggregation before giving the data to the BatchWriter.
>     Why would any part of Accumulo code be responsible for this kind of
>     application-specific data handling?
>
>     On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com
>     <ma...@baesystems.com> <roman.drapeko@baesystems.com
>     <ma...@baesystems.com>> wrote:
>
>     Just to clarify the origin of my question.
>
>     I had to do some performance tests to compare different storage
>     types of “raw” data against each other.
>
>     Hopefully, picture below is visible in the mailing list. If not, I
>     will put it somewhere else.
>
>     6 million “original” records, 1.3GB data, 233 bytes per record
>
>     Each original record is 40 fields delimited by tab, on average 19 –
>     not null
>
>     Batchwriter, single java program
>
>     First three bars represent single “heavy” mutation to insert the
>     whole tabular line / serialized object.
>
>     4,5,6,7 bars – composite mutation (all qualifiers for the same rowid
>     in one mutation)
>
>     8, 9, 10, 11 – individual mutations (all qualifiers for the same
>     rowid in separate mutations) - ~19 mutations per original record
>
>     On average, single “heavy” mutations are 7-10 times faster than
>     anything else, composite are 10%-35% faster than individual.
>
>     I am not an expert how Accumulo is implemented internally, however
>     it looks like composite mutation is treated more or less in the same
>     way as a set of individual mutations. Probably, largest overhead is
>     added by WAL.
>
>     Data utilization before and after manual compaction of test table
>     and all system tables:
>
>     It’s not clear why “accumulo du” shows twice less data used
>     comparing to “hdfs du”.
>
>     All these tests made us think that we can improve performance by
>     doing some calculations in-memory (and our use-case fits very well)
>     and reducing number of mutations. Now I am trying to understand
>     whether there is a relatively easy way to do this with Accumulo or
>     whether it’s time to look closer into something like Spark.
>
>     Thanks
>
>     Roman
>
>     *From:*Adam Fuchs [mailto:afuchs@apache.org <ma...@apache.org>]
>     *Sent:* 09 June 2015 19:08
>
>
>     *To:* user@accumulo.apache.org <ma...@accumulo.apache.org>
>     *Subject:* Re: micro compaction
>
>     I think this might be the same concept as in-mapper combining, but
>     applied to data being sent to a BatchWriter rather than an
>     OutputCollector. See [1], section 3.1.1. A similar performance
>     analysis and probably a lot of the same code should apply here.
>
>     Cheers,
>
>     Adam
>
>     [1]
>     http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>
>     On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rweeks@newbrightidea.com
>     <ma...@newbrightidea.com>> wrote:
>
>     Having a combiner stack (more generally an iterator stack) run on
>     the client-side seems to be the second most popular request on this
>     list. The most popular being, "How do I write to Accumulo from
>     inside an iterator?"
>
>     Such a thing would be very useful for me, too. I have some cycles to
>     help out, if somebody can give me an idea of where to get started
>     and where the potential land-mines are.
>
>     -Russ
>
>     On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com
>     <ma...@baesystems.com> <roman.drapeko@baesystems.com
>     <ma...@baesystems.com>> wrote:
>
>         Aggregated output is tiny, so if I do same calculations in
>         memory (instead of sending mutations to Accumulo) , I can reduce
>         overall number of mutations by 1000x or so
>
>
>
>         -----Original Message-----
>         From: Josh Elser [mailto:josh.elser@gmail.com
>         <ma...@gmail.com>]
>         Sent: 09 June 2015 16:54
>         To: user@accumulo.apache.org <ma...@accumulo.apache.org>
>         Subject: Re: micro compaction
>
>         Well, you win the prize for new terminology. I haven't ever
>         heard the term "micro compaction" before.
>
>         Can you clarify though, you say hundreds of millions of
>         mutations that result in megabytes of data. Is that an increase
>         or decrease in size.
>         Comparing apples to oranges :)
>
>         roman.drapeko@baesystems.com
>         <ma...@baesystems.com> wrote:
>          > Hi guys,
>          >
>          > While doing pre-analytics we generate hundreds of millions of
>          > mutations that result in 1-100 megabytes of useful data after
>         major
>          > compaction. We ingest into Accumulo using MR from Mapper job. We
>          > identified that performance really degrades while increasing
>         a number of mutations.
>          >
>          > The obvious improvement is to do some calculations in-memory
>         before
>          > sending mutations to Accumulo.
>          >
>          > Of course, at the same time we are looking for a solution to
>         minimize
>          > development effort.
>          >
>          > I guess I am asking about micro compaction/ingest-time
>         iterators on
>          > the client side (before data is sent to Accumulo).
>          >
>          > To my understanding, Accumulo does not support them, is it
>         correct?
>          > And if so, are there any plans to support this functionality
>         in the future?
>          >
>          > Thanks
>          >
>          > Roman
>          >
>          > Please consider the environment before printing this email. This
>          > message should be regarded as confidential. If you have
>         received this
>          > email in error please notify the sender and destroy it
>         immediately.
>          > Statements of intent shall only become binding when confirmed
>         in hard
>          > copy by an authorised signatory. The contents of this email
>         may relate
>          > to dealings with other companies under the control of BAE Systems
>          > Applied Intelligence Limited, details of which can be found at
>          > http://www.baesystems.com/Businesses/index.htm.
>         Please consider the environment before printing this email. This
>         message should be regarded as confidential. If you have received
>         this email in error please notify the sender and destroy it
>         immediately. Statements of intent shall only become binding when
>         confirmed in hard copy by an authorised signatory. The contents
>         of this email may relate to dealings with other companies under
>         the control of BAE Systems Applied Intelligence Limited, details
>         of which can be found at
>         http://www.baesystems.com/Businesses/index.htm.
>
>     Please consider the environment before printing this email. This
>     message should be regarded as confidential. If you have received
>     this email in error please notify the sender and destroy it
>     immediately. Statements of intent shall only become binding when
>     confirmed in hard copy by an authorised signatory. The contents of
>     this email may relate to dealings with other companies under the
>     control of BAE Systems Applied Intelligence Limited, details of
>     which can be found at http://www.baesystems.com/Businesses/index.htm.
>
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied
> Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.

RE: micro compaction

Posted by "roman.drapeko@baesystems.com" <ro...@baesystems.com>.

Thanks a lot, will give a try!

From: Keith Turner [mailto:keith@deenlo.com]
Sent: 09 June 2015 22:28
To: user@accumulo.apache.org
Subject: Re: micro compaction

On Tue, Jun 9, 2015 at 5:10 PM, roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
I am using:

Mutation m = new Mutation(rowId);
m.put(f1, q1, v1);
m.put(f2, q2, v2);
m.put(f3, q3, v3);

I guess it’s a native one? If not, what should I use?

The native map I am refering to is server side C++ code that can optionally be used to hold recently written mutations.  Its generally faster than the java code that does the same thing.  To use it, the following two conditions must be met.
  * The native library is built and present on all tservers.  There is a script to help build it $ACCUMULOHOME/bin/build_native_library.sh
  * Accumulo is configured to use it.  Set tserver.memory.maps.native.enabled to true in accumulo-site.xml on each node.

Thanks
Roman

From: Keith Turner [mailto:keith@deenlo.com<ma...@deenlo.com>]
Sent: 09 June 2015 22:04

To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: micro compaction

On Tue, Jun 9, 2015 at 4:06 PM, roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
My view is that introduction of ingest-time iterators would be quite a useful feature. Anyway. ☺

Also, could anyone exactly explain why composite mutation perform pretty much in the same way as a set of individual mutations?

One large composite mutation with 19 qualifiers inside is just 10-30% faster than 19 individual mutations.

One different is the row has to be sent over RPM 19 times vs once.  So the size of the row will impact this.
Are you using native maps?  The structure of the native map is Map<row, Map<col val>>.  For a mutation with 19 cols, the row is looked up once to find the column map.   For non-native map the structure is Map<Key, Value>.  Conceptually for this you keep looking up the row (or do multuple compare of the row for each column in the mutation).

From: Russ Weeks [mailto:rweeks@newbrightidea.com<ma...@newbrightidea.com>]
Sent: 09 June 2015 20:54
To: accumulo-user
Subject: Re: micro compaction

For consistency and ease of implementation. Say I've written a stack of combiners that do statistical aggregation, sampling etc. on my table. Rather than port that logic to a Storm topology or to the DStream API I'd just like to turn that stack on in my BatchWriter.

On Tue, Jun 9, 2015 at 12:47 PM David Medinets <da...@gmail.com>> wrote:
Consider using Storm, Pig, Spark, or your own framework to handle the in-memory aggregation before giving the data to the BatchWriter. Why would any part of Accumulo code be responsible for this kind of application-specific data handling?

On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
Just to clarify the origin of my question.

I had to do some performance tests to compare different storage types of “raw” data against each other.

Hopefully, picture below is visible in the mailing list. If not, I will put it somewhere else.

6 million “original” records, 1.3GB data, 233 bytes per record
Each original record is 40 fields delimited by tab, on average 19 – not null
Batchwriter, single java program

First three bars represent single “heavy” mutation to insert the whole tabular line / serialized object.
4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in one mutation)
8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in separate mutations) - ~19 mutations per original record

On average, single “heavy” mutations are 7-10 times faster than anything else, composite are 10%-35% faster than individual.

I am not an expert how Accumulo is implemented internally, however it looks like composite mutation is treated more or less in the same way as a set of individual mutations. Probably, largest overhead is added by WAL.

[cid:image001.png@01D0A308.5A2E8B50]

Data utilization before and after manual compaction of test table and all system tables:

[cid:image002.png@01D0A308.5A2E8B50]

It’s not clear why “accumulo du” shows twice less data used comparing to “hdfs du”.

All these tests made us think that we can improve performance by doing some calculations in-memory (and our use-case fits very well) and reducing number of mutations. Now I am trying to understand whether there is a relatively easy way to do this with Accumulo or whether it’s time to look closer into something like Spark.

Thanks
Roman

From: Adam Fuchs [mailto:afuchs@apache.org<ma...@apache.org>]
Sent: 09 June 2015 19:08

To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: micro compaction

I think this might be the same concept as in-mapper combining, but applied to data being sent to a BatchWriter rather than an OutputCollector. See [1], section 3.1.1. A similar performance analysis and probably a lot of the same code should apply here.

Cheers,
Adam

[1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com>> wrote:
Having a combiner stack (more generally an iterator stack) run on the client-side seems to be the second most popular request on this list. The most popular being, "How do I write to Accumulo from inside an iterator?"

Such a thing would be very useful for me, too. I have some cycles to help out, if somebody can give me an idea of where to get started and where the potential land-mines are.

-Russ

On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
Aggregated output is tiny,  so if I do same calculations in memory (instead of sending mutations to Accumulo) , I can reduce overall number of mutations by 1000x or so

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>]
Sent: 09 June 2015 16:54
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: micro compaction

Well, you win the prize for new terminology. I haven't ever heard the term "micro compaction" before.

Can you clarify though, you say hundreds of millions of mutations that result in megabytes of data. Is that an increase or decrease in size.
Comparing apples to oranges :)

roman.drapeko@baesystems.com<ma...@baesystems.com> wrote:
> Hi guys,
>
> While doing pre-analytics we generate hundreds of millions of
> mutations that result in 1-100 megabytes of useful data after major
> compaction. We ingest into Accumulo using MR from Mapper job. We
> identified that performance really degrades while increasing a number of mutations.
>
> The obvious improvement is to do some calculations in-memory before
> sending mutations to Accumulo.
>
> Of course, at the same time we are looking for a solution to minimize
> development effort.
>
> I guess I am asking about micro compaction/ingest-time iterators on
> the client side (before data is sent to Accumulo).
>
> To my understanding, Accumulo does not support them, is it correct?
> And if so, are there any plans to support this functionality in the future?
>
> Thanks
>
> Roman
>
> Please consider the environment before printing this email. This
> message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard
> copy by an authorised signatory. The contents of this email may relate
> to dealings with other companies under the control of BAE Systems
> Applied Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Re: micro compaction

Posted by Keith Turner <ke...@deenlo.com>.

On Tue, Jun 9, 2015 at 5:10 PM, roman.drapeko@baesystems.com <
roman.drapeko@baesystems.com> wrote:

>  I am using:
>
>
>
> Mutation m = new Mutation(rowId);
>
> m.put(f1, q1, v1);
>
> m.put(f2, q2, v2);
>
> m.put(f3, q3, v3);
>
>
>
> I guess it’s a native one? If not, what should I use?
>

The native map I am refering to is server side C++ code that can optionally
be used to hold recently written mutations.  Its generally faster than the
java code that does the same thing.  To use it, the following two
conditions must be met.

  * The native library is built and present on all tservers.  There is a
script to help build it $ACCUMULOHOME/bin/build_native_library.sh
  * Accumulo is configured to use it.  Set
tserver.memory.maps.native.enabled to true in accumulo-site.xml on each
node.


>
>
> Thanks
>
> Roman
>
>
>
> *From:* Keith Turner [mailto:keith@deenlo.com]
> *Sent:* 09 June 2015 22:04
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: micro compaction
>
>
>
>
>
>
>
> On Tue, Jun 9, 2015 at 4:06 PM, roman.drapeko@baesystems.com <
> roman.drapeko@baesystems.com> wrote:
>
> My view is that introduction of ingest-time iterators would be quite a
> useful feature. Anyway. J
>
>
>
> Also, could anyone exactly explain why composite mutation perform pretty
> much in the same way as a set of individual mutations?
>
>
>
> One large composite mutation with 19 qualifiers inside is just 10-30%
> faster than 19 individual mutations.
>
>
>
> One different is the row has to be sent over RPM 19 times vs once.  So the
> size of the row will impact this.
>
> Are you using native maps?  The structure of the native map is Map<row,
> Map<col val>>.  For a mutation with 19 cols, the row is looked up once to
> find the column map.   For non-native map the structure is Map<Key,
> Value>.  Conceptually for this you keep looking up the row (or do multuple
> compare of the row for each column in the mutation).
>
>
>
>
>
>
>
> *From:* Russ Weeks [mailto:rweeks@newbrightidea.com]
> *Sent:* 09 June 2015 20:54
> *To:* accumulo-user
> *Subject:* Re: micro compaction
>
>
>
> For consistency and ease of implementation. Say I've written a stack of
> combiners that do statistical aggregation, sampling etc. on my table.
> Rather than port that logic to a Storm topology or to the DStream API I'd
> just like to turn that stack on in my BatchWriter.
>
>
>
> On Tue, Jun 9, 2015 at 12:47 PM David Medinets <da...@gmail.com>
> wrote:
>
>  Consider using Storm, Pig, Spark, or your own framework to handle the
> in-memory aggregation before giving the data to the BatchWriter. Why would
> any part of Accumulo code be responsible for this kind of
> application-specific data handling?
>
>
>
> On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com <
> roman.drapeko@baesystems.com> wrote:
>
> Just to clarify the origin of my question.
>
>
>
> I had to do some performance tests to compare different storage types of
> “raw” data against each other.
>
>
>
> Hopefully, picture below is visible in the mailing list. If not, I will
> put it somewhere else.
>
>
>
> 6 million “original” records, 1.3GB data, 233 bytes per record
>
> Each original record is 40 fields delimited by tab, on average 19 – not
> null
>
> Batchwriter, single java program
>
>
>
> First three bars represent single “heavy” mutation to insert the whole
> tabular line / serialized object.
>
> 4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in
> one mutation)
>
> 8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in
> separate mutations) - ~19 mutations per original record
>
>
>
> On average, single “heavy” mutations are 7-10 times faster than anything
> else, composite are 10%-35% faster than individual.
>
>
>
> I am not an expert how Accumulo is implemented internally, however it
> looks like composite mutation is treated more or less in the same way as a
> set of individual mutations. Probably, largest overhead is added by WAL.
>
>
>
>
>
> Data utilization before and after manual compaction of test table and all
> system tables:
>
>
>
>
>
> It’s not clear why “accumulo du” shows twice less data used comparing to
> “hdfs du”.
>
>
>
> All these tests made us think that we can improve performance by doing
> some calculations in-memory (and our use-case fits very well) and reducing
> number of mutations. Now I am trying to understand whether there is a
> relatively easy way to do this with Accumulo or whether it’s time to look
> closer into something like Spark.
>
>
>
> Thanks
>
> Roman
>
>
>
>
>
>
>
>
>
> *From:* Adam Fuchs [mailto:afuchs@apache.org]
> *Sent:* 09 June 2015 19:08
>
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: micro compaction
>
>
>
> I think this might be the same concept as in-mapper combining, but applied
> to data being sent to a BatchWriter rather than an OutputCollector. See
> [1], section 3.1.1. A similar performance analysis and probably a lot of
> the same code should apply here.
>
>
>
> Cheers,
>
> Adam
>
>
>
> [1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>
>
>
> On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
> Having a combiner stack (more generally an iterator stack) run on the
> client-side seems to be the second most popular request on this list. The
> most popular being, "How do I write to Accumulo from inside an iterator?"
>
>
>
> Such a thing would be very useful for me, too. I have some cycles to help
> out, if somebody can give me an idea of where to get started and where the
> potential land-mines are.
>
>
>
> -Russ
>
>
>
> On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com <
> roman.drapeko@baesystems.com> wrote:
>
> Aggregated output is tiny,  so if I do same calculations in memory
> (instead of sending mutations to Accumulo) , I can reduce overall number of
> mutations by 1000x or so
>
>
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: 09 June 2015 16:54
> To: user@accumulo.apache.org
> Subject: Re: micro compaction
>
> Well, you win the prize for new terminology. I haven't ever heard the term
> "micro compaction" before.
>
> Can you clarify though, you say hundreds of millions of mutations that
> result in megabytes of data. Is that an increase or decrease in size.
> Comparing apples to oranges :)
>
> roman.drapeko@baesystems.com wrote:
> > Hi guys,
> >
> > While doing pre-analytics we generate hundreds of millions of
> > mutations that result in 1-100 megabytes of useful data after major
> > compaction. We ingest into Accumulo using MR from Mapper job. We
> > identified that performance really degrades while increasing a number of
> mutations.
> >
> > The obvious improvement is to do some calculations in-memory before
> > sending mutations to Accumulo.
> >
> > Of course, at the same time we are looking for a solution to minimize
> > development effort.
> >
> > I guess I am asking about micro compaction/ingest-time iterators on
> > the client side (before data is sent to Accumulo).
> >
> > To my understanding, Accumulo does not support them, is it correct?
> > And if so, are there any plans to support this functionality in the
> future?
> >
> > Thanks
> >
> > Roman
> >
> > Please consider the environment before printing this email. This
> > message should be regarded as confidential. If you have received this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> > copy by an authorised signatory. The contents of this email may relate
> > to dealings with other companies under the control of BAE Systems
> > Applied Intelligence Limited, details of which can be found at
> > http://www.baesystems.com/Businesses/index.htm.
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>
>
>
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>
>
>
>    Please consider the environment before printing this email. This
> message should be regarded as confidential. If you have received this email
> in error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>
>
>   Please consider the environment before printing this email. This
> message should be regarded as confidential. If you have received this email
> in error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>

RE: micro compaction

Posted by "roman.drapeko@baesystems.com" <ro...@baesystems.com>.

I am using:

Mutation m = new Mutation(rowId);
m.put(f1, q1, v1);
m.put(f2, q2, v2);
m.put(f3, q3, v3);

I guess it’s a native one? If not, what should I use?

Thanks
Roman

From: Keith Turner [mailto:keith@deenlo.com]
Sent: 09 June 2015 22:04
To: user@accumulo.apache.org
Subject: Re: micro compaction

On Tue, Jun 9, 2015 at 4:06 PM, roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
My view is that introduction of ingest-time iterators would be quite a useful feature. Anyway. ☺

Also, could anyone exactly explain why composite mutation perform pretty much in the same way as a set of individual mutations?

One large composite mutation with 19 qualifiers inside is just 10-30% faster than 19 individual mutations.

One different is the row has to be sent over RPM 19 times vs once.  So the size of the row will impact this.
Are you using native maps?  The structure of the native map is Map<row, Map<col val>>.  For a mutation with 19 cols, the row is looked up once to find the column map.   For non-native map the structure is Map<Key, Value>.  Conceptually for this you keep looking up the row (or do multuple compare of the row for each column in the mutation).

From: Russ Weeks [mailto:rweeks@newbrightidea.com<ma...@newbrightidea.com>]
Sent: 09 June 2015 20:54
To: accumulo-user
Subject: Re: micro compaction

For consistency and ease of implementation. Say I've written a stack of combiners that do statistical aggregation, sampling etc. on my table. Rather than port that logic to a Storm topology or to the DStream API I'd just like to turn that stack on in my BatchWriter.

On Tue, Jun 9, 2015 at 12:47 PM David Medinets <da...@gmail.com>> wrote:
Consider using Storm, Pig, Spark, or your own framework to handle the in-memory aggregation before giving the data to the BatchWriter. Why would any part of Accumulo code be responsible for this kind of application-specific data handling?

On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
Just to clarify the origin of my question.

I had to do some performance tests to compare different storage types of “raw” data against each other.

Hopefully, picture below is visible in the mailing list. If not, I will put it somewhere else.

6 million “original” records, 1.3GB data, 233 bytes per record
Each original record is 40 fields delimited by tab, on average 19 – not null
Batchwriter, single java program

First three bars represent single “heavy” mutation to insert the whole tabular line / serialized object.
4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in one mutation)
8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in separate mutations) - ~19 mutations per original record

On average, single “heavy” mutations are 7-10 times faster than anything else, composite are 10%-35% faster than individual.

I am not an expert how Accumulo is implemented internally, however it looks like composite mutation is treated more or less in the same way as a set of individual mutations. Probably, largest overhead is added by WAL.

[cid:image001.png@01D0A301.1722B630]

Data utilization before and after manual compaction of test table and all system tables:

[cid:image002.png@01D0A301.1722B630]

It’s not clear why “accumulo du” shows twice less data used comparing to “hdfs du”.

All these tests made us think that we can improve performance by doing some calculations in-memory (and our use-case fits very well) and reducing number of mutations. Now I am trying to understand whether there is a relatively easy way to do this with Accumulo or whether it’s time to look closer into something like Spark.

Thanks
Roman

From: Adam Fuchs [mailto:afuchs@apache.org<ma...@apache.org>]
Sent: 09 June 2015 19:08

To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: micro compaction

I think this might be the same concept as in-mapper combining, but applied to data being sent to a BatchWriter rather than an OutputCollector. See [1], section 3.1.1. A similar performance analysis and probably a lot of the same code should apply here.

Cheers,
Adam

[1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com>> wrote:
Having a combiner stack (more generally an iterator stack) run on the client-side seems to be the second most popular request on this list. The most popular being, "How do I write to Accumulo from inside an iterator?"

Such a thing would be very useful for me, too. I have some cycles to help out, if somebody can give me an idea of where to get started and where the potential land-mines are.

-Russ

On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
Aggregated output is tiny,  so if I do same calculations in memory (instead of sending mutations to Accumulo) , I can reduce overall number of mutations by 1000x or so

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>]
Sent: 09 June 2015 16:54
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: micro compaction

Well, you win the prize for new terminology. I haven't ever heard the term "micro compaction" before.

Can you clarify though, you say hundreds of millions of mutations that result in megabytes of data. Is that an increase or decrease in size.
Comparing apples to oranges :)

roman.drapeko@baesystems.com<ma...@baesystems.com> wrote:
> Hi guys,
>
> While doing pre-analytics we generate hundreds of millions of
> mutations that result in 1-100 megabytes of useful data after major
> compaction. We ingest into Accumulo using MR from Mapper job. We
> identified that performance really degrades while increasing a number of mutations.
>
> The obvious improvement is to do some calculations in-memory before
> sending mutations to Accumulo.
>
> Of course, at the same time we are looking for a solution to minimize
> development effort.
>
> I guess I am asking about micro compaction/ingest-time iterators on
> the client side (before data is sent to Accumulo).
>
> To my understanding, Accumulo does not support them, is it correct?
> And if so, are there any plans to support this functionality in the future?
>
> Thanks
>
> Roman
>
> Please consider the environment before printing this email. This
> message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard
> copy by an authorised signatory. The contents of this email may relate
> to dealings with other companies under the control of BAE Systems
> Applied Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Re: micro compaction

Posted by Keith Turner <ke...@deenlo.com>.

On Tue, Jun 9, 2015 at 4:06 PM, roman.drapeko@baesystems.com <
roman.drapeko@baesystems.com> wrote:

>  My view is that introduction of ingest-time iterators would be quite a
> useful feature. Anyway. J
>
>
>
> Also, could anyone exactly explain why composite mutation perform pretty
> much in the same way as a set of individual mutations?
>
>
>
> One large composite mutation with 19 qualifiers inside is just 10-30%
> faster than 19 individual mutations.
>


One different is the row has to be sent over RPM 19 times vs once.  So the
size of the row will impact this.

Are you using native maps?  The structure of the native map is Map<row,
Map<col val>>.  For a mutation with 19 cols, the row is looked up once to
find the column map.   For non-native map the structure is Map<Key,
Value>.  Conceptually for this you keep looking up the row (or do multuple
compare of the row for each column in the mutation).


>
>
>
>
> *From:* Russ Weeks [mailto:rweeks@newbrightidea.com]
> *Sent:* 09 June 2015 20:54
> *To:* accumulo-user
> *Subject:* Re: micro compaction
>
>
>
> For consistency and ease of implementation. Say I've written a stack of
> combiners that do statistical aggregation, sampling etc. on my table.
> Rather than port that logic to a Storm topology or to the DStream API I'd
> just like to turn that stack on in my BatchWriter.
>
>
>
> On Tue, Jun 9, 2015 at 12:47 PM David Medinets <da...@gmail.com>
> wrote:
>
>  Consider using Storm, Pig, Spark, or your own framework to handle the
> in-memory aggregation before giving the data to the BatchWriter. Why would
> any part of Accumulo code be responsible for this kind of
> application-specific data handling?
>
>
>
> On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com <
> roman.drapeko@baesystems.com> wrote:
>
> Just to clarify the origin of my question.
>
>
>
> I had to do some performance tests to compare different storage types of
> “raw” data against each other.
>
>
>
> Hopefully, picture below is visible in the mailing list. If not, I will
> put it somewhere else.
>
>
>
> 6 million “original” records, 1.3GB data, 233 bytes per record
>
> Each original record is 40 fields delimited by tab, on average 19 – not
> null
>
> Batchwriter, single java program
>
>
>
> First three bars represent single “heavy” mutation to insert the whole
> tabular line / serialized object.
>
> 4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in
> one mutation)
>
> 8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in
> separate mutations) - ~19 mutations per original record
>
>
>
> On average, single “heavy” mutations are 7-10 times faster than anything
> else, composite are 10%-35% faster than individual.
>
>
>
> I am not an expert how Accumulo is implemented internally, however it
> looks like composite mutation is treated more or less in the same way as a
> set of individual mutations. Probably, largest overhead is added by WAL.
>
>
>
>
>
> Data utilization before and after manual compaction of test table and all
> system tables:
>
>
>
>
>
> It’s not clear why “accumulo du” shows twice less data used comparing to
> “hdfs du”.
>
>
>
> All these tests made us think that we can improve performance by doing
> some calculations in-memory (and our use-case fits very well) and reducing
> number of mutations. Now I am trying to understand whether there is a
> relatively easy way to do this with Accumulo or whether it’s time to look
> closer into something like Spark.
>
>
>
> Thanks
>
> Roman
>
>
>
>
>
>
>
>
>
> *From:* Adam Fuchs [mailto:afuchs@apache.org]
> *Sent:* 09 June 2015 19:08
>
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: micro compaction
>
>
>
> I think this might be the same concept as in-mapper combining, but applied
> to data being sent to a BatchWriter rather than an OutputCollector. See
> [1], section 3.1.1. A similar performance analysis and probably a lot of
> the same code should apply here.
>
>
>
> Cheers,
>
> Adam
>
>
>
> [1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>
>
>
> On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
> Having a combiner stack (more generally an iterator stack) run on the
> client-side seems to be the second most popular request on this list. The
> most popular being, "How do I write to Accumulo from inside an iterator?"
>
>
>
> Such a thing would be very useful for me, too. I have some cycles to help
> out, if somebody can give me an idea of where to get started and where the
> potential land-mines are.
>
>
>
> -Russ
>
>
>
> On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com <
> roman.drapeko@baesystems.com> wrote:
>
> Aggregated output is tiny,  so if I do same calculations in memory
> (instead of sending mutations to Accumulo) , I can reduce overall number of
> mutations by 1000x or so
>
>
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: 09 June 2015 16:54
> To: user@accumulo.apache.org
> Subject: Re: micro compaction
>
> Well, you win the prize for new terminology. I haven't ever heard the term
> "micro compaction" before.
>
> Can you clarify though, you say hundreds of millions of mutations that
> result in megabytes of data. Is that an increase or decrease in size.
> Comparing apples to oranges :)
>
> roman.drapeko@baesystems.com wrote:
> > Hi guys,
> >
> > While doing pre-analytics we generate hundreds of millions of
> > mutations that result in 1-100 megabytes of useful data after major
> > compaction. We ingest into Accumulo using MR from Mapper job. We
> > identified that performance really degrades while increasing a number of
> mutations.
> >
> > The obvious improvement is to do some calculations in-memory before
> > sending mutations to Accumulo.
> >
> > Of course, at the same time we are looking for a solution to minimize
> > development effort.
> >
> > I guess I am asking about micro compaction/ingest-time iterators on
> > the client side (before data is sent to Accumulo).
> >
> > To my understanding, Accumulo does not support them, is it correct?
> > And if so, are there any plans to support this functionality in the
> future?
> >
> > Thanks
> >
> > Roman
> >
> > Please consider the environment before printing this email. This
> > message should be regarded as confidential. If you have received this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> > copy by an authorised signatory. The contents of this email may relate
> > to dealings with other companies under the control of BAE Systems
> > Applied Intelligence Limited, details of which can be found at
> > http://www.baesystems.com/Businesses/index.htm.
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>
>
>
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>
>
>
>  Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>

RE: micro compaction

Posted by "roman.drapeko@baesystems.com" <ro...@baesystems.com>.

My view is that introduction of ingest-time iterators would be quite a useful feature. Anyway. ☺

Also, could anyone exactly explain why composite mutation perform pretty much in the same way as a set of individual mutations?

One large composite mutation with 19 qualifiers inside is just 10-30% faster than 19 individual mutations.

From: Russ Weeks [mailto:rweeks@newbrightidea.com]
Sent: 09 June 2015 20:54
To: accumulo-user
Subject: Re: micro compaction

For consistency and ease of implementation. Say I've written a stack of combiners that do statistical aggregation, sampling etc. on my table. Rather than port that logic to a Storm topology or to the DStream API I'd just like to turn that stack on in my BatchWriter.

On Tue, Jun 9, 2015 at 12:47 PM David Medinets <da...@gmail.com>> wrote:
Consider using Storm, Pig, Spark, or your own framework to handle the in-memory aggregation before giving the data to the BatchWriter. Why would any part of Accumulo code be responsible for this kind of application-specific data handling?

On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
Just to clarify the origin of my question.

I had to do some performance tests to compare different storage types of “raw” data against each other.

Hopefully, picture below is visible in the mailing list. If not, I will put it somewhere else.

6 million “original” records, 1.3GB data, 233 bytes per record
Each original record is 40 fields delimited by tab, on average 19 – not null
Batchwriter, single java program

First three bars represent single “heavy” mutation to insert the whole tabular line / serialized object.
4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in one mutation)
8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in separate mutations) - ~19 mutations per original record

On average, single “heavy” mutations are 7-10 times faster than anything else, composite are 10%-35% faster than individual.

I am not an expert how Accumulo is implemented internally, however it looks like composite mutation is treated more or less in the same way as a set of individual mutations. Probably, largest overhead is added by WAL.

[cid:image001.png@01D0A2F7.DABDAD70]

Data utilization before and after manual compaction of test table and all system tables:

[cid:image002.png@01D0A2F7.DABDAD70]

It’s not clear why “accumulo du” shows twice less data used comparing to “hdfs du”.

All these tests made us think that we can improve performance by doing some calculations in-memory (and our use-case fits very well) and reducing number of mutations. Now I am trying to understand whether there is a relatively easy way to do this with Accumulo or whether it’s time to look closer into something like Spark.

Thanks
Roman

From: Adam Fuchs [mailto:afuchs@apache.org<ma...@apache.org>]
Sent: 09 June 2015 19:08

To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: micro compaction

I think this might be the same concept as in-mapper combining, but applied to data being sent to a BatchWriter rather than an OutputCollector. See [1], section 3.1.1. A similar performance analysis and probably a lot of the same code should apply here.

Cheers,
Adam

[1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com>> wrote:
Having a combiner stack (more generally an iterator stack) run on the client-side seems to be the second most popular request on this list. The most popular being, "How do I write to Accumulo from inside an iterator?"

Such a thing would be very useful for me, too. I have some cycles to help out, if somebody can give me an idea of where to get started and where the potential land-mines are.

-Russ

On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
Aggregated output is tiny,  so if I do same calculations in memory (instead of sending mutations to Accumulo) , I can reduce overall number of mutations by 1000x or so

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>]
Sent: 09 June 2015 16:54
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: micro compaction

Well, you win the prize for new terminology. I haven't ever heard the term "micro compaction" before.

Can you clarify though, you say hundreds of millions of mutations that result in megabytes of data. Is that an increase or decrease in size.
Comparing apples to oranges :)

roman.drapeko@baesystems.com<ma...@baesystems.com> wrote:
> Hi guys,
>
> While doing pre-analytics we generate hundreds of millions of
> mutations that result in 1-100 megabytes of useful data after major
> compaction. We ingest into Accumulo using MR from Mapper job. We
> identified that performance really degrades while increasing a number of mutations.
>
> The obvious improvement is to do some calculations in-memory before
> sending mutations to Accumulo.
>
> Of course, at the same time we are looking for a solution to minimize
> development effort.
>
> I guess I am asking about micro compaction/ingest-time iterators on
> the client side (before data is sent to Accumulo).
>
> To my understanding, Accumulo does not support them, is it correct?
> And if so, are there any plans to support this functionality in the future?
>
> Thanks
>
> Roman
>
> Please consider the environment before printing this email. This
> message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard
> copy by an authorised signatory. The contents of this email may relate
> to dealings with other companies under the control of BAE Systems
> Applied Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Re: micro compaction

Posted by Russ Weeks <rw...@newbrightidea.com>.

For consistency and ease of implementation. Say I've written a stack of
combiners that do statistical aggregation, sampling etc. on my table.
Rather than port that logic to a Storm topology or to the DStream API I'd
just like to turn that stack on in my BatchWriter.

On Tue, Jun 9, 2015 at 12:47 PM David Medinets <da...@gmail.com>
wrote:

> Consider using Storm, Pig, Spark, or your own framework to handle the
> in-memory aggregation before giving the data to the BatchWriter. Why would
> any part of Accumulo code be responsible for this kind of
> application-specific data handling?
>
> On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com <
> roman.drapeko@baesystems.com> wrote:
>
>>  Just to clarify the origin of my question.
>>
>>
>>
>> I had to do some performance tests to compare different storage types of
>> “raw” data against each other.
>>
>>
>>
>> Hopefully, picture below is visible in the mailing list. If not, I will
>> put it somewhere else.
>>
>>
>>
>> 6 million “original” records, 1.3GB data, 233 bytes per record
>>
>> Each original record is 40 fields delimited by tab, on average 19 – not
>> null
>>
>> Batchwriter, single java program
>>
>>
>>
>> First three bars represent single “heavy” mutation to insert the whole
>> tabular line / serialized object.
>>
>> 4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in
>> one mutation)
>>
>> 8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in
>> separate mutations) - ~19 mutations per original record
>>
>>
>>
>> On average, single “heavy” mutations are 7-10 times faster than anything
>> else, composite are 10%-35% faster than individual.
>>
>>
>>
>> I am not an expert how Accumulo is implemented internally, however it
>> looks like composite mutation is treated more or less in the same way as a
>> set of individual mutations. Probably, largest overhead is added by WAL.
>>
>>
>>
>>
>>
>> Data utilization before and after manual compaction of test table and all
>> system tables:
>>
>>
>>
>>
>>
>> It’s not clear why “accumulo du” shows twice less data used comparing to
>> “hdfs du”.
>>
>>
>>
>> All these tests made us think that we can improve performance by doing
>> some calculations in-memory (and our use-case fits very well) and reducing
>> number of mutations. Now I am trying to understand whether there is a
>> relatively easy way to do this with Accumulo or whether it’s time to look
>> closer into something like Spark.
>>
>>
>>
>> Thanks
>>
>> Roman
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Adam Fuchs [mailto:afuchs@apache.org]
>> *Sent:* 09 June 2015 19:08
>>
>> *To:* user@accumulo.apache.org
>> *Subject:* Re: micro compaction
>>
>>
>>
>> I think this might be the same concept as in-mapper combining, but
>> applied to data being sent to a BatchWriter rather than an OutputCollector.
>> See [1], section 3.1.1. A similar performance analysis and probably a lot
>> of the same code should apply here.
>>
>>
>>
>> Cheers,
>>
>> Adam
>>
>>
>>
>> [1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>>
>>
>>
>> On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com>
>> wrote:
>>
>> Having a combiner stack (more generally an iterator stack) run on the
>> client-side seems to be the second most popular request on this list. The
>> most popular being, "How do I write to Accumulo from inside an iterator?"
>>
>>
>>
>> Such a thing would be very useful for me, too. I have some cycles to help
>> out, if somebody can give me an idea of where to get started and where the
>> potential land-mines are.
>>
>>
>>
>> -Russ
>>
>>
>>
>> On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com <
>> roman.drapeko@baesystems.com> wrote:
>>
>> Aggregated output is tiny,  so if I do same calculations in memory
>> (instead of sending mutations to Accumulo) , I can reduce overall number of
>> mutations by 1000x or so
>>
>>
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:josh.elser@gmail.com]
>> Sent: 09 June 2015 16:54
>> To: user@accumulo.apache.org
>> Subject: Re: micro compaction
>>
>> Well, you win the prize for new terminology. I haven't ever heard the
>> term "micro compaction" before.
>>
>> Can you clarify though, you say hundreds of millions of mutations that
>> result in megabytes of data. Is that an increase or decrease in size.
>> Comparing apples to oranges :)
>>
>> roman.drapeko@baesystems.com wrote:
>> > Hi guys,
>> >
>> > While doing pre-analytics we generate hundreds of millions of
>> > mutations that result in 1-100 megabytes of useful data after major
>> > compaction. We ingest into Accumulo using MR from Mapper job. We
>> > identified that performance really degrades while increasing a number
>> of mutations.
>> >
>> > The obvious improvement is to do some calculations in-memory before
>> > sending mutations to Accumulo.
>> >
>> > Of course, at the same time we are looking for a solution to minimize
>> > development effort.
>> >
>> > I guess I am asking about micro compaction/ingest-time iterators on
>> > the client side (before data is sent to Accumulo).
>> >
>> > To my understanding, Accumulo does not support them, is it correct?
>> > And if so, are there any plans to support this functionality in the
>> future?
>> >
>> > Thanks
>> >
>> > Roman
>> >
>> > Please consider the environment before printing this email. This
>> > message should be regarded as confidential. If you have received this
>> > email in error please notify the sender and destroy it immediately.
>> > Statements of intent shall only become binding when confirmed in hard
>> > copy by an authorised signatory. The contents of this email may relate
>> > to dealings with other companies under the control of BAE Systems
>> > Applied Intelligence Limited, details of which can be found at
>> > http://www.baesystems.com/Businesses/index.htm.
>> Please consider the environment before printing this email. This message
>> should be regarded as confidential. If you have received this email in
>> error please notify the sender and destroy it immediately. Statements of
>> intent shall only become binding when confirmed in hard copy by an
>> authorised signatory. The contents of this email may relate to dealings
>> with other companies under the control of BAE Systems Applied Intelligence
>> Limited, details of which can be found at
>> http://www.baesystems.com/Businesses/index.htm.
>>
>>
>>  Please consider the environment before printing this email. This
>> message should be regarded as confidential. If you have received this email
>> in error please notify the sender and destroy it immediately. Statements of
>> intent shall only become binding when confirmed in hard copy by an
>> authorised signatory. The contents of this email may relate to dealings
>> with other companies under the control of BAE Systems Applied Intelligence
>> Limited, details of which can be found at
>> http://www.baesystems.com/Businesses/index.htm.
>>
>
>

Re: micro compaction

Posted by David Medinets <da...@gmail.com>.

Consider using Storm, Pig, Spark, or your own framework to handle the
in-memory aggregation before giving the data to the BatchWriter. Why would
any part of Accumulo code be responsible for this kind of
application-specific data handling?

On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com <
roman.drapeko@baesystems.com> wrote:

>  Just to clarify the origin of my question.
>
>
>
> I had to do some performance tests to compare different storage types of
> “raw” data against each other.
>
>
>
> Hopefully, picture below is visible in the mailing list. If not, I will
> put it somewhere else.
>
>
>
> 6 million “original” records, 1.3GB data, 233 bytes per record
>
> Each original record is 40 fields delimited by tab, on average 19 – not
> null
>
> Batchwriter, single java program
>
>
>
> First three bars represent single “heavy” mutation to insert the whole
> tabular line / serialized object.
>
> 4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in
> one mutation)
>
> 8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in
> separate mutations) - ~19 mutations per original record
>
>
>
> On average, single “heavy” mutations are 7-10 times faster than anything
> else, composite are 10%-35% faster than individual.
>
>
>
> I am not an expert how Accumulo is implemented internally, however it
> looks like composite mutation is treated more or less in the same way as a
> set of individual mutations. Probably, largest overhead is added by WAL.
>
>
>
>
>
> Data utilization before and after manual compaction of test table and all
> system tables:
>
>
>
>
>
> It’s not clear why “accumulo du” shows twice less data used comparing to
> “hdfs du”.
>
>
>
> All these tests made us think that we can improve performance by doing
> some calculations in-memory (and our use-case fits very well) and reducing
> number of mutations. Now I am trying to understand whether there is a
> relatively easy way to do this with Accumulo or whether it’s time to look
> closer into something like Spark.
>
>
>
> Thanks
>
> Roman
>
>
>
>
>
>
>
>
>
> *From:* Adam Fuchs [mailto:afuchs@apache.org]
> *Sent:* 09 June 2015 19:08
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: micro compaction
>
>
>
> I think this might be the same concept as in-mapper combining, but applied
> to data being sent to a BatchWriter rather than an OutputCollector. See
> [1], section 3.1.1. A similar performance analysis and probably a lot of
> the same code should apply here.
>
>
>
> Cheers,
>
> Adam
>
>
>
> [1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>
>
>
> On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
> Having a combiner stack (more generally an iterator stack) run on the
> client-side seems to be the second most popular request on this list. The
> most popular being, "How do I write to Accumulo from inside an iterator?"
>
>
>
> Such a thing would be very useful for me, too. I have some cycles to help
> out, if somebody can give me an idea of where to get started and where the
> potential land-mines are.
>
>
>
> -Russ
>
>
>
> On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com <
> roman.drapeko@baesystems.com> wrote:
>
> Aggregated output is tiny,  so if I do same calculations in memory
> (instead of sending mutations to Accumulo) , I can reduce overall number of
> mutations by 1000x or so
>
>
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: 09 June 2015 16:54
> To: user@accumulo.apache.org
> Subject: Re: micro compaction
>
> Well, you win the prize for new terminology. I haven't ever heard the term
> "micro compaction" before.
>
> Can you clarify though, you say hundreds of millions of mutations that
> result in megabytes of data. Is that an increase or decrease in size.
> Comparing apples to oranges :)
>
> roman.drapeko@baesystems.com wrote:
> > Hi guys,
> >
> > While doing pre-analytics we generate hundreds of millions of
> > mutations that result in 1-100 megabytes of useful data after major
> > compaction. We ingest into Accumulo using MR from Mapper job. We
> > identified that performance really degrades while increasing a number of
> mutations.
> >
> > The obvious improvement is to do some calculations in-memory before
> > sending mutations to Accumulo.
> >
> > Of course, at the same time we are looking for a solution to minimize
> > development effort.
> >
> > I guess I am asking about micro compaction/ingest-time iterators on
> > the client side (before data is sent to Accumulo).
> >
> > To my understanding, Accumulo does not support them, is it correct?
> > And if so, are there any plans to support this functionality in the
> future?
> >
> > Thanks
> >
> > Roman
> >
> > Please consider the environment before printing this email. This
> > message should be regarded as confidential. If you have received this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> > copy by an authorised signatory. The contents of this email may relate
> > to dealings with other companies under the control of BAE Systems
> > Applied Intelligence Limited, details of which can be found at
> > http://www.baesystems.com/Businesses/index.htm.
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>
>
>  Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>

RE: micro compaction

Posted by "roman.drapeko@baesystems.com" <ro...@baesystems.com>.

Just to clarify the origin of my question.

I had to do some performance tests to compare different storage types of “raw” data against each other.

Hopefully, picture below is visible in the mailing list. If not, I will put it somewhere else.

6 million “original” records, 1.3GB data, 233 bytes per record
Each original record is 40 fields delimited by tab, on average 19 – not null
Batchwriter, single java program

First three bars represent single “heavy” mutation to insert the whole tabular line / serialized object.
4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in one mutation)
8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in separate mutations) - ~19 mutations per original record

On average, single “heavy” mutations are 7-10 times faster than anything else, composite are 10%-35% faster than individual.

I am not an expert how Accumulo is implemented internally, however it looks like composite mutation is treated more or less in the same way as a set of individual mutations. Probably, largest overhead is added by WAL.

[cid:image002.png@01D0A2F1.43A7C1B0]

Data utilization before and after manual compaction of test table and all system tables:

[cid:image006.png@01D0A2F1.43A7C1B0]

It’s not clear why “accumulo du” shows twice less data used comparing to “hdfs du”.

All these tests made us think that we can improve performance by doing some calculations in-memory (and our use-case fits very well) and reducing number of mutations. Now I am trying to understand whether there is a relatively easy way to do this with Accumulo or whether it’s time to look closer into something like Spark.

Thanks
Roman

From: Adam Fuchs [mailto:afuchs@apache.org]
Sent: 09 June 2015 19:08
To: user@accumulo.apache.org
Subject: Re: micro compaction

I think this might be the same concept as in-mapper combining, but applied to data being sent to a BatchWriter rather than an OutputCollector. See [1], section 3.1.1. A similar performance analysis and probably a lot of the same code should apply here.

Cheers,
Adam

[1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com>> wrote:
Having a combiner stack (more generally an iterator stack) run on the client-side seems to be the second most popular request on this list. The most popular being, "How do I write to Accumulo from inside an iterator?"

Such a thing would be very useful for me, too. I have some cycles to help out, if somebody can give me an idea of where to get started and where the potential land-mines are.

-Russ

On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com<ma...@baesystems.com> <ro...@baesystems.com>> wrote:
Aggregated output is tiny,  so if I do same calculations in memory (instead of sending mutations to Accumulo) , I can reduce overall number of mutations by 1000x or so

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>]
Sent: 09 June 2015 16:54
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: micro compaction

Well, you win the prize for new terminology. I haven't ever heard the term "micro compaction" before.

Can you clarify though, you say hundreds of millions of mutations that result in megabytes of data. Is that an increase or decrease in size.
Comparing apples to oranges :)

roman.drapeko@baesystems.com<ma...@baesystems.com> wrote:
> Hi guys,
>
> While doing pre-analytics we generate hundreds of millions of
> mutations that result in 1-100 megabytes of useful data after major
> compaction. We ingest into Accumulo using MR from Mapper job. We
> identified that performance really degrades while increasing a number of mutations.
>
> The obvious improvement is to do some calculations in-memory before
> sending mutations to Accumulo.
>
> Of course, at the same time we are looking for a solution to minimize
> development effort.
>
> I guess I am asking about micro compaction/ingest-time iterators on
> the client side (before data is sent to Accumulo).
>
> To my understanding, Accumulo does not support them, is it correct?
> And if so, are there any plans to support this functionality in the future?
>
> Thanks
>
> Roman
>
> Please consider the environment before printing this email. This
> message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard
> copy by an authorised signatory. The contents of this email may relate
> to dealings with other companies under the control of BAE Systems
> Applied Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Re: micro compaction

Posted by Adam Fuchs <af...@apache.org>.

I think this might be the same concept as in-mapper combining, but applied
to data being sent to a BatchWriter rather than an OutputCollector. See
[1], section 3.1.1. A similar performance analysis and probably a lot of
the same code should apply here.

Cheers,
Adam

[1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rw...@newbrightidea.com> wrote:

> Having a combiner stack (more generally an iterator stack) run on the
> client-side seems to be the second most popular request on this list. The
> most popular being, "How do I write to Accumulo from inside an iterator?"
>
> Such a thing would be very useful for me, too. I have some cycles to help
> out, if somebody can give me an idea of where to get started and where the
> potential land-mines are.
>
> -Russ
>
> On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com <
> roman.drapeko@baesystems.com> wrote:
>
>> Aggregated output is tiny,  so if I do same calculations in memory
>> (instead of sending mutations to Accumulo) , I can reduce overall number of
>> mutations by 1000x or so
>>
>>
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:josh.elser@gmail.com]
>> Sent: 09 June 2015 16:54
>> To: user@accumulo.apache.org
>> Subject: Re: micro compaction
>>
>> Well, you win the prize for new terminology. I haven't ever heard the
>> term "micro compaction" before.
>>
>> Can you clarify though, you say hundreds of millions of mutations that
>> result in megabytes of data. Is that an increase or decrease in size.
>> Comparing apples to oranges :)
>>
>> roman.drapeko@baesystems.com wrote:
>> > Hi guys,
>> >
>> > While doing pre-analytics we generate hundreds of millions of
>> > mutations that result in 1-100 megabytes of useful data after major
>> > compaction. We ingest into Accumulo using MR from Mapper job. We
>> > identified that performance really degrades while increasing a number
>> of mutations.
>> >
>> > The obvious improvement is to do some calculations in-memory before
>> > sending mutations to Accumulo.
>> >
>> > Of course, at the same time we are looking for a solution to minimize
>> > development effort.
>> >
>> > I guess I am asking about micro compaction/ingest-time iterators on
>> > the client side (before data is sent to Accumulo).
>> >
>> > To my understanding, Accumulo does not support them, is it correct?
>> > And if so, are there any plans to support this functionality in the
>> future?
>> >
>> > Thanks
>> >
>> > Roman
>> >
>> > Please consider the environment before printing this email. This
>> > message should be regarded as confidential. If you have received this
>> > email in error please notify the sender and destroy it immediately.
>> > Statements of intent shall only become binding when confirmed in hard
>> > copy by an authorised signatory. The contents of this email may relate
>> > to dealings with other companies under the control of BAE Systems
>> > Applied Intelligence Limited, details of which can be found at
>> > http://www.baesystems.com/Businesses/index.htm.
>> Please consider the environment before printing this email. This message
>> should be regarded as confidential. If you have received this email in
>> error please notify the sender and destroy it immediately. Statements of
>> intent shall only become binding when confirmed in hard copy by an
>> authorised signatory. The contents of this email may relate to dealings
>> with other companies under the control of BAE Systems Applied Intelligence
>> Limited, details of which can be found at
>> http://www.baesystems.com/Businesses/index.htm.
>>
>

Re: micro compaction

Posted by Russ Weeks <rw...@newbrightidea.com>.

Having a combiner stack (more generally an iterator stack) run on the
client-side seems to be the second most popular request on this list. The
most popular being, "How do I write to Accumulo from inside an iterator?"

Such a thing would be very useful for me, too. I have some cycles to help
out, if somebody can give me an idea of where to get started and where the
potential land-mines are.

-Russ

On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com <
roman.drapeko@baesystems.com> wrote:

> Aggregated output is tiny,  so if I do same calculations in memory
> (instead of sending mutations to Accumulo) , I can reduce overall number of
> mutations by 1000x or so
>
>
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: 09 June 2015 16:54
> To: user@accumulo.apache.org
> Subject: Re: micro compaction
>
> Well, you win the prize for new terminology. I haven't ever heard the term
> "micro compaction" before.
>
> Can you clarify though, you say hundreds of millions of mutations that
> result in megabytes of data. Is that an increase or decrease in size.
> Comparing apples to oranges :)
>
> roman.drapeko@baesystems.com wrote:
> > Hi guys,
> >
> > While doing pre-analytics we generate hundreds of millions of
> > mutations that result in 1-100 megabytes of useful data after major
> > compaction. We ingest into Accumulo using MR from Mapper job. We
> > identified that performance really degrades while increasing a number of
> mutations.
> >
> > The obvious improvement is to do some calculations in-memory before
> > sending mutations to Accumulo.
> >
> > Of course, at the same time we are looking for a solution to minimize
> > development effort.
> >
> > I guess I am asking about micro compaction/ingest-time iterators on
> > the client side (before data is sent to Accumulo).
> >
> > To my understanding, Accumulo does not support them, is it correct?
> > And if so, are there any plans to support this functionality in the
> future?
> >
> > Thanks
> >
> > Roman
> >
> > Please consider the environment before printing this email. This
> > message should be regarded as confidential. If you have received this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> > copy by an authorised signatory. The contents of this email may relate
> > to dealings with other companies under the control of BAE Systems
> > Applied Intelligence Limited, details of which can be found at
> > http://www.baesystems.com/Businesses/index.htm.
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>

RE: micro compaction

Posted by "roman.drapeko@baesystems.com" <ro...@baesystems.com>.

Aggregated output is tiny,  so if I do same calculations in memory (instead of sending mutations to Accumulo) , I can reduce overall number of mutations by 1000x or so



-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com]
Sent: 09 June 2015 16:54
To: user@accumulo.apache.org
Subject: Re: micro compaction

Well, you win the prize for new terminology. I haven't ever heard the term "micro compaction" before.

Can you clarify though, you say hundreds of millions of mutations that result in megabytes of data. Is that an increase or decrease in size.
Comparing apples to oranges :)

roman.drapeko@baesystems.com wrote:
> Hi guys,
>
> While doing pre-analytics we generate hundreds of millions of
> mutations that result in 1-100 megabytes of useful data after major
> compaction. We ingest into Accumulo using MR from Mapper job. We
> identified that performance really degrades while increasing a number of mutations.
>
> The obvious improvement is to do some calculations in-memory before
> sending mutations to Accumulo.
>
> Of course, at the same time we are looking for a solution to minimize
> development effort.
>
> I guess I am asking about micro compaction/ingest-time iterators on
> the client side (before data is sent to Accumulo).
>
> To my understanding, Accumulo does not support them, is it correct?
> And if so, are there any plans to support this functionality in the future?
>
> Thanks
>
> Roman
>
> Please consider the environment before printing this email. This
> message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard
> copy by an authorised signatory. The contents of this email may relate
> to dealings with other companies under the control of BAE Systems
> Applied Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Re: micro compaction

Posted by Josh Elser <jo...@gmail.com>.

Well, you win the prize for new terminology. I haven't ever heard the 
term "micro compaction" before.

Can you clarify though, you say hundreds of millions of mutations that 
result in megabytes of data. Is that an increase or decrease in size. 
Comparing apples to oranges :)

roman.drapeko@baesystems.com wrote:
> Hi guys,
>
> While doing pre-analytics we generate hundreds of millions of mutations
> that result in 1-100 megabytes of useful data after major compaction. We
> ingest into Accumulo using MR from Mapper job. We identified that
> performance really degrades while increasing a number of mutations.
>
> The obvious improvement is to do some calculations in-memory before
> sending mutations to Accumulo.
>
> Of course, at the same time we are looking for a solution to minimize
> development effort.
>
> I guess I am asking about micro compaction/ingest-time iterators on the
> client side (before data is sent to Accumulo).
>
> To my understanding, Accumulo does not support them, is it correct? And
> if so, are there any plans to support this functionality in the future?
>
> Thanks
>
> Roman
>
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied
> Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.