You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Bradley Barber <bb...@phemi.com> on 2021/03/25 16:43:32 UTC

Nuances of Major Compactions

Hi all!

I'm looking for details on major compaction. Some of my colleagues and I
have been working on an iterator which we are attaching at major compaction
scope. The logic of this iterator requires that it always see entire rows -
ie. iterates over all KV entries which make up all versions of a given row.
From the Accumulo documentation, we had assumed this was guaranteed for
major compactions since tablets are partitioned at row boundaries.

However, we are seeing some intermittent (and fairly rare) occurrences of
incorrect behaviour from our iterator. Having reviewed and tested the
iterator logic, we are quite confident it works as intended. Were we
incorrect in thinking that only entire rows will take part in major
compactions? Are there instances where a major compaction within a tablet
will see only partial rows? On reviewing the documentation, it seems this *may
*be possible when a major compaction is called to merge a subset of RFiles
in a given tablet, but it's not very clear. Would anyone be able to clarify
this for us?

Issues with our iterator logic may also occur if reseeks are performed
during a major compaction. However, from our reading of the available
documentation, we got the impression that reseeks do not occur during major
compaction and we can't see why they would be. Is this guaranteed or are
there cases where a reseek may in fact be called during major compaction?

Sorry for the long, involved questions but any clarification would help us
greatly and be very appreciated :)

Hope you all are having a good week,
Bradley Barber

Re: Nuances of Major Compactions

Posted by Bradley Barber <bb...@phemi.com>.
Hi Billie,

Thanks for the quick reply! This confirms some of the concerns we had when
reviewing the documentation more closely. We've been logging out the
isFullMajorCompaction value thinking this was likely the issue, so now we
just need to design around it.

Thanks again for the reply and for taking the time to clarify this, much
appreciated!

All the best,
Bradley Barber

On Thu, Mar 25, 2021 at 10:37 AM Billie Rinaldi <bi...@apache.org> wrote:

> Yes, it is definitely possible for a major compaction to see only part of a
> row. Only during a full major compaction will an iterator see all of the
> tablet's files. Even then, the iterator would not see any k/v entries for
> the row that were still in memory in the tablet server. Ingest would need
> to be paused and the table would need to be flushed for a full major
> compaction to be guaranteed to see entire rows.
>
> The IteratorEnvironment passed into the iterator initialization has a
> method isFullMajorCompaction that allows an iterator to tell if a full
> major compaction is happening or not. Here is an example of its use:
>
> https://github.com/apache/accumulo/blob/1dc72fce2c781dee597c8c11876a3bc6c321c199/core/src/main/java/org/apache/accumulo/core/iterators/user/RowDeletingIterator.java#L98
>
> It seems like you are correct about reseeks not occurring during major
> compaction, but I would need to double check that.
>
> Billie
>
> On Thu, Mar 25, 2021 at 12:43 PM Bradley Barber <bb...@phemi.com> wrote:
>
> > Hi all!
> >
> > I'm looking for details on major compaction. Some of my colleagues and I
> > have been working on an iterator which we are attaching at major
> compaction
> > scope. The logic of this iterator requires that it always see entire
> rows -
> > ie. iterates over all KV entries which make up all versions of a given
> row.
> > From the Accumulo documentation, we had assumed this was guaranteed for
> > major compactions since tablets are partitioned at row boundaries.
> >
> > However, we are seeing some intermittent (and fairly rare) occurrences of
> > incorrect behaviour from our iterator. Having reviewed and tested the
> > iterator logic, we are quite confident it works as intended. Were we
> > incorrect in thinking that only entire rows will take part in major
> > compactions? Are there instances where a major compaction within a tablet
> > will see only partial rows? On reviewing the documentation, it seems this
> > *may
> > *be possible when a major compaction is called to merge a subset of
> RFiles
> > in a given tablet, but it's not very clear. Would anyone be able to
> clarify
> > this for us?
> >
> > Issues with our iterator logic may also occur if reseeks are performed
> > during a major compaction. However, from our reading of the available
> > documentation, we got the impression that reseeks do not occur during
> major
> > compaction and we can't see why they would be. Is this guaranteed or are
> > there cases where a reseek may in fact be called during major compaction?
> >
> > Sorry for the long, involved questions but any clarification would help
> us
> > greatly and be very appreciated :)
> >
> > Hope you all are having a good week,
> > Bradley Barber
> >
>

Re: Nuances of Major Compactions

Posted by Billie Rinaldi <bi...@apache.org>.
Yes, it is definitely possible for a major compaction to see only part of a
row. Only during a full major compaction will an iterator see all of the
tablet's files. Even then, the iterator would not see any k/v entries for
the row that were still in memory in the tablet server. Ingest would need
to be paused and the table would need to be flushed for a full major
compaction to be guaranteed to see entire rows.

The IteratorEnvironment passed into the iterator initialization has a
method isFullMajorCompaction that allows an iterator to tell if a full
major compaction is happening or not. Here is an example of its use:
https://github.com/apache/accumulo/blob/1dc72fce2c781dee597c8c11876a3bc6c321c199/core/src/main/java/org/apache/accumulo/core/iterators/user/RowDeletingIterator.java#L98

It seems like you are correct about reseeks not occurring during major
compaction, but I would need to double check that.

Billie

On Thu, Mar 25, 2021 at 12:43 PM Bradley Barber <bb...@phemi.com> wrote:

> Hi all!
>
> I'm looking for details on major compaction. Some of my colleagues and I
> have been working on an iterator which we are attaching at major compaction
> scope. The logic of this iterator requires that it always see entire rows -
> ie. iterates over all KV entries which make up all versions of a given row.
> From the Accumulo documentation, we had assumed this was guaranteed for
> major compactions since tablets are partitioned at row boundaries.
>
> However, we are seeing some intermittent (and fairly rare) occurrences of
> incorrect behaviour from our iterator. Having reviewed and tested the
> iterator logic, we are quite confident it works as intended. Were we
> incorrect in thinking that only entire rows will take part in major
> compactions? Are there instances where a major compaction within a tablet
> will see only partial rows? On reviewing the documentation, it seems this
> *may
> *be possible when a major compaction is called to merge a subset of RFiles
> in a given tablet, but it's not very clear. Would anyone be able to clarify
> this for us?
>
> Issues with our iterator logic may also occur if reseeks are performed
> during a major compaction. However, from our reading of the available
> documentation, we got the impression that reseeks do not occur during major
> compaction and we can't see why they would be. Is this guaranteed or are
> there cases where a reseek may in fact be called during major compaction?
>
> Sorry for the long, involved questions but any clarification would help us
> greatly and be very appreciated :)
>
> Hope you all are having a good week,
> Bradley Barber
>