You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Vincent Russell <vi...@gmail.com> on 2022/04/18 19:00:51 UTC

Major compactions during map reduce

Hello All,

Could major compactions that occur while a map reduce job is running cause
the map reduce job to miss records because rows have been moved to a
different tablet?

How does this work?

I'm using accumulo 2.0.1

Thank you,
Vincent

Re: Major compactions during map reduce

Posted by Vincent Russell <vi...@gmail.com>.
Thank you Christopher.

No. We aren't seeing any errors, the results of the map reduce job just
seem to be less than what we expected so it seemed like possible records
were being skipped by the mapper even though we aren't passing in any
ranges at all.

On Tue, Apr 19, 2022 at 5:33 AM Christopher <ct...@apache.org> wrote:

> Isolation should only give you consistency within a row, to ensure you're
> not scanning over partial changes from a mutation that is currently being
> written to a row. It shouldn't have anything to do with compactions or
> missing data that has already been written before the MapReduce scan has
> started.
>
> Splits shouldn't cause you to miss data either. It's been awhile since I
> looked, but I believe the MapReduce APIs simply break up a table into
> separate ranges to scan based on current tablet boundaries. If there are
> splits, then all that means is that some of the ranges will span across
> more than one tablet, but that's fine... a scan is a scan... scans don't
> need to be limited to a single tablet.
>
> Compactions could cause missed data if they transform the data in some way,
> but otherwise, I wouldn't expect them to.
>
> Are you seeing any error messages anywhere?
>
> On Mon, Apr 18, 2022, 15:23 Vincent Russell <vi...@gmail.com>
> wrote:
>
> > Hi Dave,
> >
> > Yes we are using the new MapReduce API, but we are not setting any
> > settings for isolated scan so we are using whatever the default is.
> >
> > Thanks,
> > Vincent
> >
> > On Mon, Apr 18, 2022 at 3:12 PM Dave Marion <dm...@gmail.com> wrote:
> >
> > > Major compactions should not move rows to new tablets, but a tablet
> split
> > > could. Are you using the new MapReduce API introduced in 2.0? Are you
> > > setting it to use an isolated scan?
> > >
> > > On Mon, Apr 18, 2022 at 3:01 PM Vincent Russell <
> > vincent.russell@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hello All,
> > > >
> > > > Could major compactions that occur while a map reduce job is running
> > > cause
> > > > the map reduce job to miss records because rows have been moved to a
> > > > different tablet?
> > > >
> > > > How does this work?
> > > >
> > > > I'm using accumulo 2.0.1
> > > >
> > > > Thank you,
> > > > Vincent
> > > >
> > >
> >
>

Re: Major compactions during map reduce

Posted by Dave Marion <dm...@gmail.com>.
I was initially thinking about the case where the splits change between the
job setup and the Map execution, but given more thought I think I went down
the wrong path. Tablet splitting should not affect the overall range of
keys for the MR job. If a Tablet splits after the job computes the splits,
but before the Map is run, then that Map will just scan multiple tablets.

On Tue, Apr 19, 2022 at 5:33 AM Christopher <ct...@apache.org> wrote:

> Isolation should only give you consistency within a row, to ensure you're
> not scanning over partial changes from a mutation that is currently being
> written to a row. It shouldn't have anything to do with compactions or
> missing data that has already been written before the MapReduce scan has
> started.
>
> Splits shouldn't cause you to miss data either. It's been awhile since I
> looked, but I believe the MapReduce APIs simply break up a table into
> separate ranges to scan based on current tablet boundaries. If there are
> splits, then all that means is that some of the ranges will span across
> more than one tablet, but that's fine... a scan is a scan... scans don't
> need to be limited to a single tablet.
>
> Compactions could cause missed data if they transform the data in some way,
> but otherwise, I wouldn't expect them to.
>
> Are you seeing any error messages anywhere?
>
> On Mon, Apr 18, 2022, 15:23 Vincent Russell <vi...@gmail.com>
> wrote:
>
> > Hi Dave,
> >
> > Yes we are using the new MapReduce API, but we are not setting any
> > settings for isolated scan so we are using whatever the default is.
> >
> > Thanks,
> > Vincent
> >
> > On Mon, Apr 18, 2022 at 3:12 PM Dave Marion <dm...@gmail.com> wrote:
> >
> > > Major compactions should not move rows to new tablets, but a tablet
> split
> > > could. Are you using the new MapReduce API introduced in 2.0? Are you
> > > setting it to use an isolated scan?
> > >
> > > On Mon, Apr 18, 2022 at 3:01 PM Vincent Russell <
> > vincent.russell@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hello All,
> > > >
> > > > Could major compactions that occur while a map reduce job is running
> > > cause
> > > > the map reduce job to miss records because rows have been moved to a
> > > > different tablet?
> > > >
> > > > How does this work?
> > > >
> > > > I'm using accumulo 2.0.1
> > > >
> > > > Thank you,
> > > > Vincent
> > > >
> > >
> >
>

Re: Major compactions during map reduce

Posted by Christopher <ct...@apache.org>.
Isolation should only give you consistency within a row, to ensure you're
not scanning over partial changes from a mutation that is currently being
written to a row. It shouldn't have anything to do with compactions or
missing data that has already been written before the MapReduce scan has
started.

Splits shouldn't cause you to miss data either. It's been awhile since I
looked, but I believe the MapReduce APIs simply break up a table into
separate ranges to scan based on current tablet boundaries. If there are
splits, then all that means is that some of the ranges will span across
more than one tablet, but that's fine... a scan is a scan... scans don't
need to be limited to a single tablet.

Compactions could cause missed data if they transform the data in some way,
but otherwise, I wouldn't expect them to.

Are you seeing any error messages anywhere?

On Mon, Apr 18, 2022, 15:23 Vincent Russell <vi...@gmail.com>
wrote:

> Hi Dave,
>
> Yes we are using the new MapReduce API, but we are not setting any
> settings for isolated scan so we are using whatever the default is.
>
> Thanks,
> Vincent
>
> On Mon, Apr 18, 2022 at 3:12 PM Dave Marion <dm...@gmail.com> wrote:
>
> > Major compactions should not move rows to new tablets, but a tablet split
> > could. Are you using the new MapReduce API introduced in 2.0? Are you
> > setting it to use an isolated scan?
> >
> > On Mon, Apr 18, 2022 at 3:01 PM Vincent Russell <
> vincent.russell@gmail.com
> > >
> > wrote:
> >
> > > Hello All,
> > >
> > > Could major compactions that occur while a map reduce job is running
> > cause
> > > the map reduce job to miss records because rows have been moved to a
> > > different tablet?
> > >
> > > How does this work?
> > >
> > > I'm using accumulo 2.0.1
> > >
> > > Thank you,
> > > Vincent
> > >
> >
>

Re: Major compactions during map reduce

Posted by Vincent Russell <vi...@gmail.com>.
Hi Dave,

Yes we are using the new MapReduce API, but we are not setting any
settings for isolated scan so we are using whatever the default is.

Thanks,
Vincent

On Mon, Apr 18, 2022 at 3:12 PM Dave Marion <dm...@gmail.com> wrote:

> Major compactions should not move rows to new tablets, but a tablet split
> could. Are you using the new MapReduce API introduced in 2.0? Are you
> setting it to use an isolated scan?
>
> On Mon, Apr 18, 2022 at 3:01 PM Vincent Russell <vincent.russell@gmail.com
> >
> wrote:
>
> > Hello All,
> >
> > Could major compactions that occur while a map reduce job is running
> cause
> > the map reduce job to miss records because rows have been moved to a
> > different tablet?
> >
> > How does this work?
> >
> > I'm using accumulo 2.0.1
> >
> > Thank you,
> > Vincent
> >
>

Re: Major compactions during map reduce

Posted by Dave Marion <dm...@gmail.com>.
Major compactions should not move rows to new tablets, but a tablet split
could. Are you using the new MapReduce API introduced in 2.0? Are you
setting it to use an isolated scan?

On Mon, Apr 18, 2022 at 3:01 PM Vincent Russell <vi...@gmail.com>
wrote:

> Hello All,
>
> Could major compactions that occur while a map reduce job is running cause
> the map reduce job to miss records because rows have been moved to a
> different tablet?
>
> How does this work?
>
> I'm using accumulo 2.0.1
>
> Thank you,
> Vincent
>