You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by Christopher <ct...@apache.org> on 2015/02/19 00:42:53 UTC

Re: Values go to a wrong table during recovery.

Hi Denis,

This doesn't sound like a known bug to me. Your hypothesis is reasonable,
since WALs use a surrogate ID, which maps to table ID/tablet information,
when read back. It is possible that it incorrectly interprets this mapping
and replays data into the wrong table. Given the amount of testing we do,
my instinct is to think this is unlikely, but if we can confirm this bug,
it would definitely be a very critical one.

To rule out some scenarios, is it possible that your clients are writing to
the wrong tables? Have you ever seen a failure affecting a table which does
not exist (like what might happen if there's an off-by-one error in the WAL
code)? Or affecting the metadata tables?

Can you reproduce this error reliably, or can you share the relevant ingest
code which can reproduce this failure? Also, what kind of tablet server
failures are you experiencing when this happens?

If you could file a bug report at https://issues.apache.org/browse/ACCUMULO
with any details and/or attachments to help us address the issue, we would
greatly appreciate it. This seems like something we'd want to fix pretty
quickly.

Thanks!

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Wed, Feb 18, 2015 at 6:26 PM, Denis <de...@camfex.cz> wrote:

> Hello.
>
> Few times I noticed that some tables have values they cannot have, and
> those entries have timestamp close to a tabletserver failure time.
> (I mean wrong format, one table has msgpack values at least 10 bytes
> long and another table has 1-byte values and after a failure I read
> one or two 1-byte values in the table where I expect to read msgpack).
>
> I suspect that during recovery process, when WAL is being read, some
> entries are inserted to a wrong table.
>
> May be it is a know bug as I am still using Accumulo 1.6.1
>

Re: Values go to a wrong table during recovery.

Posted by Christopher <ct...@apache.org>.

Sorry, that link should be: https://issues.apache.org/jira/browse/ACCUMULO


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Wed, Feb 18, 2015 at 6:42 PM, Christopher <ct...@apache.org> wrote:

> Hi Denis,
>
> This doesn't sound like a known bug to me. Your hypothesis is reasonable,
> since WALs use a surrogate ID, which maps to table ID/tablet information,
> when read back. It is possible that it incorrectly interprets this mapping
> and replays data into the wrong table. Given the amount of testing we do,
> my instinct is to think this is unlikely, but if we can confirm this bug,
> it would definitely be a very critical one.
>
> To rule out some scenarios, is it possible that your clients are writing
> to the wrong tables? Have you ever seen a failure affecting a table which
> does not exist (like what might happen if there's an off-by-one error in
> the WAL code)? Or affecting the metadata tables?
>
> Can you reproduce this error reliably, or can you share the relevant
> ingest code which can reproduce this failure? Also, what kind of tablet
> server failures are you experiencing when this happens?
>
> If you could file a bug report at
> https://issues.apache.org/browse/ACCUMULO with any details and/or
> attachments to help us address the issue, we would greatly appreciate it.
> This seems like something we'd want to fix pretty quickly.
>
> Thanks!
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
> On Wed, Feb 18, 2015 at 6:26 PM, Denis <de...@camfex.cz> wrote:
>
>> Hello.
>>
>> Few times I noticed that some tables have values they cannot have, and
>> those entries have timestamp close to a tabletserver failure time.
>> (I mean wrong format, one table has msgpack values at least 10 bytes
>> long and another table has 1-byte values and after a failure I read
>> one or two 1-byte values in the table where I expect to read msgpack).
>>
>> I suspect that during recovery process, when WAL is being read, some
>> entries are inserted to a wrong table.
>>
>> May be it is a know bug as I am still using Accumulo 1.6.1
>>
>
>

Re: Values go to a wrong table during recovery.

Posted by Christopher <ct...@apache.org>.

Sorry, that link should be: https://issues.apache.org/jira/browse/ACCUMULO


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Wed, Feb 18, 2015 at 6:42 PM, Christopher <ct...@apache.org> wrote:

> Hi Denis,
>
> This doesn't sound like a known bug to me. Your hypothesis is reasonable,
> since WALs use a surrogate ID, which maps to table ID/tablet information,
> when read back. It is possible that it incorrectly interprets this mapping
> and replays data into the wrong table. Given the amount of testing we do,
> my instinct is to think this is unlikely, but if we can confirm this bug,
> it would definitely be a very critical one.
>
> To rule out some scenarios, is it possible that your clients are writing
> to the wrong tables? Have you ever seen a failure affecting a table which
> does not exist (like what might happen if there's an off-by-one error in
> the WAL code)? Or affecting the metadata tables?
>
> Can you reproduce this error reliably, or can you share the relevant
> ingest code which can reproduce this failure? Also, what kind of tablet
> server failures are you experiencing when this happens?
>
> If you could file a bug report at
> https://issues.apache.org/browse/ACCUMULO with any details and/or
> attachments to help us address the issue, we would greatly appreciate it.
> This seems like something we'd want to fix pretty quickly.
>
> Thanks!
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
> On Wed, Feb 18, 2015 at 6:26 PM, Denis <de...@camfex.cz> wrote:
>
>> Hello.
>>
>> Few times I noticed that some tables have values they cannot have, and
>> those entries have timestamp close to a tabletserver failure time.
>> (I mean wrong format, one table has msgpack values at least 10 bytes
>> long and another table has 1-byte values and after a failure I read
>> one or two 1-byte values in the table where I expect to read msgpack).
>>
>> I suspect that during recovery process, when WAL is being read, some
>> entries are inserted to a wrong table.
>>
>> May be it is a know bug as I am still using Accumulo 1.6.1
>>
>
>

Re: Values go to a wrong table during recovery.

Posted by Denis <de...@camfex.cz>.

I forget to say that HDFS Datanodes and Accumulo Tablet Servers share
the same machines.
When a machine powers off, one Tablet Server and one Datanode became
unavailable.

On 2/19/15, Eric Newton <er...@gmail.com> wrote:
> https://issues.apache.org/jira/browse/ACCUMULO-3603
>
> -Eric
>
>
> On Wed, Feb 18, 2015 at 7:12 PM, Denis <de...@camfex.cz> wrote:
>
>> On 2/18/15, Christopher <ct...@apache.org> wrote:
>>
>> > To rule out some scenarios, is it possible that your clients are
>> > writing
>> to
>> > the wrong tables?
>> That was the first idea, so I added assert()'s to the code of the
>> writers few days ago. No assert was triggered, but some invalid values
>> appear after new tserver failure.
>>
>> > Have you ever seen a failure affecting a table which does
>> > not exist (like what might happen if there's an off-by-one error in the
>> WAL
>> > code)? Or affecting the metadata tables?
>> No.
>> Also, no tables were created or deleted during last two months.
>>
>> > Can you reproduce this error reliably, or can you share the relevant
>> ingest
>> > code which can reproduce this failure?
>>
>> I will think how to reproduce it.
>> What could be special about the code: inserts are performed to few
>> (5..8) tables at once (one data table + few index tables) but no
>> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
>> created and flushed consequentially, in the same thread. For Accumulo
>> 1.4 it was a performance optimization, if worked faster than
>> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
>> not changed after migration to 1.6.1.
>> In all cases with invalid values the index tables were affected (one
>> of the index table had values typical for another of the index
>> tables).
>>
>> > Also, what kind of tablet server failures are you experiencing when
>> > this
>> happens?
>> Spontaneous power-offs. There is something wrong with the power units
>> so every 2-3 days one of the servers suddenly turns off and reboots.
>>
>

Re: Values go to a wrong table during recovery.

Posted by Eric Newton <er...@gmail.com>.

Denis,

After our tests, it's highly unlikely this problem is in the server.  If
you can provide a test client that has ever replicated the problem, please
attach it to the ticket.  Otherwise, we'll close the ticket unless someone
else can reproduce the problem.

-Eric


On Fri, Feb 20, 2015 at 1:46 PM, Keith Turner <ke...@deenlo.com> wrote:

> I updated ACCUMULO-3603 w/ details about an experiment I ran.
>
> On Wed, Feb 18, 2015 at 9:44 PM, Eric Newton <er...@gmail.com>
> wrote:
>
>> https://issues.apache.org/jira/browse/ACCUMULO-3603
>>
>> -Eric
>>
>>
>> On Wed, Feb 18, 2015 at 7:12 PM, Denis <de...@camfex.cz> wrote:
>>
>>> On 2/18/15, Christopher <ct...@apache.org> wrote:
>>>
>>> > To rule out some scenarios, is it possible that your clients are
>>> writing to
>>> > the wrong tables?
>>> That was the first idea, so I added assert()'s to the code of the
>>> writers few days ago. No assert was triggered, but some invalid values
>>> appear after new tserver failure.
>>>
>>> > Have you ever seen a failure affecting a table which does
>>> > not exist (like what might happen if there's an off-by-one error in
>>> the WAL
>>> > code)? Or affecting the metadata tables?
>>> No.
>>> Also, no tables were created or deleted during last two months.
>>>
>>> > Can you reproduce this error reliably, or can you share the relevant
>>> ingest
>>> > code which can reproduce this failure?
>>>
>>> I will think how to reproduce it.
>>> What could be special about the code: inserts are performed to few
>>> (5..8) tables at once (one data table + few index tables) but no
>>> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
>>> created and flushed consequentially, in the same thread. For Accumulo
>>> 1.4 it was a performance optimization, if worked faster than
>>> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
>>> not changed after migration to 1.6.1.
>>> In all cases with invalid values the index tables were affected (one
>>> of the index table had values typical for another of the index
>>> tables).
>>>
>>> > Also, what kind of tablet server failures are you experiencing when
>>> this happens?
>>> Spontaneous power-offs. There is something wrong with the power units
>>> so every 2-3 days one of the servers suddenly turns off and reboots.
>>>
>>
>>
>

Re: Values go to a wrong table during recovery.

Posted by Denis <de...@camfex.cz>.

I think it was ACCUMULO-1364

On 2/20/15, Denis <de...@camfex.cz> wrote:
> Hm, I think, that bug is much older, which has been fixed in 1.5 or
> next minor 1.4.x. Unfortunately, I did not put the bug number in code
> comment (only
>
> conn.tableOperations.setProperty(tableName, "table.walog.enabled",
> "false") // avoid Accumulo 1.4 bug with recovery from empty logs
> (fixed in 1.5)
>
> ), so it was long time before ACCUMULO-3182's Sep-2014 and I have seen
> here in the mailing list or in the bugtracker that it was fixed in
> 1.5)
>
> On 2/20/15, Josh Elser <el...@apache.org> wrote:
>> FYI, this was fixed in 1.6.2
>> https://issues.apache.org/jira/browse/ACCUMULO-3182
>>
>> Denis wrote:
>>> There was a bug in1.4,  if a tablet had empty walog there were some
>>> startup issues (tablet remains offline or something like this), and it
>>> happened often with index tables (hmm, the same tables I have this
>>> problem).
>>
>

Re: Values go to a wrong table during recovery.

Posted by Josh Elser <jo...@gmail.com>.

Well, regardless, there was an outstanding issue where an empty WAL 
could have been made and would have blocked recovery. It's possible what 
I fixed was a regression, but it still sounds like what you saw :)

Denis wrote:
> Hm, I think, that bug is much older, which has been fixed in 1.5 or
> next minor 1.4.x. Unfortunately, I did not put the bug number in code
> comment (only
>
> conn.tableOperations.setProperty(tableName, "table.walog.enabled",
> "false") // avoid Accumulo 1.4 bug with recovery from empty logs
> (fixed in 1.5)
>
> ), so it was long time before ACCUMULO-3182's Sep-2014 and I have seen
> here in the mailing list or in the bugtracker that it was fixed in
> 1.5)
>
> On 2/20/15, Josh Elser<el...@apache.org>  wrote:
>> FYI, this was fixed in 1.6.2
>> https://issues.apache.org/jira/browse/ACCUMULO-3182
>>
>> Denis wrote:
>>> There was a bug in1.4,  if a tablet had empty walog there were some
>>> startup issues (tablet remains offline or something like this), and it
>>> happened often with index tables (hmm, the same tables I have this
>>> problem).

Re: Values go to a wrong table during recovery.

Posted by Denis <de...@camfex.cz>.

Hm, I think, that bug is much older, which has been fixed in 1.5 or
next minor 1.4.x. Unfortunately, I did not put the bug number in code
comment (only

conn.tableOperations.setProperty(tableName, "table.walog.enabled",
"false") // avoid Accumulo 1.4 bug with recovery from empty logs
(fixed in 1.5)

), so it was long time before ACCUMULO-3182's Sep-2014 and I have seen
here in the mailing list or in the bugtracker that it was fixed in
1.5)

On 2/20/15, Josh Elser <el...@apache.org> wrote:
> FYI, this was fixed in 1.6.2
> https://issues.apache.org/jira/browse/ACCUMULO-3182
>
> Denis wrote:
>> There was a bug in1.4,  if a tablet had empty walog there were some
>> startup issues (tablet remains offline or something like this), and it
>> happened often with index tables (hmm, the same tables I have this
>> problem).
>

Re: Values go to a wrong table during recovery.

Posted by Josh Elser <el...@apache.org>.

FYI, this was fixed in 1.6.2 
https://issues.apache.org/jira/browse/ACCUMULO-3182

Denis wrote:
> There was a bug in1.4,  if a tablet had empty walog there were some
> startup issues (tablet remains offline or something like this), and it
> happened often with index tables (hmm, the same tables I have this
> problem).

Re: Values go to a wrong table during recovery.

Posted by Denis <de...@camfex.cz>.

It is always used from the single thread

On 2/20/15, Denis <de...@camfex.cz> wrote:
>> So just to confirm, you have seen this issue appear while walogs are
>> enabled for all tables, yes?
>
> Yes.
>
>> Any chance you can share your code that writes across several
>> BatchWriters?
>
> The writer is:
>
> // way faster than Connector.createMultiTableBatchWriter
> class MyMultiTableBatchWriter(maxMemory: Long, maxLatency: Long,
> maxWriteThreads: Int)(implicit val conn: Connector) {
>   private[this] val writers = new
> scala.collection.mutable.HashMap[String, BatchWriter]
>   def getBatchWriter(tableName: String): BatchWriter = synchronized {
>     writers.getOrElseUpdate(tableName,
> conn.createBatchWriter(tableName, maxMemory, maxLatency,
> maxWriteThreads))
>   }
>   def flush() = writers.foreach(_._2.flush())
>   def close() = writers.foreach(_._2.close())
> }
>
> But the full code which creates/uses the indexes is more than 1Mb of
> scala code...
>
> On 2/20/15, Sean Busbey <bu...@cloudera.com> wrote:
>> On Fri, Feb 20, 2015 at 2:33 PM, Denis <de...@camfex.cz> wrote:
>>
>>> > Did you have walogs laying around when you upgraded?
>>>
>>> In 1.4-cluster (and first time in 1.6-cluster) I had walogs enabled
>>> for data tables and disabled for index tables.
>>>
>>> There was a bug in 1.4, if a tablet had empty walog there were some
>>> startup issues (tablet remains offline or something like this), and it
>>> happened often with index tables (hmm, the same tables I have this
>>> problem).
>>>
>>> So, in 1.4-cluster I disabled walog and ran full reindex periodically.
>>> After running 1.6-cluster some time I enabled walogs for all tables as
>>> the new cluster have less reliable hardware, which reboots from time
>>> to time.
>>>
>>>
>> So just to confirm, you have seen this issue appear while walogs are
>> enabled for all tables, yes?
>>
>> Any chance you can share your code that writes across several
>> BatchWriters?
>>
>> --
>> Sean
>>
>

Re: Values go to a wrong table during recovery.

Posted by Denis <de...@camfex.cz>.

> So just to confirm, you have seen this issue appear while walogs are enabled for all tables, yes?

Yes.

> Any chance you can share your code that writes across several BatchWriters?

The writer is:

// way faster than Connector.createMultiTableBatchWriter
class MyMultiTableBatchWriter(maxMemory: Long, maxLatency: Long,
maxWriteThreads: Int)(implicit val conn: Connector) {
  private[this] val writers = new
scala.collection.mutable.HashMap[String, BatchWriter]
  def getBatchWriter(tableName: String): BatchWriter = synchronized {
    writers.getOrElseUpdate(tableName,
conn.createBatchWriter(tableName, maxMemory, maxLatency,
maxWriteThreads))
  }
  def flush() = writers.foreach(_._2.flush())
  def close() = writers.foreach(_._2.close())
}

But the full code which creates/uses the indexes is more than 1Mb of
scala code...

On 2/20/15, Sean Busbey <bu...@cloudera.com> wrote:
> On Fri, Feb 20, 2015 at 2:33 PM, Denis <de...@camfex.cz> wrote:
>
>> > Did you have walogs laying around when you upgraded?
>>
>> In 1.4-cluster (and first time in 1.6-cluster) I had walogs enabled
>> for data tables and disabled for index tables.
>>
>> There was a bug in 1.4, if a tablet had empty walog there were some
>> startup issues (tablet remains offline or something like this), and it
>> happened often with index tables (hmm, the same tables I have this
>> problem).
>>
>> So, in 1.4-cluster I disabled walog and ran full reindex periodically.
>> After running 1.6-cluster some time I enabled walogs for all tables as
>> the new cluster have less reliable hardware, which reboots from time
>> to time.
>>
>>
> So just to confirm, you have seen this issue appear while walogs are
> enabled for all tables, yes?
>
> Any chance you can share your code that writes across several BatchWriters?
>
> --
> Sean
>

Re: Values go to a wrong table during recovery.

Posted by Sean Busbey <bu...@cloudera.com>.

On Fri, Feb 20, 2015 at 2:33 PM, Denis <de...@camfex.cz> wrote:

> > Did you have walogs laying around when you upgraded?
>
> In 1.4-cluster (and first time in 1.6-cluster) I had walogs enabled
> for data tables and disabled for index tables.
>
> There was a bug in 1.4, if a tablet had empty walog there were some
> startup issues (tablet remains offline or something like this), and it
> happened often with index tables (hmm, the same tables I have this
> problem).
>
> So, in 1.4-cluster I disabled walog and ran full reindex periodically.
> After running 1.6-cluster some time I enabled walogs for all tables as
> the new cluster have less reliable hardware, which reboots from time
> to time.
>
>
So just to confirm, you have seen this issue appear while walogs are
enabled for all tables, yes?

Any chance you can share your code that writes across several BatchWriters?

-- 
Sean

Re: Values go to a wrong table during recovery.

Posted by Denis <de...@camfex.cz>.

>  If you can provide a test client that has ever replicated the problem, please attach it to the ticket.

I have seen it 3 times within a month timeframe, so I do not know how
to reproduce it reliable.
Perhaps, I have to backup walogs next time and look into them.

> Is this the exact same cluster or is it just the same code you were using?

Same code, another cluster.

> Did you have walogs laying around when you upgraded?

In 1.4-cluster (and first time in 1.6-cluster) I had walogs enabled
for data tables and disabled for index tables.

There was a bug in 1.4, if a tablet had empty walog there were some
startup issues (tablet remains offline or something like this), and it
happened often with index tables (hmm, the same tables I have this
problem).

So, in 1.4-cluster I disabled walog and ran full reindex periodically.
After running 1.6-cluster some time I enabled walogs for all tables as
the new cluster have less reliable hardware, which reboots from time
to time.

> Did you upgrade through 1.5 or straight from 1.4 to 1.6?

>From 1.4 to 1.6. But it was not upgrade, it was copy of .rf files to a
new cluster and then importdirectory.

On 2/20/15, John Vines <vi...@apache.org> wrote:
> You said that you were operating this on 1.4. Is this the exact same
> cluster or is it just the same code you were using? Did you have walogs
> laying around when you upgraded? Did you upgrade through 1.5 or straight
> from 1.4 to 1.6?
>
> On Fri, Feb 20, 2015 at 1:46 PM, Keith Turner <ke...@deenlo.com> wrote:
>
>> I updated ACCUMULO-3603 w/ details about an experiment I ran.
>>
>> On Wed, Feb 18, 2015 at 9:44 PM, Eric Newton <er...@gmail.com>
>> wrote:
>>
>>> https://issues.apache.org/jira/browse/ACCUMULO-3603
>>>
>>> -Eric
>>>
>>>
>>> On Wed, Feb 18, 2015 at 7:12 PM, Denis <de...@camfex.cz> wrote:
>>>
>>>> On 2/18/15, Christopher <ct...@apache.org> wrote:
>>>>
>>>> > To rule out some scenarios, is it possible that your clients are
>>>> writing to
>>>> > the wrong tables?
>>>> That was the first idea, so I added assert()'s to the code of the
>>>> writers few days ago. No assert was triggered, but some invalid values
>>>> appear after new tserver failure.
>>>>
>>>> > Have you ever seen a failure affecting a table which does
>>>> > not exist (like what might happen if there's an off-by-one error in
>>>> the WAL
>>>> > code)? Or affecting the metadata tables?
>>>> No.
>>>> Also, no tables were created or deleted during last two months.
>>>>
>>>> > Can you reproduce this error reliably, or can you share the relevant
>>>> ingest
>>>> > code which can reproduce this failure?
>>>>
>>>> I will think how to reproduce it.
>>>> What could be special about the code: inserts are performed to few
>>>> (5..8) tables at once (one data table + few index tables) but no
>>>> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
>>>> created and flushed consequentially, in the same thread. For Accumulo
>>>> 1.4 it was a performance optimization, if worked faster than
>>>> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
>>>> not changed after migration to 1.6.1.
>>>> In all cases with invalid values the index tables were affected (one
>>>> of the index table had values typical for another of the index
>>>> tables).
>>>>
>>>> > Also, what kind of tablet server failures are you experiencing when
>>>> this happens?
>>>> Spontaneous power-offs. There is something wrong with the power units
>>>> so every 2-3 days one of the servers suddenly turns off and reboots.
>>>>
>>>
>>>
>>
>

Re: Values go to a wrong table during recovery.

Posted by John Vines <vi...@apache.org>.

You said that you were operating this on 1.4. Is this the exact same
cluster or is it just the same code you were using? Did you have walogs
laying around when you upgraded? Did you upgrade through 1.5 or straight
from 1.4 to 1.6?

On Fri, Feb 20, 2015 at 1:46 PM, Keith Turner <ke...@deenlo.com> wrote:

> I updated ACCUMULO-3603 w/ details about an experiment I ran.
>
> On Wed, Feb 18, 2015 at 9:44 PM, Eric Newton <er...@gmail.com>
> wrote:
>
>> https://issues.apache.org/jira/browse/ACCUMULO-3603
>>
>> -Eric
>>
>>
>> On Wed, Feb 18, 2015 at 7:12 PM, Denis <de...@camfex.cz> wrote:
>>
>>> On 2/18/15, Christopher <ct...@apache.org> wrote:
>>>
>>> > To rule out some scenarios, is it possible that your clients are
>>> writing to
>>> > the wrong tables?
>>> That was the first idea, so I added assert()'s to the code of the
>>> writers few days ago. No assert was triggered, but some invalid values
>>> appear after new tserver failure.
>>>
>>> > Have you ever seen a failure affecting a table which does
>>> > not exist (like what might happen if there's an off-by-one error in
>>> the WAL
>>> > code)? Or affecting the metadata tables?
>>> No.
>>> Also, no tables were created or deleted during last two months.
>>>
>>> > Can you reproduce this error reliably, or can you share the relevant
>>> ingest
>>> > code which can reproduce this failure?
>>>
>>> I will think how to reproduce it.
>>> What could be special about the code: inserts are performed to few
>>> (5..8) tables at once (one data table + few index tables) but no
>>> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
>>> created and flushed consequentially, in the same thread. For Accumulo
>>> 1.4 it was a performance optimization, if worked faster than
>>> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
>>> not changed after migration to 1.6.1.
>>> In all cases with invalid values the index tables were affected (one
>>> of the index table had values typical for another of the index
>>> tables).
>>>
>>> > Also, what kind of tablet server failures are you experiencing when
>>> this happens?
>>> Spontaneous power-offs. There is something wrong with the power units
>>> so every 2-3 days one of the servers suddenly turns off and reboots.
>>>
>>
>>
>

Re: Values go to a wrong table during recovery.

Posted by Keith Turner <ke...@deenlo.com>.

I updated ACCUMULO-3603 w/ details about an experiment I ran.

On Wed, Feb 18, 2015 at 9:44 PM, Eric Newton <er...@gmail.com> wrote:

> https://issues.apache.org/jira/browse/ACCUMULO-3603
>
> -Eric
>
>
> On Wed, Feb 18, 2015 at 7:12 PM, Denis <de...@camfex.cz> wrote:
>
>> On 2/18/15, Christopher <ct...@apache.org> wrote:
>>
>> > To rule out some scenarios, is it possible that your clients are
>> writing to
>> > the wrong tables?
>> That was the first idea, so I added assert()'s to the code of the
>> writers few days ago. No assert was triggered, but some invalid values
>> appear after new tserver failure.
>>
>> > Have you ever seen a failure affecting a table which does
>> > not exist (like what might happen if there's an off-by-one error in the
>> WAL
>> > code)? Or affecting the metadata tables?
>> No.
>> Also, no tables were created or deleted during last two months.
>>
>> > Can you reproduce this error reliably, or can you share the relevant
>> ingest
>> > code which can reproduce this failure?
>>
>> I will think how to reproduce it.
>> What could be special about the code: inserts are performed to few
>> (5..8) tables at once (one data table + few index tables) but no
>> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
>> created and flushed consequentially, in the same thread. For Accumulo
>> 1.4 it was a performance optimization, if worked faster than
>> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
>> not changed after migration to 1.6.1.
>> In all cases with invalid values the index tables were affected (one
>> of the index table had values typical for another of the index
>> tables).
>>
>> > Also, what kind of tablet server failures are you experiencing when
>> this happens?
>> Spontaneous power-offs. There is something wrong with the power units
>> so every 2-3 days one of the servers suddenly turns off and reboots.
>>
>
>

Re: Values go to a wrong table during recovery.

Posted by Eric Newton <er...@gmail.com>.

https://issues.apache.org/jira/browse/ACCUMULO-3603

-Eric


On Wed, Feb 18, 2015 at 7:12 PM, Denis <de...@camfex.cz> wrote:

> On 2/18/15, Christopher <ct...@apache.org> wrote:
>
> > To rule out some scenarios, is it possible that your clients are writing
> to
> > the wrong tables?
> That was the first idea, so I added assert()'s to the code of the
> writers few days ago. No assert was triggered, but some invalid values
> appear after new tserver failure.
>
> > Have you ever seen a failure affecting a table which does
> > not exist (like what might happen if there's an off-by-one error in the
> WAL
> > code)? Or affecting the metadata tables?
> No.
> Also, no tables were created or deleted during last two months.
>
> > Can you reproduce this error reliably, or can you share the relevant
> ingest
> > code which can reproduce this failure?
>
> I will think how to reproduce it.
> What could be special about the code: inserts are performed to few
> (5..8) tables at once (one data table + few index tables) but no
> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
> created and flushed consequentially, in the same thread. For Accumulo
> 1.4 it was a performance optimization, if worked faster than
> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
> not changed after migration to 1.6.1.
> In all cases with invalid values the index tables were affected (one
> of the index table had values typical for another of the index
> tables).
>
> > Also, what kind of tablet server failures are you experiencing when this
> happens?
> Spontaneous power-offs. There is something wrong with the power units
> so every 2-3 days one of the servers suddenly turns off and reboots.
>

Re: Values go to a wrong table during recovery.

Posted by Denis <de...@camfex.cz>.

On 2/18/15, Christopher <ct...@apache.org> wrote:

> To rule out some scenarios, is it possible that your clients are writing to
> the wrong tables?
That was the first idea, so I added assert()'s to the code of the
writers few days ago. No assert was triggered, but some invalid values
appear after new tserver failure.

> Have you ever seen a failure affecting a table which does
> not exist (like what might happen if there's an off-by-one error in the WAL
> code)? Or affecting the metadata tables?
No.
Also, no tables were created or deleted during last two months.

> Can you reproduce this error reliably, or can you share the relevant ingest
> code which can reproduce this failure?

I will think how to reproduce it.
What could be special about the code: inserts are performed to few
(5..8) tables at once (one data table + few index tables) but no
MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
created and flushed consequentially, in the same thread. For Accumulo
1.4 it was a performance optimization, if worked faster than
MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
not changed after migration to 1.6.1.
In all cases with invalid values the index tables were affected (one
of the index table had values typical for another of the index
tables).

> Also, what kind of tablet server failures are you experiencing when this happens?
Spontaneous power-offs. There is something wrong with the power units
so every 2-3 days one of the servers suddenly turns off and reboots.