You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Terry P." <te...@gmail.com> on 2013/12/05 01:29:29 UTC

Re: Bloom filter thread failure errors

Hi Eric,
Thanks for your reply, I'm just now getting back to this as I had more of
these the past two days. No tserver failures or master halts. With previous
errors we were still experiencing network issues that were indeed taking
tabletservers down, but now that they fixed a bad line card in a switch
that had been rebooting itself (but not failing over), those issues are all
gone (finally, knock on wood).

Now that I see them again in isolation with no other errors, in the main
tserver log these bloom-loader thread failures appear to happen out of the
blue with no other issues surrounding them.

However, I just checked the debug log and see they are occurring right at
the time of a Major Compaction.  E.g. from one of the tservers debug log:

2013-12-03 11:48:14,738 [tabletserver.Tablet] DEBUG: MajC initiate lock
0.00 secs, wait 0.00 secs
2013-12-03 11:48:14,739 [tabletserver.Tablet] DEBUG: Starting MajC 2;f;d
(NORMAL) [/t-0000aa9/C0000zmf.rf, <several more rfiles listed> ] -->
[/t-0000aa9/C0000zn4.rf_tmp
2013-12-03 11:48:14,780 [file.BloomFilterLayer] ERROR: Thread
"bloom-loader-41" died File /accumulo/tables/2/t-0000aa9/C0000zmf.rf is
closed

The rest of the stack looks like what I posted earlier. The very next debug
log message after the bloom loader exception is shows that the Compaction
completed successfully in 0.112 seconds.

So it looks like the bloom loader is trying to open an rfile 41ms after a
compaction started, and the file was likely just compacted during that gap
between the calls. If that's the case, can this error be safely ignored?

Thanks,
Terry

On Mon, Nov 18, 2013 at 8:56 PM, Eric Newton <er...@gmail.com> wrote:

> This is an educated guess...
>
> When a process dies "gracefully" there's a shutdown hook that closes the
> FileSystem.  That can result in messages like this.  It's likely there's an
> error before this about a zookeeper session being lost, or a halt issued by
> the master.  See if this tserver died shortly after this message. If so,
> ignore the message.
>
> -Eric
>
>
>
> On Fri, Nov 15, 2013 at 4:31 PM, Terry P. <te...@gmail.com> wrote:
>
>> Greetings folks,
>> In my Accumulo 1.4.2 cluster I am seeing ERRORS about bloom loader
>> threads dying due to an rfile being closed.  I can't copy/paste the error
>> as it's on an air-gapped system, but it starts with:
>>
>> ERROR Thread "bloom-loader-2147" died File
>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>   java.lang.IllegalStateException: File
>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>     at
>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.getBCFile(CacheableBlockFile.java:244)
>>     at
>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.access$000(CacheableBlockFile.java:142)
>> (10 more java files ... ends with java.lang.Thread.run(UnknownSource) )
>>
>> No real rhyme or reason as to when they occur; we are predominantly
>> ingest heavy with light reads by rowkey with ~10 entries per rowkey.  I
>> don't really know if client programs are getting errors when these occur or
>> not.
>>
>> I didn't find any JIRAs related to these.  Should I be concerned about
>> these?
>>
>
>

Re: Bloom filter thread failure errors

Posted by Keith Turner <ke...@deenlo.com>.

On Thu, Dec 5, 2013 at 3:37 PM, Terry P. <te...@gmail.com> wrote:

> Hi Keith,
> Here is the stack trace in the tserver DEBUG log for this most recent
> exception. The exception section is the same as what's in the main tserver
> log, but of course the MajC bits don't appear in the main log. This is
> hand-typed, but I'm pretty sure it's right.
>
>
> 2013-12-03 11:48:14,738 [tabletserver.Tablet] DEBUG: MajC initiate lock
> 0.00 secs, wait 0.00 secs
> 2013-12-03 11:48:14,739 [tabletserver.Tablet] DEBUG: Starting MajC 2;f;d
> (NORMAL) [/t-0000aa9/C0000zmf.rf, <several more rfiles listed> ] -->
> [/t-0000aa9/C0000zn4.rf_tmp
> 2013-12-03 11:48:14,780 [file.BloomFilterLayer] ERROR: Thread
> "bloom-loader-41" died File /accumulo/tables/2/t-0000aa9/C0000zmf.rf is
> closed
> java.lang.IllegalStateException: File
> /accumulo/tables/2/t-0000aa9/C0000zmf.rf is closed
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:244)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$000(CachableBlockFile.java:142)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:211)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:307)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:357)
>   at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:142)
>   at
> org.apache.accumulo.core.file.rfile.RFile$Reader.getMetaStore(Rfile.java:927)
>   at
> org.apache.accumulo.core.file.BloomFilterLayer$BloomFilterLoader$1.run(BloomFilterLayer.java:210)
>   at
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>   at java.lang.Thread.run(Unknown Source)
> 2013-12-03 11:48:14,900 [tabletserver.Compactor] DEBUG: Compaction 2;f;d
> 280 read | 280 written |  2,500 entries/sec | 0.112 secs
> 2013-12-03 11:48:14,924 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00
> secs
>
> The biggest bummer here is just that it appears on the Monitor GUI as an
> error, and we all know how Operators don't like "errors" on their screens
> ;-)  But if this is one that can be safely ignored, we'll just have to
> write that up in a procedure somewhere.
>

The code in BloomFilterLayer$BloomFilterLoader$1.run() logs IOExceptions at
debug when the file is closed.  Because this is an IllegalStateException
its not being ignored.   Would you like to open a bug for this?


>
>
>
> On Thu, Dec 5, 2013 at 9:50 AM, Keith Turner <ke...@deenlo.com> wrote:
>
>>
>>
>>
>> On Wed, Dec 4, 2013 at 7:29 PM, Terry P. <te...@gmail.com> wrote:
>>
>>> Hi Eric,
>>> Thanks for your reply, I'm just now getting back to this as I had more
>>> of these the past two days. No tserver failures or master halts. With
>>> previous errors we were still experiencing network issues that were indeed
>>> taking tabletservers down, but now that they fixed a bad line card in a
>>> switch that had been rebooting itself (but not failing over), those issues
>>> are all gone (finally, knock on wood).
>>>
>>> Now that I see them again in isolation with no other errors, in the main
>>> tserver log these bloom-loader thread failures appear to happen out of the
>>> blue with no other issues surrounding them.
>>>
>>> However, I just checked the debug log and see they are occurring right
>>> at the time of a Major Compaction.  E.g. from one of the tservers debug log:
>>>
>>> 2013-12-03 11:48:14,738 [tabletserver.Tablet] DEBUG: MajC initiate lock
>>> 0.00 secs, wait 0.00 secs
>>> 2013-12-03 11:48:14,739 [tabletserver.Tablet] DEBUG: Starting MajC 2;f;d
>>> (NORMAL) [/t-0000aa9/C0000zmf.rf, <several more rfiles listed> ] -->
>>> [/t-0000aa9/C0000zn4.rf_tmp
>>> 2013-12-03 11:48:14,780 [file.BloomFilterLayer] ERROR: Thread
>>> "bloom-loader-41" died File /accumulo/tables/2/t-0000aa9/C0000zmf.rf is
>>> closed
>>>
>>> The rest of the stack looks like what I posted earlier. The very next
>>> debug log message after the bloom loader exception is shows that the
>>> Compaction completed successfully in 0.112 seconds.
>>>
>>> So it looks like the bloom loader is trying to open an rfile 41ms after
>>> a compaction started, and the file was likely just compacted during that
>>> gap between the calls. If that's the case, can this error be safely ignored?
>>>
>>
>> Its probably safe to ignore.   Bloomfilters are loaded lazily by a
>> background thread and its possible the file will be closed by the time the
>> background thread gets around to loading it.  However it should log a debug
>> in this case, so I am curious why an ERROR is logged.  Is there a stack
>> trace associated with the message 'Thread "bloom-loader-41" ...' ?
>>
>>
>>>
>>> Thanks,
>>> Terry
>>>
>>>
>>>
>>> On Mon, Nov 18, 2013 at 8:56 PM, Eric Newton <er...@gmail.com>wrote:
>>>
>>>> This is an educated guess...
>>>>
>>>> When a process dies "gracefully" there's a shutdown hook that closes
>>>> the FileSystem.  That can result in messages like this.  It's likely
>>>> there's an error before this about a zookeeper session being lost, or a
>>>> halt issued by the master.  See if this tserver died shortly after this
>>>> message. If so, ignore the message.
>>>>
>>>> -Eric
>>>>
>>>>
>>>>
>>>> On Fri, Nov 15, 2013 at 4:31 PM, Terry P. <te...@gmail.com> wrote:
>>>>
>>>>> Greetings folks,
>>>>> In my Accumulo 1.4.2 cluster I am seeing ERRORS about bloom loader
>>>>> threads dying due to an rfile being closed.  I can't copy/paste the error
>>>>> as it's on an air-gapped system, but it starts with:
>>>>>
>>>>> ERROR Thread "bloom-loader-2147" died File
>>>>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>>>>   java.lang.IllegalStateException: File
>>>>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>>>>     at
>>>>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.getBCFile(CacheableBlockFile.java:244)
>>>>>     at
>>>>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.access$000(CacheableBlockFile.java:142)
>>>>> (10 more java files ... ends with java.lang.Thread.run(UnknownSource) )
>>>>>
>>>>> No real rhyme or reason as to when they occur; we are predominantly
>>>>> ingest heavy with light reads by rowkey with ~10 entries per rowkey.  I
>>>>> don't really know if client programs are getting errors when these occur or
>>>>> not.
>>>>>
>>>>> I didn't find any JIRAs related to these.  Should I be concerned about
>>>>> these?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Bloom filter thread failure errors

Posted by "Terry P." <te...@gmail.com>.

Hi Keith,
Here is the stack trace in the tserver DEBUG log for this most recent
exception. The exception section is the same as what's in the main tserver
log, but of course the MajC bits don't appear in the main log. This is
hand-typed, but I'm pretty sure it's right.

2013-12-03 11:48:14,738 [tabletserver.Tablet] DEBUG: MajC initiate lock
0.00 secs, wait 0.00 secs
2013-12-03 11:48:14,739 [tabletserver.Tablet] DEBUG: Starting MajC 2;f;d
(NORMAL) [/t-0000aa9/C0000zmf.rf, <several more rfiles listed> ] -->
[/t-0000aa9/C0000zn4.rf_tmp
2013-12-03 11:48:14,780 [file.BloomFilterLayer] ERROR: Thread
"bloom-loader-41" died File /accumulo/tables/2/t-0000aa9/C0000zmf.rf is
closed
java.lang.IllegalStateException: File
/accumulo/tables/2/t-0000aa9/C0000zmf.rf is closed
  at
org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:244)
  at
org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$000(CachableBlockFile.java:142)
  at
org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:211)
  at
org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:307)
  at
org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:357)
  at
org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:142)
  at
org.apache.accumulo.core.file.rfile.RFile$Reader.getMetaStore(Rfile.java:927)
  at
org.apache.accumulo.core.file.BloomFilterLayer$BloomFilterLoader$1.run(BloomFilterLayer.java:210)
  at
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  at java.lang.Thread.run(Unknown Source)
2013-12-03 11:48:14,900 [tabletserver.Compactor] DEBUG: Compaction 2;f;d
280 read | 280 written |  2,500 entries/sec | 0.112 secs
2013-12-03 11:48:14,924 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00
secs

The biggest bummer here is just that it appears on the Monitor GUI as an
error, and we all know how Operators don't like "errors" on their screens
;-)  But if this is one that can be safely ignored, we'll just have to
write that up in a procedure somewhere.



On Thu, Dec 5, 2013 at 9:50 AM, Keith Turner <ke...@deenlo.com> wrote:

>
>
>
> On Wed, Dec 4, 2013 at 7:29 PM, Terry P. <te...@gmail.com> wrote:
>
>> Hi Eric,
>> Thanks for your reply, I'm just now getting back to this as I had more of
>> these the past two days. No tserver failures or master halts. With previous
>> errors we were still experiencing network issues that were indeed taking
>> tabletservers down, but now that they fixed a bad line card in a switch
>> that had been rebooting itself (but not failing over), those issues are all
>> gone (finally, knock on wood).
>>
>> Now that I see them again in isolation with no other errors, in the main
>> tserver log these bloom-loader thread failures appear to happen out of the
>> blue with no other issues surrounding them.
>>
>> However, I just checked the debug log and see they are occurring right at
>> the time of a Major Compaction.  E.g. from one of the tservers debug log:
>>
>> 2013-12-03 11:48:14,738 [tabletserver.Tablet] DEBUG: MajC initiate lock
>> 0.00 secs, wait 0.00 secs
>> 2013-12-03 11:48:14,739 [tabletserver.Tablet] DEBUG: Starting MajC 2;f;d
>> (NORMAL) [/t-0000aa9/C0000zmf.rf, <several more rfiles listed> ] -->
>> [/t-0000aa9/C0000zn4.rf_tmp
>> 2013-12-03 11:48:14,780 [file.BloomFilterLayer] ERROR: Thread
>> "bloom-loader-41" died File /accumulo/tables/2/t-0000aa9/C0000zmf.rf is
>> closed
>>
>> The rest of the stack looks like what I posted earlier. The very next
>> debug log message after the bloom loader exception is shows that the
>> Compaction completed successfully in 0.112 seconds.
>>
>> So it looks like the bloom loader is trying to open an rfile 41ms after a
>> compaction started, and the file was likely just compacted during that gap
>> between the calls. If that's the case, can this error be safely ignored?
>>
>
> Its probably safe to ignore.   Bloomfilters are loaded lazily by a
> background thread and its possible the file will be closed by the time the
> background thread gets around to loading it.  However it should log a debug
> in this case, so I am curious why an ERROR is logged.  Is there a stack
> trace associated with the message 'Thread "bloom-loader-41" ...' ?
>
>
>>
>> Thanks,
>> Terry
>>
>>
>>
>> On Mon, Nov 18, 2013 at 8:56 PM, Eric Newton <er...@gmail.com>wrote:
>>
>>> This is an educated guess...
>>>
>>> When a process dies "gracefully" there's a shutdown hook that closes the
>>> FileSystem.  That can result in messages like this.  It's likely there's an
>>> error before this about a zookeeper session being lost, or a halt issued by
>>> the master.  See if this tserver died shortly after this message. If so,
>>> ignore the message.
>>>
>>> -Eric
>>>
>>>
>>>
>>> On Fri, Nov 15, 2013 at 4:31 PM, Terry P. <te...@gmail.com> wrote:
>>>
>>>> Greetings folks,
>>>> In my Accumulo 1.4.2 cluster I am seeing ERRORS about bloom loader
>>>> threads dying due to an rfile being closed.  I can't copy/paste the error
>>>> as it's on an air-gapped system, but it starts with:
>>>>
>>>> ERROR Thread "bloom-loader-2147" died File
>>>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>>>   java.lang.IllegalStateException: File
>>>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>>>     at
>>>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.getBCFile(CacheableBlockFile.java:244)
>>>>     at
>>>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.access$000(CacheableBlockFile.java:142)
>>>> (10 more java files ... ends with java.lang.Thread.run(UnknownSource) )
>>>>
>>>> No real rhyme or reason as to when they occur; we are predominantly
>>>> ingest heavy with light reads by rowkey with ~10 entries per rowkey.  I
>>>> don't really know if client programs are getting errors when these occur or
>>>> not.
>>>>
>>>> I didn't find any JIRAs related to these.  Should I be concerned about
>>>> these?
>>>>
>>>
>>>
>>
>

Re: Bloom filter thread failure errors

Posted by Keith Turner <ke...@deenlo.com>.

On Wed, Dec 4, 2013 at 7:29 PM, Terry P. <te...@gmail.com> wrote:

> Hi Eric,
> Thanks for your reply, I'm just now getting back to this as I had more of
> these the past two days. No tserver failures or master halts. With previous
> errors we were still experiencing network issues that were indeed taking
> tabletservers down, but now that they fixed a bad line card in a switch
> that had been rebooting itself (but not failing over), those issues are all
> gone (finally, knock on wood).
>
> Now that I see them again in isolation with no other errors, in the main
> tserver log these bloom-loader thread failures appear to happen out of the
> blue with no other issues surrounding them.
>
> However, I just checked the debug log and see they are occurring right at
> the time of a Major Compaction.  E.g. from one of the tservers debug log:
>
> 2013-12-03 11:48:14,738 [tabletserver.Tablet] DEBUG: MajC initiate lock
> 0.00 secs, wait 0.00 secs
> 2013-12-03 11:48:14,739 [tabletserver.Tablet] DEBUG: Starting MajC 2;f;d
> (NORMAL) [/t-0000aa9/C0000zmf.rf, <several more rfiles listed> ] -->
> [/t-0000aa9/C0000zn4.rf_tmp
> 2013-12-03 11:48:14,780 [file.BloomFilterLayer] ERROR: Thread
> "bloom-loader-41" died File /accumulo/tables/2/t-0000aa9/C0000zmf.rf is
> closed
>
> The rest of the stack looks like what I posted earlier. The very next
> debug log message after the bloom loader exception is shows that the
> Compaction completed successfully in 0.112 seconds.
>
> So it looks like the bloom loader is trying to open an rfile 41ms after a
> compaction started, and the file was likely just compacted during that gap
> between the calls. If that's the case, can this error be safely ignored?
>

Its probably safe to ignore.   Bloomfilters are loaded lazily by a
background thread and its possible the file will be closed by the time the
background thread gets around to loading it.  However it should log a debug
in this case, so I am curious why an ERROR is logged.  Is there a stack
trace associated with the message 'Thread "bloom-loader-41" ...' ?


>
> Thanks,
> Terry
>
>
>
> On Mon, Nov 18, 2013 at 8:56 PM, Eric Newton <er...@gmail.com>wrote:
>
>> This is an educated guess...
>>
>> When a process dies "gracefully" there's a shutdown hook that closes the
>> FileSystem.  That can result in messages like this.  It's likely there's an
>> error before this about a zookeeper session being lost, or a halt issued by
>> the master.  See if this tserver died shortly after this message. If so,
>> ignore the message.
>>
>> -Eric
>>
>>
>>
>> On Fri, Nov 15, 2013 at 4:31 PM, Terry P. <te...@gmail.com> wrote:
>>
>>> Greetings folks,
>>> In my Accumulo 1.4.2 cluster I am seeing ERRORS about bloom loader
>>> threads dying due to an rfile being closed.  I can't copy/paste the error
>>> as it's on an air-gapped system, but it starts with:
>>>
>>> ERROR Thread "bloom-loader-2147" died File
>>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>>   java.lang.IllegalStateException: File
>>> /accumulo/tables/2/t-0000aa4/F0000q3g.rf is closed
>>>     at
>>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.getBCFile(CacheableBlockFile.java:244)
>>>     at
>>> org.apache.accumulo.core.file.blockfile.impl.CacheableBlockFile$Reader.access$000(CacheableBlockFile.java:142)
>>> (10 more java files ... ends with java.lang.Thread.run(UnknownSource) )
>>>
>>> No real rhyme or reason as to when they occur; we are predominantly
>>> ingest heavy with light reads by rowkey with ~10 entries per rowkey.  I
>>> don't really know if client programs are getting errors when these occur or
>>> not.
>>>
>>> I didn't find any JIRAs related to these.  Should I be concerned about
>>> these?
>>>
>>
>>
>