You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by Corey Nolet <cj...@gmail.com> on 2014/08/22 22:35:18 UTC

Tablet server thrift issue

Eric & Keith, Chris mentioned to me that you guys have seen this issue
before. Any ideas from anyone else are much appreciated as well.

I recently updated a project's dependencies to Accumulo 1.6.0 built with
Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
component which is running all the time with a batch writer using many
threads to push mutations into Accumulo.

The issue I'm having is a show stopper. At different intervals of time,
sometimes an hour, sometimes 30 minutes, I'm getting
MutationsRejectedExceptions (server errors) from the
TabletServerBatchWriter. Once they start, I need to restart the ingester to
get them to stop. They always come back within 30 minutes to an hour...
rinse, repeat.

The exception always happens on different tablet servers. It's a thrift
error saying a message was received out of sequence. In the TabletServer
logs, I see an "Invalid session id" exception which happens only once
before the client-side batch writer starts spitting out the MREs.

I'm running some heavyweight processing in Storm along side the tablet
servers. I shut that processing off in hopes that maybe it was the culprit
but that hasn't fixed the issue.

I'm surprised I haven't seen any other posts on the topic.

Thanks!

Re: Tablet server thrift issue

Posted by Corey Nolet <cj...@gmail.com>.

As an update,

I raised the tablet server memory and I have not seen this error thrown
since. I'd like to say raising the memory, alone, was the solution but it
appears that I also may be having some performance issues with the switches
connecting the racks together. I'll update more as I dive in further.


On Fri, Aug 22, 2014 at 11:41 PM, Corey Nolet <cj...@gmail.com> wrote:

> Josh,
>
> Your advice is definitely useful- I also thought about catching the
> exception and retrying with a fresh batch writer but the fact that the
> batch writer failure doesn't go away without being re-instantiated is
> really only a nuisance. The TabletServerBatchWriter could be designed much
> better, I agree, but that is not the root of the problem.
>
> The Thrift exception that is causing the issue is what I'd like to get to
> the bottom of. It's throwing the following:
>
> *TApplicationException: applyUpdates failed: out of sequence response *
>
> I've never seen this exception before in regular use of the client API-
> but I also just updated to 1.6.0. Google isn't showing anything useful for
> how exactly this exception could come about other than using a bad
> threading model- and I don't see any drastic changes or other user
> complaints on the mailing list that would validate that line of thought.
> Quite frankly, I'm stumped. This could be a Thrift exception related to a
> Thrift bug or something bad on my system and have nothing to do with
> Accumulo.
>
> Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had
> seen the exception before and may remember what it was/how they fixed it.
>
>
> On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser <jo...@gmail.com> wrote:
>
>> Don't mean to tell you that I don't think there might be a bug/otherwise,
>> that's pretty much just the limit of what I know about the server-side
>> sessions :)
>>
>> If you have concrete "this worked in 1.4.4" and "this happens instead
>> with 1.6.0", that'd make a great ticket :D
>>
>> The BatchWriter failure case is pretty rough, actually. Eric has made
>> some changes to help already (in 1.6.1, I think), but it needs an overhaul
>> that I haven't been able to make time to fix properly, either. IIRC, the
>> only guarantee you have is that all mutations added before the last flush()
>> happened are durable on the server. Anything else is a guess. I don't know
>> the specifics, but that should be enough to work with (and saving off
>> mutations shouldn't be too costly since they're stored serialized).
>>
>>
>> On 8/22/14, 5:44 PM, Corey Nolet wrote:
>>
>>> Thanks Josh,
>>>
>>> I understand about the session ID completely but the problem I have is
>>> that
>>> the exact same client code worked, line for line, just fine in 1.4.4 and
>>> it's acting up in 1.6.0. I also seem to remember the BatchWriter
>>> automatically creating a new session when one expired without an
>>> exception
>>> causing it to fail on the client.
>>>
>>> I know we've made changes since 1.4.4 but I'd like to troubleshoot the
>>> actual issue of the BatchWriter failing due to the thrift exception
>>> rather
>>> than just catching the exception and trying mutations again. The other
>>> issue is that I've already submitted a bunch of mutations to the batch
>>> writer from different threads. Does that mean I need to be storing them
>>> off
>>> twice? (once in the BatchWriter's cache and once in my own)
>>>
>>> The BatchWriter in my ingester is constantly sending data and the tablet
>>> servers have been given more than enough memory to be able to keep up.
>>> There's no swap being used and the network isn't experiencing any errors.
>>>
>>>
>>> On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser <jo...@gmail.com>
>>> wrote:
>>>
>>>  If you get an error from a BatchWriter, you pretty much have to throw
>>>> away
>>>> that instance of the BatchWriter and make a new one. See ACCUMULO-2990.
>>>> If
>>>> you want, you should be able to catch/recover from this without having
>>>> to
>>>> restart the ingester.
>>>>
>>>> If the session ID is invalid, my guess is that it hasn't been used
>>>> recently and the tserver cleaned it up. The exception logic isn't the
>>>> greatest (as it just is presented to you as a RTE).
>>>>
>>>> https://issues.apache.org/jira/browse/ACCUMULO-2990
>>>>
>>>>
>>>> On 8/22/14, 4:35 PM, Corey Nolet wrote:
>>>>
>>>>  Eric & Keith, Chris mentioned to me that you guys have seen this issue
>>>>> before. Any ideas from anyone else are much appreciated as well.
>>>>>
>>>>> I recently updated a project's dependencies to Accumulo 1.6.0 built
>>>>> with
>>>>> Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
>>>>> component which is running all the time with a batch writer using many
>>>>> threads to push mutations into Accumulo.
>>>>>
>>>>> The issue I'm having is a show stopper. At different intervals of time,
>>>>> sometimes an hour, sometimes 30 minutes, I'm getting
>>>>> MutationsRejectedExceptions (server errors) from the
>>>>> TabletServerBatchWriter. Once they start, I need to restart the
>>>>> ingester
>>>>> to
>>>>> get them to stop. They always come back within 30 minutes to an hour...
>>>>> rinse, repeat.
>>>>>
>>>>> The exception always happens on different tablet servers. It's a thrift
>>>>> error saying a message was received out of sequence. In the
>>>>> TabletServer
>>>>> logs, I see an "Invalid session id" exception which happens only once
>>>>> before the client-side batch writer starts spitting out the MREs.
>>>>>
>>>>> I'm running some heavyweight processing in Storm along side the tablet
>>>>> servers. I shut that processing off in hopes that maybe it was the
>>>>> culprit
>>>>> but that hasn't fixed the issue.
>>>>>
>>>>> I'm surprised I haven't seen any other posts on the topic.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>
>

Re: Tablet server thrift issue

Posted by Corey Nolet <cj...@gmail.com>.

Josh,

Your advice is definitely useful- I also thought about catching the
exception and retrying with a fresh batch writer but the fact that the
batch writer failure doesn't go away without being re-instantiated is
really only a nuisance. The TabletServerBatchWriter could be designed much
better, I agree, but that is not the root of the problem.

The Thrift exception that is causing the issue is what I'd like to get to
the bottom of. It's throwing the following:

*TApplicationException: applyUpdates failed: out of sequence response *

I've never seen this exception before in regular use of the client API- but
I also just updated to 1.6.0. Google isn't showing anything useful for how
exactly this exception could come about other than using a bad threading
model- and I don't see any drastic changes or other user complaints on the
mailing list that would validate that line of thought. Quite frankly, I'm
stumped. This could be a Thrift exception related to a Thrift bug or
something bad on my system and have nothing to do with Accumulo.

Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had
seen the exception before and may remember what it was/how they fixed it.


On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser <jo...@gmail.com> wrote:

> Don't mean to tell you that I don't think there might be a bug/otherwise,
> that's pretty much just the limit of what I know about the server-side
> sessions :)
>
> If you have concrete "this worked in 1.4.4" and "this happens instead with
> 1.6.0", that'd make a great ticket :D
>
> The BatchWriter failure case is pretty rough, actually. Eric has made some
> changes to help already (in 1.6.1, I think), but it needs an overhaul that
> I haven't been able to make time to fix properly, either. IIRC, the only
> guarantee you have is that all mutations added before the last flush()
> happened are durable on the server. Anything else is a guess. I don't know
> the specifics, but that should be enough to work with (and saving off
> mutations shouldn't be too costly since they're stored serialized).
>
>
> On 8/22/14, 5:44 PM, Corey Nolet wrote:
>
>> Thanks Josh,
>>
>> I understand about the session ID completely but the problem I have is
>> that
>> the exact same client code worked, line for line, just fine in 1.4.4 and
>> it's acting up in 1.6.0. I also seem to remember the BatchWriter
>> automatically creating a new session when one expired without an exception
>> causing it to fail on the client.
>>
>> I know we've made changes since 1.4.4 but I'd like to troubleshoot the
>> actual issue of the BatchWriter failing due to the thrift exception rather
>> than just catching the exception and trying mutations again. The other
>> issue is that I've already submitted a bunch of mutations to the batch
>> writer from different threads. Does that mean I need to be storing them
>> off
>> twice? (once in the BatchWriter's cache and once in my own)
>>
>> The BatchWriter in my ingester is constantly sending data and the tablet
>> servers have been given more than enough memory to be able to keep up.
>> There's no swap being used and the network isn't experiencing any errors.
>>
>>
>> On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser <jo...@gmail.com> wrote:
>>
>>  If you get an error from a BatchWriter, you pretty much have to throw
>>> away
>>> that instance of the BatchWriter and make a new one. See ACCUMULO-2990.
>>> If
>>> you want, you should be able to catch/recover from this without having to
>>> restart the ingester.
>>>
>>> If the session ID is invalid, my guess is that it hasn't been used
>>> recently and the tserver cleaned it up. The exception logic isn't the
>>> greatest (as it just is presented to you as a RTE).
>>>
>>> https://issues.apache.org/jira/browse/ACCUMULO-2990
>>>
>>>
>>> On 8/22/14, 4:35 PM, Corey Nolet wrote:
>>>
>>>  Eric & Keith, Chris mentioned to me that you guys have seen this issue
>>>> before. Any ideas from anyone else are much appreciated as well.
>>>>
>>>> I recently updated a project's dependencies to Accumulo 1.6.0 built with
>>>> Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
>>>> component which is running all the time with a batch writer using many
>>>> threads to push mutations into Accumulo.
>>>>
>>>> The issue I'm having is a show stopper. At different intervals of time,
>>>> sometimes an hour, sometimes 30 minutes, I'm getting
>>>> MutationsRejectedExceptions (server errors) from the
>>>> TabletServerBatchWriter. Once they start, I need to restart the ingester
>>>> to
>>>> get them to stop. They always come back within 30 minutes to an hour...
>>>> rinse, repeat.
>>>>
>>>> The exception always happens on different tablet servers. It's a thrift
>>>> error saying a message was received out of sequence. In the TabletServer
>>>> logs, I see an "Invalid session id" exception which happens only once
>>>> before the client-side batch writer starts spitting out the MREs.
>>>>
>>>> I'm running some heavyweight processing in Storm along side the tablet
>>>> servers. I shut that processing off in hopes that maybe it was the
>>>> culprit
>>>> but that hasn't fixed the issue.
>>>>
>>>> I'm surprised I haven't seen any other posts on the topic.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>

Re: Tablet server thrift issue

Posted by Josh Elser <jo...@gmail.com>.

Don't mean to tell you that I don't think there might be a 
bug/otherwise, that's pretty much just the limit of what I know about 
the server-side sessions :)

If you have concrete "this worked in 1.4.4" and "this happens instead 
with 1.6.0", that'd make a great ticket :D

The BatchWriter failure case is pretty rough, actually. Eric has made 
some changes to help already (in 1.6.1, I think), but it needs an 
overhaul that I haven't been able to make time to fix properly, either. 
IIRC, the only guarantee you have is that all mutations added before the 
last flush() happened are durable on the server. Anything else is a 
guess. I don't know the specifics, but that should be enough to work 
with (and saving off mutations shouldn't be too costly since they're 
stored serialized).

On 8/22/14, 5:44 PM, Corey Nolet wrote:
> Thanks Josh,
>
> I understand about the session ID completely but the problem I have is that
> the exact same client code worked, line for line, just fine in 1.4.4 and
> it's acting up in 1.6.0. I also seem to remember the BatchWriter
> automatically creating a new session when one expired without an exception
> causing it to fail on the client.
>
> I know we've made changes since 1.4.4 but I'd like to troubleshoot the
> actual issue of the BatchWriter failing due to the thrift exception rather
> than just catching the exception and trying mutations again. The other
> issue is that I've already submitted a bunch of mutations to the batch
> writer from different threads. Does that mean I need to be storing them off
> twice? (once in the BatchWriter's cache and once in my own)
>
> The BatchWriter in my ingester is constantly sending data and the tablet
> servers have been given more than enough memory to be able to keep up.
> There's no swap being used and the network isn't experiencing any errors.
>
>
> On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser <jo...@gmail.com> wrote:
>
>> If you get an error from a BatchWriter, you pretty much have to throw away
>> that instance of the BatchWriter and make a new one. See ACCUMULO-2990. If
>> you want, you should be able to catch/recover from this without having to
>> restart the ingester.
>>
>> If the session ID is invalid, my guess is that it hasn't been used
>> recently and the tserver cleaned it up. The exception logic isn't the
>> greatest (as it just is presented to you as a RTE).
>>
>> https://issues.apache.org/jira/browse/ACCUMULO-2990
>>
>>
>> On 8/22/14, 4:35 PM, Corey Nolet wrote:
>>
>>> Eric & Keith, Chris mentioned to me that you guys have seen this issue
>>> before. Any ideas from anyone else are much appreciated as well.
>>>
>>> I recently updated a project's dependencies to Accumulo 1.6.0 built with
>>> Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
>>> component which is running all the time with a batch writer using many
>>> threads to push mutations into Accumulo.
>>>
>>> The issue I'm having is a show stopper. At different intervals of time,
>>> sometimes an hour, sometimes 30 minutes, I'm getting
>>> MutationsRejectedExceptions (server errors) from the
>>> TabletServerBatchWriter. Once they start, I need to restart the ingester
>>> to
>>> get them to stop. They always come back within 30 minutes to an hour...
>>> rinse, repeat.
>>>
>>> The exception always happens on different tablet servers. It's a thrift
>>> error saying a message was received out of sequence. In the TabletServer
>>> logs, I see an "Invalid session id" exception which happens only once
>>> before the client-side batch writer starts spitting out the MREs.
>>>
>>> I'm running some heavyweight processing in Storm along side the tablet
>>> servers. I shut that processing off in hopes that maybe it was the culprit
>>> but that hasn't fixed the issue.
>>>
>>> I'm surprised I haven't seen any other posts on the topic.
>>>
>>> Thanks!
>>>
>>>
>

Re: Tablet server thrift issue

Posted by Corey Nolet <cj...@gmail.com>.

Thanks Josh,

I understand about the session ID completely but the problem I have is that
the exact same client code worked, line for line, just fine in 1.4.4 and
it's acting up in 1.6.0. I also seem to remember the BatchWriter
automatically creating a new session when one expired without an exception
causing it to fail on the client.

I know we've made changes since 1.4.4 but I'd like to troubleshoot the
actual issue of the BatchWriter failing due to the thrift exception rather
than just catching the exception and trying mutations again. The other
issue is that I've already submitted a bunch of mutations to the batch
writer from different threads. Does that mean I need to be storing them off
twice? (once in the BatchWriter's cache and once in my own)

The BatchWriter in my ingester is constantly sending data and the tablet
servers have been given more than enough memory to be able to keep up.
There's no swap being used and the network isn't experiencing any errors.

On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser <jo...@gmail.com> wrote:

> If you get an error from a BatchWriter, you pretty much have to throw away
> that instance of the BatchWriter and make a new one. See ACCUMULO-2990. If
> you want, you should be able to catch/recover from this without having to
> restart the ingester.
>
> If the session ID is invalid, my guess is that it hasn't been used
> recently and the tserver cleaned it up. The exception logic isn't the
> greatest (as it just is presented to you as a RTE).
>
> https://issues.apache.org/jira/browse/ACCUMULO-2990
>
>
> On 8/22/14, 4:35 PM, Corey Nolet wrote:
>
>> Eric & Keith, Chris mentioned to me that you guys have seen this issue
>> before. Any ideas from anyone else are much appreciated as well.
>>
>> I recently updated a project's dependencies to Accumulo 1.6.0 built with
>> Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
>> component which is running all the time with a batch writer using many
>> threads to push mutations into Accumulo.
>>
>> The issue I'm having is a show stopper. At different intervals of time,
>> sometimes an hour, sometimes 30 minutes, I'm getting
>> MutationsRejectedExceptions (server errors) from the
>> TabletServerBatchWriter. Once they start, I need to restart the ingester
>> to
>> get them to stop. They always come back within 30 minutes to an hour...
>> rinse, repeat.
>>
>> The exception always happens on different tablet servers. It's a thrift
>> error saying a message was received out of sequence. In the TabletServer
>> logs, I see an "Invalid session id" exception which happens only once
>> before the client-side batch writer starts spitting out the MREs.
>>
>> I'm running some heavyweight processing in Storm along side the tablet
>> servers. I shut that processing off in hopes that maybe it was the culprit
>> but that hasn't fixed the issue.
>>
>> I'm surprised I haven't seen any other posts on the topic.
>>
>> Thanks!
>>
>>

Re: Tablet server thrift issue

Posted by Josh Elser <jo...@gmail.com>.

If you get an error from a BatchWriter, you pretty much have to throw 
away that instance of the BatchWriter and make a new one. See 
ACCUMULO-2990. If you want, you should be able to catch/recover from 
this without having to restart the ingester.

If the session ID is invalid, my guess is that it hasn't been used 
recently and the tserver cleaned it up. The exception logic isn't the 
greatest (as it just is presented to you as a RTE).

https://issues.apache.org/jira/browse/ACCUMULO-2990

On 8/22/14, 4:35 PM, Corey Nolet wrote:
> Eric & Keith, Chris mentioned to me that you guys have seen this issue
> before. Any ideas from anyone else are much appreciated as well.
>
> I recently updated a project's dependencies to Accumulo 1.6.0 built with
> Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
> component which is running all the time with a batch writer using many
> threads to push mutations into Accumulo.
>
> The issue I'm having is a show stopper. At different intervals of time,
> sometimes an hour, sometimes 30 minutes, I'm getting
> MutationsRejectedExceptions (server errors) from the
> TabletServerBatchWriter. Once they start, I need to restart the ingester to
> get them to stop. They always come back within 30 minutes to an hour...
> rinse, repeat.
>
> The exception always happens on different tablet servers. It's a thrift
> error saying a message was received out of sequence. In the TabletServer
> logs, I see an "Invalid session id" exception which happens only once
> before the client-side batch writer starts spitting out the MREs.
>
> I'm running some heavyweight processing in Storm along side the tablet
> servers. I shut that processing off in hopes that maybe it was the culprit
> but that hasn't fixed the issue.
>
> I'm surprised I haven't seen any other posts on the topic.
>
> Thanks!
>