You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by meghdoot bhattacharya <me...@yahoo.com.INVALID> on 2018/06/05 07:30:10 UTC

Mesos replicated log dual writes

Recently investigation of logs of aurora around snapshot creations in replicated log come up with this
I0601 21:57:27.322444   144 log.cpp:577] Attempting to append 524304 bytes to the logI0601 21:57:27.322501   144 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 6506690I0601 21:57:27.323122   139 replica.cpp:539] Replica received write request for position 6506690 from __req_res__(3685)@x.x.x.x:yyyyI0601 21:57:27.443131   139 leveldb.cpp:341] Persisting action (524331 bytes) to leveldb took 119.847993msI0601 21:57:27.443208   139 replica.cpp:710] Persisted action APPEND at position 6506690I0601 21:57:27.443437   139 replica.cpp:693] Replica received learned notice for position 6506690 from @0.0.0.0:0I0601 21:57:27.478754   139 leveldb.cpp:341] Persisting action (524333 bytes) to leveldb took 35.278735msI0601 21:57:27.478818   139 replica.cpp:710] Persisted action APPEND at position 6506690


It seems there are dual writes to leveldb for the same record (with 2 byte diff). Becomes expensive if that is the case for large records. Hoping to get some insights.
Thx

Re: Mesos replicated log dual writes

Posted by Jie Yu <yu...@gmail.com>.

I am not aware any ticket related to this, so creating a ticket for
tracking the issue would be great.

- Jie

On Tue, Jun 5, 2018 at 9:30 AM, Meghdoot bhattacharya <
meghdoot_b@yahoo.com.invalid> wrote:

> Thx Joseph. Writing back the entire blob is costly instead of writing back
> only the 2 bytes. In this case this record is just a chunk of large
> snapshot that is getting written. So it multiplies.
>
> Should I open a jira to track and resolve this or something like this is
> already open?
>
>
> Thx
>
> > On Jun 5, 2018, at 9:10 AM, Joseph Wu <jo...@mesosphere.io> wrote:
> >
> > The two-byte difference is most likely coming from the "learned" field:
> > https://github.com/apache/mesos/blob/master/src/messages/log.proto#L49
> >
> > The first time an entry is recorded, the entry is "unlearned", basically
> meaning that the entry has not been written to a quorum of masters (yet).
> Once a quorum of masters indicate to each other that they have written an
> entry, they then flip the bit to "learned".
> >
> > This process is mandatory for Paxos to work, but I can see how
> overwriting an entry could be expensive.  There's no particular restriction
> on why we must overwrite the entire entry, except that we currently use the
> position number (i.e. 6506690 in your example) as the key in the leveldb
> implementation.
> >
> >> On Tue, Jun 5, 2018 at 12:30 AM, meghdoot bhattacharya
> <me...@yahoo.com.invalid> wrote:
> >> Recently investigation of logs of aurora around snapshot creations in
> replicated log come up with this
> >> I0601 21:57:27.322444   144 log.cpp:577] Attempting to append 524304
> bytes to the logI0601 21:57:27.322501   144 coordinator.cpp:348]
> Coordinator attempting to write APPEND action at position 6506690I0601
> 21:57:27.323122   139 replica.cpp:539] Replica received write request for
> position 6506690 from __req_res__(3685)@x.x.x.x:yyyyI0601
> 21:57:27.443131   139 leveldb.cpp:341] Persisting action (524331 bytes) to
> leveldb took 119.847993msI0601 21:57:27.443208   139 replica.cpp:710]
> Persisted action APPEND at position 6506690I0601 21:57:27.443437   139
> replica.cpp:693] Replica received learned notice for position 6506690 from
> @0.0.0.0:0I0601 21:57:27.478754   139 leveldb.cpp:341] Persisting action
> (524333 bytes) to leveldb took 35.278735msI0601 21:57:27.478818   139
> replica.cpp:710] Persisted action APPEND at position 6506690
> >>
> >>
> >> It seems there are dual writes to leveldb for the same record (with 2
> byte diff). Becomes expensive if that is the case for large records. Hoping
> to get some insights.
> >> Thx
> >
>

Re: Mesos replicated log dual writes

Posted by Jie Yu <yu...@gmail.com>.

I am not aware any ticket related to this, so creating a ticket for
tracking the issue would be great.

- Jie

On Tue, Jun 5, 2018 at 9:30 AM, Meghdoot bhattacharya <
meghdoot_b@yahoo.com.invalid> wrote:

> Thx Joseph. Writing back the entire blob is costly instead of writing back
> only the 2 bytes. In this case this record is just a chunk of large
> snapshot that is getting written. So it multiplies.
>
> Should I open a jira to track and resolve this or something like this is
> already open?
>
>
> Thx
>
> > On Jun 5, 2018, at 9:10 AM, Joseph Wu <jo...@mesosphere.io> wrote:
> >
> > The two-byte difference is most likely coming from the "learned" field:
> > https://github.com/apache/mesos/blob/master/src/messages/log.proto#L49
> >
> > The first time an entry is recorded, the entry is "unlearned", basically
> meaning that the entry has not been written to a quorum of masters (yet).
> Once a quorum of masters indicate to each other that they have written an
> entry, they then flip the bit to "learned".
> >
> > This process is mandatory for Paxos to work, but I can see how
> overwriting an entry could be expensive.  There's no particular restriction
> on why we must overwrite the entire entry, except that we currently use the
> position number (i.e. 6506690 in your example) as the key in the leveldb
> implementation.
> >
> >> On Tue, Jun 5, 2018 at 12:30 AM, meghdoot bhattacharya
> <me...@yahoo.com.invalid> wrote:
> >> Recently investigation of logs of aurora around snapshot creations in
> replicated log come up with this
> >> I0601 21:57:27.322444   144 log.cpp:577] Attempting to append 524304
> bytes to the logI0601 21:57:27.322501   144 coordinator.cpp:348]
> Coordinator attempting to write APPEND action at position 6506690I0601
> 21:57:27.323122   139 replica.cpp:539] Replica received write request for
> position 6506690 from __req_res__(3685)@x.x.x.x:yyyyI0601
> 21:57:27.443131   139 leveldb.cpp:341] Persisting action (524331 bytes) to
> leveldb took 119.847993msI0601 21:57:27.443208   139 replica.cpp:710]
> Persisted action APPEND at position 6506690I0601 21:57:27.443437   139
> replica.cpp:693] Replica received learned notice for position 6506690 from
> @0.0.0.0:0I0601 21:57:27.478754   139 leveldb.cpp:341] Persisting action
> (524333 bytes) to leveldb took 35.278735msI0601 21:57:27.478818   139
> replica.cpp:710] Persisted action APPEND at position 6506690
> >>
> >>
> >> It seems there are dual writes to leveldb for the same record (with 2
> byte diff). Becomes expensive if that is the case for large records. Hoping
> to get some insights.
> >> Thx
> >
>

Re: Mesos replicated log dual writes

Posted by Meghdoot bhattacharya <me...@yahoo.com.INVALID>.

Thx Joseph. Writing back the entire blob is costly instead of writing back only the 2 bytes. In this case this record is just a chunk of large snapshot that is getting written. So it multiplies.

Should I open a jira to track and resolve this or something like this is already open?


Thx

> On Jun 5, 2018, at 9:10 AM, Joseph Wu <jo...@mesosphere.io> wrote:
> 
> The two-byte difference is most likely coming from the "learned" field:
> https://github.com/apache/mesos/blob/master/src/messages/log.proto#L49
> 
> The first time an entry is recorded, the entry is "unlearned", basically meaning that the entry has not been written to a quorum of masters (yet).  Once a quorum of masters indicate to each other that they have written an entry, they then flip the bit to "learned".
> 
> This process is mandatory for Paxos to work, but I can see how overwriting an entry could be expensive.  There's no particular restriction on why we must overwrite the entire entry, except that we currently use the position number (i.e. 6506690 in your example) as the key in the leveldb implementation.  
> 
>> On Tue, Jun 5, 2018 at 12:30 AM, meghdoot bhattacharya <me...@yahoo.com.invalid> wrote:
>> Recently investigation of logs of aurora around snapshot creations in replicated log come up with this
>> I0601 21:57:27.322444   144 log.cpp:577] Attempting to append 524304 bytes to the logI0601 21:57:27.322501   144 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 6506690I0601 21:57:27.323122   139 replica.cpp:539] Replica received write request for position 6506690 from __req_res__(3685)@x.x.x.x:yyyyI0601 21:57:27.443131   139 leveldb.cpp:341] Persisting action (524331 bytes) to leveldb took 119.847993msI0601 21:57:27.443208   139 replica.cpp:710] Persisted action APPEND at position 6506690I0601 21:57:27.443437   139 replica.cpp:693] Replica received learned notice for position 6506690 from @0.0.0.0:0I0601 21:57:27.478754   139 leveldb.cpp:341] Persisting action (524333 bytes) to leveldb took 35.278735msI0601 21:57:27.478818   139 replica.cpp:710] Persisted action APPEND at position 6506690
>> 
>> 
>> It seems there are dual writes to leveldb for the same record (with 2 byte diff). Becomes expensive if that is the case for large records. Hoping to get some insights.
>> Thx
>

Re: Mesos replicated log dual writes

Posted by Meghdoot bhattacharya <me...@yahoo.com.INVALID>.

Thx Joseph. Writing back the entire blob is costly instead of writing back only the 2 bytes. In this case this record is just a chunk of large snapshot that is getting written. So it multiplies.

Should I open a jira to track and resolve this or something like this is already open?


Thx

> On Jun 5, 2018, at 9:10 AM, Joseph Wu <jo...@mesosphere.io> wrote:
> 
> The two-byte difference is most likely coming from the "learned" field:
> https://github.com/apache/mesos/blob/master/src/messages/log.proto#L49
> 
> The first time an entry is recorded, the entry is "unlearned", basically meaning that the entry has not been written to a quorum of masters (yet).  Once a quorum of masters indicate to each other that they have written an entry, they then flip the bit to "learned".
> 
> This process is mandatory for Paxos to work, but I can see how overwriting an entry could be expensive.  There's no particular restriction on why we must overwrite the entire entry, except that we currently use the position number (i.e. 6506690 in your example) as the key in the leveldb implementation.  
> 
>> On Tue, Jun 5, 2018 at 12:30 AM, meghdoot bhattacharya <me...@yahoo.com.invalid> wrote:
>> Recently investigation of logs of aurora around snapshot creations in replicated log come up with this
>> I0601 21:57:27.322444   144 log.cpp:577] Attempting to append 524304 bytes to the logI0601 21:57:27.322501   144 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 6506690I0601 21:57:27.323122   139 replica.cpp:539] Replica received write request for position 6506690 from __req_res__(3685)@x.x.x.x:yyyyI0601 21:57:27.443131   139 leveldb.cpp:341] Persisting action (524331 bytes) to leveldb took 119.847993msI0601 21:57:27.443208   139 replica.cpp:710] Persisted action APPEND at position 6506690I0601 21:57:27.443437   139 replica.cpp:693] Replica received learned notice for position 6506690 from @0.0.0.0:0I0601 21:57:27.478754   139 leveldb.cpp:341] Persisting action (524333 bytes) to leveldb took 35.278735msI0601 21:57:27.478818   139 replica.cpp:710] Persisted action APPEND at position 6506690
>> 
>> 
>> It seems there are dual writes to leveldb for the same record (with 2 byte diff). Becomes expensive if that is the case for large records. Hoping to get some insights.
>> Thx
>

Re: Mesos replicated log dual writes

Posted by Joseph Wu <jo...@mesosphere.io>.

The two-byte difference is most likely coming from the "learned" field:
https://github.com/apache/mesos/blob/master/src/messages/log.proto#L49

The first time an entry is recorded, the entry is "unlearned", basically
meaning that the entry has not been written to a quorum of masters (yet).
Once a quorum of masters indicate to each other that they have written an
entry, they then flip the bit to "learned".

This process is mandatory for Paxos to work, but I can see how overwriting
an entry could be expensive.  There's no particular restriction on why we
must overwrite the entire entry, except that we currently use the position
number (i.e. 6506690 in your example) as the key in the leveldb
implementation.

On Tue, Jun 5, 2018 at 12:30 AM, meghdoot bhattacharya <
meghdoot_b@yahoo.com.invalid> wrote:

> Recently investigation of logs of aurora around snapshot creations in
> replicated log come up with this
> I0601 21:57:27.322444   144 log.cpp:577] Attempting to append 524304
> bytes to the logI0601 21:57:27.322501   144 coordinator.cpp:348]
> Coordinator attempting to write APPEND action at position 6506690I0601
> 21:57:27.323122   139 replica.cpp:539] Replica received write request for
> position 6506690 from __req_res__(3685)@x.x.x.x:yyyyI0601 21:57:27.443131
>   139 leveldb.cpp:341] Persisting action (524331 bytes) to
> leveldb took 119.847993msI0601 21:57:27.443208   139 replica.cpp:710]
> Persisted action APPEND at position 6506690I0601 21:57:27.443437   139
> replica.cpp:693] Replica received learned notice for position 6506690 from
> @0.0.0.0:0I0601 21:57:27.478754   139 leveldb.cpp:341] Persisting action
> (524333 bytes) to leveldb took 35.278735msI0601 21:57:27.478818   139
> replica.cpp:710] Persisted action APPEND at position 6506690
>
>
> It seems there are dual writes to leveldb for the same record (with 2 byte
> diff). Becomes expensive if that is the case for large records. Hoping to
> get some insights.
> Thx
>