You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bookkeeper.apache.org by Sijie Guo <gu...@gmail.com> on 2012/10/09 09:04:53 UTC

Re: High latencies observed at the bookkeeper client while reading entries

> I was talking about BookkeeperAdmin and how it's used from
BookkeeperTools.
> I believe that we currently only support replicating a bookie onto another
> bookie.

BookKeeperAdmin doesn't access the data files directly to replicate
entries. It interacted with alive bookie servers to replicate entries that
belong to dead bookie.  So at most of the time, only bookie server access
the data files.

The only exception is that when you ran a bookie shell to look into the
details of entry log files or journals, it ran as a separated process in
read-only mode.


On Fri, Sep 28, 2012 at 12:54 AM, Aniruddha Laud
<tr...@gmail.com>wrote:

> On Thu, Sep 27, 2012 at 3:21 AM, Sijie Guo <gu...@gmail.com> wrote:
>
> > > I took at look at the leveldb homepage and it says - "Only a single
> > process
> > > (possibly multi-threaded) can access a particular database at a time."
> > > This is bad because it means we can't run the console or any recovery
> > > related operations while the bookies are running.
> >
> > Yes. leveldb is single process. it prevent misuse by acquiring lock from
> > filesystem. We did can't run the console to look its data while the
> bookie
> > is running.
> >
> > I am confusing about the 'recovery' operations you mentioned. what kind
> of
> > recovery?
> >
> I was talking about BookkeeperAdmin and how it's used from BookkeeperTools.
> I believe that we currently only support replicating a bookie onto another
> bookie. We might want to support more operations of this nature in future.
> These might need simultaneous access to the log files of a live bookie.
>
> >
> > > When we flush to the log file, we
> > > should simply flush entries sorted by the ledger id as key.
> >
> > If we want to sort when flushing, we need to buffer all edits in memory,
> > then flush. I am assume this approach would act same as LSM tree (what
> > leveldb did).
> >
> Yes. quite similar. I just glanced over LSM trees, I'll take a more
> detailed look soon.
>
> >
> > On Thu, Sep 27, 2012 at 9:39 AM, Aniruddha Laud <
> trojan.of.troy@gmail.com
> > >wrote:
> >
> > > On Wed, Sep 26, 2012 at 5:06 PM, Sijie Guo <gu...@gmail.com> wrote:
> > >
> > > > Sounds good to have thread pool (and make them configurable) for
> > reading
> > > > from entry log files.
> > > >
> > > > > We can have a write threadpool (and we should always keep this
> > > > lower than the number of processors) to process the add requests.
> > > >
> > > > one more point, since we had only one entry log file active accepting
> > > entry
> > > > written to it. I don't think multiple write threads would help now,
> > since
> > > > adding entry to entry log file is synchronized method.
> > > >
> > > I was talking about writing to the journal file. From what I
> understand,
> > > log file entries
> > > are flushed periodically by one thread and that is okay. The publish
> > > latencies on hedwig are
> > > dependent on the journal writes, though.
> > >
> > > >
> > > > In order to utilize the disk bandwidth more efficiently, we might
> need
> > to
> > > > have one active entry log file accepting entry written per ledger
> disk.
> > > But
> > > > it might need to introduce some directory layout change (like logId,
> > > > currently we using incrementing log id for whole bookie) and logic
> > > > changes.  it would be a separated task if did that.
> > > >
> > > > > Another thing we could possibly look at is re ordering our writes
> to
> > > the
> > > > log file to try and maintain locality for ledger entries.
> > > >
> > > > We had a prototype work internally working on using leveldb to store
> 1)
> > > > small entries data (size less than hundred of bytes), 2) ledger index
> > for
> > > > large entries (acts as ledger cache for index entries). Benefiting
> from
> > > > leveldb, 1) we could have more efficient cache when there are larger
> > > number
> > > > of ledgers and size skew between ledgers, 2) we could have data
> > belonging
> > > > to same ledger clustered when writing to disks, which achieves what
> you
> > > > mentioned 'reordering writes' in somehow.
> > > >
> > > I took at look at the leveldb homepage and it says - "Only a single
> > process
> > > (possibly multi-threaded) can access a particular database at a time."
> > > This is bad because it means we can't run the console or any recovery
> > > related operations while the bookies are running. I may be wrong,
> though.
> > > What I had in mind was pretty simple. When we flush to the log file, we
> > > should simply flush entries sorted by the ledger id as key. Some
> changes
> > > might be needed to the ledger index cache, but I'm not very sure what
> the
> > > changes would be. What do you think?
> > >
> > > >
> > > > -Sijie
> > > >
> > > > On Thu, Sep 27, 2012 at 12:52 AM, Aniruddha Laud
> > > > <tr...@gmail.com>wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Those stats I pasted might be a little misleading as they show the
> > > > average
> > > > > over a couple of minutes. Whenever there are reads to the ledger
> > disks,
> > > > the
> > > > > queue size on them is sometimes as high as 100. Also, the CPU
> > > utilization
> > > > > has been lower than 10% throughout and the process will continue to
> > > > remain
> > > > > I/O bound even if we introduce more threads (As the CPU remains
> idle
> > > > while
> > > > > doing I/O).
> > > > >
> > > > > A couple of observations about the write path. We currently have a
> > hard
> > > > > coded buffer size for journal writes (I believe it's 512KB) and we
> > > flush
> > > > to
> > > > > the disk when this fills up or if there is no entry to process
> (which
> > > is
> > > > > highly unlikely in case of a high throughput application running on
> > > top).
> > > > > We should make this buffer size configurable. Now, with more
> threads,
> > > we
> > > > > can process more packets in parallel and this buffer can be filled
> up
> > > > > faster. We can have a write threadpool (and we should always keep
> > this
> > > > > lower than the number of processors) to process the add requests.
> > > > >
> > > > > For read requests, a configurable number of worker threads would be
> > > ideal
> > > > > and we could let the user tune it depending on the kind of read
> > > patterns
> > > > > they expect. Given that ledgers are interleaved ATM, I would expect
> > the
> > > > > performance to increase linearly with the number of threads till a
> > > > certain
> > > > > point and then level out.
> > > > >
> > > > > Another thing we could possibly look at is re ordering our writes
> to
> > > the
> > > > > log file to try and maintain locality for ledger entries. This
> might
> > > > reduce
> > > > > the number of random seeks we do in case only a small number of
> > ledgers
> > > > are
> > > > > lagging.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > Regards,
> > > > > Aniruddha.
> > > > >
> > > > > On Wed, Sep 26, 2012 at 2:55 AM, Rakesh R <ra...@huawei.com>
> > wrote:
> > > > >
> > > > > > >>>One question: what is multi-ledgers?
> > > > > > multiple ledgers directories(muliple disks)
> > > > > >
> > > > > > >>>CPU utilization might not be largely affected if the threads
> are
> > > > > > sitting there waiting on IO
> > > > > > Ok, seems I got it.
> > > > > > If one thread spends most of its time waiting for I/O completion
> > > > instead
> > > > > > of using the CPU, but does not mean that "we've hit the system
> I/O
> > > > > > bandwidth limit", then IMHO having multiple threads (or
> > asynchronous
> > > > I/O)
> > > > > > might improve performance (by enabling more than one concurrent
> I/O
> > > > > > operation).
> > > > > >
> > > > > > -Rakesh
> > > > > > ________________________________________
> > > > > > From: Flavio Junqueira [fpj@yahoo-inc.com]
> > > > > > Sent: Wednesday, September 26, 2012 2:17 PM
> > > > > > To: bookkeeper-dev@zookeeper.apache.org
> > > > > > Subject: Re: High latencies observed at the bookkeeper client
> while
> > > > > > reading entries
> > > > > >
> > > > > > CPU utilization might not be largely affected if the threads are
> > > > sitting
> > > > > > there waiting on IO. In my understanding of the proposal so far,
> > the
> > > > idea
> > > > > > is to have multiple threads only to perform IO.
> > > > > >
> > > > > > One question: what is multi-ledgers?
> > > > > >
> > > > > > -Flavio
> > > > > >
> > > > > >
> > > > > > On Sep 26, 2012, at 7:52 AM, Rakesh R wrote:
> > > > > >
> > > > > > > I just adding one more point:
> > > > > > >
> > > > > > > Increasing the number of threads, can hit the CPU utilization
> > too.
> > > > > Also,
> > > > > > we would consider this and good to observe whether its more on
> I/O
> > > > bound
> > > > > > than CPU bound. However, it depends in great detail on the disks
> > and
> > > > how
> > > > > > much CPU work other threads are doing before they, too, end up
> > > waiting
> > > > on
> > > > > > those disks.
> > > > > > >
> > > > > > > I'm also thinking inline with Flavio's suggestion to have one
> > > thread
> > > > > per
> > > > > > ledger/journal device. Multithreading can help us with I/O bound
> > > > problems
> > > > > > if the I/O is perform against different disks.
> > > > > > >
> > > > > > > From the iostat report: waiting time of ledger directories. It
> > > shows
> > > > we
> > > > > > have options to fully utilizing the disk bandwidth.
> > > > > > >
> > > > > > > multi-ledgers disk usage:
> > > > > > > avgqu-sz
> > > > > > > 1.10
> > > > > > > 0.12
> > > > > > > 0.54
> > > > > > > 0.13
> > > > > > >
> > > > > > > -Rakesh
> > > > > > > ________________________________________
> > > > > > > From: Sijie Guo [guosijie@gmail.com]
> > > > > > > Sent: Wednesday, September 26, 2012 5:58 AM
> > > > > > > To: bookkeeper-dev@zookeeper.apache.org
> > > > > > > Subject: Re: High latencies observed at the bookkeeper client
> > while
> > > > > > reading entries
> > > > > > >
> > > > > > > One more point is that each write/read request to entry log
> files
> > > > would
> > > > > > be
> > > > > > > converted to write/read a 8K blob data, since you used
> > > > BufferedChannel.
> > > > > > For
> > > > > > > write requests, a larger write size is OK. For read requests,
> > they
> > > > are
> > > > > > > almost randomly. Even you read a larger blob, the blob might be
> > > > useless
> > > > > > > when next read goes to other place. Even more, I don't think we
> > > need
> > > > to
> > > > > > > maintain another fixed length readBuffer in BufferedChannel, it
> > > > almost
> > > > > > > doesn't help for random reads, we could leverage OS cache for
> it.
> > > > > > >
> > > > > > > On Wed, Sep 26, 2012 at 8:06 AM, Sijie Guo <guosijie@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > >> For serving requests, either queuing the requests in bookie
> > server
> > > > per
> > > > > > >> channel (write/read are blocking operations), or queueing in
> os
> > > > kernel
> > > > > > to
> > > > > > >> let block device queuing and schedule those io requests. I
> think
> > > > Stu's
> > > > > > >> point is to leverage block device's schedule algorithm to
> issue
> > io
> > > > > > requests
> > > > > > >> in multiple threads to fully utilize the disk bandwidth.
> > > > > > >>
> > > > > > >> from the iostat reports provided by Aniruddha, the average
> queue
> > > > > length
> > > > > > >> and utilized percentage are not high, which means most of time
> > the
> > > > > disks
> > > > > > >> are idle. It makes sense to use multiple threads to issue read
> > > > > requests.
> > > > > > >> one write thread and several read threads might work for each
> > > > device.
> > > > > > >>
> > > > > > >> On Wed, Sep 26, 2012 at 5:06 AM, Flavio Junqueira <
> > > > fpj@yahoo-inc.com
> > > > > > >wrote:
> > > > > > >>
> > > > > > >>> Hi Stu, I'm not sure I understand your point. If with one
> > thread
> > > we
> > > > > are
> > > > > > >>> getting pretty high latency (case Aniruddha described),
> doesn't
> > > it
> > > > > > mean we
> > > > > > >>> have a number of requests queued up? Adding more threads
> might
> > > only
> > > > > > make
> > > > > > >>> the problem worse by queueing up even more requests. I'm
> > possibly
> > > > > > missing
> > > > > > >>> your point...
> > > > > > >>>
> > > > > > >>> -Flavio
> > > > > > >>>
> > > > > > >>> On Sep 25, 2012, at 9:37 PM, Stu Hood wrote:
> > > > > > >>>
> > > > > > >>>> Separating by device would help, but will not allow the
> > devices
> > > to
> > > > > be
> > > > > > >>> fully
> > > > > > >>>> utilized: in order to buffer enough io commands into a
> disk's
> > > > queue
> > > > > > for
> > > > > > >>> the
> > > > > > >>>> elevator algorithms to kick in, you either need to use
> > multiple
> > > > > > threads
> > > > > > >>> per
> > > > > > >>>> disk, or native async IO (not trivially available within the
> > > JVM.)
> > > > > > >>>>
> > > > > > >>>> On Tue, Sep 25, 2012 at 2:23 AM, Flavio Junqueira <
> > > > > fpj@yahoo-inc.com>
> > > > > > >>> wrote:
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>> On Sep 25, 2012, at 10:55 AM, Aniruddha Laud wrote:
> > > > > > >>>>>
> > > > > > >>>>>> On Tue, Sep 25, 2012 at 1:35 AM, Flavio Junqueira <
> > > > > > fpj@yahoo-inc.com>
> > > > > > >>>>> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>>> Just to add a couple of comments to the discussion,
> > > separating
> > > > > > reads
> > > > > > >>> and
> > > > > > >>>>>>> writes into different threads should only help with
> queuing
> > > > > > latency.
> > > > > > >>> It
> > > > > > >>>>>>> wouldn't help with IO latency.
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>> Yes, but with the current implementation, publishes
> > latencies
> > > in
> > > > > > >>> hedwig
> > > > > > >>>>>> suffer because of lagging subscribers. By separating read
> > and
> > > > > write
> > > > > > >>>>> queues,
> > > > > > >>>>>> we can at least guarantee that the write SLA is maintained
> > > > > (separate
> > > > > > >>>>>> journal disk + separate thread would ensure that writes
> are
> > > not
> > > > > > >>> affected
> > > > > > >>>>> by
> > > > > > >>>>>> read related seeks)
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>> Agreed and based on my comment below, I was wondering if it
> > > > > wouldn't
> > > > > > be
> > > > > > >>>>> best to separate traffic across threads by device instead
> of
> > by
> > > > > > >>> operation
> > > > > > >>>>> type.
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> Also, it sounds like a good idea to have at least one
> > thread
> > > > per
> > > > > > >>> ledger
> > > > > > >>>>>>> device. In the case of multiple ledger devices, if we use
> > one
> > > > > > single
> > > > > > >>>>>>> thread, then the performance of the bookie will be driven
> > by
> > > > the
> > > > > > >>> slowest
> > > > > > >>>>>>> disk, no?
> > > > > > >>>>>>>
> > > > > > >>>>>> yup, makes sense.
> > > > > > >>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> -Flavio
> > > > > > >>>>>>>
> > > > > > >>>>>>> On Sep 25, 2012, at 10:24 AM, Ivan Kelly wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>>> Could you give some information on what those
> > shortcomings
> > > > are?
> > > > > > >>> Also,
> > > > > > >>>>> do
> > > > > > >>>>>>>>> let me know if you need any more information from our
> > end.
> > > > > > >>>>>>>> Off the top of my head:
> > > > > > >>>>>>>> - reads and writes are handled in the same thread (as
> you
> > > have
> > > > > > >>>>> observed)
> > > > > > >>>>>>>> - each entry read requires a single RPC.
> > > > > > >>>>>>>> - entries are read in parallel
> > > > > > >>>>>>>
> > > > > > >>>>>> By parallel, you mean the BufferedChannel wrapper on top
> of
> > > > > > >>> FileChannel,
> > > > > > >>>>>> right?
> > > > > > >>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Not all of these could result in the high latency you
> see,
> > > but
> > > > > if
> > > > > > >>> each
> > > > > > >>>>>>>> entry is being read separately, a sync on the ledger
> disk
> > in
> > > > > > between
> > > > > > >>>>>>>> will make a mess of the disk head scheduling.
> > > > > > >>>>>>>
> > > > > > >>>>>> Increasing the time interval between  flushing log files
> > might
> > > > > > >>> possibly
> > > > > > >>>>>> help in this case then?
> > > > > > >>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> -Ivan
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>> Thanks for the help :)
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: High latencies observed at the bookkeeper client while reading entries

Posted by Aniruddha Laud <tr...@gmail.com>.

I've created https://issues.apache.org/jira/browse/BOOKKEEPER-429 for
tracking the multiple threads patch without any entry reordering. I'm going
to open another ticket that addresses entry reordering. I'll upload the
patch by EOD today.

Regards,
Aniruddha.

On Tue, Oct 9, 2012 at 12:04 AM, Sijie Guo <gu...@gmail.com> wrote:

> > I was talking about BookkeeperAdmin and how it's used from
> BookkeeperTools.
> > I believe that we currently only support replicating a bookie onto
> another
> > bookie.
>
> BookKeeperAdmin doesn't access the data files directly to replicate
> entries. It interacted with alive bookie servers to replicate entries that
> belong to dead bookie.  So at most of the time, only bookie server access
> the data files.
>
> The only exception is that when you ran a bookie shell to look into the
> details of entry log files or journals, it ran as a separated process in
> read-only mode.
>
>
> On Fri, Sep 28, 2012 at 12:54 AM, Aniruddha Laud
> <tr...@gmail.com>wrote:
>
> > On Thu, Sep 27, 2012 at 3:21 AM, Sijie Guo <gu...@gmail.com> wrote:
> >
> > > > I took at look at the leveldb homepage and it says - "Only a single
> > > process
> > > > (possibly multi-threaded) can access a particular database at a
> time."
> > > > This is bad because it means we can't run the console or any recovery
> > > > related operations while the bookies are running.
> > >
> > > Yes. leveldb is single process. it prevent misuse by acquiring lock
> from
> > > filesystem. We did can't run the console to look its data while the
> > bookie
> > > is running.
> > >
> > > I am confusing about the 'recovery' operations you mentioned. what kind
> > of
> > > recovery?
> > >
> > I was talking about BookkeeperAdmin and how it's used from
> BookkeeperTools.
> > I believe that we currently only support replicating a bookie onto
> another
> > bookie. We might want to support more operations of this nature in
> future.
> > These might need simultaneous access to the log files of a live bookie.
> >
> > >
> > > > When we flush to the log file, we
> > > > should simply flush entries sorted by the ledger id as key.
> > >
> > > If we want to sort when flushing, we need to buffer all edits in
> memory,
> > > then flush. I am assume this approach would act same as LSM tree (what
> > > leveldb did).
> > >
> > Yes. quite similar. I just glanced over LSM trees, I'll take a more
> > detailed look soon.
> >
> > >
> > > On Thu, Sep 27, 2012 at 9:39 AM, Aniruddha Laud <
> > trojan.of.troy@gmail.com
> > > >wrote:
> > >
> > > > On Wed, Sep 26, 2012 at 5:06 PM, Sijie Guo <gu...@gmail.com>
> wrote:
> > > >
> > > > > Sounds good to have thread pool (and make them configurable) for
> > > reading
> > > > > from entry log files.
> > > > >
> > > > > > We can have a write threadpool (and we should always keep this
> > > > > lower than the number of processors) to process the add requests.
> > > > >
> > > > > one more point, since we had only one entry log file active
> accepting
> > > > entry
> > > > > written to it. I don't think multiple write threads would help now,
> > > since
> > > > > adding entry to entry log file is synchronized method.
> > > > >
> > > > I was talking about writing to the journal file. From what I
> > understand,
> > > > log file entries
> > > > are flushed periodically by one thread and that is okay. The publish
> > > > latencies on hedwig are
> > > > dependent on the journal writes, though.
> > > >
> > > > >
> > > > > In order to utilize the disk bandwidth more efficiently, we might
> > need
> > > to
> > > > > have one active entry log file accepting entry written per ledger
> > disk.
> > > > But
> > > > > it might need to introduce some directory layout change (like
> logId,
> > > > > currently we using incrementing log id for whole bookie) and logic
> > > > > changes.  it would be a separated task if did that.
> > > > >
> > > > > > Another thing we could possibly look at is re ordering our writes
> > to
> > > > the
> > > > > log file to try and maintain locality for ledger entries.
> > > > >
> > > > > We had a prototype work internally working on using leveldb to
> store
> > 1)
> > > > > small entries data (size less than hundred of bytes), 2) ledger
> index
> > > for
> > > > > large entries (acts as ledger cache for index entries). Benefiting
> > from
> > > > > leveldb, 1) we could have more efficient cache when there are
> larger
> > > > number
> > > > > of ledgers and size skew between ledgers, 2) we could have data
> > > belonging
> > > > > to same ledger clustered when writing to disks, which achieves what
> > you
> > > > > mentioned 'reordering writes' in somehow.
> > > > >
> > > > I took at look at the leveldb homepage and it says - "Only a single
> > > process
> > > > (possibly multi-threaded) can access a particular database at a
> time."
> > > > This is bad because it means we can't run the console or any recovery
> > > > related operations while the bookies are running. I may be wrong,
> > though.
> > > > What I had in mind was pretty simple. When we flush to the log file,
> we
> > > > should simply flush entries sorted by the ledger id as key. Some
> > changes
> > > > might be needed to the ledger index cache, but I'm not very sure what
> > the
> > > > changes would be. What do you think?
> > > >
> > > > >
> > > > > -Sijie
> > > > >
> > > > > On Thu, Sep 27, 2012 at 12:52 AM, Aniruddha Laud
> > > > > <tr...@gmail.com>wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Those stats I pasted might be a little misleading as they show
> the
> > > > > average
> > > > > > over a couple of minutes. Whenever there are reads to the ledger
> > > disks,
> > > > > the
> > > > > > queue size on them is sometimes as high as 100. Also, the CPU
> > > > utilization
> > > > > > has been lower than 10% throughout and the process will continue
> to
> > > > > remain
> > > > > > I/O bound even if we introduce more threads (As the CPU remains
> > idle
> > > > > while
> > > > > > doing I/O).
> > > > > >
> > > > > > A couple of observations about the write path. We currently have
> a
> > > hard
> > > > > > coded buffer size for journal writes (I believe it's 512KB) and
> we
> > > > flush
> > > > > to
> > > > > > the disk when this fills up or if there is no entry to process
> > (which
> > > > is
> > > > > > highly unlikely in case of a high throughput application running
> on
> > > > top).
> > > > > > We should make this buffer size configurable. Now, with more
> > threads,
> > > > we
> > > > > > can process more packets in parallel and this buffer can be
> filled
> > up
> > > > > > faster. We can have a write threadpool (and we should always keep
> > > this
> > > > > > lower than the number of processors) to process the add requests.
> > > > > >
> > > > > > For read requests, a configurable number of worker threads would
> be
> > > > ideal
> > > > > > and we could let the user tune it depending on the kind of read
> > > > patterns
> > > > > > they expect. Given that ledgers are interleaved ATM, I would
> expect
> > > the
> > > > > > performance to increase linearly with the number of threads till
> a
> > > > > certain
> > > > > > point and then level out.
> > > > > >
> > > > > > Another thing we could possibly look at is re ordering our writes
> > to
> > > > the
> > > > > > log file to try and maintain locality for ledger entries. This
> > might
> > > > > reduce
> > > > > > the number of random seeks we do in case only a small number of
> > > ledgers
> > > > > are
> > > > > > lagging.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > Regards,
> > > > > > Aniruddha.
> > > > > >
> > > > > > On Wed, Sep 26, 2012 at 2:55 AM, Rakesh R <ra...@huawei.com>
> > > wrote:
> > > > > >
> > > > > > > >>>One question: what is multi-ledgers?
> > > > > > > multiple ledgers directories(muliple disks)
> > > > > > >
> > > > > > > >>>CPU utilization might not be largely affected if the threads
> > are
> > > > > > > sitting there waiting on IO
> > > > > > > Ok, seems I got it.
> > > > > > > If one thread spends most of its time waiting for I/O
> completion
> > > > > instead
> > > > > > > of using the CPU, but does not mean that "we've hit the system
> > I/O
> > > > > > > bandwidth limit", then IMHO having multiple threads (or
> > > asynchronous
> > > > > I/O)
> > > > > > > might improve performance (by enabling more than one concurrent
> > I/O
> > > > > > > operation).
> > > > > > >
> > > > > > > -Rakesh
> > > > > > > ________________________________________
> > > > > > > From: Flavio Junqueira [fpj@yahoo-inc.com]
> > > > > > > Sent: Wednesday, September 26, 2012 2:17 PM
> > > > > > > To: bookkeeper-dev@zookeeper.apache.org
> > > > > > > Subject: Re: High latencies observed at the bookkeeper client
> > while
> > > > > > > reading entries
> > > > > > >
> > > > > > > CPU utilization might not be largely affected if the threads
> are
> > > > > sitting
> > > > > > > there waiting on IO. In my understanding of the proposal so
> far,
> > > the
> > > > > idea
> > > > > > > is to have multiple threads only to perform IO.
> > > > > > >
> > > > > > > One question: what is multi-ledgers?
> > > > > > >
> > > > > > > -Flavio
> > > > > > >
> > > > > > >
> > > > > > > On Sep 26, 2012, at 7:52 AM, Rakesh R wrote:
> > > > > > >
> > > > > > > > I just adding one more point:
> > > > > > > >
> > > > > > > > Increasing the number of threads, can hit the CPU utilization
> > > too.
> > > > > > Also,
> > > > > > > we would consider this and good to observe whether its more on
> > I/O
> > > > > bound
> > > > > > > than CPU bound. However, it depends in great detail on the
> disks
> > > and
> > > > > how
> > > > > > > much CPU work other threads are doing before they, too, end up
> > > > waiting
> > > > > on
> > > > > > > those disks.
> > > > > > > >
> > > > > > > > I'm also thinking inline with Flavio's suggestion to have one
> > > > thread
> > > > > > per
> > > > > > > ledger/journal device. Multithreading can help us with I/O
> bound
> > > > > problems
> > > > > > > if the I/O is perform against different disks.
> > > > > > > >
> > > > > > > > From the iostat report: waiting time of ledger directories.
> It
> > > > shows
> > > > > we
> > > > > > > have options to fully utilizing the disk bandwidth.
> > > > > > > >
> > > > > > > > multi-ledgers disk usage:
> > > > > > > > avgqu-sz
> > > > > > > > 1.10
> > > > > > > > 0.12
> > > > > > > > 0.54
> > > > > > > > 0.13
> > > > > > > >
> > > > > > > > -Rakesh
> > > > > > > > ________________________________________
> > > > > > > > From: Sijie Guo [guosijie@gmail.com]
> > > > > > > > Sent: Wednesday, September 26, 2012 5:58 AM
> > > > > > > > To: bookkeeper-dev@zookeeper.apache.org
> > > > > > > > Subject: Re: High latencies observed at the bookkeeper client
> > > while
> > > > > > > reading entries
> > > > > > > >
> > > > > > > > One more point is that each write/read request to entry log
> > files
> > > > > would
> > > > > > > be
> > > > > > > > converted to write/read a 8K blob data, since you used
> > > > > BufferedChannel.
> > > > > > > For
> > > > > > > > write requests, a larger write size is OK. For read requests,
> > > they
> > > > > are
> > > > > > > > almost randomly. Even you read a larger blob, the blob might
> be
> > > > > useless
> > > > > > > > when next read goes to other place. Even more, I don't think
> we
> > > > need
> > > > > to
> > > > > > > > maintain another fixed length readBuffer in BufferedChannel,
> it
> > > > > almost
> > > > > > > > doesn't help for random reads, we could leverage OS cache for
> > it.
> > > > > > > >
> > > > > > > > On Wed, Sep 26, 2012 at 8:06 AM, Sijie Guo <
> guosijie@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > >> For serving requests, either queuing the requests in bookie
> > > server
> > > > > per
> > > > > > > >> channel (write/read are blocking operations), or queueing in
> > os
> > > > > kernel
> > > > > > > to
> > > > > > > >> let block device queuing and schedule those io requests. I
> > think
> > > > > Stu's
> > > > > > > >> point is to leverage block device's schedule algorithm to
> > issue
> > > io
> > > > > > > requests
> > > > > > > >> in multiple threads to fully utilize the disk bandwidth.
> > > > > > > >>
> > > > > > > >> from the iostat reports provided by Aniruddha, the average
> > queue
> > > > > > length
> > > > > > > >> and utilized percentage are not high, which means most of
> time
> > > the
> > > > > > disks
> > > > > > > >> are idle. It makes sense to use multiple threads to issue
> read
> > > > > > requests.
> > > > > > > >> one write thread and several read threads might work for
> each
> > > > > device.
> > > > > > > >>
> > > > > > > >> On Wed, Sep 26, 2012 at 5:06 AM, Flavio Junqueira <
> > > > > fpj@yahoo-inc.com
> > > > > > > >wrote:
> > > > > > > >>
> > > > > > > >>> Hi Stu, I'm not sure I understand your point. If with one
> > > thread
> > > > we
> > > > > > are
> > > > > > > >>> getting pretty high latency (case Aniruddha described),
> > doesn't
> > > > it
> > > > > > > mean we
> > > > > > > >>> have a number of requests queued up? Adding more threads
> > might
> > > > only
> > > > > > > make
> > > > > > > >>> the problem worse by queueing up even more requests. I'm
> > > possibly
> > > > > > > missing
> > > > > > > >>> your point...
> > > > > > > >>>
> > > > > > > >>> -Flavio
> > > > > > > >>>
> > > > > > > >>> On Sep 25, 2012, at 9:37 PM, Stu Hood wrote:
> > > > > > > >>>
> > > > > > > >>>> Separating by device would help, but will not allow the
> > > devices
> > > > to
> > > > > > be
> > > > > > > >>> fully
> > > > > > > >>>> utilized: in order to buffer enough io commands into a
> > disk's
> > > > > queue
> > > > > > > for
> > > > > > > >>> the
> > > > > > > >>>> elevator algorithms to kick in, you either need to use
> > > multiple
> > > > > > > threads
> > > > > > > >>> per
> > > > > > > >>>> disk, or native async IO (not trivially available within
> the
> > > > JVM.)
> > > > > > > >>>>
> > > > > > > >>>> On Tue, Sep 25, 2012 at 2:23 AM, Flavio Junqueira <
> > > > > > fpj@yahoo-inc.com>
> > > > > > > >>> wrote:
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> On Sep 25, 2012, at 10:55 AM, Aniruddha Laud wrote:
> > > > > > > >>>>>
> > > > > > > >>>>>> On Tue, Sep 25, 2012 at 1:35 AM, Flavio Junqueira <
> > > > > > > fpj@yahoo-inc.com>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>>> Just to add a couple of comments to the discussion,
> > > > separating
> > > > > > > reads
> > > > > > > >>> and
> > > > > > > >>>>>>> writes into different threads should only help with
> > queuing
> > > > > > > latency.
> > > > > > > >>> It
> > > > > > > >>>>>>> wouldn't help with IO latency.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>> Yes, but with the current implementation, publishes
> > > latencies
> > > > in
> > > > > > > >>> hedwig
> > > > > > > >>>>>> suffer because of lagging subscribers. By separating
> read
> > > and
> > > > > > write
> > > > > > > >>>>> queues,
> > > > > > > >>>>>> we can at least guarantee that the write SLA is
> maintained
> > > > > > (separate
> > > > > > > >>>>>> journal disk + separate thread would ensure that writes
> > are
> > > > not
> > > > > > > >>> affected
> > > > > > > >>>>> by
> > > > > > > >>>>>> read related seeks)
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>> Agreed and based on my comment below, I was wondering if
> it
> > > > > > wouldn't
> > > > > > > be
> > > > > > > >>>>> best to separate traffic across threads by device instead
> > of
> > > by
> > > > > > > >>> operation
> > > > > > > >>>>> type.
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Also, it sounds like a good idea to have at least one
> > > thread
> > > > > per
> > > > > > > >>> ledger
> > > > > > > >>>>>>> device. In the case of multiple ledger devices, if we
> use
> > > one
> > > > > > > single
> > > > > > > >>>>>>> thread, then the performance of the bookie will be
> driven
> > > by
> > > > > the
> > > > > > > >>> slowest
> > > > > > > >>>>>>> disk, no?
> > > > > > > >>>>>>>
> > > > > > > >>>>>> yup, makes sense.
> > > > > > > >>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> -Flavio
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> On Sep 25, 2012, at 10:24 AM, Ivan Kelly wrote:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>>> Could you give some information on what those
> > > shortcomings
> > > > > are?
> > > > > > > >>> Also,
> > > > > > > >>>>> do
> > > > > > > >>>>>>>>> let me know if you need any more information from our
> > > end.
> > > > > > > >>>>>>>> Off the top of my head:
> > > > > > > >>>>>>>> - reads and writes are handled in the same thread (as
> > you
> > > > have
> > > > > > > >>>>> observed)
> > > > > > > >>>>>>>> - each entry read requires a single RPC.
> > > > > > > >>>>>>>> - entries are read in parallel
> > > > > > > >>>>>>>
> > > > > > > >>>>>> By parallel, you mean the BufferedChannel wrapper on top
> > of
> > > > > > > >>> FileChannel,
> > > > > > > >>>>>> right?
> > > > > > > >>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Not all of these could result in the high latency you
> > see,
> > > > but
> > > > > > if
> > > > > > > >>> each
> > > > > > > >>>>>>>> entry is being read separately, a sync on the ledger
> > disk
> > > in
> > > > > > > between
> > > > > > > >>>>>>>> will make a mess of the disk head scheduling.
> > > > > > > >>>>>>>
> > > > > > > >>>>>> Increasing the time interval between  flushing log files
> > > might
> > > > > > > >>> possibly
> > > > > > > >>>>>> help in this case then?
> > > > > > > >>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> -Ivan
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>> Thanks for the help :)
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>