You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org> on 2012/10/13 02:57:02 UTC

[jira] [Created] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Yixue (Andrew) Zhu created BOOKKEEPER-432:
---------------------------------------------

             Summary: Improve performance of entry log range read per ledger entries 
                 Key: BOOKKEEPER-432
                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
             Project: Bookkeeper
          Issue Type: Improvement
          Components: bookkeeper-server
    Affects Versions: 4.1.0
         Environment: Linux
            Reporter: Yixue (Andrew) Zhu


We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.

Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.

Some possible improvements:
1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 

The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.

Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.

2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by yixue zhu <yx...@gmail.com>.
I am thinking about changing it.
I am going to send draft plan to you. It would be great for you to validate
and change notes.
Yixue

On Thu, Oct 18, 2012 at 12:22 AM, Ivan Kelly (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478750#comment-13478750]
>
> Ivan Kelly commented on BOOKKEEPER-432:
> ---------------------------------------
>
> We'd have to change to point at which entries are added to the index cache
> also. It does add some complexity to the change, but it still keeps all the
> changes in the entrylogger for now. Are you working on this at the moment?
> I was thinking of picking this up next week.
>
> > Improve performance of entry log range read per ledger entries
> > ---------------------------------------------------------------
> >
> >                 Key: BOOKKEEPER-432
> >                 URL:
> https://issues.apache.org/jira/browse/BOOKKEEPER-432
> >             Project: Bookkeeper
> >          Issue Type: Improvement
> >          Components: bookkeeper-server
> >    Affects Versions: 4.2.0
> >         Environment: Linux
> >            Reporter: Yixue (Andrew) Zhu
> >              Labels: patch
> >
> > We observed random I/O reads when some subscribers fall behind (on some
> topics), as delivery needs to scan the entry logs (thru ledger index),
> which are interleaved with ledger entries across all ledgers being served.
> > Essentially, the ledger index is a non-clustered index. It is not
> effective when a large number of ledger entries need to be served, which
> tend to be scattered around due to interleaving.
> > Some possible improvements:
> > 1. Change the ledger entries buffer to use a SkipList (or other
> suitable), sorted on (ledger, entry sequence). When the buffer is flushed,
> the entry log is written out in the already-sorted order.
> > The "active" ledger index can point to the entries buffer (SkipList),
> and fixed up with entry-log position once latter is persisted.
> > Or, the ledger index can be just rebuilt on demand. The entry log file
> tail can have index attached (light-weight b-tree, similar with big-table).
> We need to track per ledger which log files contribute entries to it, so
> that in-memory index can be rebuilt from the tails of corresponding log
> files.
> > 2. Use affinity concept to make ensembles of ledgers (belonging to same
> topic) as identical as possible. This will help above 1. be more effective.
> >
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>



-- 
Best Regards,
yixue

[jira] [Comment Edited] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488199#comment-13488199 ] 

Ivan Kelly edited comment on BOOKKEEPER-432 at 10/31/12 8:30 PM:
-----------------------------------------------------------------

Are the intermediatory levels of the btree between the root and the leaves in the btree stored on disk, or is it only the leaves? Do you maintain separate index files per ledger also? Sorry for all the questions, I'm trying to get the full picture in my head :)

{quote}
If you mean Ledger Log by write-ahead-log, yes, we will not read them unless during recovery. I am not proposing caching it.

By entry-log, I mean entry data (messages), which is stored in data blocks (aka pages, chunks). These data blocks can be cached on demand, using LRU replacement policy. 
{quote}
By write ahead log, I mean that the usecase BK, as a whole is designed for, is as a write ahead log, which, by it's nature should be seldom read. I'm not refering to the bookie's own journal. I could imagine one usecases where multiple reads could be necessary (such as hedwig, with many subscribers, consuming at different rates) but these should be handled at a higher level (such as the read ahead cache in hedwig).
                
      was (Author: ikelly):
    Are the intermediatory levels of the btree between the root and the leaves in the btree stored on disk, or is it only the leaves? Do you maintain separate index files per ledger also? Sorry for all the questions, I'm trying to get the full pivture in my head :)

{quote}
If you mean Ledger Log by write-ahead-log, yes, we will not read them unless during recovery. I am not proposing caching it.

By entry-log, I mean entry data (messages), which is stored in data blocks (aka pages, chunks). These data blocks can be cached on demand, using LRU replacement policy. 
{quote}
By write ahead log, I mean that the usecase BK, as a whole is designed for, is as a write ahead log, which, by it's nature should be seldom read. I'm not refering to the bookie's own journal. I could imagine one usecases where multiple reads could be necessary (such as hedwig, with many subscribers, consuming at different rates) but these should be handled at a higher level (such as the read ahead cache in hedwig).
                  
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481279#comment-13481279 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

[~fpj] with 1000 ledgers, flushing can take up a couple of minutes. but even if it took only one second, at 50k entries per second, it would mean reading 50 entries per seek rather than 1.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: PortSkipListLedgerStorage.patch

Updated patch to resolve conflict
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStorage.patch, SkipList.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487959#comment-13487959 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

Thanks Ivan for the review.
The SkipList is for initial staging of caching newly added entries to transform random I/O to sequential, it will not be used afterwards.

The B-tree uses sparse index, is tailored for sequential data block read.

The index file is to track which B-tree to cache per-ledger entry log starting position to start block read with.

To be more concrete, the B-tree data blocks will be cached on demand (using a pool of blocks).  

                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Aniruddha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480386#comment-13480386 ] 

Aniruddha commented on BOOKKEEPER-432:
--------------------------------------

[~ikelly] I had a similar approach in mind to what you have suggested with some auto-tuning. I can do this over the weekend while [~yx3zhu@gmail.com] works on a more formal skip list based design. Is that okay? 
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510216#comment-13510216 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

Review board
https://reviews.apache.org/r/8350/
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStore.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490497#comment-13490497 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

Ivan, Flavio,
I have changed the design so that it can be backward compatible.
Yet, it keep the entries sorted in the entry log file.
Let me know your thoughts.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504555#comment-13504555 ] 

Hadoop QA commented on BOOKKEEPER-432:
--------------------------------------

Testing JIRA BOOKKEEPER-432

WARNING: Running test-patch on a dirty local svn workspace

Patch <a href="/jira/secure/attachment/12554292/SkipList.patch">/jira/secure/attachment/12554292/SkipList.patch</a> downloaded at Tue Nov 27 11:50:12 UTC 2012

----------------------------

{color:red}-1{color} Patch failed to apply to head of branch

----------------------------
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, SkipList.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489363#comment-13489363 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

I've posted some results we had benchmarking against hbase's hregion format which you guys might be interested in.
https://github.com/ivankelly/bkvhbase/wiki/Performance-Graphs

{quote}
The per-ledger index file just holds information which btree-file the entries are stored. We could just look up the b-tree using its index block and data blocks, which are all LRU cached
{quote}
So all entries for a single ledger will be stored sequentially on disk in a single file. Instead of storing, in the index file, which btree file contains the ledger entries, couldn't you store the file and the offset to the entries? It would save you having to go to the file index at all.

{quote}
we can enhance HedWig to batch read entries from Bookies.
{quote}
This is something I think the bookkeeper client has needed for quite a while.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488290#comment-13488290 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

Only the leave level of btree are data blocks (for bookkeeper usage, we probably ends up just one level of index block, if sparse index is used).
All data blocks and index blocks are stored on disk.

The Bookie's ReadEntryProcessor uses LedgerStorage::getEntry to retrieve individual entries. The LRU block cache should help next calls from Hedwig to read more entries.

The per-ledger index file just holds information which btree-file the entries are stored. We could just look up the b-tree using its index block and data blocks, which are all LRU cached.

To speed up entry look up, we can maintain in-memory-only per-ledger index cache, which retrieve information from b-tree on demand, with each retrieval caches starting offset of a batch of entries.

Once this is in place, we can enhance HedWig to batch read entries from Bookies, which can handle it well using the design.


 

                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: BookieLedgerStorageProposal.pdf)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487015#comment-13487015 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

It would be good to get a more detailed description of both the btree & skiplist disk formats.
For btree, I'm not sure the approach makes sense. The workload is append only, so with a tree, you're going to be constantly skewing to the right, and having to rebalance.
For skiplist, how many levels of skiplists do you envisage? The index file format you describe is effectively the top level.

                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491539#comment-13491539 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

The new log entry you propose is for the journal? If so, im not sure it's needed,as we already mark the journal after each flush, to allow persisted entries to be cleaned up.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492271#comment-13492271 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

After each flush, we call Journal.LastLogMark.rollLog(), which should have the same effect. We read the last mark during recovery to skip to the point in the journal from which entries have not been persisted. 

In any case, I think this is outside the scope of this jira, as this is bookie recovery and this jira should focus on read performance.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478196#comment-13478196 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

Thanks Ivan,
The index file depends on the log offset of entry log. If we just do part 1 or pool of blocks, the index cache need to be filled/fixed after entry log is sorted (and flushed).
One possibly choice is to chain the flushing of entry block to the filling/correcting index cache, adding some complexity.
Thoughts?


                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: BookieLedgerStorageProposal.pdf
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: SkipList.patch)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStorage.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509593#comment-13509593 ] 

Hadoop QA commented on BOOKKEEPER-432:
--------------------------------------

Testing JIRA BOOKKEEPER-432


Patch [PortSkipListLedgerStore.patch|https://issues.apache.org/jira/secure/attachment/12555895/PortSkipListLedgerStore.patch] downloaded at Tue Dec  4 08:18:20 UTC 2012

----------------------------

{color:green}+1 PATCH_APPLIES{color}
{color:green}+1 CLEAN{color}
{color:red}-1 RAW_PATCH_ANALYSIS{color}
.    {color:green}+1{color} the patch does not introduce any @author tags
.    {color:red}-1{color} the patch contains 1 line(s) with tabs
.    {color:green}+1{color} the patch does not introduce any trailing spaces
.    {color:green}+1{color} the patch does not introduce any line longer than 120
.    {color:green}+1{color} the patch does adds/modifies 7 testcase(s)
{color:green}+1 RAT{color}
.    {color:green}+1{color} the patch does not seem to introduce new RAT warnings
{color:red}-1 JAVADOC{color}
.    {color:red}-1{color} the patch seems to introduce 3 new Javadoc warning(s)
{color:green}+1 COMPILE{color}
.    {color:green}+1{color} HEAD compiles
.    {color:green}+1{color} patch compiles
.    {color:green}+1{color} the patch does not seem to introduce new javac warnings
{color:green}+1 FINDBUGS{color}
.    {color:green}+1{color} the patch does not seem to introduce new Findbugs warnings
{color:green}+1 TESTS{color}
.    Tests run: 393
{color:green}+1 DISTRO{color}
.    {color:green}+1{color} distro tarball builds with the patch 

----------------------------
{color:red}*-1 Overall result, please check the reported -1(s)*{color}


The full output of the test-patch run is available at

.   https://builds.apache.org/job/bookkeeper-trunk-precommit-build/73/
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStore.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477989#comment-13477989 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

I think 1. can be split into two parts, sorting before flushing, and maintaining skiplists rather than index files. Sorting before flushing should give us a really easy win and shouldn't be very difficult to implement.

For example, lets say we have 1k entries and are writing at 50k entries per second to 1000 ledgers. Lets assume flushing takes 1 second (in reality for 1000 ledgers it'll be more like 2 minutes).

This means the amount of data flushed will be 50MB. As it currently is now, entries will be evenly distributed across this, so for 1000 ledgers, there will be 999KB (999 other 1KB entries) between any 2 entries of the same ledger. This is well outside of the size of the read ahead (128KB), so on average, we have to seek for every single entry. 

By contrast, with sorting, a single seek would get us 50 entries.

I don't think we explicitly need to sort either. We can have a block pool of 50k blocks. Each ledger on the server side has chain of blocks from the pool. When we flush, we flush the blocks from each ledger in order. If we run out of blocks, we flush (without a disk force) a ledger, and return the blocks to the pool.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487351#comment-13487351 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

The B-tree will be read-only once it is written out from snapshotted Skip-List (like bulk import), i.e. we do not need to rebalance it.
SkipList can use ConcurrentSkipListSet to start as simple.
The index file format is on-disk format. We can maintain an in-memory only data structure to speedup per-ledger read initial seek, using current ledger cache format, just that is it cached on demand (from the index file and B-tree/skipList).

Uploaded revised document.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: BookieLedgerStorageProposal.pdf)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal (1).pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480733#comment-13480733 ] 

Flavio Junqueira commented on BOOKKEEPER-432:
---------------------------------------------

I like the idea of using skip lists, it might be particularly useful when Hedwig subscribers are catching up. I also like the idea of sorting before flushing, but I'm not so sure how much it will improve performance. It mostly depends on the number of entries we actually have upon a flush.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: BookieLedgerStorageProposal.pdf)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: BookieLedgerStorageProposal.pdf
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu reassigned BOOKKEEPER-432:
---------------------------------------------

    Assignee: Yixue (Andrew) Zhu
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478750#comment-13478750 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

We'd have to change to point at which entries are added to the index cache also. It does add some complexity to the change, but it still keeps all the changes in the entrylogger for now. Are you working on this at the moment? I was thinking of picking this up next week.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491618#comment-13491618 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

The new log entry is for the journal. It is designed to speedup replay - as the entry log file content will be identical, replay can skip replaying entries, if the crash happened before entry log flushed (but accumulated a lot of entries already). The marker is like database checkpoint record, it just reduce the amount of log to replay (since last flush).

We do not need to implement it as first cut. 
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: BookieLedgerStorageProposal.pdf)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508944#comment-13508944 ] 

Hadoop QA commented on BOOKKEEPER-432:
--------------------------------------

Testing JIRA BOOKKEEPER-432

WARNING: Running test-patch on a dirty local svn workspace

Patch <a href="/jira/secure/attachment/12555797/PortSkipListLedgerStorage.patch">/jira/secure/attachment/12555797/PortSkipListLedgerStorage.patch</a> downloaded at Mon Dec  3 18:31:38 UTC 2012

----------------------------

{color:green}+1 PATCH_APPLIES{color}
{color:green}+1 CLEAN{color}
{color:red}-1 RAW_PATCH_ANALYSIS{color}
.    {color:green}+1{color} the patch does not introduce any @author tags
.    {color:red}-1{color} the patch contains 1 line(s) with tabs
.    {color:green}+1{color} the patch does not introduce any trailing spaces
.    {color:green}+1{color} the patch does not introduce any line longer than 120
.    {color:green}+1{color} the patch does adds/modifies 6 testcase(s)
{color:green}+1 RAT{color}
.    {color:green}+1{color} the patch does not seem to introduce new RAT warnings
{color:red}-1 JAVADOC{color}
.    {color:red}-1{color} the patch seems to introduce 3 new Javadoc warning(s)
{color:green}+1 COMPILE{color}
.    {color:green}+1{color} HEAD compiles
.    {color:green}+1{color} patch compiles
.    {color:green}+1{color} the patch does not seem to introduce new javac warnings
{color:red}-1 FINDBUGS{color}
.    {color:red}-1{color} the patch seems to introduce 8 new Findbugs warning(s) in module(s) [bookkeeper-server]
{color:red}-1 TESTS{color}
.    Tests run: 393
.    Tests failed: 2
.    Tests errors: 0

.    The patch failed the following testcases:

.      testLedgerDelete[0](org.apache.bookkeeper.test.LedgerDeleteTest)
.      testLedgerDelete[1](org.apache.bookkeeper.test.LedgerDeleteTest)

{color:green}+1 DISTRO{color}
.    {color:green}+1{color} distro tarball builds with the patch 

----------------------------
{color:red}*-1 Overall result, please check the reported -1(s)*{color}


The full output of the test-patch run is available at

.   https://builds.apache.org/job/bookkeeper-trunk-precommit-build/70/
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStore.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487659#comment-13487659 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

Thanks for clarifying Yixue. It would be good if the doc separated the btree and skiplist approaches completely, but I this I understand the overall jist.

Both the designs seem to be motivated by quick access to individual entries. I think this motivation isn't fully correct. Really what we need is quick access to a suffix of entries. 

Lets say a flush takes 60 seconds, for 1000 ledgers, 1KB entries, writing at 50k per second. For an individual ledger, there will be ((60*50k)/1000) = 3000 entries, or roughly 3MB of data. The offset of the block has to written to the index. A 7.2k rpm disk can read sequentially at 100MB/s, and can do about 120 seeks a second. We need one seek to get to the block in any case. Now, if want to get to the precise entry, we need to seek again. Or we can just read the whole block. Which is quicker? I would say they're roughly the same, but one requires much more complicated data structures.

Now, if a flush takes longer, the tradeoff changes, and I have seen flushes taking longer before, but if we change the index from storing the every entry, to just the start of a sequence of entries, flushes should get faster. And even then, if we want to add skips, we should add extra entries in the index rather than in the entrylog, as that is why we have an index. Having an auxiliary index negates the need for complex data structures in the entrylog itself.


                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Sijie Guo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487391#comment-13487391 ] 

Sijie Guo commented on BOOKKEEPER-432:
--------------------------------------

[~yx3zhu@gmail.com] After took a look at your proposal, it was quite similar as a LSM implementation. (If I am not right, please correct me.) If that, why not leverage a mature LSM implementation, like leveldb, which also maintain in-memory data in skip lists, have data snapshotted on disk in a btree-like structures, and compacted data to improve locality of entries.

I did some prototype works on using leveldb last year. I could try to pick up it again to bench its performance using Ivan's scripts.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: BookieLedgerStorageProposal.pdf)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Affects Version/s:     (was: 4.1.0)
                       4.2.0
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: PortSkipListLedgerStore.patch

Address findbugs
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStore.patch, PortSkipListLedgerStore.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482479#comment-13482479 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

I've added a tool to github which i've been using to bench the ledgerstorage performance. It should be useful when developing this patch.
https://github.com/ivankelly/bkvhbase
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488013#comment-13488013 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

Ah, got it. The skiplist is inmemory, preflush. The btree is on disk post flush. The skiplist makes sense to me, the btree not so much. Trees are O(logn). If the means logn seeks, then the approach aleady loses against an approach that flushes the skiplist directly, and stores the per ledger offsets in the index, which would be O(1).

Also, I don't think caching full blocks makes much sense for BKs usecases. BK provides a writeahead log, which means, in the normal case, the average number of times a block will be read is 0. Rarely, it will be read once. It would be extremely rare that it would be read more than once.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: BookieLedgerStorageProposal (1).pdf
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal (1).pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487400#comment-13487400 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

The level-db stack is not suitable for bookkeeper usage, as it is tailored to reduce compaction impact (using levels), in the same time increases write I/O overhead (twice for write-intensive workload).

Here, we do not need frequent compaction to make sure read not going thru multiple levels. We just need compaction if the entry log is mostly empty (i.e. most entries already consumed).


  
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482196#comment-13482196 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

Even without the data write time, for 1000 ledgers, there are 1000 seeks, so the lower bound is 8 seconds (1 seek = 8ms)
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: PortSkipListLedgerStore.patch)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStore.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489557#comment-13489557 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

Thanks Flavio, they are good questions.
Yes, we will use more than one skiplist (created on demand), while one is being flushed, another will be filled in. (level-db uses at most two skiplists).

Unlike level-db or HBase, we do not need aggressive compaction to reduce multiple-update-version impact to readers, where union iterator is used. We can disable current minor compaction in bookie.

LRU-data-block-caching should help disk contention among readers w/ different read entry id.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: BookieLedgerStorageProposal.pdf
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488131#comment-13488131 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

The B-Tree is block based, and are consecutively stored in the (ledger-id, entry-id) order. The index blocks are at tail of the file, will not interfere with read-ahead. 

If you mean Ledger Log by write-ahead-log, yes, we will not read them unless during recovery. I am not proposing caching it.

By entry-log, I mean entry data (messages), which is stored in data blocks (aka pages, chunks). These data blocks can be cached on demand, using LRU replacement policy. 

We will keep per-ledger offsets in memory, which is cached on demand from B-tree, to speed up locating the first entry, followed by chunk read from consecutive data blocks. 

The idea of using sparse index is that we are tuning for bulk read. 
We could put dense index blocks in the B-tree if performance measurement show otherwise.  

                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489536#comment-13489536 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

Thanks Ivan for the graph. W.r.t. the write throughput, does the graph reflect compaction in HBase? Are the regions hosted in one machine? It reinforces my argument of not using HBase or LevelDb as it is. 

The proposal will not use HBase's compaction or multi-region design, so the write performance cannot be reflected in the graph.

Re - "So all entries for a single ledger will be stored sequentially on disk in a single file."
Not all entries. Some clustered entries store in one file, the next clustered entries in another file.

The idea of not storing index entries offset is that the offset is not finalized until the SkipList is flushed. Using on-demand index entry page cache should get the read-performance on-par.

Of course, we can pin the index entries in-memory, until the SkipList is flushed.
I can go with this approach. Does it address your concern, Ivan?



 

  
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: BookieLedgerStorageProposal.pdf

Added staging to the proposal.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480384#comment-13480384 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

I am going to send draft node, including a couple of choices.
I can work on it.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487398#comment-13487398 ] 

Stu Hood commented on BOOKKEEPER-432:
-------------------------------------

[~hustlmsp]: Please do post that code. Seeing a prototype involving an existing LSM tree would be very useful before we go too far toward implementing our own.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489645#comment-13489645 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

{quote}
Thanks Ivan for the graph. W.r.t. the write throughput, does the graph reflect compaction in HBase? Are the regions hosted in one machine? It reinforces my argument of not using HBase or LevelDb as it is. {quote}
Yes, it's single disk even, with compaction. What really kills the throughput though, is that each region flushes to a separate file, so its seeking all over the place. Same will go for leveldb.

{quote}
Re - "So all entries for a single ledger will be stored sequentially on disk in a single file."
Not all entries. Some clustered entries store in one file, the next clustered entries in another file
{quote}
yup, thats what I meant, but articulated badly. All entries for a ledger within one file are sequential, though the ledger is across many files.

{quote}
The idea of not storing index entries offset is that the offset is not finalized until the SkipList is flushed. Using on-demand index entry page cache should get the read-performance on-par.

Of course, we can pin the index entries in-memory, until the SkipList is flushed.
{quote}
Ah, i understand your rationale now. You only need to store one offset per ledger while the skiplist is being flushed. With a million ledgers, this is only 4 megs, and i think it simplifies the design (and removes one seek, though this isnt so important i guess). plus this same index format can then be used with the sorted entrylog aniruddha is implementing in BOOKKEEPER-451.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: BookieLedgerStorageProposal.pdf)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: PortSkipListLedgerStorage.patch)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStore.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: PortSkipListLedgerStore.patch
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStore.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489523#comment-13489523 ] 

Flavio Junqueira commented on BOOKKEEPER-432:
---------------------------------------------

I'm really glad you guys are working on this and it sounds like it will be a nice addition to the project. I have a few comments about the proposal, and hopefully they haven't been asked before. I skimmed through the comments and haven't seen anything so here they are.

The design says that once a B-tree is created and complete, it will not be updated. I'm wondering if this means that we accumulate entries in a skiplist and once we fill up the buffer with entries, we organize the entries in a B-tree and flush it. If so, I suppose that you're thinking of having two buffers so that we can flush one while filling up the next. In general, my concern here is impacting write throughput. The current design has this nice feature that the throughput is actually limited by the speed of the journal and not the speed of the ledger store (in the absence of reads).

I'm also wondering how compaction will affect overall performance. My current intuition is that it shouldn't be much more expensive than the current compaction we have, but I'm interested in your opinion.   

                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487402#comment-13487402 ] 

Yixue (Andrew) Zhu commented on BOOKKEEPER-432:
-----------------------------------------------

General K/V store implementation (including level-db) deals with update and introduce version/timestamp, which we do not need. 
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: BookieLedgerStorageProposal.pdf
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481495#comment-13481495 ] 

Flavio Junqueira commented on BOOKKEEPER-432:
---------------------------------------------

[~ikelly] are you talking about flushing ledger entries or ledger index pages?
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508958#comment-13508958 ] 

Hadoop QA commented on BOOKKEEPER-432:
--------------------------------------

Testing JIRA BOOKKEEPER-432

WARNING: Running test-patch on a dirty local svn workspace

Patch <a href="/jira/secure/attachment/12555803/PortSkipListLedgerStore.patch">/jira/secure/attachment/12555803/PortSkipListLedgerStore.patch</a> downloaded at Mon Dec  3 18:58:51 UTC 2012

----------------------------

{color:green}+1 PATCH_APPLIES{color}
{color:green}+1 CLEAN{color}
{color:red}-1 RAW_PATCH_ANALYSIS{color}
.    {color:green}+1{color} the patch does not introduce any @author tags
.    {color:red}-1{color} the patch contains 1 line(s) with tabs
.    {color:green}+1{color} the patch does not introduce any trailing spaces
.    {color:green}+1{color} the patch does not introduce any line longer than 120
.    {color:green}+1{color} the patch does adds/modifies 7 testcase(s)
{color:green}+1 RAT{color}
.    {color:green}+1{color} the patch does not seem to introduce new RAT warnings
{color:red}-1 JAVADOC{color}
.    {color:red}-1{color} the patch seems to introduce 3 new Javadoc warning(s)
{color:green}+1 COMPILE{color}
.    {color:green}+1{color} HEAD compiles
.    {color:green}+1{color} patch compiles
.    {color:green}+1{color} the patch does not seem to introduce new javac warnings
{color:red}-1 FINDBUGS{color}
.    {color:red}-1{color} the patch seems to introduce 8 new Findbugs warning(s) in module(s) [bookkeeper-server]
{color:green}+1 TESTS{color}
.    Tests run: 393
{color:green}+1 DISTRO{color}
.    {color:green}+1{color} distro tarball builds with the patch 

----------------------------
{color:red}*-1 Overall result, please check the reported -1(s)*{color}


The full output of the test-patch run is available at

.   https://builds.apache.org/job/bookkeeper-trunk-precommit-build/71/
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, PortSkipListLedgerStore.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: BookieLedgerStorageProposal.pdf
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488199#comment-13488199 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

Are the intermediatory levels of the btree between the root and the leaves in the btree stored on disk, or is it only the leaves? Do you maintain separate index files per ledger also? Sorry for all the questions, I'm trying to get the full pivture in my head :)

{quote}
If you mean Ledger Log by write-ahead-log, yes, we will not read them unless during recovery. I am not proposing caching it.

By entry-log, I mean entry data (messages), which is stored in data blocks (aka pages, chunks). These data blocks can be cached on demand, using LRU replacement policy. 
{quote}
By write ahead log, I mean that the usecase BK, as a whole is designed for, is as a write ahead log, which, by it's nature should be seldom read. I'm not refering to the bookie's own journal. I could imagine one usecases where multiple reads could be necessary (such as hedwig, with many subscribers, consuming at different rates) but these should be handled at a higher level (such as the read ahead cache in hedwig).
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Sijie Guo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487600#comment-13487600 ] 

Sijie Guo commented on BOOKKEEPER-432:
--------------------------------------

[~stuhood] I would try to generate a patch for it.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment:     (was: BookieLedgerStorageProposal (1).pdf)
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482195#comment-13482195 ] 

Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------

Both, both are flushed in the same operation, LedgerStorage#flush() called from the sync thread. its the index files which are taking the time, as flush time increases as number of ledgers does.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>              Labels: patch
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: SkipList.patch

   Uses skip list to sort entries before adding them to entry log file, to improve ledger read performance. Memory arena is used to allocate skip list entries, to avoid GC impact.
    A single-threaded scheduler is used to flush skip list to buffered entry log file channel, once configured data size limit is reached. Sync thread is notified as well to flush file buffers.
    Compaction uses Skip list, to merge entries together as well as remove duplicate entries.
    This change also fix an existing issue of old entry logs being removed w/o forcing new entry logs flushed.
    
    It is the first cut of BOOKKEEPER-432 implementation.
                
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf, SkipList.patch
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-432) Improve performance of entry log range read per ledger entries

Posted by "Yixue (Andrew) Zhu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yixue (Andrew) Zhu updated BOOKKEEPER-432:
------------------------------------------

    Attachment: BookieLedgerStorageProposal.pdf
    
> Improve performance of entry log range read per ledger entries 
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-432
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.2.0
>         Environment: Linux
>            Reporter: Yixue (Andrew) Zhu
>            Assignee: Yixue (Andrew) Zhu
>              Labels: patch
>         Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some topics), as delivery needs to scan the entry logs (thru ledger index), which are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective when a large number of ledger entries need to be served, which tend to be scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable), sorted on (ledger, entry sequence). When the buffer is flushed, the entry log is written out in the already-sorted order. 
> The "active" ledger index can point to the entries buffer (SkipList), and fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail can have index attached (light-weight b-tree, similar with big-table). We need to track per ledger which log files contribute entries to it, so that in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same topic) as identical as possible. This will help above 1. be more effective.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira