You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Erik Holstad (JIRA)" <ji...@apache.org> on 2009/04/25 01:19:30 UTC

[jira] Commented: (HBASE-1261) Improvements to flushing, compacting, and splitting

    [ https://issues.apache.org/jira/browse/HBASE-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702616#action_12702616 ] 

Erik Holstad commented on HBASE-1261:
-------------------------------------

wo types of compactions, minor and major.
Tasks that can be performed during a compaction:
1. Sorting the KeyValue from the different file
2. Checking TTL
3. Checking number of versions to keep
4. Removing KeyValues that are effected by deletes
5. Removing the deletes

For a major compaction all of the above is done, so the question is what we
should do for a minor. 

The thing is that when merging the different files into one list we need to sort
the smallest entry in all the files to be able to tell which one to put in first
,which means that we already are doing most of the work. So adding 2,3 and 4
will only add 3 extra checks, but will probably give better performance in the
system in total, but needs to be tested. The deletes can't be removes for a non
major compaction, so they will be kept in the new file, but will only apply to 
older files, which will speed up reading.

Today when doing the merge we open up a scanner for all the n files that we want
to merge and for every value we put into the new file we do n compares, so if
the total number of inserts are m we get something like m*n compares since we
are not using any of the information from the last compare, so we end up redoing
a lot of work. 

I propose a TreeSet or a Heap structure so we get something like m*log(n). 
After that we just need to do some testing to see how much the extra compares
will add and how big the read performance gain will be by enforcing the extra
checks.

> Improvements to flushing, compacting, and splitting
> ---------------------------------------------------
>
>                 Key: HBASE-1261
>                 URL: https://issues.apache.org/jira/browse/HBASE-1261
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: documentation, master, regionserver
>    Affects Versions: 0.19.0
>            Reporter: Evgeny Ryabitskiy
>            Assignee: Evgeny Ryabitskiy
>
> This issue concludes my task for GSoC 2009 
> The idea is to make all futures from Bigtable paper and other ideas in this processes of to flushing, compacting and splitting.
> Some talk in IRC related to this issue creation:
> [22:37] <jgray2> bigtable has an optimization which, when flushing the memcache, will actually write out a new storefile that merges the existing storefile with the memcache
> [22:37] <jgray2> that way you don't have more than one storefile
> [22:37] <jgray2> which is a huge inefficiency

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.