You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "nkeywal (JIRA)" <ji...@apache.org> on 2012/11/21 18:09:58 UTC

[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

    [ https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502132#comment-13502132 ] 

nkeywal commented on HBASE-6774:
--------------------------------

After thinking again about this one, here is another possible solution:
- put the memstore state in ZooKeeper
- when we create a new memstore, we asynchronously write the state in ZK (region with empty memstore & region server name)
- When the first put is written in the WAL, we synchronously write to ZK that this region has now an non empty memstore.
- then the other puts don't need any ZK writes or synchronisation
- on memstore flush, we asynchronously update the state in ZK to empty memstore region.
- on crash, the master checks the region memstore states. If region is assigned but its memstore is empty, we can reassign the region immediately. If there is no data in ZK, or this data says the memstore is not empty, the master does nothing.

This is high level, I obviously need to tune it for multiple memstore case and study all error cases. But it seems doable.

So we would have a maximum of 100K znodes (1 per region) in ZK, with one viewer (the master), and one writer (the region server).
These objects would be written on memstore creation & flush, so not very often.
If we don't have the znode in ZK, we split as today. We could loose the whole ZK data without any impact.
This can be made optional (and may be even activated per table: it could be activated only for reference tables and meta. Tables heavily written would not do that. This lowers the number of znode to write into ZK)
Region servers are already connected to zookeeper, we don't add any ZK connection.

Pros:
- do the job: the region non written will be reassigned immediately
- add a security if we can't split the logs: the table that were not written can be made available immediately
- optional, and configurable per table
- should not decrease write performances; only the first put is impacted (by about 10-15ms). With a block size of 128Mb or more, it's acceptable imho.
- don't add workload (read nor write) on HDFS
- no dependency on ZK content: we continue to work if the ZK content 'disappears'.

Cons:
- add workload on ZooKeeper: but it's configurable per table, so we can limit to whatever we want. We can even imagine heuristic (wait before creating the znode, don't create it if a put occurs before 10 seconds for example)
- as always, any new feature adds complexity to the whole thing... Could nearly be done with coprocessors (likely not the master part however).

                
> Immediate assignment of regions that don't have entries in HLog
> ---------------------------------------------------------------
>
>                 Key: HBASE-6774
>                 URL: https://issues.apache.org/jira/browse/HBASE-6774
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>
> The algo is today, after a failure detection:
> - split the logs
> - when all the logs are split, assign the regions
> But some regions can have no entries at all in the HLog. There are many reasons for this:
> - kind of reference or historical tables. Bulk written sometimes then read only.
> - sequential rowkeys. In this case, most of the regions will be read only. But they can be in a regionserver with a lot of writes.
> - tables flushed often for safety reasons. I'm thinking about meta here.
> For meta; we can imagine flushing very often. Hence, the recovery for meta, in many cases, will be the failure detection time.
> There are different possible algos:
> Option 1)
>  A new task is added, in parallel of the split. This task reads all the HLog. If there is no entry for a region, this region is assigned.
>  Pro: simple
>  Cons: We will need to read all the files. Add a read.
> Option 2)
>  The master writes in ZK the number of log files, per region.
>  When the regionserver starts the split, it reads the full block (64M) and decrease the log file counter of the region. If it reaches 0, the assign start. At the end of its split, the region server decreases the counter as well. This allow to start the assign even if not all the HLog are finished. It would allow to make some regions available even if we have an issue in one of the log file.
>  Pro: parallel
>  Cons: add something to do for the region server. Requites to read the whole file before starting to write. 
> Option 3)
>  Add some metadata at the end of the log file. The last log file won't have meta data, as if we are recovering, it's because the server crashed. But the others will. And last log file should be smaller (half a block on average).  
> Option 4) Still some metadata, but in a different file. Cons: write are increased (but not that much, we just need to write the region once). Pros: if we lose the HLog files (major failure, no replica available) we can still continue with the regions that were not written at this stage.
> I think it should be done, even if none of the algorithm above is totally convincing yet. It's linked as well to locality and short circuit reads: with these two points reading the file twice become much less of an issue for example. My current preference would be to open the file twice in the region server, once for splitting as of today, once for a quick read looking for unused regions. Who knows, may be it would even be faster this way, the quick read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira