You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jay Kreps (JIRA)" <ji...@apache.org> on 2012/07/18 20:03:34 UTC
[jira] [Commented] (KAFKA-188) Support multiple data directories

    [ https://issues.apache.org/jira/browse/KAFKA-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417322#comment-13417322 ] 

Jay Kreps commented on KAFKA-188:
---------------------------------

Here are a few additional thoughts on this:
1. This is actually a lot more valuable after kafka 0.8 is out since we will already allow replication at a higher level so the raid is less desirable. Patch should definitely be on 0.8 branch, though it will likely be the same for 0.7.
2. It is worth deciding if we want to support unbalanced disk sizes and speeds. E.g. if you have a 7.2k RPM drive and a 10k rpm drive will we allow you to balance over these? I recommend we skip this for now, we can always do it later. So like with RAID, we will treat all drives equally.
3. I think the only change will be in LogManager. Instead of a single config.logDir parameter we will need logDirs, an array of directories. In createLog() we will need a policy that chooses the best disk on which to place the new log-partition.
4. There may be a few other places that assume a single log directory, may have to grep around and check for that. I don't think their is to much else, as everything else should interact through LogManager and once the Log instance is created it doesn't care where it's home directory is.

One approach to placement would be to always create new logs on the "least loaded" directory. The definition of "least loaded" is likely to be a heuristic. There are two things we want to balance (1) data size, (2) i/o throughput. If the retention policy is based on time, then size is a good proxy for throughput. However you could imagine having one log with very small retention size but very high throughput. Another problem is that the usage may change over time, and migration is not feasable. For example a new feature going through a ramped rollout might produce almost no data at first and then later produce gobs of data. Furthermore you might get odd results in the case where you manually pre-create many topics all at once as they would all end up on whichever directory had the least data.

I think a better strategy would be to not try to estimate the least-loaded partition and instead just do round-robin assignment (e.g. logDirs(counter.getAndIncrement() % logDirs.length)). The assumption would be that the number of partitions is large enough that  each topic has one partition on each disk.

Either of these strategies has a corner case if all data goes to a single topic (or one topic dominates the load distribution), and that topic has (say) 5 local partitions and 4 data directories. In this case one directory will get 2x the others. However this corner case could be worked around by carefully aligning the partition count and the total number of data directories, so I don't think we need to handle it here.
                
> Support multiple data directories
> ---------------------------------
>
>                 Key: KAFKA-188
>                 URL: https://issues.apache.org/jira/browse/KAFKA-188
>             Project: Kafka
>          Issue Type: New Feature
>            Reporter: Jay Kreps
>
> Currently we allow only a single data directory. This means that a multi-disk configuration needs to be a RAID array or LVM volume or something like that to be mounted as a single directory.
> For a high-throughput low-reliability configuration this would mean RAID0 striping. Common wisdom in Hadoop land has it that a JBOD setup that just mounts each disk as a separate directory and does application-level balancing over these results in about 30% write-improvement. For example see this claim here:
>   http://old.nabble.com/Re%3A-RAID-vs.-JBOD-p21466110.html
> It is not clear to me why this would be the case--it seems the RAID controller should be able to balance writes as well as the application so it may depend on the details of the setup.
> Nonetheless this would be really easy to implement, all you need to do is add multiple data directories and balance partition creation over these disks.
> One problem this might cause is if a particular topic is much larger than the others it might unbalance the load across the disks. The partition->disk assignment policy should probably attempt to evenly spread each topic to avoid this, rather than just trying keep the number of partitions balanced between disks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira