You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Marcelo Valle (BLOOMBERG/ LONDON)" <mv...@bloomberg.net> on 2015/04/07 12:56:35 UTC

pre-spliting or not, that's the question

Hello, 

I am still in my first steps with HBase, I was used to use Cassandra a while ago.

For several years, I was used to think trying to store data in Cassandra ordered among nodes was something evil, as it's OrderedPartitioner is something not supported and not recommended in production. 

In HBase/Hadoop would, this is the default though. When trying to optimize for writes, I was told people use to use pre-spiting in HBase, some times using salting keys. This seems to make HBase behave as Cassandra random partitioner, loosing data order across nodes (because of the salting) but having a better write throughput.  

Because of these differences, I started to question what's the real advantage of having ordered data across nodes. For most applications, wouldn't pre-splitting be better? For a large number of applications, designing data without relying on order across nodes seems better, as 1 - it might be possible and 2 - when it's not possible you can whether use another table as index or index data to Solr/ES/Lucene and read from there in more complex scenarios. Maybe in some specific cases where you want little latency from the time you write data to time you read data, but reading much more than you write it could have some advantage, maybe...

As acting as a sorted map was a concept design decision of HBase, I think there must be reasons behind this decision and it seems I am not being able to figure these... Could you please point them out? 

I am asking this to improve my architectural understanding of HBase, as sometimes I might be getting the wrong impression there is no advantage in using post-splitting solution, when maybe it's just lack of knowledge I have on the technology.

Best regards,
Marcelo.