You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Aleks Laz <al...@none.at> on 2014/12/01 15:41:21 UTC

Re: Newbie Question about 37TB binary storage on HBase

Dear Michael.

Am 29-11-2014 23:49, schrieb Michael Segel:
> Guys, KISS.
> 
> You can use a sequence file to store the images since the images are 
> static.

Sorry but what do you mean with this sentence?

> Use HBase to index the images.
> 
> If you want… you could use ES or SOLR to take the HBase index and put
> it in memory.

This statement is related to the log issue, isn't it?

> Thrift/Stargate HBase API? Really?  Sorry unless its vastly improved
> over the years… ice.

Ok. What's your suggestion to talk wit hadoop/HBase with none Java 
Programs?

> Note this simple pattern works really well in the IoT scheme of things.
> 
> Also… depending on the index(es),
> Going SOLR and Sequence file may actually yield better i/o performance
> and  scale better.

Please can you explain this a little bit more, thank you.

BR
Aleks

> On Nov 28, 2014, at 5:37 PM, Otis Gospodnetic
> <ot...@gmail.com> wrote:
> 
>> Hi,
>> 
>> On Fri, Nov 28, 2014 at 5:08 AM, Wilm Schumacher 
>> <wilm.schumacher@cawoom.com
>>> wrote:
>> 
>>> Hi Otis,
>>> 
>>> thx for the interesting insight. This is very interesting. I never 
>>> had
>>> ES really on scale. But we plan to do that, with hbase as primary db 
>>> (of
>>> course ;) ).
>>> 
>>> I just had the opinion that ES and hbase would scale side by side.
>>> 
>> 
>> Sure, they can both be *scaled*, but in *our* use case HBase was more
>> efficient with the same amount of data and same hardware.  It could 
>> handle
>> the same volume of data with the same hardware better (lower CPU, GC, 
>> etc.)
>> than ES.  Please note that this was our use case.  I'm not saying it's
>> universally true.  The two tools have different features, do different 
>> sort
>> of work under the hood, so this difference makes sense to me.
>> 
>> 
>>> Could you please give us some details on what you mean by "more 
>>> scalable"?
>>> 
>> 
>> Please see above.
>> 
>> 
>>> What was the ES backend?
>>> 
>> 
>> We used it to store metrics from SPM <http://sematext.com/spm/>.  We 
>> use
>> HBase for that in SPM Cloud version, but we don't use HBase in the On
>> Premises version of SPM due to the operational complexity of HBase.
>> 
>> Otis
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>> 
>> 
>> 
>> 
>>> Am 28.11.2014 um 06:37 schrieb Otis Gospodnetic:
>>>> Hi,
>>>> 
>>>> There was a mention of Elasticsearch here that caught my attention.
>>>> We use both HBase and Elasticsearch at Sematext.  SPM
>>>> <http://sematext.com/spm/>, which monitors things like Hadoop, 
>>>> Spark,
>>> etc.
>>>> etc. including HBase and ES, can actually use either HBase or
>>> Elasticsearch
>>>> as the data store.  We experimented with both and an a few years old
>>>> version of HBase was more scalable than the latest ES, at least in 
>>>> our
>>> use
>>>> case.
>>>> 
>>>> Otis
>>>> --
>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log 
>>>> Management
>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>> 
>>>> 
>>>> On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <al...@none.at> 
>>>> wrote:
>>>> 
>>>>> Dear wilm and ted.
>>>>> 
>>>>> Thanks for your input and ideas.
>>>>> 
>>>>> I will now step back and learn more about big data and big storage 
>>>>> to
>>>>> be able to talk further.
>>>>> 
>>>>> Cheers Aleks
>>>>> 
>>>>> Am 28-11-2014 01:20, schrieb Wilm Schumacher:
>>>>> 
>>>>> Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>>>>>> 
>>>>>>> What's the plan about the "MOB-extension"?
>>>>>>> 
>>>>>> https://issues.apache.org/jira/browse/HBASE-11339
>>>>>> 
>>>>>> From development point of view I can build HBase with the
>>> "MOB-extension"
>>>>>>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, 
>>>>>>> ...)
>>> is
>>>>>>> much
>>>>>>> easier to maintain.
>>>>>>> 
>>>>>> that's true :/
>>>>>> 
>>>>>> We need to make some "accesslog" analyzing like piwik or awffull.
>>>>>>> 
>>>>>> I see. Well, this is of course possible, too.
>>>>>> 
>>>>>> Maybe elasticsearch is a better tool for that?
>>>>>>> 
>>>>>> I used elastic search for full text search. Works veeery well :D. 
>>>>>> Loved
>>>>>> it. But I never used it as primary database. And I wouldn't see an
>>>>>> advantage for using ES here.
>>>>>> 
>>>>>> As far as I have understood hadoop client see a 'Filesystem' with 
>>>>>> 37
>>> TB
>>>>>>> or
>>>>>>> 120 TB but from the server point of view how should I plan the
>>>>>>> storage/server
>>>>>>> setup for the datanodes.
>>>>>>> 
>>>>>> now I get your question. If you have a replication factor of 3 (so
>>> every
>>>>>> data is hold three times by the cluster), then the aggregated 
>>>>>> storage
>>>>>> has to be at least 3 times the 120 TB (+ buffer + operating system
>>>>>> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
>>>>>> 
>>>>>> What happen when a datanode have 20TB but the whole hadoop/HBase 2
>>> node
>>>>>>> cluster have 40?
>>>>>>> 
>>>>>> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
>>>>>> distributes the data over the nodes.
>>>>>> 
>>>>>> ?! why "40 million rows", do you mean the file tables?
>>>>>>> In the DB is only some Data like, User account, id for a 
>>>>>>> directory and
>>>>>>> so on.
>>>>>>> 
>>>>>> If you use hbase as primary storage, every file would be a row. 
>>>>>> Think
>>> of
>>>>>> a "blob" in RDBMS. 40 millions files => 40 million rows.
>>>>>> 
>>>>>> Assume you create an access log for the 40 millions files and 
>>>>>> assume
>>>>>> every file is accessed 100 times and every access is a row in 
>>>>>> another
>>>>>> "access log" table => 4 billion rows ;).
>>>>>> 
>>>>>> Currently, yes php is the main language.
>>>>>>> I don't know a good solution for php similar like hadoop, anyone 
>>>>>>> else
>>>>>>> know one?
>>>>>>> 
>>>>>> well, the basic stuff could be done by thrift/rest with a native 
>>>>>> php
>>>>>> binding. It depends on what you are trying to do. If it's just 
>>>>>> CRUD and
>>>>>> some scanning and filtering, thrift/rest should be enough. But as 
>>>>>> you
>>>>>> said ... who knows what the future brings. If you want to do the 
>>>>>> fancy
>>>>>> stuff, you should use java and deliver the data to your php
>>> application-
>>>>>> 
>>>>>> Just for completeness: There is HiveQL, too. This is kind of "SQL 
>>>>>> for
>>>>>> hadoop". There is a hive client for php (as it is delivered by 
>>>>>> thrift)
>>>>>> https://cwiki.apache.org/confluence/display/Hive/HiveClient
>>>>>> 
>>>>>> Another fitting option for your access log could be cassandra.
>>> Cassandra
>>>>>> is good at write performance, thus it is used for logging. 
>>>>>> Cassandra
>>> has
>>>>>> a "sql like" language, called cql. This works from php almost like 
>>>>>> a
>>>>>> normal RDBMS. Prepared statements and all this stuff.
>>>>>> 
>>>>>> But I think this is done the wrong way around. You should select a
>>>>>> technology and then choose the language/interfaces etc. And if you
>>>>>> choose hbase, and java is a good choice, and you use nginx and php 
>>>>>> is a
>>>>>> good choice, the only task is to deliver data from A to B and 
>>>>>> back.
>>>>>> 
>>>>>> Best wishes,
>>>>>> 
>>>>>> Wilm
>>>>>> 
>>>>> 
>>>> 
>>> 

Re: Newbie Question about 37TB binary storage on HBase

Posted by Michael Segel <mi...@hotmail.com>.
You receive images, You can store the images in sequence files.  (Since HDFS is a WORM file system, you will have to do some work here, storing individual images in a folder on HDFS where you would sweep the images in to a single sequence file and then use HBase to track the location of the image (Index of the series of sequence files. ) ) Once you build the sequence file, you won’t be touching it unless you’re doing some file maintenance and want to combine and create larger sequence files, or sort images in to new sequence files. 

So now if you want a specific image, you need to perform a lookup in HBase to find the url of the sequence file and the offset in to the sequence file to get the specific image. 
If you want to add more to the index (user, timestamp, image metadata… etc …) you will end up with multiple indexes.  Here you could then use an in memory index (SOLR) which will let you combine attributes to determine the image or set of images you want to retrieve.

The downside is that you can’t out of the box persist the SOLR index in HBase… (although I may be somewhat dated here.) 

To be honest, we looked at Stargate… pre 0.89 release like 0.23 release… It was a mess so we never looked back.  And of course the client was/is a java shop. So Java is the first choice. 


Just my $0.02 cents

On Dec 1, 2014, at 2:41 PM, Aleks Laz <al...@none.at> wrote:

> Dear Michael.
> 
> Am 29-11-2014 23:49, schrieb Michael Segel:
>> Guys, KISS.
>> You can use a sequence file to store the images since the images are static.
> 
> Sorry but what do you mean with this sentence?
> 
>> Use HBase to index the images.
>> If you want… you could use ES or SOLR to take the HBase index and put
>> it in memory.
> 
> This statement is related to the log issue, isn't it?
> 
>> Thrift/Stargate HBase API? Really?  Sorry unless its vastly improved
>> over the years… ice.
> 
> Ok. What's your suggestion to talk wit hadoop/HBase with none Java Programs?
> 
>> Note this simple pattern works really well in the IoT scheme of things.
>> Also… depending on the index(es),
>> Going SOLR and Sequence file may actually yield better i/o performance
>> and  scale better.
> 
> Please can you explain this a little bit more, thank you.
> 
> BR
> Aleks
> 
>> On Nov 28, 2014, at 5:37 PM, Otis Gospodnetic
>> <ot...@gmail.com> wrote:
>>> Hi,
>>> On Fri, Nov 28, 2014 at 5:08 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
>>>> wrote:
>>>> Hi Otis,
>>>> thx for the interesting insight. This is very interesting. I never had
>>>> ES really on scale. But we plan to do that, with hbase as primary db (of
>>>> course ;) ).
>>>> I just had the opinion that ES and hbase would scale side by side.
>>> Sure, they can both be *scaled*, but in *our* use case HBase was more
>>> efficient with the same amount of data and same hardware.  It could handle
>>> the same volume of data with the same hardware better (lower CPU, GC, etc.)
>>> than ES.  Please note that this was our use case.  I'm not saying it's
>>> universally true.  The two tools have different features, do different sort
>>> of work under the hood, so this difference makes sense to me.
>>>> Could you please give us some details on what you mean by "more scalable"?
>>> Please see above.
>>>> What was the ES backend?
>>> We used it to store metrics from SPM <http://sematext.com/spm/>.  We use
>>> HBase for that in SPM Cloud version, but we don't use HBase in the On
>>> Premises version of SPM due to the operational complexity of HBase.
>>> Otis
>>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>> Am 28.11.2014 um 06:37 schrieb Otis Gospodnetic:
>>>>> Hi,
>>>>> There was a mention of Elasticsearch here that caught my attention.
>>>>> We use both HBase and Elasticsearch at Sematext.  SPM
>>>>> <http://sematext.com/spm/>, which monitors things like Hadoop, Spark,
>>>> etc.
>>>>> etc. including HBase and ES, can actually use either HBase or
>>>> Elasticsearch
>>>>> as the data store.  We experimented with both and an a few years old
>>>>> version of HBase was more scalable than the latest ES, at least in our
>>>> use
>>>>> case.
>>>>> Otis
>>>>> --
>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>> On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <al...@none.at> wrote:
>>>>>> Dear wilm and ted.
>>>>>> Thanks for your input and ideas.
>>>>>> I will now step back and learn more about big data and big storage to
>>>>>> be able to talk further.
>>>>>> Cheers Aleks
>>>>>> Am 28-11-2014 01:20, schrieb Wilm Schumacher:
>>>>>> Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>>>>>>>> What's the plan about the "MOB-extension"?
>>>>>>> https://issues.apache.org/jira/browse/HBASE-11339
>>>>>>> From development point of view I can build HBase with the
>>>> "MOB-extension"
>>>>>>>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...)
>>>> is
>>>>>>>> much
>>>>>>>> easier to maintain.
>>>>>>> that's true :/
>>>>>>> We need to make some "accesslog" analyzing like piwik or awffull.
>>>>>>> I see. Well, this is of course possible, too.
>>>>>>> Maybe elasticsearch is a better tool for that?
>>>>>>> I used elastic search for full text search. Works veeery well :D. Loved
>>>>>>> it. But I never used it as primary database. And I wouldn't see an
>>>>>>> advantage for using ES here.
>>>>>>> As far as I have understood hadoop client see a 'Filesystem' with 37
>>>> TB
>>>>>>>> or
>>>>>>>> 120 TB but from the server point of view how should I plan the
>>>>>>>> storage/server
>>>>>>>> setup for the datanodes.
>>>>>>> now I get your question. If you have a replication factor of 3 (so
>>>> every
>>>>>>> data is hold three times by the cluster), then the aggregated storage
>>>>>>> has to be at least 3 times the 120 TB (+ buffer + operating system
>>>>>>> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
>>>>>>> What happen when a datanode have 20TB but the whole hadoop/HBase 2
>>>> node
>>>>>>>> cluster have 40?
>>>>>>> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
>>>>>>> distributes the data over the nodes.
>>>>>>> ?! why "40 million rows", do you mean the file tables?
>>>>>>>> In the DB is only some Data like, User account, id for a directory and
>>>>>>>> so on.
>>>>>>> If you use hbase as primary storage, every file would be a row. Think
>>>> of
>>>>>>> a "blob" in RDBMS. 40 millions files => 40 million rows.
>>>>>>> Assume you create an access log for the 40 millions files and assume
>>>>>>> every file is accessed 100 times and every access is a row in another
>>>>>>> "access log" table => 4 billion rows ;).
>>>>>>> Currently, yes php is the main language.
>>>>>>>> I don't know a good solution for php similar like hadoop, anyone else
>>>>>>>> know one?
>>>>>>> well, the basic stuff could be done by thrift/rest with a native php
>>>>>>> binding. It depends on what you are trying to do. If it's just CRUD and
>>>>>>> some scanning and filtering, thrift/rest should be enough. But as you
>>>>>>> said ... who knows what the future brings. If you want to do the fancy
>>>>>>> stuff, you should use java and deliver the data to your php
>>>> application-
>>>>>>> Just for completeness: There is HiveQL, too. This is kind of "SQL for
>>>>>>> hadoop". There is a hive client for php (as it is delivered by thrift)
>>>>>>> https://cwiki.apache.org/confluence/display/Hive/HiveClient
>>>>>>> Another fitting option for your access log could be cassandra.
>>>> Cassandra
>>>>>>> is good at write performance, thus it is used for logging. Cassandra
>>>> has
>>>>>>> a "sql like" language, called cql. This works from php almost like a
>>>>>>> normal RDBMS. Prepared statements and all this stuff.
>>>>>>> But I think this is done the wrong way around. You should select a
>>>>>>> technology and then choose the language/interfaces etc. And if you
>>>>>>> choose hbase, and java is a good choice, and you use nginx and php is a
>>>>>>> good choice, the only task is to deliver data from A to B and back.
>>>>>>> Best wishes,
>>>>>>> Wilm
>