You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Aleks Laz <al...@none.at> on 2014/11/27 22:27:47 UTC

Newbie Question about 37TB binary storage on HBase

Dear All.

Hi Wilm ;-)

I have started this question on hadoop-user list.

https://mail-archives.apache.org/mod_mbox/hadoop-user/201411.mbox/%3C0dacebda87d76ce0b72f7c53f02464cb@none.at%3E

I hope you can help me.

We have since ~2012 collected a lot of binary data (jpg's).
The size per file is ~1-5 MB currently but this could be changed.

There are much more then 41 055 670 Files, the count still
run, in ~680 <ID> dirs with this hierarchy

The Storage hierarchy is like this.

                      <YEAR>/<MONTH>/<DAY>
<MOUNT_ROOT>/cams/<ID>/2014/11/19/

The binary data are in the directory below <DAY> ~1000 Files per
directory and mounted with xfs.

The pictures are more or less volatile.
Means: After saved on the disc there are seldom and never changes on the
images.

Due to the fact that the platform now grows up we need to create a more
scalable setup.

I haven't read to much about HBase, due to the fact that I haven't seen
as an option.

Please accept my apologies for this I will start now to dig deeper into
HBase.

Due to the fact that on this list more experienced Hadoop, HDFS and 
HBase
users are then I, I hope you can answer some basic questions from me.

Our application is a nginx/php-fpm/postgresql Setup.
The target design is nginx + proxy features / php-fpm / $DB / $Storage.

.) Can I mix HDFS /HBase for binary data storage and data analyzing?

.) What is the preferred way to us HBase  with PHP?
.) How difficult is it to use HBase with PHP?

.) What's a good solution for the 37 TB or the upcoming ~120 TB to
distribute?
    [ ] N Servers with 1 37 TB mountpoints per server?
    [ ] N Servers with x TB mountpoints pers server?
    [ ] other:

.) Is HBase a good value for $Storage?
.) Is HBase a good value for $DB?
     DB-Size is smaller then 1 GB, I would use HBase just for HA features
     of Hadoop.

.) Due to the fact that HBase is a file-system I could use
       /cams , for binary data
       /DB   , for DB storage
       /logs , for log storage
     but is this wise. On the 'disk' they are different RAIDs.

.) Should I plan a dedicated Network+Card for the 'cluster
    communication' as for the most other cluster software?
    From what I have read it looks not necessary but from security point
    of view, yes.

.) Maybe the communication with the componnents (hadoop, zk, ...) could
    be setup ed with TLS?

Thank you very much that you have read the mail up to this line ;-)

Thank you also for feedback which is very welcome and appreciated.

Best Regards
Aleks

Re: Newbie Question about 37TB binary storage on HBase

Posted by Ted Yu <yu...@gmail.com>.
For MOB, please take a look at HBASE-11339

Cheers

On Nov 27, 2014, at 3:32 PM, Aleks Laz <al...@none.at> wrote:

> Hi Wilm.
> 
> Am 27-11-2014 23:41, schrieb Wilm Schumacher:
>> Hi Aleks ;),
>> Am 27.11.2014 um 22:27 schrieb Aleks Laz:
>>> Our application is a nginx/php-fpm/postgresql Setup.
>>> The target design is nginx + proxy features / php-fpm / $DB / $Storage.
>>> .) Can I mix HDFS /HBase for binary data storage and data analyzing?
>> yes. hbase is perfect for that. For storage it will work (with the
>> "MOB-extension") and with map reduce you can do whatever data analyzing
>> you want. I assume you do some image processing with the data?!?!
> 
> What's the plan about the "MOB-extension"?
> 
> From development point of view I can build HBase with the "MOB-extension"
> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is much
> easier to maintain.
> 
> Currently there are no plans to analyse the images, but who knows what the
> future brings.
> 
> We need to make some "accesslog" analyzing like piwik or awffull.
> Maybe elasticsearch is a better tool for that?
> 
>>> .) What is the preferred way to us HBase  with PHP?
>> The native client lib is in java. This is the best way to go. But if you
>> need only basic access from the php application, then thrift or rest
>> would be a good choice.
>> http://wiki.apache.org/hadoop/Hbase/ThriftApi
>> http://wiki.apache.org/hadoop/Hbase/Stargate
> 
> Stargate is a cool name ;-)
> 
>> There are language bindings for both
>>> .) How difficult is it to use HBase with PHP?
>> Depending on what you are trying to do. If you just do a little
>> fetching, updating, inserting etc. it's pretty easy. More complicate
>> stuff I would do in java and expose it by a custom api by a java service.
>>> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
>>> distribute?
>>>   [ ] N Servers with 1 37 TB mountpoints per server?
>>>   [ ] N Servers with x TB mountpoints pers server?
>>>   [ ] other:
>> that's "not your business". hbase/hadoop does the trick for you. hbase
>> distributes the data, replicates it etc.. You will only talk to the master.
> 
> Well but at the end of the day I will need a physical storage distributed over
> x servers.
> 
> My question is do I need to care that all servers have enough storage for the
> whole data?
> 
> As far as I have understood hadoop client see a 'Filesystem' with 37 TB or
> 120 TB but from the server point of view how should I plan the storage/server
> setup for the datanodes.
> 
> As from the link below hadoophbase-capacity-planning and
> 
> http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/
> 
> #####
> ....
> Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:
> 
>    12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
> ...
> #####
> 
> What happen when a datanode have 20TB but the whole hadoop/HBase 2 node cluster have 40?
> 
> I see I'm still new to hadoop/HBase concept.
> 
>>> .) Is HBase a good value for $Storage?
>> yes ;)
>>> .) Is HBase a good value for $DB?
>>>    DB-Size is smaller then 1 GB, I would use HBase just for HA features
>>>    of Hadoop.
>> well, the official documentation says:
>> »First, make sure you have enough data. If you have hundreds of millions
>> or billions of rows, then HBase is a good candidate. If you only have a
>> few thousand/million rows, then using a traditional RDBMS might be a
>> better choice ...«
> 
> Okay so I will stay for this on postgresql with pgbouncer.
> 
>> In my experience at around 1-10 million rows RDBMS are not really
>> useable anymore. But I only used small/cheap hardware ... and don't like
>> RDBMS ;).
> 
> ;-)
> 
>> Well, you will have at least 40 million rows ... and the plattform is
>> growing. I think SQL isn't a choice anymore. And as you have heavy read
>> and only a few writes hbase is a good fit.
> 
> ?! why "40 million rows", do you mean the file tables?
> In the DB is only some Data like, User account, id for a directory and so on.
> 
>>> .) Due to the fact that HBase is a file-system I could use
>>>      /cams , for binary data
>>>      /DB   , for DB storage
>>>      /logs , for log storage
>>>    but is this wise. On the 'disk' they are different RAIDs.
>> hbase is a data store. This was probably copy pasted from the original
>> hadoop question ;).
> 
> ;-)
> 
>>> .) Should I plan a dedicated Network+Card for the 'cluster
>>>   communication' as for the most other cluster software?
>>>   From what I have read it looks not necessary but from security point
>>>   of view, yes.
>> http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/
>> Cloudera employees says that it wouldn't harm if you have to push a lot
>> of data to the cluster.
> 
> Okay, so it is like other cluster setups.
> 
>>> .) Maybe the communication with the componnents (hadoop, zk, ...) could
>>>   be setup ed with TLS?
>> hbase is build on top of hadoop/hdfs. This in the "hadoop domain".
>> hadoop can encrypt the transported data by TLS, can encrypt the data on
>> the disc, you can use kerberos auth (but this stuff I never did) etc.
>> etc.. So the answer is yes.
> 
> Thanks.
> 
>> Last remark: You seem kind of bound to PHP. The hadoop world is written
>> in java. Of course there are a lot of ways to do stuff in other
>> languages, over interfaces etc. But the java api is the most powerful
>> and sometimes there are no other ways then to use it directly.
> 
> Currently, yes php is the main language.
> I don't know a good solution for php similar like hadoop, anyone else know one?
> 
> I will take a look on
> 
> https://wiki.apache.org/hadoop/PoweredBy
> 
> to get some Ideas for a working solution.
> 
>> Best wishes,
>> Wilm
> 
> Thanks for your feedbak.
> I will dig deeper into this topic and start to setup the components step by step.
> 
> BR Aleks

Re: Newbie Question about 37TB binary storage on HBase

Posted by Michael Segel <mi...@hotmail.com>.
You receive images, You can store the images in sequence files.  (Since HDFS is a WORM file system, you will have to do some work here, storing individual images in a folder on HDFS where you would sweep the images in to a single sequence file and then use HBase to track the location of the image (Index of the series of sequence files. ) ) Once you build the sequence file, you won’t be touching it unless you’re doing some file maintenance and want to combine and create larger sequence files, or sort images in to new sequence files. 

So now if you want a specific image, you need to perform a lookup in HBase to find the url of the sequence file and the offset in to the sequence file to get the specific image. 
If you want to add more to the index (user, timestamp, image metadata… etc …) you will end up with multiple indexes.  Here you could then use an in memory index (SOLR) which will let you combine attributes to determine the image or set of images you want to retrieve.

The downside is that you can’t out of the box persist the SOLR index in HBase… (although I may be somewhat dated here.) 

To be honest, we looked at Stargate… pre 0.89 release like 0.23 release… It was a mess so we never looked back.  And of course the client was/is a java shop. So Java is the first choice. 


Just my $0.02 cents

On Dec 1, 2014, at 2:41 PM, Aleks Laz <al...@none.at> wrote:

> Dear Michael.
> 
> Am 29-11-2014 23:49, schrieb Michael Segel:
>> Guys, KISS.
>> You can use a sequence file to store the images since the images are static.
> 
> Sorry but what do you mean with this sentence?
> 
>> Use HBase to index the images.
>> If you want… you could use ES or SOLR to take the HBase index and put
>> it in memory.
> 
> This statement is related to the log issue, isn't it?
> 
>> Thrift/Stargate HBase API? Really?  Sorry unless its vastly improved
>> over the years… ice.
> 
> Ok. What's your suggestion to talk wit hadoop/HBase with none Java Programs?
> 
>> Note this simple pattern works really well in the IoT scheme of things.
>> Also… depending on the index(es),
>> Going SOLR and Sequence file may actually yield better i/o performance
>> and  scale better.
> 
> Please can you explain this a little bit more, thank you.
> 
> BR
> Aleks
> 
>> On Nov 28, 2014, at 5:37 PM, Otis Gospodnetic
>> <ot...@gmail.com> wrote:
>>> Hi,
>>> On Fri, Nov 28, 2014 at 5:08 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
>>>> wrote:
>>>> Hi Otis,
>>>> thx for the interesting insight. This is very interesting. I never had
>>>> ES really on scale. But we plan to do that, with hbase as primary db (of
>>>> course ;) ).
>>>> I just had the opinion that ES and hbase would scale side by side.
>>> Sure, they can both be *scaled*, but in *our* use case HBase was more
>>> efficient with the same amount of data and same hardware.  It could handle
>>> the same volume of data with the same hardware better (lower CPU, GC, etc.)
>>> than ES.  Please note that this was our use case.  I'm not saying it's
>>> universally true.  The two tools have different features, do different sort
>>> of work under the hood, so this difference makes sense to me.
>>>> Could you please give us some details on what you mean by "more scalable"?
>>> Please see above.
>>>> What was the ES backend?
>>> We used it to store metrics from SPM <http://sematext.com/spm/>.  We use
>>> HBase for that in SPM Cloud version, but we don't use HBase in the On
>>> Premises version of SPM due to the operational complexity of HBase.
>>> Otis
>>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>> Am 28.11.2014 um 06:37 schrieb Otis Gospodnetic:
>>>>> Hi,
>>>>> There was a mention of Elasticsearch here that caught my attention.
>>>>> We use both HBase and Elasticsearch at Sematext.  SPM
>>>>> <http://sematext.com/spm/>, which monitors things like Hadoop, Spark,
>>>> etc.
>>>>> etc. including HBase and ES, can actually use either HBase or
>>>> Elasticsearch
>>>>> as the data store.  We experimented with both and an a few years old
>>>>> version of HBase was more scalable than the latest ES, at least in our
>>>> use
>>>>> case.
>>>>> Otis
>>>>> --
>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>> On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <al...@none.at> wrote:
>>>>>> Dear wilm and ted.
>>>>>> Thanks for your input and ideas.
>>>>>> I will now step back and learn more about big data and big storage to
>>>>>> be able to talk further.
>>>>>> Cheers Aleks
>>>>>> Am 28-11-2014 01:20, schrieb Wilm Schumacher:
>>>>>> Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>>>>>>>> What's the plan about the "MOB-extension"?
>>>>>>> https://issues.apache.org/jira/browse/HBASE-11339
>>>>>>> From development point of view I can build HBase with the
>>>> "MOB-extension"
>>>>>>>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...)
>>>> is
>>>>>>>> much
>>>>>>>> easier to maintain.
>>>>>>> that's true :/
>>>>>>> We need to make some "accesslog" analyzing like piwik or awffull.
>>>>>>> I see. Well, this is of course possible, too.
>>>>>>> Maybe elasticsearch is a better tool for that?
>>>>>>> I used elastic search for full text search. Works veeery well :D. Loved
>>>>>>> it. But I never used it as primary database. And I wouldn't see an
>>>>>>> advantage for using ES here.
>>>>>>> As far as I have understood hadoop client see a 'Filesystem' with 37
>>>> TB
>>>>>>>> or
>>>>>>>> 120 TB but from the server point of view how should I plan the
>>>>>>>> storage/server
>>>>>>>> setup for the datanodes.
>>>>>>> now I get your question. If you have a replication factor of 3 (so
>>>> every
>>>>>>> data is hold three times by the cluster), then the aggregated storage
>>>>>>> has to be at least 3 times the 120 TB (+ buffer + operating system
>>>>>>> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
>>>>>>> What happen when a datanode have 20TB but the whole hadoop/HBase 2
>>>> node
>>>>>>>> cluster have 40?
>>>>>>> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
>>>>>>> distributes the data over the nodes.
>>>>>>> ?! why "40 million rows", do you mean the file tables?
>>>>>>>> In the DB is only some Data like, User account, id for a directory and
>>>>>>>> so on.
>>>>>>> If you use hbase as primary storage, every file would be a row. Think
>>>> of
>>>>>>> a "blob" in RDBMS. 40 millions files => 40 million rows.
>>>>>>> Assume you create an access log for the 40 millions files and assume
>>>>>>> every file is accessed 100 times and every access is a row in another
>>>>>>> "access log" table => 4 billion rows ;).
>>>>>>> Currently, yes php is the main language.
>>>>>>>> I don't know a good solution for php similar like hadoop, anyone else
>>>>>>>> know one?
>>>>>>> well, the basic stuff could be done by thrift/rest with a native php
>>>>>>> binding. It depends on what you are trying to do. If it's just CRUD and
>>>>>>> some scanning and filtering, thrift/rest should be enough. But as you
>>>>>>> said ... who knows what the future brings. If you want to do the fancy
>>>>>>> stuff, you should use java and deliver the data to your php
>>>> application-
>>>>>>> Just for completeness: There is HiveQL, too. This is kind of "SQL for
>>>>>>> hadoop". There is a hive client for php (as it is delivered by thrift)
>>>>>>> https://cwiki.apache.org/confluence/display/Hive/HiveClient
>>>>>>> Another fitting option for your access log could be cassandra.
>>>> Cassandra
>>>>>>> is good at write performance, thus it is used for logging. Cassandra
>>>> has
>>>>>>> a "sql like" language, called cql. This works from php almost like a
>>>>>>> normal RDBMS. Prepared statements and all this stuff.
>>>>>>> But I think this is done the wrong way around. You should select a
>>>>>>> technology and then choose the language/interfaces etc. And if you
>>>>>>> choose hbase, and java is a good choice, and you use nginx and php is a
>>>>>>> good choice, the only task is to deliver data from A to B and back.
>>>>>>> Best wishes,
>>>>>>> Wilm
> 


Re: Newbie Question about 37TB binary storage on HBase

Posted by Aleks Laz <al...@none.at>.
Dear Michael.

Am 29-11-2014 23:49, schrieb Michael Segel:
> Guys, KISS.
> 
> You can use a sequence file to store the images since the images are 
> static.

Sorry but what do you mean with this sentence?

> Use HBase to index the images.
> 
> If you want… you could use ES or SOLR to take the HBase index and put
> it in memory.

This statement is related to the log issue, isn't it?

> Thrift/Stargate HBase API? Really?  Sorry unless its vastly improved
> over the years… ice.

Ok. What's your suggestion to talk wit hadoop/HBase with none Java 
Programs?

> Note this simple pattern works really well in the IoT scheme of things.
> 
> Also… depending on the index(es),
> Going SOLR and Sequence file may actually yield better i/o performance
> and  scale better.

Please can you explain this a little bit more, thank you.

BR
Aleks

> On Nov 28, 2014, at 5:37 PM, Otis Gospodnetic
> <ot...@gmail.com> wrote:
> 
>> Hi,
>> 
>> On Fri, Nov 28, 2014 at 5:08 AM, Wilm Schumacher 
>> <wilm.schumacher@cawoom.com
>>> wrote:
>> 
>>> Hi Otis,
>>> 
>>> thx for the interesting insight. This is very interesting. I never 
>>> had
>>> ES really on scale. But we plan to do that, with hbase as primary db 
>>> (of
>>> course ;) ).
>>> 
>>> I just had the opinion that ES and hbase would scale side by side.
>>> 
>> 
>> Sure, they can both be *scaled*, but in *our* use case HBase was more
>> efficient with the same amount of data and same hardware.  It could 
>> handle
>> the same volume of data with the same hardware better (lower CPU, GC, 
>> etc.)
>> than ES.  Please note that this was our use case.  I'm not saying it's
>> universally true.  The two tools have different features, do different 
>> sort
>> of work under the hood, so this difference makes sense to me.
>> 
>> 
>>> Could you please give us some details on what you mean by "more 
>>> scalable"?
>>> 
>> 
>> Please see above.
>> 
>> 
>>> What was the ES backend?
>>> 
>> 
>> We used it to store metrics from SPM <http://sematext.com/spm/>.  We 
>> use
>> HBase for that in SPM Cloud version, but we don't use HBase in the On
>> Premises version of SPM due to the operational complexity of HBase.
>> 
>> Otis
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>> 
>> 
>> 
>> 
>>> Am 28.11.2014 um 06:37 schrieb Otis Gospodnetic:
>>>> Hi,
>>>> 
>>>> There was a mention of Elasticsearch here that caught my attention.
>>>> We use both HBase and Elasticsearch at Sematext.  SPM
>>>> <http://sematext.com/spm/>, which monitors things like Hadoop, 
>>>> Spark,
>>> etc.
>>>> etc. including HBase and ES, can actually use either HBase or
>>> Elasticsearch
>>>> as the data store.  We experimented with both and an a few years old
>>>> version of HBase was more scalable than the latest ES, at least in 
>>>> our
>>> use
>>>> case.
>>>> 
>>>> Otis
>>>> --
>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log 
>>>> Management
>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>> 
>>>> 
>>>> On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <al...@none.at> 
>>>> wrote:
>>>> 
>>>>> Dear wilm and ted.
>>>>> 
>>>>> Thanks for your input and ideas.
>>>>> 
>>>>> I will now step back and learn more about big data and big storage 
>>>>> to
>>>>> be able to talk further.
>>>>> 
>>>>> Cheers Aleks
>>>>> 
>>>>> Am 28-11-2014 01:20, schrieb Wilm Schumacher:
>>>>> 
>>>>> Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>>>>>> 
>>>>>>> What's the plan about the "MOB-extension"?
>>>>>>> 
>>>>>> https://issues.apache.org/jira/browse/HBASE-11339
>>>>>> 
>>>>>> From development point of view I can build HBase with the
>>> "MOB-extension"
>>>>>>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, 
>>>>>>> ...)
>>> is
>>>>>>> much
>>>>>>> easier to maintain.
>>>>>>> 
>>>>>> that's true :/
>>>>>> 
>>>>>> We need to make some "accesslog" analyzing like piwik or awffull.
>>>>>>> 
>>>>>> I see. Well, this is of course possible, too.
>>>>>> 
>>>>>> Maybe elasticsearch is a better tool for that?
>>>>>>> 
>>>>>> I used elastic search for full text search. Works veeery well :D. 
>>>>>> Loved
>>>>>> it. But I never used it as primary database. And I wouldn't see an
>>>>>> advantage for using ES here.
>>>>>> 
>>>>>> As far as I have understood hadoop client see a 'Filesystem' with 
>>>>>> 37
>>> TB
>>>>>>> or
>>>>>>> 120 TB but from the server point of view how should I plan the
>>>>>>> storage/server
>>>>>>> setup for the datanodes.
>>>>>>> 
>>>>>> now I get your question. If you have a replication factor of 3 (so
>>> every
>>>>>> data is hold three times by the cluster), then the aggregated 
>>>>>> storage
>>>>>> has to be at least 3 times the 120 TB (+ buffer + operating system
>>>>>> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
>>>>>> 
>>>>>> What happen when a datanode have 20TB but the whole hadoop/HBase 2
>>> node
>>>>>>> cluster have 40?
>>>>>>> 
>>>>>> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
>>>>>> distributes the data over the nodes.
>>>>>> 
>>>>>> ?! why "40 million rows", do you mean the file tables?
>>>>>>> In the DB is only some Data like, User account, id for a 
>>>>>>> directory and
>>>>>>> so on.
>>>>>>> 
>>>>>> If you use hbase as primary storage, every file would be a row. 
>>>>>> Think
>>> of
>>>>>> a "blob" in RDBMS. 40 millions files => 40 million rows.
>>>>>> 
>>>>>> Assume you create an access log for the 40 millions files and 
>>>>>> assume
>>>>>> every file is accessed 100 times and every access is a row in 
>>>>>> another
>>>>>> "access log" table => 4 billion rows ;).
>>>>>> 
>>>>>> Currently, yes php is the main language.
>>>>>>> I don't know a good solution for php similar like hadoop, anyone 
>>>>>>> else
>>>>>>> know one?
>>>>>>> 
>>>>>> well, the basic stuff could be done by thrift/rest with a native 
>>>>>> php
>>>>>> binding. It depends on what you are trying to do. If it's just 
>>>>>> CRUD and
>>>>>> some scanning and filtering, thrift/rest should be enough. But as 
>>>>>> you
>>>>>> said ... who knows what the future brings. If you want to do the 
>>>>>> fancy
>>>>>> stuff, you should use java and deliver the data to your php
>>> application-
>>>>>> 
>>>>>> Just for completeness: There is HiveQL, too. This is kind of "SQL 
>>>>>> for
>>>>>> hadoop". There is a hive client for php (as it is delivered by 
>>>>>> thrift)
>>>>>> https://cwiki.apache.org/confluence/display/Hive/HiveClient
>>>>>> 
>>>>>> Another fitting option for your access log could be cassandra.
>>> Cassandra
>>>>>> is good at write performance, thus it is used for logging. 
>>>>>> Cassandra
>>> has
>>>>>> a "sql like" language, called cql. This works from php almost like 
>>>>>> a
>>>>>> normal RDBMS. Prepared statements and all this stuff.
>>>>>> 
>>>>>> But I think this is done the wrong way around. You should select a
>>>>>> technology and then choose the language/interfaces etc. And if you
>>>>>> choose hbase, and java is a good choice, and you use nginx and php 
>>>>>> is a
>>>>>> good choice, the only task is to deliver data from A to B and 
>>>>>> back.
>>>>>> 
>>>>>> Best wishes,
>>>>>> 
>>>>>> Wilm
>>>>>> 
>>>>> 
>>>> 
>>> 

Re: Newbie Question about 37TB binary storage on HBase

Posted by Michael Segel <mi...@hotmail.com>.
Guys, KISS.

You can use a sequence file to store the images since the images are static. 
Use HBase to index the images. 

If you want… you could use ES or SOLR to take the HBase index and put it in memory. 

Thrift/Stargate HBase API? Really?  Sorry unless its vastly improved over the years… ice. 

Note this simple pattern works really well in the IoT scheme of things. 

Also… depending on the index(es), 
Going SOLR and Sequence file may actually yield better i/o performance and  scale better. 



On Nov 28, 2014, at 5:37 PM, Otis Gospodnetic <ot...@gmail.com> wrote:

> Hi,
> 
> On Fri, Nov 28, 2014 at 5:08 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
>> wrote:
> 
>> Hi Otis,
>> 
>> thx for the interesting insight. This is very interesting. I never had
>> ES really on scale. But we plan to do that, with hbase as primary db (of
>> course ;) ).
>> 
>> I just had the opinion that ES and hbase would scale side by side.
>> 
> 
> Sure, they can both be *scaled*, but in *our* use case HBase was more
> efficient with the same amount of data and same hardware.  It could handle
> the same volume of data with the same hardware better (lower CPU, GC, etc.)
> than ES.  Please note that this was our use case.  I'm not saying it's
> universally true.  The two tools have different features, do different sort
> of work under the hood, so this difference makes sense to me.
> 
> 
>> Could you please give us some details on what you mean by "more scalable"?
>> 
> 
> Please see above.
> 
> 
>> What was the ES backend?
>> 
> 
> We used it to store metrics from SPM <http://sematext.com/spm/>.  We use
> HBase for that in SPM Cloud version, but we don't use HBase in the On
> Premises version of SPM due to the operational complexity of HBase.
> 
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> 
> 
> 
> 
>> Am 28.11.2014 um 06:37 schrieb Otis Gospodnetic:
>>> Hi,
>>> 
>>> There was a mention of Elasticsearch here that caught my attention.
>>> We use both HBase and Elasticsearch at Sematext.  SPM
>>> <http://sematext.com/spm/>, which monitors things like Hadoop, Spark,
>> etc.
>>> etc. including HBase and ES, can actually use either HBase or
>> Elasticsearch
>>> as the data store.  We experimented with both and an a few years old
>>> version of HBase was more scalable than the latest ES, at least in our
>> use
>>> case.
>>> 
>>> Otis
>>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>> 
>>> 
>>> On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <al...@none.at> wrote:
>>> 
>>>> Dear wilm and ted.
>>>> 
>>>> Thanks for your input and ideas.
>>>> 
>>>> I will now step back and learn more about big data and big storage to
>>>> be able to talk further.
>>>> 
>>>> Cheers Aleks
>>>> 
>>>> Am 28-11-2014 01:20, schrieb Wilm Schumacher:
>>>> 
>>>> Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>>>>> 
>>>>>> What's the plan about the "MOB-extension"?
>>>>>> 
>>>>> https://issues.apache.org/jira/browse/HBASE-11339
>>>>> 
>>>>> From development point of view I can build HBase with the
>> "MOB-extension"
>>>>>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...)
>> is
>>>>>> much
>>>>>> easier to maintain.
>>>>>> 
>>>>> that's true :/
>>>>> 
>>>>> We need to make some "accesslog" analyzing like piwik or awffull.
>>>>>> 
>>>>> I see. Well, this is of course possible, too.
>>>>> 
>>>>> Maybe elasticsearch is a better tool for that?
>>>>>> 
>>>>> I used elastic search for full text search. Works veeery well :D. Loved
>>>>> it. But I never used it as primary database. And I wouldn't see an
>>>>> advantage for using ES here.
>>>>> 
>>>>> As far as I have understood hadoop client see a 'Filesystem' with 37
>> TB
>>>>>> or
>>>>>> 120 TB but from the server point of view how should I plan the
>>>>>> storage/server
>>>>>> setup for the datanodes.
>>>>>> 
>>>>> now I get your question. If you have a replication factor of 3 (so
>> every
>>>>> data is hold three times by the cluster), then the aggregated storage
>>>>> has to be at least 3 times the 120 TB (+ buffer + operating system
>>>>> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
>>>>> 
>>>>> What happen when a datanode have 20TB but the whole hadoop/HBase 2
>> node
>>>>>> cluster have 40?
>>>>>> 
>>>>> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
>>>>> distributes the data over the nodes.
>>>>> 
>>>>> ?! why "40 million rows", do you mean the file tables?
>>>>>> In the DB is only some Data like, User account, id for a directory and
>>>>>> so on.
>>>>>> 
>>>>> If you use hbase as primary storage, every file would be a row. Think
>> of
>>>>> a "blob" in RDBMS. 40 millions files => 40 million rows.
>>>>> 
>>>>> Assume you create an access log for the 40 millions files and assume
>>>>> every file is accessed 100 times and every access is a row in another
>>>>> "access log" table => 4 billion rows ;).
>>>>> 
>>>>> Currently, yes php is the main language.
>>>>>> I don't know a good solution for php similar like hadoop, anyone else
>>>>>> know one?
>>>>>> 
>>>>> well, the basic stuff could be done by thrift/rest with a native php
>>>>> binding. It depends on what you are trying to do. If it's just CRUD and
>>>>> some scanning and filtering, thrift/rest should be enough. But as you
>>>>> said ... who knows what the future brings. If you want to do the fancy
>>>>> stuff, you should use java and deliver the data to your php
>> application-
>>>>> 
>>>>> Just for completeness: There is HiveQL, too. This is kind of "SQL for
>>>>> hadoop". There is a hive client for php (as it is delivered by thrift)
>>>>> https://cwiki.apache.org/confluence/display/Hive/HiveClient
>>>>> 
>>>>> Another fitting option for your access log could be cassandra.
>> Cassandra
>>>>> is good at write performance, thus it is used for logging. Cassandra
>> has
>>>>> a "sql like" language, called cql. This works from php almost like a
>>>>> normal RDBMS. Prepared statements and all this stuff.
>>>>> 
>>>>> But I think this is done the wrong way around. You should select a
>>>>> technology and then choose the language/interfaces etc. And if you
>>>>> choose hbase, and java is a good choice, and you use nginx and php is a
>>>>> good choice, the only task is to deliver data from A to B and back.
>>>>> 
>>>>> Best wishes,
>>>>> 
>>>>> Wilm
>>>>> 
>>>> 
>>> 
>> 


Re: Newbie Question about 37TB binary storage on HBase

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi,

On Fri, Nov 28, 2014 at 5:08 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi Otis,
>
> thx for the interesting insight. This is very interesting. I never had
> ES really on scale. But we plan to do that, with hbase as primary db (of
> course ;) ).
>
> I just had the opinion that ES and hbase would scale side by side.
>

Sure, they can both be *scaled*, but in *our* use case HBase was more
efficient with the same amount of data and same hardware.  It could handle
the same volume of data with the same hardware better (lower CPU, GC, etc.)
than ES.  Please note that this was our use case.  I'm not saying it's
universally true.  The two tools have different features, do different sort
of work under the hood, so this difference makes sense to me.


> Could you please give us some details on what you mean by "more scalable"?
>

Please see above.


> What was the ES backend?
>

We used it to store metrics from SPM <http://sematext.com/spm/>.  We use
HBase for that in SPM Cloud version, but we don't use HBase in the On
Premises version of SPM due to the operational complexity of HBase.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




> Am 28.11.2014 um 06:37 schrieb Otis Gospodnetic:
> > Hi,
> >
> > There was a mention of Elasticsearch here that caught my attention.
> > We use both HBase and Elasticsearch at Sematext.  SPM
> > <http://sematext.com/spm/>, which monitors things like Hadoop, Spark,
> etc.
> > etc. including HBase and ES, can actually use either HBase or
> Elasticsearch
> > as the data store.  We experimented with both and an a few years old
> > version of HBase was more scalable than the latest ES, at least in our
> use
> > case.
> >
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <al...@none.at> wrote:
> >
> >> Dear wilm and ted.
> >>
> >> Thanks for your input and ideas.
> >>
> >> I will now step back and learn more about big data and big storage to
> >> be able to talk further.
> >>
> >> Cheers Aleks
> >>
> >> Am 28-11-2014 01:20, schrieb Wilm Schumacher:
> >>
> >>  Am 28.11.2014 um 00:32 schrieb Aleks Laz:
> >>>
> >>>> What's the plan about the "MOB-extension"?
> >>>>
> >>> https://issues.apache.org/jira/browse/HBASE-11339
> >>>
> >>>  From development point of view I can build HBase with the
> "MOB-extension"
> >>>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...)
> is
> >>>> much
> >>>> easier to maintain.
> >>>>
> >>> that's true :/
> >>>
> >>>  We need to make some "accesslog" analyzing like piwik or awffull.
> >>>>
> >>> I see. Well, this is of course possible, too.
> >>>
> >>>  Maybe elasticsearch is a better tool for that?
> >>>>
> >>> I used elastic search for full text search. Works veeery well :D. Loved
> >>> it. But I never used it as primary database. And I wouldn't see an
> >>> advantage for using ES here.
> >>>
> >>>  As far as I have understood hadoop client see a 'Filesystem' with 37
> TB
> >>>> or
> >>>> 120 TB but from the server point of view how should I plan the
> >>>> storage/server
> >>>> setup for the datanodes.
> >>>>
> >>> now I get your question. If you have a replication factor of 3 (so
> every
> >>> data is hold three times by the cluster), then the aggregated storage
> >>> has to be at least 3 times the 120 TB (+ buffer + operating system
> >>> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
> >>>
> >>>  What happen when a datanode have 20TB but the whole hadoop/HBase 2
> node
> >>>> cluster have 40?
> >>>>
> >>> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
> >>> distributes the data over the nodes.
> >>>
> >>>  ?! why "40 million rows", do you mean the file tables?
> >>>> In the DB is only some Data like, User account, id for a directory and
> >>>> so on.
> >>>>
> >>> If you use hbase as primary storage, every file would be a row. Think
> of
> >>> a "blob" in RDBMS. 40 millions files => 40 million rows.
> >>>
> >>> Assume you create an access log for the 40 millions files and assume
> >>> every file is accessed 100 times and every access is a row in another
> >>> "access log" table => 4 billion rows ;).
> >>>
> >>>  Currently, yes php is the main language.
> >>>> I don't know a good solution for php similar like hadoop, anyone else
> >>>> know one?
> >>>>
> >>> well, the basic stuff could be done by thrift/rest with a native php
> >>> binding. It depends on what you are trying to do. If it's just CRUD and
> >>> some scanning and filtering, thrift/rest should be enough. But as you
> >>> said ... who knows what the future brings. If you want to do the fancy
> >>> stuff, you should use java and deliver the data to your php
> application-
> >>>
> >>> Just for completeness: There is HiveQL, too. This is kind of "SQL for
> >>> hadoop". There is a hive client for php (as it is delivered by thrift)
> >>> https://cwiki.apache.org/confluence/display/Hive/HiveClient
> >>>
> >>> Another fitting option for your access log could be cassandra.
> Cassandra
> >>> is good at write performance, thus it is used for logging. Cassandra
> has
> >>> a "sql like" language, called cql. This works from php almost like a
> >>> normal RDBMS. Prepared statements and all this stuff.
> >>>
> >>> But I think this is done the wrong way around. You should select a
> >>> technology and then choose the language/interfaces etc. And if you
> >>> choose hbase, and java is a good choice, and you use nginx and php is a
> >>> good choice, the only task is to deliver data from A to B and back.
> >>>
> >>> Best wishes,
> >>>
> >>> Wilm
> >>>
> >>
> >
>

Re: Newbie Question about 37TB binary storage on HBase

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi Otis,

thx for the interesting insight. This is very interesting. I never had
ES really on scale. But we plan to do that, with hbase as primary db (of
course ;) ).

I just had the opinion that ES and hbase would scale side by side.

Could you please give us some details on what you mean by "more scalable"?

What was the ES backend?

Best regards

Wilm

Am 28.11.2014 um 06:37 schrieb Otis Gospodnetic:
> Hi,
> 
> There was a mention of Elasticsearch here that caught my attention.
> We use both HBase and Elasticsearch at Sematext.  SPM
> <http://sematext.com/spm/>, which monitors things like Hadoop, Spark, etc.
> etc. including HBase and ES, can actually use either HBase or Elasticsearch
> as the data store.  We experimented with both and an a few years old
> version of HBase was more scalable than the latest ES, at least in our use
> case.
> 
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> 
> 
> On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <al...@none.at> wrote:
> 
>> Dear wilm and ted.
>>
>> Thanks for your input and ideas.
>>
>> I will now step back and learn more about big data and big storage to
>> be able to talk further.
>>
>> Cheers Aleks
>>
>> Am 28-11-2014 01:20, schrieb Wilm Schumacher:
>>
>>  Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>>>
>>>> What's the plan about the "MOB-extension"?
>>>>
>>> https://issues.apache.org/jira/browse/HBASE-11339
>>>
>>>  From development point of view I can build HBase with the "MOB-extension"
>>>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is
>>>> much
>>>> easier to maintain.
>>>>
>>> that's true :/
>>>
>>>  We need to make some "accesslog" analyzing like piwik or awffull.
>>>>
>>> I see. Well, this is of course possible, too.
>>>
>>>  Maybe elasticsearch is a better tool for that?
>>>>
>>> I used elastic search for full text search. Works veeery well :D. Loved
>>> it. But I never used it as primary database. And I wouldn't see an
>>> advantage for using ES here.
>>>
>>>  As far as I have understood hadoop client see a 'Filesystem' with 37 TB
>>>> or
>>>> 120 TB but from the server point of view how should I plan the
>>>> storage/server
>>>> setup for the datanodes.
>>>>
>>> now I get your question. If you have a replication factor of 3 (so every
>>> data is hold three times by the cluster), then the aggregated storage
>>> has to be at least 3 times the 120 TB (+ buffer + operating system
>>> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
>>>
>>>  What happen when a datanode have 20TB but the whole hadoop/HBase 2 node
>>>> cluster have 40?
>>>>
>>> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
>>> distributes the data over the nodes.
>>>
>>>  ?! why "40 million rows", do you mean the file tables?
>>>> In the DB is only some Data like, User account, id for a directory and
>>>> so on.
>>>>
>>> If you use hbase as primary storage, every file would be a row. Think of
>>> a "blob" in RDBMS. 40 millions files => 40 million rows.
>>>
>>> Assume you create an access log for the 40 millions files and assume
>>> every file is accessed 100 times and every access is a row in another
>>> "access log" table => 4 billion rows ;).
>>>
>>>  Currently, yes php is the main language.
>>>> I don't know a good solution for php similar like hadoop, anyone else
>>>> know one?
>>>>
>>> well, the basic stuff could be done by thrift/rest with a native php
>>> binding. It depends on what you are trying to do. If it's just CRUD and
>>> some scanning and filtering, thrift/rest should be enough. But as you
>>> said ... who knows what the future brings. If you want to do the fancy
>>> stuff, you should use java and deliver the data to your php application-
>>>
>>> Just for completeness: There is HiveQL, too. This is kind of "SQL for
>>> hadoop". There is a hive client for php (as it is delivered by thrift)
>>> https://cwiki.apache.org/confluence/display/Hive/HiveClient
>>>
>>> Another fitting option for your access log could be cassandra. Cassandra
>>> is good at write performance, thus it is used for logging. Cassandra has
>>> a "sql like" language, called cql. This works from php almost like a
>>> normal RDBMS. Prepared statements and all this stuff.
>>>
>>> But I think this is done the wrong way around. You should select a
>>> technology and then choose the language/interfaces etc. And if you
>>> choose hbase, and java is a good choice, and you use nginx and php is a
>>> good choice, the only task is to deliver data from A to B and back.
>>>
>>> Best wishes,
>>>
>>> Wilm
>>>
>>
> 

Re: Newbie Question about 37TB binary storage on HBase

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi,

There was a mention of Elasticsearch here that caught my attention.
We use both HBase and Elasticsearch at Sematext.  SPM
<http://sematext.com/spm/>, which monitors things like Hadoop, Spark, etc.
etc. including HBase and ES, can actually use either HBase or Elasticsearch
as the data store.  We experimented with both and an a few years old
version of HBase was more scalable than the latest ES, at least in our use
case.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <al...@none.at> wrote:

> Dear wilm and ted.
>
> Thanks for your input and ideas.
>
> I will now step back and learn more about big data and big storage to
> be able to talk further.
>
> Cheers Aleks
>
> Am 28-11-2014 01:20, schrieb Wilm Schumacher:
>
>  Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>>
>>> What's the plan about the "MOB-extension"?
>>>
>> https://issues.apache.org/jira/browse/HBASE-11339
>>
>>  From development point of view I can build HBase with the "MOB-extension"
>>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is
>>> much
>>> easier to maintain.
>>>
>> that's true :/
>>
>>  We need to make some "accesslog" analyzing like piwik or awffull.
>>>
>> I see. Well, this is of course possible, too.
>>
>>  Maybe elasticsearch is a better tool for that?
>>>
>> I used elastic search for full text search. Works veeery well :D. Loved
>> it. But I never used it as primary database. And I wouldn't see an
>> advantage for using ES here.
>>
>>  As far as I have understood hadoop client see a 'Filesystem' with 37 TB
>>> or
>>> 120 TB but from the server point of view how should I plan the
>>> storage/server
>>> setup for the datanodes.
>>>
>> now I get your question. If you have a replication factor of 3 (so every
>> data is hold three times by the cluster), then the aggregated storage
>> has to be at least 3 times the 120 TB (+ buffer + operating system
>> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
>>
>>  What happen when a datanode have 20TB but the whole hadoop/HBase 2 node
>>> cluster have 40?
>>>
>> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
>> distributes the data over the nodes.
>>
>>  ?! why "40 million rows", do you mean the file tables?
>>> In the DB is only some Data like, User account, id for a directory and
>>> so on.
>>>
>> If you use hbase as primary storage, every file would be a row. Think of
>> a "blob" in RDBMS. 40 millions files => 40 million rows.
>>
>> Assume you create an access log for the 40 millions files and assume
>> every file is accessed 100 times and every access is a row in another
>> "access log" table => 4 billion rows ;).
>>
>>  Currently, yes php is the main language.
>>> I don't know a good solution for php similar like hadoop, anyone else
>>> know one?
>>>
>> well, the basic stuff could be done by thrift/rest with a native php
>> binding. It depends on what you are trying to do. If it's just CRUD and
>> some scanning and filtering, thrift/rest should be enough. But as you
>> said ... who knows what the future brings. If you want to do the fancy
>> stuff, you should use java and deliver the data to your php application-
>>
>> Just for completeness: There is HiveQL, too. This is kind of "SQL for
>> hadoop". There is a hive client for php (as it is delivered by thrift)
>> https://cwiki.apache.org/confluence/display/Hive/HiveClient
>>
>> Another fitting option for your access log could be cassandra. Cassandra
>> is good at write performance, thus it is used for logging. Cassandra has
>> a "sql like" language, called cql. This works from php almost like a
>> normal RDBMS. Prepared statements and all this stuff.
>>
>> But I think this is done the wrong way around. You should select a
>> technology and then choose the language/interfaces etc. And if you
>> choose hbase, and java is a good choice, and you use nginx and php is a
>> good choice, the only task is to deliver data from A to B and back.
>>
>> Best wishes,
>>
>> Wilm
>>
>

Re: Newbie Question about 37TB binary storage on HBase

Posted by Aleks Laz <al...@none.at>.
Dear wilm and ted.

Thanks for your input and ideas.

I will now step back and learn more about big data and big storage to
be able to talk further.

Cheers Aleks

Am 28-11-2014 01:20, schrieb Wilm Schumacher:
> Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>> What's the plan about the "MOB-extension"?
> https://issues.apache.org/jira/browse/HBASE-11339
> 
>> From development point of view I can build HBase with the 
>> "MOB-extension"
>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) 
>> is
>> much
>> easier to maintain.
> that's true :/
> 
>> We need to make some "accesslog" analyzing like piwik or awffull.
> I see. Well, this is of course possible, too.
> 
>> Maybe elasticsearch is a better tool for that?
> I used elastic search for full text search. Works veeery well :D. Loved
> it. But I never used it as primary database. And I wouldn't see an
> advantage for using ES here.
> 
>> As far as I have understood hadoop client see a 'Filesystem' with 37 
>> TB or
>> 120 TB but from the server point of view how should I plan the
>> storage/server
>> setup for the datanodes.
> now I get your question. If you have a replication factor of 3 (so 
> every
> data is hold three times by the cluster), then the aggregated storage
> has to be at least 3 times the 120 TB (+ buffer + operating system
> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
> 
>> What happen when a datanode have 20TB but the whole hadoop/HBase 2 
>> node
>> cluster have 40?
> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
> distributes the data over the nodes.
> 
>> ?! why "40 million rows", do you mean the file tables?
>> In the DB is only some Data like, User account, id for a directory and
>> so on.
> If you use hbase as primary storage, every file would be a row. Think 
> of
> a "blob" in RDBMS. 40 millions files => 40 million rows.
> 
> Assume you create an access log for the 40 millions files and assume
> every file is accessed 100 times and every access is a row in another
> "access log" table => 4 billion rows ;).
> 
>> Currently, yes php is the main language.
>> I don't know a good solution for php similar like hadoop, anyone else
>> know one?
> well, the basic stuff could be done by thrift/rest with a native php
> binding. It depends on what you are trying to do. If it's just CRUD and
> some scanning and filtering, thrift/rest should be enough. But as you
> said ... who knows what the future brings. If you want to do the fancy
> stuff, you should use java and deliver the data to your php 
> application-
> 
> Just for completeness: There is HiveQL, too. This is kind of "SQL for
> hadoop". There is a hive client for php (as it is delivered by thrift)
> https://cwiki.apache.org/confluence/display/Hive/HiveClient
> 
> Another fitting option for your access log could be cassandra. 
> Cassandra
> is good at write performance, thus it is used for logging. Cassandra 
> has
> a "sql like" language, called cql. This works from php almost like a
> normal RDBMS. Prepared statements and all this stuff.
> 
> But I think this is done the wrong way around. You should select a
> technology and then choose the language/interfaces etc. And if you
> choose hbase, and java is a good choice, and you use nginx and php is a
> good choice, the only task is to deliver data from A to B and back.
> 
> Best wishes,
> 
> Wilm

Re: Newbie Question about 37TB binary storage on HBase

Posted by Wilm Schumacher <wi...@cawoom.com>.
Am 28.11.2014 um 00:32 schrieb Aleks Laz:
> What's the plan about the "MOB-extension"?
https://issues.apache.org/jira/browse/HBASE-11339

> From development point of view I can build HBase with the "MOB-extension"
> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is
> much
> easier to maintain.
that's true :/

> We need to make some "accesslog" analyzing like piwik or awffull.
I see. Well, this is of course possible, too.

> Maybe elasticsearch is a better tool for that?
I used elastic search for full text search. Works veeery well :D. Loved
it. But I never used it as primary database. And I wouldn't see an
advantage for using ES here.

> As far as I have understood hadoop client see a 'Filesystem' with 37 TB or
> 120 TB but from the server point of view how should I plan the
> storage/server
> setup for the datanodes.
now I get your question. If you have a replication factor of 3 (so every
data is hold three times by the cluster), then the aggregated storage
has to be at least 3 times the 120 TB (+ buffer + operating system
etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.

> What happen when a datanode have 20TB but the whole hadoop/HBase 2 node
> cluster have 40?
well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
distributes the data over the nodes.

> ?! why "40 million rows", do you mean the file tables?
> In the DB is only some Data like, User account, id for a directory and
> so on.
If you use hbase as primary storage, every file would be a row. Think of
a "blob" in RDBMS. 40 millions files => 40 million rows.

Assume you create an access log for the 40 millions files and assume
every file is accessed 100 times and every access is a row in another
"access log" table => 4 billion rows ;).

> Currently, yes php is the main language.
> I don't know a good solution for php similar like hadoop, anyone else
> know one?
well, the basic stuff could be done by thrift/rest with a native php
binding. It depends on what you are trying to do. If it's just CRUD and
some scanning and filtering, thrift/rest should be enough. But as you
said ... who knows what the future brings. If you want to do the fancy
stuff, you should use java and deliver the data to your php application-

Just for completeness: There is HiveQL, too. This is kind of "SQL for
hadoop". There is a hive client for php (as it is delivered by thrift)
https://cwiki.apache.org/confluence/display/Hive/HiveClient

Another fitting option for your access log could be cassandra. Cassandra
is good at write performance, thus it is used for logging. Cassandra has
a "sql like" language, called cql. This works from php almost like a
normal RDBMS. Prepared statements and all this stuff.

But I think this is done the wrong way around. You should select a
technology and then choose the language/interfaces etc. And if you
choose hbase, and java is a good choice, and you use nginx and php is a
good choice, the only task is to deliver data from A to B and back.

Best wishes,

Wilm

Re: Newbie Question about 37TB binary storage on HBase

Posted by Aleks Laz <al...@none.at>.
Hi Wilm.

Am 27-11-2014 23:41, schrieb Wilm Schumacher:
> Hi Aleks ;),
> 
> Am 27.11.2014 um 22:27 schrieb Aleks Laz:
>> Our application is a nginx/php-fpm/postgresql Setup.
>> The target design is nginx + proxy features / php-fpm / $DB / 
>> $Storage.
>> 
>> .) Can I mix HDFS /HBase for binary data storage and data analyzing?
> 
> yes. hbase is perfect for that. For storage it will work (with the
> "MOB-extension") and with map reduce you can do whatever data analyzing
> you want. I assume you do some image processing with the data?!?!

What's the plan about the "MOB-extension"?

 From development point of view I can build HBase with the 
"MOB-extension"
but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is 
much
easier to maintain.

Currently there are no plans to analyse the images, but who knows what 
the
future brings.

We need to make some "accesslog" analyzing like piwik or awffull.
Maybe elasticsearch is a better tool for that?

>> .) What is the preferred way to us HBase  with PHP?
> 
> The native client lib is in java. This is the best way to go. But if 
> you
> need only basic access from the php application, then thrift or rest
> would be a good choice.
> 
> http://wiki.apache.org/hadoop/Hbase/ThriftApi
> http://wiki.apache.org/hadoop/Hbase/Stargate

Stargate is a cool name ;-)

> There are language bindings for both
> 
>> .) How difficult is it to use HBase with PHP?
> Depending on what you are trying to do. If you just do a little
> fetching, updating, inserting etc. it's pretty easy. More complicate
> stuff I would do in java and expose it by a custom api by a java 
> service.
> 
>> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
>> distribute?
>>    [ ] N Servers with 1 37 TB mountpoints per server?
>>    [ ] N Servers with x TB mountpoints pers server?
>>    [ ] other:
> that's "not your business". hbase/hadoop does the trick for you. hbase
> distributes the data, replicates it etc.. You will only talk to the 
> master.

Well but at the end of the day I will need a physical storage 
distributed over
x servers.

My question is do I need to care that all servers have enough storage 
for the
whole data?

As far as I have understood hadoop client see a 'Filesystem' with 37 TB 
or
120 TB but from the server point of view how should I plan the 
storage/server
setup for the datanodes.

As from the link below hadoophbase-capacity-planning and

http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/

#####
....
Here are the recommended specifications for DataNode/TaskTrackers in a 
balanced Hadoop cluster:

     12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) 
configuration
...
#####

What happen when a datanode have 20TB but the whole hadoop/HBase 2 node 
cluster have 40?

I see I'm still new to hadoop/HBase concept.

>> .) Is HBase a good value for $Storage?
> yes ;)
> 
>> .) Is HBase a good value for $DB?
>>     DB-Size is smaller then 1 GB, I would use HBase just for HA 
>> features
>>     of Hadoop.
> well, the official documentation says:
> »First, make sure you have enough data. If you have hundreds of 
> millions
> or billions of rows, then HBase is a good candidate. If you only have a
> few thousand/million rows, then using a traditional RDBMS might be a
> better choice ...«

Okay so I will stay for this on postgresql with pgbouncer.

> In my experience at around 1-10 million rows RDBMS are not really
> useable anymore. But I only used small/cheap hardware ... and don't 
> like
> RDBMS ;).

;-)

> Well, you will have at least 40 million rows ... and the plattform is
> growing. I think SQL isn't a choice anymore. And as you have heavy read
> and only a few writes hbase is a good fit.

?! why "40 million rows", do you mean the file tables?
In the DB is only some Data like, User account, id for a directory and 
so on.

>> .) Due to the fact that HBase is a file-system I could use
>>       /cams , for binary data
>>       /DB   , for DB storage
>>       /logs , for log storage
>>     but is this wise. On the 'disk' they are different RAIDs.
> hbase is a data store. This was probably copy pasted from the original
> hadoop question ;).

;-)

>> .) Should I plan a dedicated Network+Card for the 'cluster
>>    communication' as for the most other cluster software?
>>    From what I have read it looks not necessary but from security 
>> point
>>    of view, yes.
> 
> http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/
> 
> Cloudera employees says that it wouldn't harm if you have to push a lot
> of data to the cluster.

Okay, so it is like other cluster setups.

>> .) Maybe the communication with the componnents (hadoop, zk, ...) 
>> could
>>    be setup ed with TLS?
> 
> hbase is build on top of hadoop/hdfs. This in the "hadoop domain".
> hadoop can encrypt the transported data by TLS, can encrypt the data on
> the disc, you can use kerberos auth (but this stuff I never did) etc.
> etc.. So the answer is yes.

Thanks.

> Last remark: You seem kind of bound to PHP. The hadoop world is written
> in java. Of course there are a lot of ways to do stuff in other
> languages, over interfaces etc. But the java api is the most powerful
> and sometimes there are no other ways then to use it directly.

Currently, yes php is the main language.
I don't know a good solution for php similar like hadoop, anyone else 
know one?

I will take a look on

https://wiki.apache.org/hadoop/PoweredBy

to get some Ideas for a working solution.

> Best wishes,
> 
> Wilm

Thanks for your feedbak.
I will dig deeper into this topic and start to setup the components step 
by step.

BR Aleks

Re: Newbie Question about 37TB binary storage on HBase

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi Aleks ;),

Am 27.11.2014 um 22:27 schrieb Aleks Laz:
> Our application is a nginx/php-fpm/postgresql Setup.
> The target design is nginx + proxy features / php-fpm / $DB / $Storage.
> 
> .) Can I mix HDFS /HBase for binary data storage and data analyzing?
yes. hbase is perfect for that. For storage it will work (with the
"MOB-extension") and with map reduce you can do whatever data analyzing
you want. I assume you do some image processing with the data?!?!

> .) What is the preferred way to us HBase  with PHP?
The native client lib is in java. This is the best way to go. But if you
need only basic access from the php application, then thrift or rest
would be a good choice.

http://wiki.apache.org/hadoop/Hbase/ThriftApi
http://wiki.apache.org/hadoop/Hbase/Stargate

There are language bindings for both

> .) How difficult is it to use HBase with PHP?
Depending on what you are trying to do. If you just do a little
fetching, updating, inserting etc. it's pretty easy. More complicate
stuff I would do in java and expose it by a custom api by a java service.

> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
> distribute?
>    [ ] N Servers with 1 37 TB mountpoints per server?
>    [ ] N Servers with x TB mountpoints pers server?
>    [ ] other:
that's "not your business". hbase/hadoop does the trick for you. hbase
distributes the data, replicates it etc.. You will only talk to the master.

> .) Is HBase a good value for $Storage?
yes ;)

> .) Is HBase a good value for $DB?
>     DB-Size is smaller then 1 GB, I would use HBase just for HA features
>     of Hadoop.
well, the official documentation says:
»First, make sure you have enough data. If you have hundreds of millions
or billions of rows, then HBase is a good candidate. If you only have a
few thousand/million rows, then using a traditional RDBMS might be a
better choice ...«

In my experience at around 1-10 million rows RDBMS are not really
useable anymore. But I only used small/cheap hardware ... and don't like
RDBMS ;).

Well, you will have at least 40 million rows ... and the plattform is
growing. I think SQL isn't a choice anymore. And as you have heavy read
and only a few writes hbase is a good fit.

> .) Due to the fact that HBase is a file-system I could use
>       /cams , for binary data
>       /DB   , for DB storage
>       /logs , for log storage
>     but is this wise. On the 'disk' they are different RAIDs.
hbase is a data store. This was probably copy pasted from the original
hadoop question ;).

> .) Should I plan a dedicated Network+Card for the 'cluster
>    communication' as for the most other cluster software?
>    From what I have read it looks not necessary but from security point
>    of view, yes.
http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/

Cloudera employees says that it wouldn't harm if you have to push a lot
of data to the cluster.

> .) Maybe the communication with the componnents (hadoop, zk, ...) could
>    be setup ed with TLS?
hbase is build on top of hadoop/hdfs. This in the "hadoop domain".
hadoop can encrypt the transported data by TLS, can encrypt the data on
the disc, you can use kerberos auth (but this stuff I never did) etc.
etc.. So the answer is yes.

Last remark: You seem kind of bound to PHP. The hadoop world is written
in java. Of course there are a lot of ways to do stuff in other
languages, over interfaces etc. But the java api is the most powerful
and sometimes there are no other ways then to use it directly.

Best wishes,

Wilm