You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Rose, Joseph" <Jo...@childrens.harvard.edu> on 2015/03/03 16:32:46 UTC

Standalone == Dev Only?

Folks,

I’m new to HBase (but not new to these sorts of data stores.) I think HBase would be a good fit for a project I’m working on, except for one thing: the amount of data we’re talking about, here, is far smaller than what’s usually recommended for HBase. As I read the docs, though, it seems like the main argument against small datasets is replication: HDFS requires a bunch of nodes right from the start and that’s overkill for my use.

So, what’s the motivation behind labeling standalone HBase deployments “dev only”? If all I really need is a table full of keys and all of that will fit comfortably in a single node, and if I have my own backup solution (literally, backing up the VM on which it’ll run), why bother with HDFS and distributed HBase?

(As an aside, I could go to something like Berkeley DB but then I don’t get all the nice coprocessors and filters and so on, not to mention cell-level security. Because I work with patient data the latter is definitely a huge win.)

Thanks for your help.


Joseph Rose
Intelligent Health Laboratory
Boston Children’s Hospital

Re: Standalone == Dev Only?

Posted by Michael Segel <mi...@hotmail.com>.

You’re dealing with patient data which is either very structured or semi-structured where you can use an RDBMs if you really think about your schema. 

If you want an RDBMs that can be used to hold objects, look at Informix’s IDS which is now IBM’s IDS. It contains the extensibility that you could store objects using AVRO if someone at IBM had built datablades.  (And yes you can do ‘cell’ aka field level encryption too.) 

In terms of growth, how much data? 100TB or more? That’s seems to be the limit of systems like Oracle’s Exadata and Vertica, Informix does allow for Federated queries, however, again YMMV depending on your schema. 

Then there’s options like HBase. 
But even for small amounts of data, you’re going to need a cluster of about 5 datanodes for RS. And you need to also consider what you are storing and how you are storing the data.  Security is another issue. And trust me, depending on your requirements. It can be a real bitch. 


There are other issues too, like stability. Depending on your admin’s skill, your schema and use case… YMMV. 

But if you have your heart set on HBase, also consider Splice Machines’ add-on which gives some relational power to HBase. 

Not all RDBMS are created equal. 
Not all No-SQL databases are created equal. 

If you need certain security, think about Accumulo, or look at MapR’s releases. 
M3 is free and if you want something that will scale and perform… MapR’s M7 aka MapRDB.

And until any of the Apache vendors pick a hardware vendor, MapR is the only one who has a TPC.org <http://tpc.org/> benchmark under their belt and since they were the first, they set the bar that others have to beat.


> On Mar 6, 2015, at 4:21 PM, Stack <st...@duboce.net> wrote:
> 
> On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
> Joseph.Rose@childrens.harvard.edu> wrote:
> 
>> So, I think Nick, St.Ack and Wilm have all made some excellent points, but
>> this last email more or less hit it on the head. Like I said, I¹m working
>> with patient data and while the volume is small now, it¹s not going to
>> stay that way. And the cell-level security is a *huge* win ‹ I¹m sure you
>> folks have some idea how happy that feature makes me. I¹d also rather be
>> writing coprocessors than triggers or ‹ heaven forbid ‹ PL/SQL.
>> 
>> But there¹s another, more fundamental thing: we¹re exploring other DB
>> architectures because classical RDBMS systems haven¹t always worked out so
>> well. In fact, we¹re having a bit of a hard time with the current project
>> because we¹ve been constrained (thus far) to a relational system and it
>> doesn¹t seem to be a clean fit. A key/val store, on the other hand, will
>> have enough flexibility to get the job done, I think. It¹s all being
>> prototyped now, so we¹ll see.
>> 
>> 
> Ok. Sounds like you know the +/-s. Was just checking.
> 
> 
> 
>> I think the final issue with hadoop-common (re: unimplemented sync for
>> local filesystems) is the one showstopper for us. We have to have assured
>> durability. I¹m willing to devote some cycles to get it done, so maybe I¹m
>> the one that says this problem is worthwhile.
>> 
>> 
> I remember that was once the case but looking in codebase now, sync calls
> through to ProtobufLogWriter which does a 'flush' on output (though comment
> says this is a noop). The output stream is an instance of
> FSDataOutputStream made with a RawLOS. The flush should come out here:
> 
> 220     public void flush() throws IOException { fos.flush(); }
> 
> ... where fos is an instance of FileOutputStream.
> 
> In sync we go on to call hflush which looks like it calls flush again.
> 
> What hadoop/hbase versions we talking about? HADOOP-8861 added the above
> behavior for hadoop 1.2.
> 
> Try it I'd say.
> 
> St.Ack
> 
> 
> 
> 
> 
>> Thanks for chiming in. I¹d love to hear more.
>> 
>> 
>> -j
>> 
>> 
>> On 3/6/15, 3:02 PM, "Wilm Schumacher" <wi...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> Am 06.03.2015 um 19:18 schrieb Stack:
>>>> Why not use an RDBMS then?
>>> 
>>> When I first read the hbase documentation I also stumbled about the
>>> "only use for large datasets" or "standalone only in dev mode" etc. In
>>> my point of view there are some arguments against RDBMSs and for e.g.
>>> hbase, although we talk about a single node application.
>>> 
>>> * scalability is a future investment. Even if the dataset is small now,
>>> it doesn't mean that it is in the future, too. Scalabilty in size and
>>> computing power is always a good idea.
>>> 
>>> * query language: for a user hbase is more of a database library than a
>>> "DBMS". For me this is a big plus, as it forces the user to do it the
>>> right way. Just think of SQL-injection. Or CQL-injection for that
>>> matter. Query languages are like scripting languages. Makes easy stuff
>>> easier and hard stuff harder.
>>> 
>>> * fancy features: hbase has fancy features RDBMSs doesn't have. E.g.
>>> coprocessors. I know that e.g. mysql has "triggers", but they are not
>>> nearly as powerful as coprocessors. And don't forget that you have to
>>> write most of the triggers in this *curse word* SQ-language if you don't
>>> want to use evil hacks.
>>> 
>>> * schema-less: another HUGE plus is the possibility to use it without a
>>> fixed schema. In SQL you would need several tables and do a lot of
>>> joins. And the output is way harder to get and to parse.
>>> 
>>> * ecosystem: when you use hbase you automatically get the whole hadoop,
>>> or better apache foundation, ecosystem right away. Not only hdfs, but
>>> mapred, lucene, spark, kafka etc. etc..
>>> 
>>> There are only two real arguments against hbase in that scenario:
>>> 
>>> * joins etc.: well, in sql that's a question of minutes. In hbase that
>>> takes a little more effort. BUT: then it's done the right way ;).
>>> 
>>> * RDMSs are more widely known: well ... that's not the fault of hbase ;).
>>> 
>>> Thus, I think that the hbase community should be more self-reliant for
>>> that matter, even and especially for applications in the SQL realm ;).
>>> Which is a good opportunity to say congratulations for the hbase 1.0
>>> milestone. And thank you for that.
>>> 
>>> Best wishes
>>> 
>>> Wilm
>>> 
>> 
>> 




The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Standalone == Dev Only?

Posted by Michael Segel <mi...@hotmail.com>.

I guess the old adage is true. 

If you only have a hammer, then every problem looks like a nail. 
As an architect, its your role to find the right tools to be used to solve the problem in the most efficient and effective manner.  
So the first question you need to ask is if HBase is the right tool. 

The OP’s project isn’t one that should be put in to HBase. 
Velocity? Volume? Variety? 

These are the three aspects of Big Data and they can also be used to test if a problem should be solved using HBase. You don’t need all three, but you should have at least two of the three if you have a good candidate. 

The other thing to consider is how you plan on using the data. If you’re not using M/R or HDFS, then you don’t want to use HBase in production. 

And as a good architect, you want to take the inverse of the problem and ask why not a Relational Database, or an existing Hierarchical Database. 
(Both technologies have been around 30+ years.) And it turns out that you can 

So the OP’s problem lacks the volume. 
It also lacks the variety. 

So if we ask a simple question of how to use an RDBMS to handle this… its pretty straight forward. 

Store the medical record(s) in either XML or JSON format. 

On ingestion, copy out only the fields required to identify an unique record.  That’s your base record storage. 

Indexing could be done one of two ways. 
1) You could use an inverted table. 
2) You could copy out the field to be used in the index as a column and then index that column. 

If you use an inverted table, your schema design would translate in to HBase. 

Then when you access the data, you use the index to find the result set and for each record, you have the JSON object that you can use as a whole or just components. 

The pattern of storing the record in a single column as  Text LOB and then creating indexes to identify and locate the records isn’t new. I’ve used it at a client over 15 yrs ago for an ODS implementation. 

In terms of HBase… 
Stability depends on the hardware, admin and the use cases. Its still relatively unstable.  In most cases no where near 4 9’s. 

Considering that there is also the regulatory compliance issues … e.g. security… This alone will rule HBase out in a stand alone situation and again even with Kerberos implemented, you may not meet your security requirements. 

Bottom line, the OP is going to do what he’s going to do. All I can do is tell him its not a good idea, and why. 

This email thread is great column fodder for a blog as well as for a presentation as to why/why not HBase and Hadoop.  Its something that should be included in a design lecture or lectures, but unfortunately, most of the larger conferences are driven by the vendors who have their own agendas and slots that they want to fill with marketing talks. 

BTW, I am really curious as to how if the OP is using a standalone instance of HBase does the immature HDFS encryption help secure his data?  ;-) 

HTH

-Mike

> On Mar 13, 2015, at 3:44 PM, Sean Busbey <bu...@cloudera.com> wrote:
> 
> On Fri, Mar 13, 2015 at 2:41 PM, Michael Segel <mi...@hotmail.com>
> wrote:
> 
>> 
>> In stand alone, you’re writing to local disk. You lose the disk you lose
>> the data, unless of course you’ve raided your drives.
>> Then when you lose the node, you lose the data because its not being
>> replicated. While this may not be a major issue or concern… you have to be
>> aware of it’s potential.
>> 
>> 
> It sounds like he has this issue covered via VM imaging.
> 
> 
> 
>> The other issue when it comes to security, HBase relies on the cluster’s
>> security.
>> To be clear, HBase relies on the cluster and the use of Kerberos to help
>> with authentication.  So that only those who have the rights to see the
>> data can actually have access to it.
>> 
>> 
> 
> He can get around this by relying on the Thrift or REST services to act an
> an arbitrator, or he could make his own. So long as he separates access to
> the underlying cluster / hbase apis from whatever does exposing the data,
> this shouldn't be a problem.
> 
> 
> 
>> Then you have to worry about auditing. With respect to HBase, out of the
>> box, you don’t have any auditing.
>> 
>> 
> 
> HBase has auditing. By default it is disabled and it certainly could use
> some improvement. Documentation would be a good start. I'm sure the
> community would be happy to work with Joseph to close whatever gap he needs.
> 
> 
> 
> 
>> You also don’t have built in encryption.
>> You can do it, but then you have a bit of work ahead of you.
>> Cell level encryption? Accumulo?
>> 
>> 
> HBase as had encryption since within the 0.98 line. It is stable now in the
> 1.0 release line. HDFS also supports encryption, though I'm sure using it
> with the LocalFileSystem would benefit from testing. There are vendors that
> can help with integration with proper key servers, if that is something
> Joseph needs and doesn't want to do on his own.
> 
> Accumulo does not do cell level encryption.
> 
> 
> 
>> There’s definitely more to it.
>> 
>> But the one killer thing… you need to be HIPPA compliant and the simplest
>> way to do this is to use a real RDBMS. If you need extensibility, look at
>> IDS from IBM (IBM bought Informix ages ago.)
>> 
>> I think based on the size of your data… you can get away with the free
>> version, and even if not, IBM does do discounts with Universities and could
>> even sponsor research projects.
>> 
>> I don’t know your data, but 10^6 rows is still small.
>> 
>> The point I’m trying to make is that based on what you’ve said, HBase is
>> definitely not the right database for you.
>> 
>> 
> We haven't heard what the target data set size is. If Joseph has reason to
> believe that it will be big enough to warrant something like HBase (e.g.
> 10s of billions of rows), I think there's merit to his argument for
> starting with HBase. Single node use cases are definitely not something
> we've covered well to date, but it would probably help our overall
> usability story to do so.
> 
> 
> -- 
> Sean

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Standalone == Dev Only?

Posted by Sean Busbey <bu...@cloudera.com>.

On Fri, Mar 13, 2015 at 2:41 PM, Michael Segel <mi...@hotmail.com>
wrote:

>
> In stand alone, you’re writing to local disk. You lose the disk you lose
> the data, unless of course you’ve raided your drives.
> Then when you lose the node, you lose the data because its not being
> replicated. While this may not be a major issue or concern… you have to be
> aware of it’s potential.
>
>
It sounds like he has this issue covered via VM imaging.



> The other issue when it comes to security, HBase relies on the cluster’s
> security.
> To be clear, HBase relies on the cluster and the use of Kerberos to help
> with authentication.  So that only those who have the rights to see the
> data can actually have access to it.
>
>

He can get around this by relying on the Thrift or REST services to act an
an arbitrator, or he could make his own. So long as he separates access to
the underlying cluster / hbase apis from whatever does exposing the data,
this shouldn't be a problem.



> Then you have to worry about auditing. With respect to HBase, out of the
> box, you don’t have any auditing.
>
>

HBase has auditing. By default it is disabled and it certainly could use
some improvement. Documentation would be a good start. I'm sure the
community would be happy to work with Joseph to close whatever gap he needs.




> You also don’t have built in encryption.
> You can do it, but then you have a bit of work ahead of you.
> Cell level encryption? Accumulo?
>
>
HBase as had encryption since within the 0.98 line. It is stable now in the
1.0 release line. HDFS also supports encryption, though I'm sure using it
with the LocalFileSystem would benefit from testing. There are vendors that
can help with integration with proper key servers, if that is something
Joseph needs and doesn't want to do on his own.

Accumulo does not do cell level encryption.



> There’s definitely more to it.
>
> But the one killer thing… you need to be HIPPA compliant and the simplest
> way to do this is to use a real RDBMS. If you need extensibility, look at
> IDS from IBM (IBM bought Informix ages ago.)
>
> I think based on the size of your data… you can get away with the free
> version, and even if not, IBM does do discounts with Universities and could
> even sponsor research projects.
>
> I don’t know your data, but 10^6 rows is still small.
>
> The point I’m trying to make is that based on what you’ve said, HBase is
> definitely not the right database for you.
>
>
We haven't heard what the target data set size is. If Joseph has reason to
believe that it will be big enough to warrant something like HBase (e.g.
10s of billions of rows), I think there's merit to his argument for
starting with HBase. Single node use cases are definitely not something
we've covered well to date, but it would probably help our overall
usability story to do so.


-- 
Sean

Re: Standalone == Dev Only?

Posted by Michael Segel <mi...@hotmail.com>.

Joseph, 

In stand alone, you’re writing to local disk. You lose the disk you lose the data, unless of course you’ve raided your drives. 
Then when you lose the node, you lose the data because its not being replicated. While this may not be a major issue or concern… you have to be aware of it’s potential. 

The other issue when it comes to security, HBase relies on the cluster’s security. 
To be clear, HBase relies on the cluster and the use of Kerberos to help with authentication.  So that only those who have the rights to see the data can actually have access to it. 

Then you have to worry about auditing. With respect to HBase, out of the box, you don’t have any auditing. 

With respect to stability,  YMMV.  HBase is only as stable as the admin. 

You also don’t have built in encryption.  
You can do it, but then you have a bit of work ahead of you. 
Cell level encryption? Accumulo?

There’s definitely more to it. 

But the one killer thing… you need to be HIPPA compliant and the simplest way to do this is to use a real RDBMS. If you need extensibility, look at IDS from IBM (IBM bought Informix ages ago.) 

I think based on the size of your data… you can get away with the free version, and even if not, IBM does do discounts with Universities and could even sponsor research projects. 

I don’t know your data, but 10^6 rows is still small.  

The point I’m trying to make is that based on what you’ve said, HBase is definitely not the right database for you. 


> On Mar 13, 2015, at 1:56 PM, Rose, Joseph <Jo...@childrens.harvard.edu> wrote:
> 
> Michael,
> 
> Thanks for your concern. Let me ask a few questions, since you’re implying
> that HDFS is the only way to reduce risk and ensure security, which is not
> the assumption under which I’ve been working.
> 
> A brief rundown of our problem’s characteristics, since I haven’t really
> described what we’re doing:
> * We’re read heavy, write light. It’s likely we’ll do a large import of
> the data and update less than 0.1% per day.
> * The dataset isn’t huge, at the moment (it will likely become huge in the
> future.) If I were to go the RDBMS route I’d guess it could all fit on a
> dual core i5 machine with 2G memory and a quarter terabyte disk — and that
> might be over spec’d. What we’re doing is functional and solves a certain
> problem but is also a prototype for a much larger dataset.
> * We do need security, you’re absolutely right, and the data is subject to
> HIPPA.
> * Availability should be good but we don’t have to go overboard. A couple
> of nines would be just fine.
> * We plan on running this on a fairly small VM. The VM will be backed up
> nightly.
> 
> So, with that in mind, let me make sure I’ve got this right.
> 
> Your main points were data loss and security. As I understand it, HDFS
> might be the right choice for dozens of terabytes to petabyte scale (where
> it effectively becomes impossible to do a clean backup, since the odds of
> a undetected, hardware-level error during replication are not
> insignificant, even if you can find enough space.) But we’re talking gigs
> — easily & reliably replicated (I do it on my home machine all the time.)
> And since it looks like HBase has a stable file system after committing
> mutations, shutting down changes, doing a backup & re-enabling mutations
> seem like a fine choice. Do you see a hole with this approach?
> 
> As for security, and as I understand it, HBase’s security model — both for
> tagging and encryption -- is built into the database layer, not HDFS. We
> very much want cell-level security with roles (because HIPPA) and
> encryption (also because HIPPA) but I don’t think that has anything to do
> with the underlying filesystem. Again, is there something here I’ve missed?
> 
> When we get to 10^6+ rows we will probably build out a small cluster.
> We’re well below that threshold at the moment but will get there soon
> enough.
> 
> 
> -j
> 
> 
> On 3/13/15, 1:46 PM, "Michael Segel" <michael_segel@hotmail.com <ma...@hotmail.com>> wrote:
> 
>> Guys, 
>> 
>> More than just needing some love.
>> No HDFS… means data at risk.
>> No HDFS… means that stand alone will have security issues.
>> 
>> Patient Data? HINT: HIPPA.
>> 
>> Please think your design through and if you go w HBase… you will want to
>> build out a small cluster.
>> 
>>> On Mar 10, 2015, at 6:16 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>>> 
>>> As Stack and Andrew said, just wanted to give you fair warning that this
>>> mode may need some love. Likewise, there are probably alternative that
>>> run
>>> a bit lighter weight, though you flatter us with the reminder of the
>>> long
>>> feature list.
>>> 
>>> I have no problem with helping to fix and committing fixes to bugs that
>>> crop up in local mode operations. Bring 'em on!
>>> 
>>> -n
>>> 
>>> On Tue, Mar 10, 2015 at 3:56 PM, Alex Baranau <al...@gmail.com>
>>> wrote:
>>> 
>>>> On:
>>>> 
>>>> - Future investment in a design that scales better
>>>> 
>>>> Indeed, designing against key value store is different from designing
>>>> against RDBMs.
>>>> 
>>>> I wonder if you explored an option to abstract the storage layer and
>>>> using
>>>> "single node purposed" store until you grow enough to switch to another
>>>> one?
>>>> 
>>>> E.g. you could use LevelDB [1] that is pretty fast (and there's java
>>>> rewrite of it, if you need java APIs [2]). We use it in CDAP [3] in a
>>>> standalone version to make the development environment (SDK) lighter.
>>>> We
>>>> swap it with HBase in distributed mode without changing the application
>>>> code. It doesn't have coprocessors and other specific to HBase
>>>> features you
>>>> are talking about, though. But you can figure out how to bridge client
>>>> APIs
>>>> with an abstraction layer (e.g. we have common Table interface [4]).
>>>> You
>>>> can even add versions on cells (see [5] for example of how we do it).
>>>> 
>>>> Also, you could use RDBMs behind key-value abstraction, to start with,
>>>> while keeping your app design clean out of RDBMs specifics.
>>>> 
>>>> Alex Baranau
>>>> 
>>>> [1] 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_l <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_l>
>>>> eveldb&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjE
>>>> n0B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-
>>>> CCJ7rLU2XLh5RjJJOjub8v2AQzbZLo&s=WRQk8xqNYxyT3htTfBna2R_9bgKJZPB4tDyItgU
>>>> qwJI&e= 
>>>> [2] 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dain_lev <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dain_lev>
>>>> eldb&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0
>>>> B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CC
>>>> J7rLU2XLh5RjJJOjub8v2AQzbZLo&s=YwiXrLkihDEPAbXTcIvLzRjYn7nT3DcOJRsuvpIwm
>>>> G0&e= 
>>>> [3] 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q <https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q>
>>>> S4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN3
>>>> 7RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOju
>>>> b8v2AQzbZLo&s=lXOGj-4TC5bxYeGvDmZwHQRlHTGlHU4MEpon_XqKNgU&e=
>>>> [4]
>>>> 
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata>
>>>> _cdap_blob_develop_cdap-2Dapi_src_main_java_co_cask_cdap_api_dataset_tab
>>>> le_Table.java&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j
>>>> 9wyupjEn0B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1Lntz
>>>> oxFQvo-CCJ7rLU2XLh5RjJJOjub8v2AQzbZLo&s=oMAOmpbfDimKx4TUp0xhVpWtww0oZ6Ar
>>>> Udol-UzgmFg&e= 
>>>> [5]
>>>> 
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata>
>>>> _cdap_blob_develop_cdap-2Ddata-2Dfabric_src_main_java_co_cask_cdap_data2
>>>> _dataset2_lib_table_leveldb_LevelDBTableCore.java&d=BQIFaQ&c=qS4goWBT7po
>>>> plM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN37RKmLLRc05
>>>> fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOjub8v2AQzbZL
>>>> o&s=3Fvtru1ABs6pL4sh0sE8Z-xPyy-m-GoqEWhyOHp3e-c&e=
>>>> 
>>>> --
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q <https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q>
>>>> S4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN3
>>>> 7RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOju
>>>> b8v2AQzbZLo&s=lXOGj-4TC5bxYeGvDmZwHQRlHTGlHU4MEpon_XqKNgU&e=  - open
>>>> source framework to build and run data applications
>>>> on Hadoop & HBase
>>>> 
>>>> On Tue, Mar 10, 2015 at 8:42 AM, Rose, Joseph <
>>>> Joseph.Rose@childrens.harvard.edu> wrote:
>>>> 
>>>>> Sorry, never answered your question about versions. I have 1.0.0
>>>>> version
>>>>> of hbase, which has hadoop-common 2.5.1 in its lib folder.
>>>>> 
>>>>> 
>>>>> -j
>>>>> 
>>>>> 
>>>>> On 3/10/15, 11:36 AM, "Rose, Joseph"
>>>>> <Jo...@childrens.harvard.edu>
>>>>> wrote:
>>>>> 
>>>>>> I tried it and it does work now. It looks like the interface for
>>>>>> hadoop.fs.Syncable changed in March, 2012 to remove the deprecated
>>>> sync()
>>>>>> method and define only hsync() instead. The same committer did the
>>>>>> right
>>>>>> thing and removed sync() from FSDataOutputStream at the same time.
>>>>>> The
>>>>>> remaining hsync() method calls flush() if the underlying stream
>>>>>> doesn't
>>>>>> implement Syncable.
>>>>>> 
>>>>>> 
>>>>>> -j
>>>>>> 
>>>>>> 
>>>>>> On 3/6/15, 5:24 PM, "Stack" <st...@duboce.net> wrote:
>>>>>> 
>>>>>>> On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
>>>>>>> Joseph.Rose@childrens.harvard.edu> wrote:
>>>>>>> 
>>>>>>>> I think the final issue with hadoop-common (re: unimplemented sync
>>>> for
>>>>>>>> local filesystems) is the one showstopper for us. We have to have
>>>>>>>> assured
>>>>>>>> durability. I¹m willing to devote some cycles to get it done, so
>>>> maybe
>>>>>>>> I¹m
>>>>>>>> the one that says this problem is worthwhile.
>>>>>>>> 
>>>>>>>> 
>>>>>>> I remember that was once the case but looking in codebase now, sync
>>>> calls
>>>>>>> through to ProtobufLogWriter which does a 'flush' on output (though
>>>>>>> comment
>>>>>>> says this is a noop). The output stream is an instance of
>>>>>>> FSDataOutputStream made with a RawLOS. The flush should come out
>>>>>>> here:
>>>>>>> 
>>>>>>> 220     public void flush() throws IOException { fos.flush(); }
>>>>>>> 
>>>>>>> ... where fos is an instance of FileOutputStream.
>>>>>>> 
>>>>>>> In sync we go on to call hflush which looks like it calls flush
>>>>>>> again.
>>>>>>> 
>>>>>>> What hadoop/hbase versions we talking about? HADOOP-8861 added the
>>>> above
>>>>>>> behavior for hadoop 1.2.
>>>>>>> 
>>>>>>> Try it I'd say.
>>>>>>> 
>>>>>>> St.Ack
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Standalone == Dev Only?

Posted by "Rose, Joseph" <Jo...@childrens.harvard.edu>.

Michael,

Thanks for your concern. Let me ask a few questions, since you’re implying
that HDFS is the only way to reduce risk and ensure security, which is not
the assumption under which I’ve been working.

A brief rundown of our problem’s characteristics, since I haven’t really
described what we’re doing:
* We’re read heavy, write light. It’s likely we’ll do a large import of
the data and update less than 0.1% per day.
* The dataset isn’t huge, at the moment (it will likely become huge in the
future.) If I were to go the RDBMS route I’d guess it could all fit on a
dual core i5 machine with 2G memory and a quarter terabyte disk — and that
might be over spec’d. What we’re doing is functional and solves a certain
problem but is also a prototype for a much larger dataset.
* We do need security, you’re absolutely right, and the data is subject to
HIPPA.
* Availability should be good but we don’t have to go overboard. A couple
of nines would be just fine.
* We plan on running this on a fairly small VM. The VM will be backed up
nightly.

So, with that in mind, let me make sure I’ve got this right.

Your main points were data loss and security. As I understand it, HDFS
might be the right choice for dozens of terabytes to petabyte scale (where
it effectively becomes impossible to do a clean backup, since the odds of
a undetected, hardware-level error during replication are not
insignificant, even if you can find enough space.) But we’re talking gigs
— easily & reliably replicated (I do it on my home machine all the time.)
And since it looks like HBase has a stable file system after committing
mutations, shutting down changes, doing a backup & re-enabling mutations
seem like a fine choice. Do you see a hole with this approach?

As for security, and as I understand it, HBase’s security model — both for
tagging and encryption -- is built into the database layer, not HDFS. We
very much want cell-level security with roles (because HIPPA) and
encryption (also because HIPPA) but I don’t think that has anything to do
with the underlying filesystem. Again, is there something here I’ve missed?

When we get to 10^6+ rows we will probably build out a small cluster.
We’re well below that threshold at the moment but will get there soon
enough.


-j


On 3/13/15, 1:46 PM, "Michael Segel" <mi...@hotmail.com> wrote:

>Guys, 
>
>More than just needing some love.
>No HDFS… means data at risk.
>No HDFS… means that stand alone will have security issues.
>
>Patient Data? HINT: HIPPA.
>
>Please think your design through and if you go w HBase… you will want to
>build out a small cluster.
>
>> On Mar 10, 2015, at 6:16 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>> 
>> As Stack and Andrew said, just wanted to give you fair warning that this
>> mode may need some love. Likewise, there are probably alternative that
>>run
>> a bit lighter weight, though you flatter us with the reminder of the
>>long
>> feature list.
>> 
>> I have no problem with helping to fix and committing fixes to bugs that
>> crop up in local mode operations. Bring 'em on!
>> 
>> -n
>> 
>> On Tue, Mar 10, 2015 at 3:56 PM, Alex Baranau <al...@gmail.com>
>> wrote:
>> 
>>> On:
>>> 
>>> - Future investment in a design that scales better
>>> 
>>> Indeed, designing against key value store is different from designing
>>> against RDBMs.
>>> 
>>> I wonder if you explored an option to abstract the storage layer and
>>>using
>>> "single node purposed" store until you grow enough to switch to another
>>> one?
>>> 
>>> E.g. you could use LevelDB [1] that is pretty fast (and there's java
>>> rewrite of it, if you need java APIs [2]). We use it in CDAP [3] in a
>>> standalone version to make the development environment (SDK) lighter.
>>>We
>>> swap it with HBase in distributed mode without changing the application
>>> code. It doesn't have coprocessors and other specific to HBase
>>>features you
>>> are talking about, though. But you can figure out how to bridge client
>>>APIs
>>> with an abstraction layer (e.g. we have common Table interface [4]).
>>>You
>>> can even add versions on cells (see [5] for example of how we do it).
>>> 
>>> Also, you could use RDBMs behind key-value abstraction, to start with,
>>> while keeping your app design clean out of RDBMs specifics.
>>> 
>>> Alex Baranau
>>> 
>>> [1] 
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_l
>>>eveldb&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjE
>>>n0B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-
>>>CCJ7rLU2XLh5RjJJOjub8v2AQzbZLo&s=WRQk8xqNYxyT3htTfBna2R_9bgKJZPB4tDyItgU
>>>qwJI&e= 
>>> [2] 
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dain_lev
>>>eldb&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0
>>>B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CC
>>>J7rLU2XLh5RjJJOjub8v2AQzbZLo&s=YwiXrLkihDEPAbXTcIvLzRjYn7nT3DcOJRsuvpIwm
>>>G0&e= 
>>> [3] 
>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q
>>>S4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN3
>>>7RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOju
>>>b8v2AQzbZLo&s=lXOGj-4TC5bxYeGvDmZwHQRlHTGlHU4MEpon_XqKNgU&e=
>>> [4]
>>> 
>>> 
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata
>>>_cdap_blob_develop_cdap-2Dapi_src_main_java_co_cask_cdap_api_dataset_tab
>>>le_Table.java&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j
>>>9wyupjEn0B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1Lntz
>>>oxFQvo-CCJ7rLU2XLh5RjJJOjub8v2AQzbZLo&s=oMAOmpbfDimKx4TUp0xhVpWtww0oZ6Ar
>>>Udol-UzgmFg&e= 
>>> [5]
>>> 
>>> 
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata
>>>_cdap_blob_develop_cdap-2Ddata-2Dfabric_src_main_java_co_cask_cdap_data2
>>>_dataset2_lib_table_leveldb_LevelDBTableCore.java&d=BQIFaQ&c=qS4goWBT7po
>>>plM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN37RKmLLRc05
>>>fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOjub8v2AQzbZL
>>>o&s=3Fvtru1ABs6pL4sh0sE8Z-xPyy-m-GoqEWhyOHp3e-c&e=
>>> 
>>> --
>>> 
>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q
>>>S4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN3
>>>7RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOju
>>>b8v2AQzbZLo&s=lXOGj-4TC5bxYeGvDmZwHQRlHTGlHU4MEpon_XqKNgU&e=  - open
>>>source framework to build and run data applications
>>> on Hadoop & HBase
>>> 
>>> On Tue, Mar 10, 2015 at 8:42 AM, Rose, Joseph <
>>> Joseph.Rose@childrens.harvard.edu> wrote:
>>> 
>>>> Sorry, never answered your question about versions. I have 1.0.0
>>>>version
>>>> of hbase, which has hadoop-common 2.5.1 in its lib folder.
>>>> 
>>>> 
>>>> -j
>>>> 
>>>> 
>>>> On 3/10/15, 11:36 AM, "Rose, Joseph"
>>>><Jo...@childrens.harvard.edu>
>>>> wrote:
>>>> 
>>>>> I tried it and it does work now. It looks like the interface for
>>>>> hadoop.fs.Syncable changed in March, 2012 to remove the deprecated
>>> sync()
>>>>> method and define only hsync() instead. The same committer did the
>>>>>right
>>>>> thing and removed sync() from FSDataOutputStream at the same time.
>>>>>The
>>>>> remaining hsync() method calls flush() if the underlying stream
>>>>>doesn't
>>>>> implement Syncable.
>>>>> 
>>>>> 
>>>>> -j
>>>>> 
>>>>> 
>>>>> On 3/6/15, 5:24 PM, "Stack" <st...@duboce.net> wrote:
>>>>> 
>>>>>> On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
>>>>>> Joseph.Rose@childrens.harvard.edu> wrote:
>>>>>> 
>>>>>>> I think the final issue with hadoop-common (re: unimplemented sync
>>> for
>>>>>>> local filesystems) is the one showstopper for us. We have to have
>>>>>>> assured
>>>>>>> durability. I¹m willing to devote some cycles to get it done, so
>>> maybe
>>>>>>> I¹m
>>>>>>> the one that says this problem is worthwhile.
>>>>>>> 
>>>>>>> 
>>>>>> I remember that was once the case but looking in codebase now, sync
>>> calls
>>>>>> through to ProtobufLogWriter which does a 'flush' on output (though
>>>>>> comment
>>>>>> says this is a noop). The output stream is an instance of
>>>>>> FSDataOutputStream made with a RawLOS. The flush should come out
>>>>>>here:
>>>>>> 
>>>>>> 220     public void flush() throws IOException { fos.flush(); }
>>>>>> 
>>>>>> ... where fos is an instance of FileOutputStream.
>>>>>> 
>>>>>> In sync we go on to call hflush which looks like it calls flush
>>>>>>again.
>>>>>> 
>>>>>> What hadoop/hbase versions we talking about? HADOOP-8861 added the
>>> above
>>>>>> behavior for hadoop 1.2.
>>>>>> 
>>>>>> Try it I'd say.
>>>>>> 
>>>>>> St.Ack
>>>>> 
>>>> 
>>>> 
>>> 
>
>The opinions expressed here are mine, while they may reflect a cognitive
>thought, that is purely accidental.
>Use at your own risk.
>Michael Segel
>michael_segel (AT) hotmail.com
>
>
>
>
>

Re: Standalone == Dev Only?

Posted by Michael Segel <mi...@hotmail.com>.

Guys, 

More than just needing some love. 
No HDFS… means data at risk. 
No HDFS… means that stand alone will have security issues. 

Patient Data? HINT: HIPPA.

Please think your design through and if you go w HBase… you will want to build out a small cluster. 

> On Mar 10, 2015, at 6:16 PM, Nick Dimiduk <nd...@gmail.com> wrote:
> 
> As Stack and Andrew said, just wanted to give you fair warning that this
> mode may need some love. Likewise, there are probably alternative that run
> a bit lighter weight, though you flatter us with the reminder of the long
> feature list.
> 
> I have no problem with helping to fix and committing fixes to bugs that
> crop up in local mode operations. Bring 'em on!
> 
> -n
> 
> On Tue, Mar 10, 2015 at 3:56 PM, Alex Baranau <al...@gmail.com>
> wrote:
> 
>> On:
>> 
>> - Future investment in a design that scales better
>> 
>> Indeed, designing against key value store is different from designing
>> against RDBMs.
>> 
>> I wonder if you explored an option to abstract the storage layer and using
>> "single node purposed" store until you grow enough to switch to another
>> one?
>> 
>> E.g. you could use LevelDB [1] that is pretty fast (and there's java
>> rewrite of it, if you need java APIs [2]). We use it in CDAP [3] in a
>> standalone version to make the development environment (SDK) lighter. We
>> swap it with HBase in distributed mode without changing the application
>> code. It doesn't have coprocessors and other specific to HBase features you
>> are talking about, though. But you can figure out how to bridge client APIs
>> with an abstraction layer (e.g. we have common Table interface [4]). You
>> can even add versions on cells (see [5] for example of how we do it).
>> 
>> Also, you could use RDBMs behind key-value abstraction, to start with,
>> while keeping your app design clean out of RDBMs specifics.
>> 
>> Alex Baranau
>> 
>> [1] https://github.com/google/leveldb
>> [2] https://github.com/dain/leveldb
>> [3] http://cdap.io
>> [4]
>> 
>> https://github.com/caskdata/cdap/blob/develop/cdap-api/src/main/java/co/cask/cdap/api/dataset/table/Table.java
>> [5]
>> 
>> https://github.com/caskdata/cdap/blob/develop/cdap-data-fabric/src/main/java/co/cask/cdap/data2/dataset2/lib/table/leveldb/LevelDBTableCore.java
>> 
>> --
>> http://cdap.io - open source framework to build and run data applications
>> on Hadoop & HBase
>> 
>> On Tue, Mar 10, 2015 at 8:42 AM, Rose, Joseph <
>> Joseph.Rose@childrens.harvard.edu> wrote:
>> 
>>> Sorry, never answered your question about versions. I have 1.0.0 version
>>> of hbase, which has hadoop-common 2.5.1 in its lib folder.
>>> 
>>> 
>>> -j
>>> 
>>> 
>>> On 3/10/15, 11:36 AM, "Rose, Joseph" <Jo...@childrens.harvard.edu>
>>> wrote:
>>> 
>>>> I tried it and it does work now. It looks like the interface for
>>>> hadoop.fs.Syncable changed in March, 2012 to remove the deprecated
>> sync()
>>>> method and define only hsync() instead. The same committer did the right
>>>> thing and removed sync() from FSDataOutputStream at the same time. The
>>>> remaining hsync() method calls flush() if the underlying stream doesn't
>>>> implement Syncable.
>>>> 
>>>> 
>>>> -j
>>>> 
>>>> 
>>>> On 3/6/15, 5:24 PM, "Stack" <st...@duboce.net> wrote:
>>>> 
>>>>> On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
>>>>> Joseph.Rose@childrens.harvard.edu> wrote:
>>>>> 
>>>>>> I think the final issue with hadoop-common (re: unimplemented sync
>> for
>>>>>> local filesystems) is the one showstopper for us. We have to have
>>>>>> assured
>>>>>> durability. I¹m willing to devote some cycles to get it done, so
>> maybe
>>>>>> I¹m
>>>>>> the one that says this problem is worthwhile.
>>>>>> 
>>>>>> 
>>>>> I remember that was once the case but looking in codebase now, sync
>> calls
>>>>> through to ProtobufLogWriter which does a 'flush' on output (though
>>>>> comment
>>>>> says this is a noop). The output stream is an instance of
>>>>> FSDataOutputStream made with a RawLOS. The flush should come out here:
>>>>> 
>>>>> 220     public void flush() throws IOException { fos.flush(); }
>>>>> 
>>>>> ... where fos is an instance of FileOutputStream.
>>>>> 
>>>>> In sync we go on to call hflush which looks like it calls flush again.
>>>>> 
>>>>> What hadoop/hbase versions we talking about? HADOOP-8861 added the
>> above
>>>>> behavior for hadoop 1.2.
>>>>> 
>>>>> Try it I'd say.
>>>>> 
>>>>> St.Ack
>>>> 
>>> 
>>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Standalone == Dev Only?

Posted by Nick Dimiduk <nd...@gmail.com>.

As Stack and Andrew said, just wanted to give you fair warning that this
mode may need some love. Likewise, there are probably alternative that run
a bit lighter weight, though you flatter us with the reminder of the long
feature list.

I have no problem with helping to fix and committing fixes to bugs that
crop up in local mode operations. Bring 'em on!

-n

On Tue, Mar 10, 2015 at 3:56 PM, Alex Baranau <al...@gmail.com>
wrote:

> On:
>
> - Future investment in a design that scales better
>
> Indeed, designing against key value store is different from designing
> against RDBMs.
>
> I wonder if you explored an option to abstract the storage layer and using
> "single node purposed" store until you grow enough to switch to another
> one?
>
> E.g. you could use LevelDB [1] that is pretty fast (and there's java
> rewrite of it, if you need java APIs [2]). We use it in CDAP [3] in a
> standalone version to make the development environment (SDK) lighter. We
> swap it with HBase in distributed mode without changing the application
> code. It doesn't have coprocessors and other specific to HBase features you
> are talking about, though. But you can figure out how to bridge client APIs
> with an abstraction layer (e.g. we have common Table interface [4]). You
> can even add versions on cells (see [5] for example of how we do it).
>
> Also, you could use RDBMs behind key-value abstraction, to start with,
> while keeping your app design clean out of RDBMs specifics.
>
> Alex Baranau
>
> [1] https://github.com/google/leveldb
> [2] https://github.com/dain/leveldb
> [3] http://cdap.io
> [4]
>
> https://github.com/caskdata/cdap/blob/develop/cdap-api/src/main/java/co/cask/cdap/api/dataset/table/Table.java
> [5]
>
> https://github.com/caskdata/cdap/blob/develop/cdap-data-fabric/src/main/java/co/cask/cdap/data2/dataset2/lib/table/leveldb/LevelDBTableCore.java
>
> --
> http://cdap.io - open source framework to build and run data applications
> on Hadoop & HBase
>
> On Tue, Mar 10, 2015 at 8:42 AM, Rose, Joseph <
> Joseph.Rose@childrens.harvard.edu> wrote:
>
> > Sorry, never answered your question about versions. I have 1.0.0 version
> > of hbase, which has hadoop-common 2.5.1 in its lib folder.
> >
> >
> > -j
> >
> >
> > On 3/10/15, 11:36 AM, "Rose, Joseph" <Jo...@childrens.harvard.edu>
> > wrote:
> >
> > >I tried it and it does work now. It looks like the interface for
> > >hadoop.fs.Syncable changed in March, 2012 to remove the deprecated
> sync()
> > >method and define only hsync() instead. The same committer did the right
> > >thing and removed sync() from FSDataOutputStream at the same time. The
> > >remaining hsync() method calls flush() if the underlying stream doesn't
> > >implement Syncable.
> > >
> > >
> > >-j
> > >
> > >
> > >On 3/6/15, 5:24 PM, "Stack" <st...@duboce.net> wrote:
> > >
> > >>On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
> > >>Joseph.Rose@childrens.harvard.edu> wrote:
> > >>
> > >>> I think the final issue with hadoop-common (re: unimplemented sync
> for
> > >>> local filesystems) is the one showstopper for us. We have to have
> > >>>assured
> > >>> durability. I¹m willing to devote some cycles to get it done, so
> maybe
> > >>>I¹m
> > >>> the one that says this problem is worthwhile.
> > >>>
> > >>>
> > >>I remember that was once the case but looking in codebase now, sync
> calls
> > >>through to ProtobufLogWriter which does a 'flush' on output (though
> > >>comment
> > >>says this is a noop). The output stream is an instance of
> > >>FSDataOutputStream made with a RawLOS. The flush should come out here:
> > >>
> > >>220     public void flush() throws IOException { fos.flush(); }
> > >>
> > >>... where fos is an instance of FileOutputStream.
> > >>
> > >>In sync we go on to call hflush which looks like it calls flush again.
> > >>
> > >>What hadoop/hbase versions we talking about? HADOOP-8861 added the
> above
> > >>behavior for hadoop 1.2.
> > >>
> > >>Try it I'd say.
> > >>
> > >>St.Ack
> > >
> >
> >
>

Re: Standalone == Dev Only?

Posted by Alex Baranau <al...@gmail.com>.

On:

- Future investment in a design that scales better

Indeed, designing against key value store is different from designing
against RDBMs.

I wonder if you explored an option to abstract the storage layer and using
"single node purposed" store until you grow enough to switch to another one?

E.g. you could use LevelDB [1] that is pretty fast (and there's java
rewrite of it, if you need java APIs [2]). We use it in CDAP [3] in a
standalone version to make the development environment (SDK) lighter. We
swap it with HBase in distributed mode without changing the application
code. It doesn't have coprocessors and other specific to HBase features you
are talking about, though. But you can figure out how to bridge client APIs
with an abstraction layer (e.g. we have common Table interface [4]). You
can even add versions on cells (see [5] for example of how we do it).

Also, you could use RDBMs behind key-value abstraction, to start with,
while keeping your app design clean out of RDBMs specifics.

Alex Baranau

[1] https://github.com/google/leveldb
[2] https://github.com/dain/leveldb
[3] http://cdap.io
[4]
https://github.com/caskdata/cdap/blob/develop/cdap-api/src/main/java/co/cask/cdap/api/dataset/table/Table.java
[5]
https://github.com/caskdata/cdap/blob/develop/cdap-data-fabric/src/main/java/co/cask/cdap/data2/dataset2/lib/table/leveldb/LevelDBTableCore.java

--
http://cdap.io - open source framework to build and run data applications
on Hadoop & HBase

On Tue, Mar 10, 2015 at 8:42 AM, Rose, Joseph <
Joseph.Rose@childrens.harvard.edu> wrote:

> Sorry, never answered your question about versions. I have 1.0.0 version
> of hbase, which has hadoop-common 2.5.1 in its lib folder.
>
>
> -j
>
>
> On 3/10/15, 11:36 AM, "Rose, Joseph" <Jo...@childrens.harvard.edu>
> wrote:
>
> >I tried it and it does work now. It looks like the interface for
> >hadoop.fs.Syncable changed in March, 2012 to remove the deprecated sync()
> >method and define only hsync() instead. The same committer did the right
> >thing and removed sync() from FSDataOutputStream at the same time. The
> >remaining hsync() method calls flush() if the underlying stream doesn't
> >implement Syncable.
> >
> >
> >-j
> >
> >
> >On 3/6/15, 5:24 PM, "Stack" <st...@duboce.net> wrote:
> >
> >>On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
> >>Joseph.Rose@childrens.harvard.edu> wrote:
> >>
> >>> I think the final issue with hadoop-common (re: unimplemented sync for
> >>> local filesystems) is the one showstopper for us. We have to have
> >>>assured
> >>> durability. I¹m willing to devote some cycles to get it done, so maybe
> >>>I¹m
> >>> the one that says this problem is worthwhile.
> >>>
> >>>
> >>I remember that was once the case but looking in codebase now, sync calls
> >>through to ProtobufLogWriter which does a 'flush' on output (though
> >>comment
> >>says this is a noop). The output stream is an instance of
> >>FSDataOutputStream made with a RawLOS. The flush should come out here:
> >>
> >>220     public void flush() throws IOException { fos.flush(); }
> >>
> >>... where fos is an instance of FileOutputStream.
> >>
> >>In sync we go on to call hflush which looks like it calls flush again.
> >>
> >>What hadoop/hbase versions we talking about? HADOOP-8861 added the above
> >>behavior for hadoop 1.2.
> >>
> >>Try it I'd say.
> >>
> >>St.Ack
> >
>
>

Re: Standalone == Dev Only?

Posted by "Rose, Joseph" <Jo...@childrens.harvard.edu>.

Sorry, never answered your question about versions. I have 1.0.0 version
of hbase, which has hadoop-common 2.5.1 in its lib folder.


-j


On 3/10/15, 11:36 AM, "Rose, Joseph" <Jo...@childrens.harvard.edu>
wrote:

>I tried it and it does work now. It looks like the interface for
>hadoop.fs.Syncable changed in March, 2012 to remove the deprecated sync()
>method and define only hsync() instead. The same committer did the right
>thing and removed sync() from FSDataOutputStream at the same time. The
>remaining hsync() method calls flush() if the underlying stream doesn’t
>implement Syncable.
>
>
>-j
>
>
>On 3/6/15, 5:24 PM, "Stack" <st...@duboce.net> wrote:
>
>>On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
>>Joseph.Rose@childrens.harvard.edu> wrote:
>>
>>> I think the final issue with hadoop-common (re: unimplemented sync for
>>> local filesystems) is the one showstopper for us. We have to have
>>>assured
>>> durability. I¹m willing to devote some cycles to get it done, so maybe
>>>I¹m
>>> the one that says this problem is worthwhile.
>>>
>>>
>>I remember that was once the case but looking in codebase now, sync calls
>>through to ProtobufLogWriter which does a 'flush' on output (though
>>comment
>>says this is a noop). The output stream is an instance of
>>FSDataOutputStream made with a RawLOS. The flush should come out here:
>>
>>220     public void flush() throws IOException { fos.flush(); }
>>
>>... where fos is an instance of FileOutputStream.
>>
>>In sync we go on to call hflush which looks like it calls flush again.
>>
>>What hadoop/hbase versions we talking about? HADOOP-8861 added the above
>>behavior for hadoop 1.2.
>>
>>Try it I'd say.
>>
>>St.Ack
>

Re: Standalone == Dev Only?

Posted by "Rose, Joseph" <Jo...@childrens.harvard.edu>.

I tried it and it does work now. It looks like the interface for
hadoop.fs.Syncable changed in March, 2012 to remove the deprecated sync()
method and define only hsync() instead. The same committer did the right
thing and removed sync() from FSDataOutputStream at the same time. The
remaining hsync() method calls flush() if the underlying stream doesn’t
implement Syncable.

-j

On 3/6/15, 5:24 PM, "Stack" <st...@duboce.net> wrote:

>On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
>Joseph.Rose@childrens.harvard.edu> wrote:
>
>> I think the final issue with hadoop-common (re: unimplemented sync for
>> local filesystems) is the one showstopper for us. We have to have
>>assured
>> durability. I¹m willing to devote some cycles to get it done, so maybe
>>I¹m
>> the one that says this problem is worthwhile.
>>
>>
>I remember that was once the case but looking in codebase now, sync calls
>through to ProtobufLogWriter which does a 'flush' on output (though
>comment
>says this is a noop). The output stream is an instance of
>FSDataOutputStream made with a RawLOS. The flush should come out here:
>
>220     public void flush() throws IOException { fos.flush(); }
>
>... where fos is an instance of FileOutputStream.
>
>In sync we go on to call hflush which looks like it calls flush again.
>
>What hadoop/hbase versions we talking about? HADOOP-8861 added the above
>behavior for hadoop 1.2.
>
>Try it I'd say.
>
>St.Ack

Re: Standalone == Dev Only?

Posted by Stack <st...@duboce.net>.

On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
Joseph.Rose@childrens.harvard.edu> wrote:

> So, I think Nick, St.Ack and Wilm have all made some excellent points, but
> this last email more or less hit it on the head. Like I said, I¹m working
> with patient data and while the volume is small now, it¹s not going to
> stay that way. And the cell-level security is a *huge* win ‹ I¹m sure you
> folks have some idea how happy that feature makes me. I¹d also rather be
> writing coprocessors than triggers or ‹ heaven forbid ‹ PL/SQL.
>
> But there¹s another, more fundamental thing: we¹re exploring other DB
> architectures because classical RDBMS systems haven¹t always worked out so
> well. In fact, we¹re having a bit of a hard time with the current project
> because we¹ve been constrained (thus far) to a relational system and it
> doesn¹t seem to be a clean fit. A key/val store, on the other hand, will
> have enough flexibility to get the job done, I think. It¹s all being
> prototyped now, so we¹ll see.
>
>
Ok. Sounds like you know the +/-s. Was just checking.



> I think the final issue with hadoop-common (re: unimplemented sync for
> local filesystems) is the one showstopper for us. We have to have assured
> durability. I¹m willing to devote some cycles to get it done, so maybe I¹m
> the one that says this problem is worthwhile.
>
>
I remember that was once the case but looking in codebase now, sync calls
through to ProtobufLogWriter which does a 'flush' on output (though comment
says this is a noop). The output stream is an instance of
FSDataOutputStream made with a RawLOS. The flush should come out here:

220     public void flush() throws IOException { fos.flush(); }

... where fos is an instance of FileOutputStream.

In sync we go on to call hflush which looks like it calls flush again.

What hadoop/hbase versions we talking about? HADOOP-8861 added the above
behavior for hadoop 1.2.

Try it I'd say.

St.Ack





> Thanks for chiming in. I¹d love to hear more.
>
>
> -j
>
>
> On 3/6/15, 3:02 PM, "Wilm Schumacher" <wi...@gmail.com> wrote:
>
> >Hi,
> >
> >Am 06.03.2015 um 19:18 schrieb Stack:
> >> Why not use an RDBMS then?
> >
> >When I first read the hbase documentation I also stumbled about the
> >"only use for large datasets" or "standalone only in dev mode" etc. In
> >my point of view there are some arguments against RDBMSs and for e.g.
> >hbase, although we talk about a single node application.
> >
> >* scalability is a future investment. Even if the dataset is small now,
> >it doesn't mean that it is in the future, too. Scalabilty in size and
> >computing power is always a good idea.
> >
> >* query language: for a user hbase is more of a database library than a
> >"DBMS". For me this is a big plus, as it forces the user to do it the
> >right way. Just think of SQL-injection. Or CQL-injection for that
> >matter. Query languages are like scripting languages. Makes easy stuff
> >easier and hard stuff harder.
> >
> >* fancy features: hbase has fancy features RDBMSs doesn't have. E.g.
> >coprocessors. I know that e.g. mysql has "triggers", but they are not
> >nearly as powerful as coprocessors. And don't forget that you have to
> >write most of the triggers in this *curse word* SQ-language if you don't
> >want to use evil hacks.
> >
> >* schema-less: another HUGE plus is the possibility to use it without a
> >fixed schema. In SQL you would need several tables and do a lot of
> >joins. And the output is way harder to get and to parse.
> >
> >* ecosystem: when you use hbase you automatically get the whole hadoop,
> >or better apache foundation, ecosystem right away. Not only hdfs, but
> >mapred, lucene, spark, kafka etc. etc..
> >
> >There are only two real arguments against hbase in that scenario:
> >
> >* joins etc.: well, in sql that's a question of minutes. In hbase that
> >takes a little more effort. BUT: then it's done the right way ;).
> >
> >* RDMSs are more widely known: well ... that's not the fault of hbase ;).
> >
> >Thus, I think that the hbase community should be more self-reliant for
> >that matter, even and especially for applications in the SQL realm ;).
> >Which is a good opportunity to say congratulations for the hbase 1.0
> >milestone. And thank you for that.
> >
> >Best wishes
> >
> >Wilm
> >
>
>

Re: Standalone == Dev Only?

Posted by Andrew Purtell <ap...@apache.org>.

... And if you have at most "small data" at this stage, you might be able
to cut the heap sizes of the HDFS daemons in half.

On Fri, Mar 6, 2015 at 2:18 PM, Andrew Purtell <ap...@apache.org> wrote:

> > I think the final issue with hadoop-common (re: unimplemented sync for local
> filesystems) is the one showstopper for us.
>
> Although the unnecessary overhead would be significant, you could run a
> stripped down HDFS stack on the VM. Give the NameNode, SecondaryNameNode,
> and DataNode 1GB of heap only (so this sacrifices 3GB of RAM), configure
> replication to default to 1, enable short-circuit reads for direct file
> access during reads, and enable sync-behind-write on the DataNode. If using
> EXT3 or 4 mount with dirsync. (Or use XFS.) This is about as good as you'll
> be able to do until that Hadoop JIRA is addressed. It could get you over
> the hump. I do this sort of thing for testing the full HDFS+HBase stack
> when I'm unable to get my hands on a cluster.
>
>
> On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
> Joseph.Rose@childrens.harvard.edu> wrote:
>
>> So, I think Nick, St.Ack and Wilm have all made some excellent points, but
>> this last email more or less hit it on the head. Like I said, I¹m working
>> with patient data and while the volume is small now, it¹s not going to
>> stay that way. And the cell-level security is a *huge* win ‹ I¹m sure you
>> folks have some idea how happy that feature makes me. I¹d also rather be
>> writing coprocessors than triggers or ‹ heaven forbid ‹ PL/SQL.
>>
>> But there¹s another, more fundamental thing: we¹re exploring other DB
>> architectures because classical RDBMS systems haven¹t always worked out so
>> well. In fact, we¹re having a bit of a hard time with the current project
>> because we¹ve been constrained (thus far) to a relational system and it
>> doesn¹t seem to be a clean fit. A key/val store, on the other hand, will
>> have enough flexibility to get the job done, I think. It¹s all being
>> prototyped now, so we¹ll see.
>>
>> I think the final issue with hadoop-common (re: unimplemented sync for
>> local filesystems) is the one showstopper for us. We have to have assured
>> durability. I¹m willing to devote some cycles to get it done, so maybe I¹m
>> the one that says this problem is worthwhile.
>>
>> Thanks for chiming in. I¹d love to hear more.
>>
>>
>> -j
>>
>>
>> On 3/6/15, 3:02 PM, "Wilm Schumacher" <wi...@gmail.com> wrote:
>>
>> >Hi,
>> >
>> >Am 06.03.2015 um 19:18 schrieb Stack:
>> >> Why not use an RDBMS then?
>> >
>> >When I first read the hbase documentation I also stumbled about the
>> >"only use for large datasets" or "standalone only in dev mode" etc. In
>> >my point of view there are some arguments against RDBMSs and for e.g.
>> >hbase, although we talk about a single node application.
>> >
>> >* scalability is a future investment. Even if the dataset is small now,
>> >it doesn't mean that it is in the future, too. Scalabilty in size and
>> >computing power is always a good idea.
>> >
>> >* query language: for a user hbase is more of a database library than a
>> >"DBMS". For me this is a big plus, as it forces the user to do it the
>> >right way. Just think of SQL-injection. Or CQL-injection for that
>> >matter. Query languages are like scripting languages. Makes easy stuff
>> >easier and hard stuff harder.
>> >
>> >* fancy features: hbase has fancy features RDBMSs doesn't have. E.g.
>> >coprocessors. I know that e.g. mysql has "triggers", but they are not
>> >nearly as powerful as coprocessors. And don't forget that you have to
>> >write most of the triggers in this *curse word* SQ-language if you don't
>> >want to use evil hacks.
>> >
>> >* schema-less: another HUGE plus is the possibility to use it without a
>> >fixed schema. In SQL you would need several tables and do a lot of
>> >joins. And the output is way harder to get and to parse.
>> >
>> >* ecosystem: when you use hbase you automatically get the whole hadoop,
>> >or better apache foundation, ecosystem right away. Not only hdfs, but
>> >mapred, lucene, spark, kafka etc. etc..
>> >
>> >There are only two real arguments against hbase in that scenario:
>> >
>> >* joins etc.: well, in sql that's a question of minutes. In hbase that
>> >takes a little more effort. BUT: then it's done the right way ;).
>> >
>> >* RDMSs are more widely known: well ... that's not the fault of hbase ;).
>> >
>> >Thus, I think that the hbase community should be more self-reliant for
>> >that matter, even and especially for applications in the SQL realm ;).
>> >Which is a good opportunity to say congratulations for the hbase 1.0
>> >milestone. And thank you for that.
>> >
>>
>
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Standalone == Dev Only?

Posted by Andrew Purtell <ap...@apache.org>.

> I think the final issue with hadoop-common (re: unimplemented sync for local
filesystems) is the one showstopper for us.

Although the unnecessary overhead would be significant, you could run a
stripped down HDFS stack on the VM. Give the NameNode, SecondaryNameNode,
and DataNode 1GB of heap only (so this sacrifices 3GB of RAM), configure
replication to default to 1, enable short-circuit reads for direct file
access during reads, and enable sync-behind-write on the DataNode. If using
EXT3 or 4 mount with dirsync. (Or use XFS.) This is about as good as you'll
be able to do until that Hadoop JIRA is addressed. It could get you over
the hump. I do this sort of thing for testing the full HDFS+HBase stack
when I'm unable to get my hands on a cluster.


On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
Joseph.Rose@childrens.harvard.edu> wrote:

> So, I think Nick, St.Ack and Wilm have all made some excellent points, but
> this last email more or less hit it on the head. Like I said, I¹m working
> with patient data and while the volume is small now, it¹s not going to
> stay that way. And the cell-level security is a *huge* win ‹ I¹m sure you
> folks have some idea how happy that feature makes me. I¹d also rather be
> writing coprocessors than triggers or ‹ heaven forbid ‹ PL/SQL.
>
> But there¹s another, more fundamental thing: we¹re exploring other DB
> architectures because classical RDBMS systems haven¹t always worked out so
> well. In fact, we¹re having a bit of a hard time with the current project
> because we¹ve been constrained (thus far) to a relational system and it
> doesn¹t seem to be a clean fit. A key/val store, on the other hand, will
> have enough flexibility to get the job done, I think. It¹s all being
> prototyped now, so we¹ll see.
>
> I think the final issue with hadoop-common (re: unimplemented sync for
> local filesystems) is the one showstopper for us. We have to have assured
> durability. I¹m willing to devote some cycles to get it done, so maybe I¹m
> the one that says this problem is worthwhile.
>
> Thanks for chiming in. I¹d love to hear more.
>
>
> -j
>
>
> On 3/6/15, 3:02 PM, "Wilm Schumacher" <wi...@gmail.com> wrote:
>
> >Hi,
> >
> >Am 06.03.2015 um 19:18 schrieb Stack:
> >> Why not use an RDBMS then?
> >
> >When I first read the hbase documentation I also stumbled about the
> >"only use for large datasets" or "standalone only in dev mode" etc. In
> >my point of view there are some arguments against RDBMSs and for e.g.
> >hbase, although we talk about a single node application.
> >
> >* scalability is a future investment. Even if the dataset is small now,
> >it doesn't mean that it is in the future, too. Scalabilty in size and
> >computing power is always a good idea.
> >
> >* query language: for a user hbase is more of a database library than a
> >"DBMS". For me this is a big plus, as it forces the user to do it the
> >right way. Just think of SQL-injection. Or CQL-injection for that
> >matter. Query languages are like scripting languages. Makes easy stuff
> >easier and hard stuff harder.
> >
> >* fancy features: hbase has fancy features RDBMSs doesn't have. E.g.
> >coprocessors. I know that e.g. mysql has "triggers", but they are not
> >nearly as powerful as coprocessors. And don't forget that you have to
> >write most of the triggers in this *curse word* SQ-language if you don't
> >want to use evil hacks.
> >
> >* schema-less: another HUGE plus is the possibility to use it without a
> >fixed schema. In SQL you would need several tables and do a lot of
> >joins. And the output is way harder to get and to parse.
> >
> >* ecosystem: when you use hbase you automatically get the whole hadoop,
> >or better apache foundation, ecosystem right away. Not only hdfs, but
> >mapred, lucene, spark, kafka etc. etc..
> >
> >There are only two real arguments against hbase in that scenario:
> >
> >* joins etc.: well, in sql that's a question of minutes. In hbase that
> >takes a little more effort. BUT: then it's done the right way ;).
> >
> >* RDMSs are more widely known: well ... that's not the fault of hbase ;).
> >
> >Thus, I think that the hbase community should be more self-reliant for
> >that matter, even and especially for applications in the SQL realm ;).
> >Which is a good opportunity to say congratulations for the hbase 1.0
> >milestone. And thank you for that.
> >
>

Re: Standalone == Dev Only?

Posted by "Rose, Joseph" <Jo...@childrens.harvard.edu>.

So, I think Nick, St.Ack and Wilm have all made some excellent points, but
this last email more or less hit it on the head. Like I said, I¹m working
with patient data and while the volume is small now, it¹s not going to
stay that way. And the cell-level security is a *huge* win ‹ I¹m sure you
folks have some idea how happy that feature makes me. I¹d also rather be
writing coprocessors than triggers or ‹ heaven forbid ‹ PL/SQL.

But there¹s another, more fundamental thing: we¹re exploring other DB
architectures because classical RDBMS systems haven¹t always worked out so
well. In fact, we¹re having a bit of a hard time with the current project
because we¹ve been constrained (thus far) to a relational system and it
doesn¹t seem to be a clean fit. A key/val store, on the other hand, will
have enough flexibility to get the job done, I think. It¹s all being
prototyped now, so we¹ll see.

I think the final issue with hadoop-common (re: unimplemented sync for
local filesystems) is the one showstopper for us. We have to have assured
durability. I¹m willing to devote some cycles to get it done, so maybe I¹m
the one that says this problem is worthwhile.

Thanks for chiming in. I¹d love to hear more.

-j

On 3/6/15, 3:02 PM, "Wilm Schumacher" <wi...@gmail.com> wrote:

>Hi,
>
>Am 06.03.2015 um 19:18 schrieb Stack:
>> Why not use an RDBMS then?
>
>When I first read the hbase documentation I also stumbled about the
>"only use for large datasets" or "standalone only in dev mode" etc. In
>my point of view there are some arguments against RDBMSs and for e.g.
>hbase, although we talk about a single node application.
>
>* scalability is a future investment. Even if the dataset is small now,
>it doesn't mean that it is in the future, too. Scalabilty in size and
>computing power is always a good idea.
>
>* query language: for a user hbase is more of a database library than a
>"DBMS". For me this is a big plus, as it forces the user to do it the
>right way. Just think of SQL-injection. Or CQL-injection for that
>matter. Query languages are like scripting languages. Makes easy stuff
>easier and hard stuff harder.
>
>* fancy features: hbase has fancy features RDBMSs doesn't have. E.g.
>coprocessors. I know that e.g. mysql has "triggers", but they are not
>nearly as powerful as coprocessors. And don't forget that you have to
>write most of the triggers in this *curse word* SQ-language if you don't
>want to use evil hacks.
>
>* schema-less: another HUGE plus is the possibility to use it without a
>fixed schema. In SQL you would need several tables and do a lot of
>joins. And the output is way harder to get and to parse.
>
>* ecosystem: when you use hbase you automatically get the whole hadoop,
>or better apache foundation, ecosystem right away. Not only hdfs, but
>mapred, lucene, spark, kafka etc. etc..
>
>There are only two real arguments against hbase in that scenario:
>
>* joins etc.: well, in sql that's a question of minutes. In hbase that
>takes a little more effort. BUT: then it's done the right way ;).
>
>* RDMSs are more widely known: well ... that's not the fault of hbase ;).
>
>Thus, I think that the hbase community should be more self-reliant for
>that matter, even and especially for applications in the SQL realm ;).
>Which is a good opportunity to say congratulations for the hbase 1.0
>milestone. And thank you for that.
>
>Best wishes
>
>Wilm
>

Re: Standalone == Dev Only?

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

Am 06.03.2015 um 19:18 schrieb Stack:
> Why not use an RDBMS then?

When I first read the hbase documentation I also stumbled about the
"only use for large datasets" or "standalone only in dev mode" etc. In
my point of view there are some arguments against RDBMSs and for e.g.
hbase, although we talk about a single node application.

* scalability is a future investment. Even if the dataset is small now,
it doesn't mean that it is in the future, too. Scalabilty in size and
computing power is always a good idea.

* query language: for a user hbase is more of a database library than a
"DBMS". For me this is a big plus, as it forces the user to do it the
right way. Just think of SQL-injection. Or CQL-injection for that
matter. Query languages are like scripting languages. Makes easy stuff
easier and hard stuff harder.

* fancy features: hbase has fancy features RDBMSs doesn't have. E.g.
coprocessors. I know that e.g. mysql has "triggers", but they are not
nearly as powerful as coprocessors. And don't forget that you have to
write most of the triggers in this *curse word* SQ-language if you don't
want to use evil hacks.

* schema-less: another HUGE plus is the possibility to use it without a
fixed schema. In SQL you would need several tables and do a lot of
joins. And the output is way harder to get and to parse.

* ecosystem: when you use hbase you automatically get the whole hadoop,
or better apache foundation, ecosystem right away. Not only hdfs, but
mapred, lucene, spark, kafka etc. etc..

There are only two real arguments against hbase in that scenario:

* joins etc.: well, in sql that's a question of minutes. In hbase that
takes a little more effort. BUT: then it's done the right way ;).

* RDMSs are more widely known: well ... that's not the fault of hbase ;).

Thus, I think that the hbase community should be more self-reliant for
that matter, even and especially for applications in the SQL realm ;).
Which is a good opportunity to say congratulations for the hbase 1.0
milestone. And thank you for that.

Best wishes

Wilm

Re: Standalone == Dev Only?

Posted by Stack <st...@duboce.net>.

On Tue, Mar 3, 2015 at 7:32 AM, Rose, Joseph <
Joseph.Rose@childrens.harvard.edu> wrote:

> Folks,
>
> I’m new to HBase (but not new to these sorts of data stores.) I think
> HBase would be a good fit for a project I’m working on, except for one
> thing: the amount of data we’re talking about, here, is far smaller than
> what’s usually recommended for HBase. As I read the docs, though, it seems
> like the main argument against small datasets is replication: HDFS requires
> a bunch of nodes right from the start and that’s overkill for my use.
>
>
Why not use an RDBMS then?


> So, what’s the motivation behind labeling standalone HBase deployments
> “dev only”? If all I really need is a table full of keys and all of that
> will fit comfortably in a single node, and if I have my own backup solution
> (literally, backing up the VM on which it’ll run), why bother with HDFS and
> distributed HBase?
>
> (As an aside, I could go to something like Berkeley DB but then I don’t
> get all the nice coprocessors and filters and so on, not to mention
> cell-level security. Because I work with patient data the latter is
> definitely a huge win.)
>
>
What Nick said. Standalone and 'throwaway' are usually found in the same
sentence so little consideration (testing/verification) has been done to
ensure it works well.

That said, it basically works and I know of at least one instance where a
standalone instance is hosting tsdb for a decent-sized cluster.

St.Ack



> Thanks for your help.
>
>
> Joseph Rose
> Intelligent Health Laboratory
> Boston Children’s Hospital
>
>

Re: Standalone == Dev Only?

Posted by Nick Dimiduk <nd...@gmail.com>.

Hi Joseph,

Generally speaking we've thought of stand-alone mode a dev/testing because
the common use case for HBase is larger datasets. There's nothing
specifically non-production about a stand-alone mode, though you obviously
won't have high-availability, and there may be bugs in the code paths that
are less used. For instance, issues like HADOOP-7844 pop up are not not
necessarily prioritized because local mode isn't the common production case.

There's also been discussion of refactoring HBase to extract its storage
engine, making it available as a library that can be embedded in other
applications. I'm not sure of the status of that discussion (or if there's
a ticket for it), but it may also be suitable for your use-case. CC Enis as
he's been vocal about this recently.

-n

https://issues.apache.org/jira/browse/HADOOP-7844

On Tue, Mar 3, 2015 at 7:32 AM, Rose, Joseph <
Joseph.Rose@childrens.harvard.edu> wrote:

> Folks,
>
> I’m new to HBase (but not new to these sorts of data stores.) I think
> HBase would be a good fit for a project I’m working on, except for one
> thing: the amount of data we’re talking about, here, is far smaller than
> what’s usually recommended for HBase. As I read the docs, though, it seems
> like the main argument against small datasets is replication: HDFS requires
> a bunch of nodes right from the start and that’s overkill for my use.
>
> So, what’s the motivation behind labeling standalone HBase deployments
> “dev only”? If all I really need is a table full of keys and all of that
> will fit comfortably in a single node, and if I have my own backup solution
> (literally, backing up the VM on which it’ll run), why bother with HDFS and
> distributed HBase?
>
> (As an aside, I could go to something like Berkeley DB but then I don’t
> get all the nice coprocessors and filters and so on, not to mention
> cell-level security. Because I work with patient data the latter is
> definitely a huge win.)
>
> Thanks for your help.
>
>
> Joseph Rose
> Intelligent Health Laboratory
> Boston Children’s Hospital
>
>