You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Ben Liang <li...@hotmail.com> on 2015/04/06 00:26:08 UTC

How to Manage Data Architecture & Modeling for HBase

Hi all,
	Do you have any tools to manage Data Architecture & Modeling for HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?   
	
	Please give me some advice.
	
Regards,
Ben Liang

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Imants Cekusins <im...@gmail.com>.

>  tools to manage Data Architecture & Modeling for HBase

To aid visualizing table structure, you could use Enterprise Architect

Even though HBase cells store BLOBS, quite often these BLOBS are
serialized classes.

In EA classes can appear in table definition as field types.

It is possible to put a mix of table definitions, class definitions
(and a number of other diagram items) on the same diagram.

EA probably won't generate HBase DDL or DML code for you. These
diagrams are however useful for visualizing, documenting and
discussion.


I am not familiar with Powerdesinger or ERWin so can't comment on these.

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Michael Segel <mi...@hotmail.com>.

So this is the hardest thing to do… teach someone not to look at the data in terms of an RDBMs model. 

And there aren’t any hard and fast rules… 

Lets look at an example. 

You’re creating an application for Medicare/Medicaid to help identify potential abuses and fraud within the system. 

In part of your application, you’re going to store all relevant patient information and billing/claim records. 

Within your patient claim data, you have a procedure code. 

In a traditional RDBMS DW, you’d have a fact table and you’d have a relationship between the code, its description, and whatever other data, and then link to it within your patient record. 

But in HBase, your claim record would have all of this information with no reference to the lookup table. 

You would still want the lookup table for your application so that you could load it in to memory when you’re writing or processing records, yet you’re storing the relevant fact data in to the record.  But the lookup table isn’t associated with your base claim data.  (When the claim comes in… you may get the diagnostic code, but during the ingestion process, you’d want to add in the relevant information surrounding the diagnostic code. This could be anything from a description, or the entire record. 

In theory, HBase should not be normalized.  The idea is that when I pull a record from the base table, most if not all of the data should be present. 
This is why a hierarchical model is a better fit. 

In terms of a DW, you don’t have a star schema.  In fact, you really shouldn’t have much of a schema outside of a box. or a simple schema with a box and children representing the column families. 

The best example that I can give is looking at the BBC’s Sherlock Homes serial.  In one episode, the villain created a mental image of a library with a bunch of record cards in his mind and this is how he accessed information that he could use to blackmail people. 

So think of a medical records filing cabinet. When you go to see the doctor, he pulls out your folder and it contains everything that he has on you and your medical history. Its all there in one record. He pulls out the folder and your medical history is in reverse chronological order. Each patient visit, lab result, etc … 

You have to remember that in HBase, you don’t want to join tables to get a result. Too slow and too cumbersome.  Remember its a distributed database. 

This is why you have to look at things from the 80’s like Revelation (Dick Pick’s OS/Database) , Universe / U2 (Ascential/Informix/IBM)  and other systems. 

HTH

-Mike

> On Apr 6, 2015, at 8:34 AM, Ben Liang <li...@hotmail.com> wrote:
> 
> Thank you for your prompt reply.
> 
> In my daily work, I mainly used Oracle DB to build a data warehouse with star topology data modeling, about financial analysis and marketing analysis.
> Now I trying to use Hbase to do it. 
> 
> I has a question,
> 1) many tables from ERP should be Incremental loading every day , Including some insert and some update,  this scenario is appropriate to use  hbase to build data worehose？
> 2) Is there some case about Enterprise BI Solutions with HBASE? 
> 
> thanks.
> 
> 
> Regards,
> Ben Liang
> 
>> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> Yeah. Jean-Marc is right. 
>> 
>> You have to think more in terms of a hierarchical model where you’re modeling records not relationships. 
>> 
>> Your model would look like a single ER box per record type. 
>> 
>> The HBase schema is very simple.  Tables, column families and that’s it for static structures.  Even then, column families tend to get misused. 
>> 
>> If you’re looking at a relational model… Phoenix or Splice Machines would allow you to do something… although Phoenix is still VERY primitive. 
>> (Do they take advantage of cell versioning like spice machines yet? ) 
>> 
>> 
>> There are a couple of interesting things where you could create your own modeling tool / syntax (relationships)… 
>> 
>> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs. 
>> 2) You can join entities on either a FK principle or on a weaker relationship type. 
>> 
>> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a finite bounded length not to exceed the size of a region. So you could store an entire record as a CLOB within a cell.  Its in this sense that a cell can represent multiple attributes of your object/record that you gain an additional dimension and why you only need to use a single data type. 
>> 
>> HBase and Hadoop in general allow one to join orthogonal data sets that have a weak relationship.  So while you can still join sets against a FK which implies a relationship, you don’t have to do it. 
>> 
>> Imagine if you wanted to find out the average cost of a front end collision by car of college aged drivers by major. 
>> You would be joining insurance records against registrations for all of the universities in the US for those students between the ages of 17 and 25. 
>> 
>> How would you model this when in fact neither defining attribute is a FK? 
>> (This is why you need a good Secondary Indexing implementation and not something brain dead that wasn’t alcohol induced. ;-) 
>> 
>> Does that make sense? 
>> 
>> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or Progressive Insurance are doing anything like this. But they could.
>> 
>>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
>>> 
>>> Not sure you want to ever do that... Designing an HBase application is far
>>> different from designing an RDBMS one. Not sure those tools fit well here.
>>> 
>>> What's you're goal? Designing your HBase schema somewhere and then let the
>>> tool generate your HBase tables?
>>> 
>>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
>>> 
>>>> Hi all,
>>>>      Do you have any tools to manage Data Architecture & Modeling for
>>>> HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>>>> 
>>>>      Please give me some advice.
>>>> 
>>>> Regards,
>>>> Ben Liang
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
>> Use at your own risk. 
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Jiten Gore <ji...@gores.net>.

@Ben, you have got great tips on HBase schema design.

If you are looking for a visual tool, feel free to try the attached template in Word (open using outline viewer). I created it after noticing that there were no existing tools to display and review HBase schema and as others have commented on this thread, the relational tools do not make much sense.

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Michael Segel <mi...@hotmail.com>.

Ok… 

The Hash table isn’t a really good analogy.
Its really a scan of the META table that identifies which region. 

With respect to your use case… 

You can use HBase to solve it. 
You will need to pre-generate your indexes and you can expect your data to grow exponentially.  You shouldn’t rule it out, just plan for it. 
In terms of secondary indexes… which one? 
Since HBase doesn’t natively support it, you could use anything. Inverted tables or Lucene for that matter. (Or some other format.) 

In terms of using the indexes… you would have to do a query/scan against the indexes, then take the intersection of the result set(s). 
(This step could be omitted if using Lucene, but there are other issues… like memory management so your index memory footprint can be managed, however… even here there are challenges.) 

So if you want to start simple… do an inverted table. Even here you have a choice… you can have a thin row or you can store X number of keys in the inverted row. It gets back to the fat row vs thin row, or something in between. Again, there are permutations to the basic pattern which have differing amounts of complexity and performance.  (Note: We didn’t have time to walk through and benchmark  these options, and they’re still relatively theoretical.) 

Without knowing more about your use case… its hard to say what will and what wont work. (e.g. Choosing which attribute to index has an impact. ) Also the size of your raw data set, and then the size with the differing indexes. 

Outside of HBase, if you’re running MapR, they do have MapRDB which doesn’t have some of the issues you have with HBase,  while more stable, it only runs on MapR. 
(I’m told its in the community edition, so when I get the chance, I’ll have to play with it. ) 

HTH

-Mike

> On Apr 6, 2015, at 7:23 PM, Pamecha, Abhishek <ap...@paypal.com.INVALID> wrote:
> 
> Thanks for your clarifications. I see where I made assumptions and you corrected them. Primarily it was assuming my response times requirements for the use case below.
> 
> To explain: My use case was exactly what you mentioned in the post AS the main usecase for Hadoop except that we required results in less than 1 min for arbitrary filters and aggregations on keys in the client query. 
> 
> We were exploring to use HBase for that retrieval with a mix of some pre-generation done by M/R jobs and some filtering done by application at the time of the query. But the dimensions in which to slice and dice were quite a lot so pre-generation was exponential and quickly ruled out. On the spot filtering lead to a lot of data reads in the application as each row could only be organized a certain fixed hierarchical fashion which might not be best for a specific aggregation request (which for eg: required aggregated results to be grouped under the opposite hierarchy). Secondary indexes meant M/R jobs which weren’t useful given our SLAs.
> 
> Anyways, I incorrectly assumed that the use case here is also with such response times. It is absolutely ok for a response to take as much time as a normal M/R job takes and still satisfy the usecase requirements. As you said, is indeed the selling point for Hadoop to solve numerous problems.
> 
> Re: hashtables:
> 	I have always viewed HBase as a large distributed hash table with key ranges mapped to data nodes. Partly this has come from my interpretation of HBase's own documentation and partly based on other NoSQL datastores which revolve around same concept. Is it not right that a key is "hashed" to map to which node it is served by? And that all the data for that key is within that node in a blob? My intention was to state that such a hash table design is "very good" for key based data look ups but NOT ideal for use cases where a certain subset of keys need to be scanned and its values aggregated. This essentially means a FULL SCAN of the datastore with a reliable, consistent but a large (for our usecase) response time. Again, other usecases can live with this latency as in many reporting applications for example.
> 
> I concur HBase is not the right solution for all scenarios. It is indeed my first choice when it comes to storing BLOBs against a key and useful for quick lookup and future updates against those keys. When it comes to aggregations and filters across the entire range of keys, I would prefer to explore traditional RDBMS cluster solution as well if it can be a fit and contrast the two.
> 
> I hope I could clarify my intentions. I couldn't agree more with your last statement quoted again:
> 
> "I am not suggesting that HBase is right for all occasions, because its not. But I am suggesting that a lot of effort and failed attempts can be avoided by understanding how to best use HBase and to not think in terms of relationships."
> 
> Thanks
> Abhishek
> 
> 
> 
> 
> 
> 
> Thanks,
> Abhishek
> 
> 
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com] 
> Sent: Monday, April 06, 2015 4:20 PM
> To: user@hbase.apache.org
> Cc: user@phoenix.apache.org
> Subject: Re: How to Manage Data Architecture & Modeling for HBase
> 
> Ok… 
> 
> Need to clarify this… 
> 
> The use of real time is a bit misleading. Its subjective real time. 
> 
> With respect to schema design… please see my longer post on design in this thread. Again think Hierarchical which means that you get everything in a single get(). 
> 
> And yes, you have to think about your use case.  In some use cases, you are using M/R and pulling data and doing calculations which is output in to HBase where another app will in subjective real time , pull data from hbase for use. 
> 
> In my earlier post I talked about using HBase to join data from different data sets. This is one of the main use cases and arguments for Hadoop. That you want to gain value by taking data from different data sets where the combined data may yield insights that were not previously possible. 
> 
> I’m not sure where you are getting at with hash tables.  
> 
> I am not suggesting that HBase is right for all occasions, because its not. But I am suggesting that a lot of effort and failed attempts can be avoided by understanding how to best use HBase and to not think in terms of relationships. 
> 
> HTH
> 
> -Mike
> 
> 
> 
>> On Apr 6, 2015, at 12:09 PM, Pamecha, Abhishek <ap...@paypal.com.INVALID> wrote:
>> 
>> I would stress that if you envision any joins or arbitrary slices and dices at a later point in your application, you might want to either redesign your schema "very carefully"  or be ready for more time consuming ( not near real time) answers. We had explored a possible solution on similar lines but a hashtable approach (as expected)  isn’t the best for database joins OR slicing based on arbitrary columns across the whole dataset. We had to switch back to a relational db for our usecase.
>> 
>> Thanks,
>> Abhishek
>> 
>> -----Original Message-----
>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>> Sent: Monday, April 06, 2015 9:55 AM
>> To: user@hbase.apache.org
>> Cc: user@phoenix.apache.org
>> Subject: Re: How to Manage Data Architecture & Modeling for HBase
>> 
>> I should add that in terms of financial modeling…
>> 
>> Its easier to store derivatives and synthetic instruments because you aren’t really constrained by a relational model. 
>> (Derivatives are nothing more than a contract.)
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Apr 6, 2015, at 8:34 AM, Ben Liang <li...@hotmail.com> wrote:
>>> 
>>> Thank you for your prompt reply.
>>> 
>>> In my daily work, I mainly used Oracle DB to build a data warehouse with star topology data modeling, about financial analysis and marketing analysis.
>>> Now I trying to use Hbase to do it. 
>>> 
>>> I has a question,
>>> 1) many tables from ERP should be Incremental loading every day , 
>>> Including some insert and some update,  this scenario is appropriate 
>>> to use  hbase to build data worehose？
>>> 2) Is there some case about Enterprise BI Solutions with HBASE? 
>>> 
>>> thanks.
>>> 
>>> 
>>> Regards,
>>> Ben Liang
>>> 
>>>> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com> wrote:
>>>> 
>>>> Yeah. Jean-Marc is right. 
>>>> 
>>>> You have to think more in terms of a hierarchical model where you’re modeling records not relationships. 
>>>> 
>>>> Your model would look like a single ER box per record type. 
>>>> 
>>>> The HBase schema is very simple.  Tables, column families and that’s it for static structures.  Even then, column families tend to get misused. 
>>>> 
>>>> If you’re looking at a relational model… Phoenix or Splice Machines would allow you to do something… although Phoenix is still VERY primitive. 
>>>> (Do they take advantage of cell versioning like spice machines yet? 
>>>> )
>>>> 
>>>> 
>>>> There are a couple of interesting things where you could create your 
>>>> own modeling tool / syntax (relationships)…
>>>> 
>>>> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs. 
>>>> 2) You can join entities on either a FK principle or on a weaker relationship type. 
>>>> 
>>>> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a finite bounded length not to exceed the size of a region. So you could store an entire record as a CLOB within a cell.  Its in this sense that a cell can represent multiple attributes of your object/record that you gain an additional dimension and why you only need to use a single data type. 
>>>> 
>>>> HBase and Hadoop in general allow one to join orthogonal data sets that have a weak relationship.  So while you can still join sets against a FK which implies a relationship, you don’t have to do it. 
>>>> 
>>>> Imagine if you wanted to find out the average cost of a front end collision by car of college aged drivers by major. 
>>>> You would be joining insurance records against registrations for all of the universities in the US for those students between the ages of 17 and 25. 
>>>> 
>>>> How would you model this when in fact neither defining attribute is a FK? 
>>>> (This is why you need a good Secondary Indexing implementation and 
>>>> not something brain dead that wasn’t alcohol induced. ;-)
>>>> 
>>>> Does that make sense? 
>>>> 
>>>> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or Progressive Insurance are doing anything like this. But they could.
>>>> 
>>>>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
>>>>> 
>>>>> Not sure you want to ever do that... Designing an HBase application 
>>>>> is far different from designing an RDBMS one. Not sure those tools fit well here.
>>>>> 
>>>>> What's you're goal? Designing your HBase schema somewhere and then 
>>>>> let the tool generate your HBase tables?
>>>>> 
>>>>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
>>>>> 
>>>>>> Hi all,
>>>>>>    Do you have any tools to manage Data Architecture & Modeling 
>>>>>> for HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>>>>>> 
>>>>>>    Please give me some advice.
>>>>>> 
>>>>>> Regards,
>>>>>> Ben Liang
>>>>>> 
>>>>>> 
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
>>>> Use at your own risk. 
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
>> Use at your own risk. 
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Nick Dimiduk <nd...@gmail.com>.

On Mon, Apr 6, 2015 at 5:23 PM, Pamecha, Abhishek <
apamecha@paypal.com.invalid> wrote:

> Re: hashtables:
>         I have always viewed HBase as a large distributed hash table with
> key ranges mapped to data nodes. Partly this has come from my
> interpretation of HBase's own documentation and partly based on other NoSQL
> datastores which revolve around same concept. Is it not right that a key is
> "hashed" to map to which node it is served by? And that all the data for
> that key is within that node in a blob? My intention was to state that such
> a hash table design is "very good" for key based data look ups but NOT
> ideal for use cases where a certain subset of keys need to be scanned and
> its values aggregated. This essentially means a FULL SCAN of the datastore
> with a reliable, consistent but a large (for our usecase) response time.
> Again, other usecases can live with this latency as in many reporting
> applications for example.


Not quite. HBase is an ordered, range-partitioned map. Indeed you have
random lookup based on key, but those keys are strictly ordered. A region
is a key range, so all sequential keys (like [a...f]) are stored together
in a single region. Regions themselves are spread across the cluster
uniformly, so there's no guarantees made that two sequential regions would
be hosted on the same region server. Thus, HBase is quite good at
sequential access as well as random access.

The Data Model [0] and Schema Design [1] sections of our online manual
explain this in more detail.

[0]: http://hbase.apache.org/book.html#datamodel
[1]: http://hbase.apache.org/book.html#schema


> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, April 06, 2015 4:20 PM
> To: user@hbase.apache.org
> Cc: user@phoenix.apache.org
> Subject: Re: How to Manage Data Architecture & Modeling for HBase
>
> Ok…
>
> Need to clarify this…
>
> The use of real time is a bit misleading. Its subjective real time.
>
> With respect to schema design… please see my longer post on design in this
> thread. Again think Hierarchical which means that you get everything in a
> single get().
>
> And yes, you have to think about your use case.  In some use cases, you
> are using M/R and pulling data and doing calculations which is output in to
> HBase where another app will in subjective real time , pull data from hbase
> for use.
>
> In my earlier post I talked about using HBase to join data from different
> data sets. This is one of the main use cases and arguments for Hadoop. That
> you want to gain value by taking data from different data sets where the
> combined data may yield insights that were not previously possible.
>
> I’m not sure where you are getting at with hash tables.
>
> I am not suggesting that HBase is right for all occasions, because its
> not. But I am suggesting that a lot of effort and failed attempts can be
> avoided by understanding how to best use HBase and to not think in terms of
> relationships.
>
> HTH
>
> -Mike
>
>
>
> > On Apr 6, 2015, at 12:09 PM, Pamecha, Abhishek
> <ap...@paypal.com.INVALID> wrote:
> >
> > I would stress that if you envision any joins or arbitrary slices and
> dices at a later point in your application, you might want to either
> redesign your schema "very carefully"  or be ready for more time consuming
> ( not near real time) answers. We had explored a possible solution on
> similar lines but a hashtable approach (as expected)  isn’t the best for
> database joins OR slicing based on arbitrary columns across the whole
> dataset. We had to switch back to a relational db for our usecase.
> >
> > Thanks,
> > Abhishek
> >
> > -----Original Message-----
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Monday, April 06, 2015 9:55 AM
> > To: user@hbase.apache.org
> > Cc: user@phoenix.apache.org
> > Subject: Re: How to Manage Data Architecture & Modeling for HBase
> >
> > I should add that in terms of financial modeling…
> >
> > Its easier to store derivatives and synthetic instruments because you
> aren’t really constrained by a relational model.
> > (Derivatives are nothing more than a contract.)
> >
> > HTH
> >
> > -Mike
> >
> >> On Apr 6, 2015, at 8:34 AM, Ben Liang <li...@hotmail.com> wrote:
> >>
> >> Thank you for your prompt reply.
> >>
> >> In my daily work, I mainly used Oracle DB to build a data warehouse
> with star topology data modeling, about financial analysis and marketing
> analysis.
> >> Now I trying to use Hbase to do it.
> >>
> >> I has a question,
> >> 1) many tables from ERP should be Incremental loading every day ,
> >> Including some insert and some update,  this scenario is appropriate
> >> to use  hbase to build data worehose？
> >> 2) Is there some case about Enterprise BI Solutions with HBASE?
> >>
> >> thanks.
> >>
> >>
> >> Regards,
> >> Ben Liang
> >>
> >>> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com>
> wrote:
> >>>
> >>> Yeah. Jean-Marc is right.
> >>>
> >>> You have to think more in terms of a hierarchical model where you’re
> modeling records not relationships.
> >>>
> >>> Your model would look like a single ER box per record type.
> >>>
> >>> The HBase schema is very simple.  Tables, column families and that’s
> it for static structures.  Even then, column families tend to get misused.
> >>>
> >>> If you’re looking at a relational model… Phoenix or Splice Machines
> would allow you to do something… although Phoenix is still VERY primitive.
> >>> (Do they take advantage of cell versioning like spice machines yet?
> >>> )
> >>>
> >>>
> >>> There are a couple of interesting things where you could create your
> >>> own modeling tool / syntax (relationships)…
> >>>
> >>> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs.
> >>> 2) You can join entities on either a FK principle or on a weaker
> relationship type.
> >>>
> >>> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a
> finite bounded length not to exceed the size of a region. So you could
> store an entire record as a CLOB within a cell.  Its in this sense that a
> cell can represent multiple attributes of your object/record that you gain
> an additional dimension and why you only need to use a single data type.
> >>>
> >>> HBase and Hadoop in general allow one to join orthogonal data sets
> that have a weak relationship.  So while you can still join sets against a
> FK which implies a relationship, you don’t have to do it.
> >>>
> >>> Imagine if you wanted to find out the average cost of a front end
> collision by car of college aged drivers by major.
> >>> You would be joining insurance records against registrations for all
> of the universities in the US for those students between the ages of 17 and
> 25.
> >>>
> >>> How would you model this when in fact neither defining attribute is a
> FK?
> >>> (This is why you need a good Secondary Indexing implementation and
> >>> not something brain dead that wasn’t alcohol induced. ;-)
> >>>
> >>> Does that make sense?
> >>>
> >>> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or
> Progressive Insurance are doing anything like this. But they could.
> >>>
> >>>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
> >>>>
> >>>> Not sure you want to ever do that... Designing an HBase application
> >>>> is far different from designing an RDBMS one. Not sure those tools
> fit well here.
> >>>>
> >>>> What's you're goal? Designing your HBase schema somewhere and then
> >>>> let the tool generate your HBase tables?
> >>>>
> >>>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
> >>>>
> >>>>> Hi all,
> >>>>>     Do you have any tools to manage Data Architecture & Modeling
> >>>>> for HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do
> it?
> >>>>>
> >>>>>     Please give me some advice.
> >>>>>
> >>>>> Regards,
> >>>>> Ben Liang
> >>>>>
> >>>>>
> >>>
> >>> The opinions expressed here are mine, while they may reflect a
> cognitive thought, that is purely accidental.
> >>> Use at your own risk.
> >>> Michael Segel
> >>> michael_segel (AT) hotmail.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
>
>

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Nick Dimiduk <nd...@gmail.com>.

On Mon, Apr 6, 2015 at 5:23 PM, Pamecha, Abhishek <
apamecha@paypal.com.invalid> wrote:

> Re: hashtables:
>         I have always viewed HBase as a large distributed hash table with
> key ranges mapped to data nodes. Partly this has come from my
> interpretation of HBase's own documentation and partly based on other NoSQL
> datastores which revolve around same concept. Is it not right that a key is
> "hashed" to map to which node it is served by? And that all the data for
> that key is within that node in a blob? My intention was to state that such
> a hash table design is "very good" for key based data look ups but NOT
> ideal for use cases where a certain subset of keys need to be scanned and
> its values aggregated. This essentially means a FULL SCAN of the datastore
> with a reliable, consistent but a large (for our usecase) response time.
> Again, other usecases can live with this latency as in many reporting
> applications for example.


Not quite. HBase is an ordered, range-partitioned map. Indeed you have
random lookup based on key, but those keys are strictly ordered. A region
is a key range, so all sequential keys (like [a...f]) are stored together
in a single region. Regions themselves are spread across the cluster
uniformly, so there's no guarantees made that two sequential regions would
be hosted on the same region server. Thus, HBase is quite good at
sequential access as well as random access.

The Data Model [0] and Schema Design [1] sections of our online manual
explain this in more detail.

[0]: http://hbase.apache.org/book.html#datamodel
[1]: http://hbase.apache.org/book.html#schema


> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, April 06, 2015 4:20 PM
> To: user@hbase.apache.org
> Cc: user@phoenix.apache.org
> Subject: Re: How to Manage Data Architecture & Modeling for HBase
>
> Ok…
>
> Need to clarify this…
>
> The use of real time is a bit misleading. Its subjective real time.
>
> With respect to schema design… please see my longer post on design in this
> thread. Again think Hierarchical which means that you get everything in a
> single get().
>
> And yes, you have to think about your use case.  In some use cases, you
> are using M/R and pulling data and doing calculations which is output in to
> HBase where another app will in subjective real time , pull data from hbase
> for use.
>
> In my earlier post I talked about using HBase to join data from different
> data sets. This is one of the main use cases and arguments for Hadoop. That
> you want to gain value by taking data from different data sets where the
> combined data may yield insights that were not previously possible.
>
> I’m not sure where you are getting at with hash tables.
>
> I am not suggesting that HBase is right for all occasions, because its
> not. But I am suggesting that a lot of effort and failed attempts can be
> avoided by understanding how to best use HBase and to not think in terms of
> relationships.
>
> HTH
>
> -Mike
>
>
>
> > On Apr 6, 2015, at 12:09 PM, Pamecha, Abhishek
> <ap...@paypal.com.INVALID> wrote:
> >
> > I would stress that if you envision any joins or arbitrary slices and
> dices at a later point in your application, you might want to either
> redesign your schema "very carefully"  or be ready for more time consuming
> ( not near real time) answers. We had explored a possible solution on
> similar lines but a hashtable approach (as expected)  isn’t the best for
> database joins OR slicing based on arbitrary columns across the whole
> dataset. We had to switch back to a relational db for our usecase.
> >
> > Thanks,
> > Abhishek
> >
> > -----Original Message-----
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Monday, April 06, 2015 9:55 AM
> > To: user@hbase.apache.org
> > Cc: user@phoenix.apache.org
> > Subject: Re: How to Manage Data Architecture & Modeling for HBase
> >
> > I should add that in terms of financial modeling…
> >
> > Its easier to store derivatives and synthetic instruments because you
> aren’t really constrained by a relational model.
> > (Derivatives are nothing more than a contract.)
> >
> > HTH
> >
> > -Mike
> >
> >> On Apr 6, 2015, at 8:34 AM, Ben Liang <li...@hotmail.com> wrote:
> >>
> >> Thank you for your prompt reply.
> >>
> >> In my daily work, I mainly used Oracle DB to build a data warehouse
> with star topology data modeling, about financial analysis and marketing
> analysis.
> >> Now I trying to use Hbase to do it.
> >>
> >> I has a question,
> >> 1) many tables from ERP should be Incremental loading every day ,
> >> Including some insert and some update,  this scenario is appropriate
> >> to use  hbase to build data worehose？
> >> 2) Is there some case about Enterprise BI Solutions with HBASE?
> >>
> >> thanks.
> >>
> >>
> >> Regards,
> >> Ben Liang
> >>
> >>> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com>
> wrote:
> >>>
> >>> Yeah. Jean-Marc is right.
> >>>
> >>> You have to think more in terms of a hierarchical model where you’re
> modeling records not relationships.
> >>>
> >>> Your model would look like a single ER box per record type.
> >>>
> >>> The HBase schema is very simple.  Tables, column families and that’s
> it for static structures.  Even then, column families tend to get misused.
> >>>
> >>> If you’re looking at a relational model… Phoenix or Splice Machines
> would allow you to do something… although Phoenix is still VERY primitive.
> >>> (Do they take advantage of cell versioning like spice machines yet?
> >>> )
> >>>
> >>>
> >>> There are a couple of interesting things where you could create your
> >>> own modeling tool / syntax (relationships)…
> >>>
> >>> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs.
> >>> 2) You can join entities on either a FK principle or on a weaker
> relationship type.
> >>>
> >>> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a
> finite bounded length not to exceed the size of a region. So you could
> store an entire record as a CLOB within a cell.  Its in this sense that a
> cell can represent multiple attributes of your object/record that you gain
> an additional dimension and why you only need to use a single data type.
> >>>
> >>> HBase and Hadoop in general allow one to join orthogonal data sets
> that have a weak relationship.  So while you can still join sets against a
> FK which implies a relationship, you don’t have to do it.
> >>>
> >>> Imagine if you wanted to find out the average cost of a front end
> collision by car of college aged drivers by major.
> >>> You would be joining insurance records against registrations for all
> of the universities in the US for those students between the ages of 17 and
> 25.
> >>>
> >>> How would you model this when in fact neither defining attribute is a
> FK?
> >>> (This is why you need a good Secondary Indexing implementation and
> >>> not something brain dead that wasn’t alcohol induced. ;-)
> >>>
> >>> Does that make sense?
> >>>
> >>> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or
> Progressive Insurance are doing anything like this. But they could.
> >>>
> >>>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
> >>>>
> >>>> Not sure you want to ever do that... Designing an HBase application
> >>>> is far different from designing an RDBMS one. Not sure those tools
> fit well here.
> >>>>
> >>>> What's you're goal? Designing your HBase schema somewhere and then
> >>>> let the tool generate your HBase tables?
> >>>>
> >>>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
> >>>>
> >>>>> Hi all,
> >>>>>     Do you have any tools to manage Data Architecture & Modeling
> >>>>> for HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do
> it?
> >>>>>
> >>>>>     Please give me some advice.
> >>>>>
> >>>>> Regards,
> >>>>> Ben Liang
> >>>>>
> >>>>>
> >>>
> >>> The opinions expressed here are mine, while they may reflect a
> cognitive thought, that is purely accidental.
> >>> Use at your own risk.
> >>> Michael Segel
> >>> michael_segel (AT) hotmail.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
>
>

RE: How to Manage Data Architecture & Modeling for HBase

Posted by "Pamecha, Abhishek" <ap...@paypal.com.INVALID>.

Thanks for your clarifications. I see where I made assumptions and you corrected them. Primarily it was assuming my response times requirements for the use case below.

To explain: My use case was exactly what you mentioned in the post AS the main usecase for Hadoop except that we required results in less than 1 min for arbitrary filters and aggregations on keys in the client query. 

We were exploring to use HBase for that retrieval with a mix of some pre-generation done by M/R jobs and some filtering done by application at the time of the query. But the dimensions in which to slice and dice were quite a lot so pre-generation was exponential and quickly ruled out. On the spot filtering lead to a lot of data reads in the application as each row could only be organized a certain fixed hierarchical fashion which might not be best for a specific aggregation request (which for eg: required aggregated results to be grouped under the opposite hierarchy). Secondary indexes meant M/R jobs which weren’t useful given our SLAs.

Anyways, I incorrectly assumed that the use case here is also with such response times. It is absolutely ok for a response to take as much time as a normal M/R job takes and still satisfy the usecase requirements. As you said, is indeed the selling point for Hadoop to solve numerous problems.

Re: hashtables:
	I have always viewed HBase as a large distributed hash table with key ranges mapped to data nodes. Partly this has come from my interpretation of HBase's own documentation and partly based on other NoSQL datastores which revolve around same concept. Is it not right that a key is "hashed" to map to which node it is served by? And that all the data for that key is within that node in a blob? My intention was to state that such a hash table design is "very good" for key based data look ups but NOT ideal for use cases where a certain subset of keys need to be scanned and its values aggregated. This essentially means a FULL SCAN of the datastore with a reliable, consistent but a large (for our usecase) response time. Again, other usecases can live with this latency as in many reporting applications for example.

I concur HBase is not the right solution for all scenarios. It is indeed my first choice when it comes to storing BLOBs against a key and useful for quick lookup and future updates against those keys. When it comes to aggregations and filters across the entire range of keys, I would prefer to explore traditional RDBMS cluster solution as well if it can be a fit and contrast the two.

I hope I could clarify my intentions. I couldn't agree more with your last statement quoted again:

"I am not suggesting that HBase is right for all occasions, because its not. But I am suggesting that a lot of effort and failed attempts can be avoided by understanding how to best use HBase and to not think in terms of relationships."

Thanks
Abhishek

Thanks,
Abhishek

-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Monday, April 06, 2015 4:20 PM
To: user@hbase.apache.org
Cc: user@phoenix.apache.org
Subject: Re: How to Manage Data Architecture & Modeling for HBase

Ok… 

Need to clarify this… 

The use of real time is a bit misleading. Its subjective real time. 

With respect to schema design… please see my longer post on design in this thread. Again think Hierarchical which means that you get everything in a single get(). 

And yes, you have to think about your use case.  In some use cases, you are using M/R and pulling data and doing calculations which is output in to HBase where another app will in subjective real time , pull data from hbase for use. 

In my earlier post I talked about using HBase to join data from different data sets. This is one of the main use cases and arguments for Hadoop. That you want to gain value by taking data from different data sets where the combined data may yield insights that were not previously possible. 

I’m not sure where you are getting at with hash tables.  

I am not suggesting that HBase is right for all occasions, because its not. But I am suggesting that a lot of effort and failed attempts can be avoided by understanding how to best use HBase and to not think in terms of relationships. 

HTH

-Mike

> On Apr 6, 2015, at 12:09 PM, Pamecha, Abhishek <ap...@paypal.com.INVALID> wrote:
> 
> I would stress that if you envision any joins or arbitrary slices and dices at a later point in your application, you might want to either redesign your schema "very carefully"  or be ready for more time consuming ( not near real time) answers. We had explored a possible solution on similar lines but a hashtable approach (as expected)  isn’t the best for database joins OR slicing based on arbitrary columns across the whole dataset. We had to switch back to a relational db for our usecase.
> 
> Thanks,
> Abhishek
> 
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, April 06, 2015 9:55 AM
> To: user@hbase.apache.org
> Cc: user@phoenix.apache.org
> Subject: Re: How to Manage Data Architecture & Modeling for HBase
> 
> I should add that in terms of financial modeling…
> 
> Its easier to store derivatives and synthetic instruments because you aren’t really constrained by a relational model. 
> (Derivatives are nothing more than a contract.)
> 
> HTH
> 
> -Mike
> 
>> On Apr 6, 2015, at 8:34 AM, Ben Liang <li...@hotmail.com> wrote:
>> 
>> Thank you for your prompt reply.
>> 
>> In my daily work, I mainly used Oracle DB to build a data warehouse with star topology data modeling, about financial analysis and marketing analysis.
>> Now I trying to use Hbase to do it. 
>> 
>> I has a question,
>> 1) many tables from ERP should be Incremental loading every day , 
>> Including some insert and some update,  this scenario is appropriate 
>> to use  hbase to build data worehose？
>> 2) Is there some case about Enterprise BI Solutions with HBASE? 
>> 
>> thanks.
>> 
>> 
>> Regards,
>> Ben Liang
>> 
>>> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com> wrote:
>>> 
>>> Yeah. Jean-Marc is right. 
>>> 
>>> You have to think more in terms of a hierarchical model where you’re modeling records not relationships. 
>>> 
>>> Your model would look like a single ER box per record type. 
>>> 
>>> The HBase schema is very simple.  Tables, column families and that’s it for static structures.  Even then, column families tend to get misused. 
>>> 
>>> If you’re looking at a relational model… Phoenix or Splice Machines would allow you to do something… although Phoenix is still VERY primitive. 
>>> (Do they take advantage of cell versioning like spice machines yet? 
>>> )
>>> 
>>> 
>>> There are a couple of interesting things where you could create your 
>>> own modeling tool / syntax (relationships)…
>>> 
>>> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs. 
>>> 2) You can join entities on either a FK principle or on a weaker relationship type. 
>>> 
>>> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a finite bounded length not to exceed the size of a region. So you could store an entire record as a CLOB within a cell.  Its in this sense that a cell can represent multiple attributes of your object/record that you gain an additional dimension and why you only need to use a single data type. 
>>> 
>>> HBase and Hadoop in general allow one to join orthogonal data sets that have a weak relationship.  So while you can still join sets against a FK which implies a relationship, you don’t have to do it. 
>>> 
>>> Imagine if you wanted to find out the average cost of a front end collision by car of college aged drivers by major. 
>>> You would be joining insurance records against registrations for all of the universities in the US for those students between the ages of 17 and 25. 
>>> 
>>> How would you model this when in fact neither defining attribute is a FK? 
>>> (This is why you need a good Secondary Indexing implementation and 
>>> not something brain dead that wasn’t alcohol induced. ;-)
>>> 
>>> Does that make sense? 
>>> 
>>> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or Progressive Insurance are doing anything like this. But they could.
>>> 
>>>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
>>>> 
>>>> Not sure you want to ever do that... Designing an HBase application 
>>>> is far different from designing an RDBMS one. Not sure those tools fit well here.
>>>> 
>>>> What's you're goal? Designing your HBase schema somewhere and then 
>>>> let the tool generate your HBase tables?
>>>> 
>>>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
>>>> 
>>>>> Hi all,
>>>>>     Do you have any tools to manage Data Architecture & Modeling 
>>>>> for HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>>>>> 
>>>>>     Please give me some advice.
>>>>> 
>>>>> Regards,
>>>>> Ben Liang
>>>>> 
>>>>> 
>>> 
>>> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
>>> Use at your own risk. 
>>> Michael Segel
>>> michael_segel (AT) hotmail.com
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
> Use at your own risk. 
> Michael Segel
> michael_segel (AT) hotmail.com
> 
> 
> 
> 
>

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Michael Segel <mi...@hotmail.com>.

Ok… 

Need to clarify this… 

The use of real time is a bit misleading. Its subjective real time. 

With respect to schema design… please see my longer post on design in this thread. Again think Hierarchical which means that you get everything in a single get(). 

And yes, you have to think about your use case.  In some use cases, you are using M/R and pulling data and doing calculations which is output in to HBase where another app will in subjective real time , pull data from hbase for use. 

In my earlier post I talked about using HBase to join data from different data sets. This is one of the main use cases and arguments for Hadoop. That you want to gain value by taking data from different data sets where the combined data may yield insights that were not previously possible. 

I’m not sure where you are getting at with hash tables.  

I am not suggesting that HBase is right for all occasions, because its not. But I am suggesting that a lot of effort and failed attempts can be avoided by understanding how to best use HBase and to not think in terms of relationships. 

HTH

-Mike



> On Apr 6, 2015, at 12:09 PM, Pamecha, Abhishek <ap...@paypal.com.INVALID> wrote:
> 
> I would stress that if you envision any joins or arbitrary slices and dices at a later point in your application, you might want to either redesign your schema "very carefully"  or be ready for more time consuming ( not near real time) answers. We had explored a possible solution on similar lines but a hashtable approach (as expected)  isn’t the best for database joins OR slicing based on arbitrary columns across the whole dataset. We had to switch back to a relational db for our usecase.
> 
> Thanks,
> Abhishek
> 
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com] 
> Sent: Monday, April 06, 2015 9:55 AM
> To: user@hbase.apache.org
> Cc: user@phoenix.apache.org
> Subject: Re: How to Manage Data Architecture & Modeling for HBase
> 
> I should add that in terms of financial modeling… 
> 
> Its easier to store derivatives and synthetic instruments because you aren’t really constrained by a relational model. 
> (Derivatives are nothing more than a contract.) 
> 
> HTH
> 
> -Mike
> 
>> On Apr 6, 2015, at 8:34 AM, Ben Liang <li...@hotmail.com> wrote:
>> 
>> Thank you for your prompt reply.
>> 
>> In my daily work, I mainly used Oracle DB to build a data warehouse with star topology data modeling, about financial analysis and marketing analysis.
>> Now I trying to use Hbase to do it. 
>> 
>> I has a question,
>> 1) many tables from ERP should be Incremental loading every day , 
>> Including some insert and some update,  this scenario is appropriate 
>> to use  hbase to build data worehose？
>> 2) Is there some case about Enterprise BI Solutions with HBASE? 
>> 
>> thanks.
>> 
>> 
>> Regards,
>> Ben Liang
>> 
>>> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com> wrote:
>>> 
>>> Yeah. Jean-Marc is right. 
>>> 
>>> You have to think more in terms of a hierarchical model where you’re modeling records not relationships. 
>>> 
>>> Your model would look like a single ER box per record type. 
>>> 
>>> The HBase schema is very simple.  Tables, column families and that’s it for static structures.  Even then, column families tend to get misused. 
>>> 
>>> If you’re looking at a relational model… Phoenix or Splice Machines would allow you to do something… although Phoenix is still VERY primitive. 
>>> (Do they take advantage of cell versioning like spice machines yet? )
>>> 
>>> 
>>> There are a couple of interesting things where you could create your 
>>> own modeling tool / syntax (relationships)…
>>> 
>>> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs. 
>>> 2) You can join entities on either a FK principle or on a weaker relationship type. 
>>> 
>>> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a finite bounded length not to exceed the size of a region. So you could store an entire record as a CLOB within a cell.  Its in this sense that a cell can represent multiple attributes of your object/record that you gain an additional dimension and why you only need to use a single data type. 
>>> 
>>> HBase and Hadoop in general allow one to join orthogonal data sets that have a weak relationship.  So while you can still join sets against a FK which implies a relationship, you don’t have to do it. 
>>> 
>>> Imagine if you wanted to find out the average cost of a front end collision by car of college aged drivers by major. 
>>> You would be joining insurance records against registrations for all of the universities in the US for those students between the ages of 17 and 25. 
>>> 
>>> How would you model this when in fact neither defining attribute is a FK? 
>>> (This is why you need a good Secondary Indexing implementation and 
>>> not something brain dead that wasn’t alcohol induced. ;-)
>>> 
>>> Does that make sense? 
>>> 
>>> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or Progressive Insurance are doing anything like this. But they could.
>>> 
>>>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
>>>> 
>>>> Not sure you want to ever do that... Designing an HBase application 
>>>> is far different from designing an RDBMS one. Not sure those tools fit well here.
>>>> 
>>>> What's you're goal? Designing your HBase schema somewhere and then 
>>>> let the tool generate your HBase tables?
>>>> 
>>>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
>>>> 
>>>>> Hi all,
>>>>>     Do you have any tools to manage Data Architecture & Modeling 
>>>>> for HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>>>>> 
>>>>>     Please give me some advice.
>>>>> 
>>>>> Regards,
>>>>> Ben Liang
>>>>> 
>>>>> 
>>> 
>>> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
>>> Use at your own risk. 
>>> Michael Segel
>>> michael_segel (AT) hotmail.com
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
> Use at your own risk. 
> Michael Segel
> michael_segel (AT) hotmail.com
> 
> 
> 
> 
>

RE: How to Manage Data Architecture & Modeling for HBase

Posted by "Pamecha, Abhishek" <ap...@paypal.com.INVALID>.

I would stress that if you envision any joins or arbitrary slices and dices at a later point in your application, you might want to either redesign your schema "very carefully"  or be ready for more time consuming ( not near real time) answers. We had explored a possible solution on similar lines but a hashtable approach (as expected)  isn’t the best for database joins OR slicing based on arbitrary columns across the whole dataset. We had to switch back to a relational db for our usecase.

Thanks,
Abhishek

-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Monday, April 06, 2015 9:55 AM
To: user@hbase.apache.org
Cc: user@phoenix.apache.org
Subject: Re: How to Manage Data Architecture & Modeling for HBase

I should add that in terms of financial modeling… 

Its easier to store derivatives and synthetic instruments because you aren’t really constrained by a relational model. 
(Derivatives are nothing more than a contract.) 

HTH

-Mike

> On Apr 6, 2015, at 8:34 AM, Ben Liang <li...@hotmail.com> wrote:
> 
> Thank you for your prompt reply.
> 
> In my daily work, I mainly used Oracle DB to build a data warehouse with star topology data modeling, about financial analysis and marketing analysis.
> Now I trying to use Hbase to do it. 
> 
> I has a question,
> 1) many tables from ERP should be Incremental loading every day , 
> Including some insert and some update,  this scenario is appropriate 
> to use  hbase to build data worehose？
> 2) Is there some case about Enterprise BI Solutions with HBASE? 
> 
> thanks.
> 
> 
> Regards,
> Ben Liang
> 
>> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> Yeah. Jean-Marc is right. 
>> 
>> You have to think more in terms of a hierarchical model where you’re modeling records not relationships. 
>> 
>> Your model would look like a single ER box per record type. 
>> 
>> The HBase schema is very simple.  Tables, column families and that’s it for static structures.  Even then, column families tend to get misused. 
>> 
>> If you’re looking at a relational model… Phoenix or Splice Machines would allow you to do something… although Phoenix is still VERY primitive. 
>> (Do they take advantage of cell versioning like spice machines yet? )
>> 
>> 
>> There are a couple of interesting things where you could create your 
>> own modeling tool / syntax (relationships)…
>> 
>> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs. 
>> 2) You can join entities on either a FK principle or on a weaker relationship type. 
>> 
>> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a finite bounded length not to exceed the size of a region. So you could store an entire record as a CLOB within a cell.  Its in this sense that a cell can represent multiple attributes of your object/record that you gain an additional dimension and why you only need to use a single data type. 
>> 
>> HBase and Hadoop in general allow one to join orthogonal data sets that have a weak relationship.  So while you can still join sets against a FK which implies a relationship, you don’t have to do it. 
>> 
>> Imagine if you wanted to find out the average cost of a front end collision by car of college aged drivers by major. 
>> You would be joining insurance records against registrations for all of the universities in the US for those students between the ages of 17 and 25. 
>> 
>> How would you model this when in fact neither defining attribute is a FK? 
>> (This is why you need a good Secondary Indexing implementation and 
>> not something brain dead that wasn’t alcohol induced. ;-)
>> 
>> Does that make sense? 
>> 
>> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or Progressive Insurance are doing anything like this. But they could.
>> 
>>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
>>> 
>>> Not sure you want to ever do that... Designing an HBase application 
>>> is far different from designing an RDBMS one. Not sure those tools fit well here.
>>> 
>>> What's you're goal? Designing your HBase schema somewhere and then 
>>> let the tool generate your HBase tables?
>>> 
>>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
>>> 
>>>> Hi all,
>>>>      Do you have any tools to manage Data Architecture & Modeling 
>>>> for HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>>>> 
>>>>      Please give me some advice.
>>>> 
>>>> Regards,
>>>> Ben Liang
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
>> Use at your own risk. 
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Michael Segel <mi...@hotmail.com>.

I should add that in terms of financial modeling… 

Its easier to store derivatives and synthetic instruments because you aren’t really constrained by a relational model. 
(Derivatives are nothing more than a contract.) 

HTH

-Mike

> On Apr 6, 2015, at 8:34 AM, Ben Liang <li...@hotmail.com> wrote:
> 
> Thank you for your prompt reply.
> 
> In my daily work, I mainly used Oracle DB to build a data warehouse with star topology data modeling, about financial analysis and marketing analysis.
> Now I trying to use Hbase to do it. 
> 
> I has a question,
> 1) many tables from ERP should be Incremental loading every day , Including some insert and some update,  this scenario is appropriate to use  hbase to build data worehose？
> 2) Is there some case about Enterprise BI Solutions with HBASE? 
> 
> thanks.
> 
> 
> Regards,
> Ben Liang
> 
>> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com> wrote:
>> 
>> Yeah. Jean-Marc is right. 
>> 
>> You have to think more in terms of a hierarchical model where you’re modeling records not relationships. 
>> 
>> Your model would look like a single ER box per record type. 
>> 
>> The HBase schema is very simple.  Tables, column families and that’s it for static structures.  Even then, column families tend to get misused. 
>> 
>> If you’re looking at a relational model… Phoenix or Splice Machines would allow you to do something… although Phoenix is still VERY primitive. 
>> (Do they take advantage of cell versioning like spice machines yet? ) 
>> 
>> 
>> There are a couple of interesting things where you could create your own modeling tool / syntax (relationships)… 
>> 
>> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs. 
>> 2) You can join entities on either a FK principle or on a weaker relationship type. 
>> 
>> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a finite bounded length not to exceed the size of a region. So you could store an entire record as a CLOB within a cell.  Its in this sense that a cell can represent multiple attributes of your object/record that you gain an additional dimension and why you only need to use a single data type. 
>> 
>> HBase and Hadoop in general allow one to join orthogonal data sets that have a weak relationship.  So while you can still join sets against a FK which implies a relationship, you don’t have to do it. 
>> 
>> Imagine if you wanted to find out the average cost of a front end collision by car of college aged drivers by major. 
>> You would be joining insurance records against registrations for all of the universities in the US for those students between the ages of 17 and 25. 
>> 
>> How would you model this when in fact neither defining attribute is a FK? 
>> (This is why you need a good Secondary Indexing implementation and not something brain dead that wasn’t alcohol induced. ;-) 
>> 
>> Does that make sense? 
>> 
>> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or Progressive Insurance are doing anything like this. But they could.
>> 
>>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
>>> 
>>> Not sure you want to ever do that... Designing an HBase application is far
>>> different from designing an RDBMS one. Not sure those tools fit well here.
>>> 
>>> What's you're goal? Designing your HBase schema somewhere and then let the
>>> tool generate your HBase tables?
>>> 
>>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
>>> 
>>>> Hi all,
>>>>      Do you have any tools to manage Data Architecture & Modeling for
>>>> HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>>>> 
>>>>      Please give me some advice.
>>>> 
>>>> Regards,
>>>> Ben Liang
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
>> Use at your own risk. 
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Ben Liang <li...@hotmail.com>.

Thank you for your prompt reply.

In my daily work, I mainly used Oracle DB to build a data warehouse with star topology data modeling, about financial analysis and marketing analysis.
Now I trying to use Hbase to do it. 

 I has a question,
1) many tables from ERP should be Incremental loading every day , Including some insert and some update,  this scenario is appropriate to use  hbase to build data worehose？
2) Is there some case about Enterprise BI Solutions with HBASE? 

thanks.


Regards,
Ben Liang

> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com> wrote:
> 
> Yeah. Jean-Marc is right. 
> 
> You have to think more in terms of a hierarchical model where you’re modeling records not relationships. 
> 
> Your model would look like a single ER box per record type. 
> 
> The HBase schema is very simple.  Tables, column families and that’s it for static structures.  Even then, column families tend to get misused. 
> 
> If you’re looking at a relational model… Phoenix or Splice Machines would allow you to do something… although Phoenix is still VERY primitive. 
> (Do they take advantage of cell versioning like spice machines yet? ) 
> 
> 
> There are a couple of interesting things where you could create your own modeling tool / syntax (relationships)… 
> 
> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs. 
> 2) You can join entities on either a FK principle or on a weaker relationship type. 
> 
> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a finite bounded length not to exceed the size of a region. So you could store an entire record as a CLOB within a cell.  Its in this sense that a cell can represent multiple attributes of your object/record that you gain an additional dimension and why you only need to use a single data type. 
> 
> HBase and Hadoop in general allow one to join orthogonal data sets that have a weak relationship.  So while you can still join sets against a FK which implies a relationship, you don’t have to do it. 
> 
> Imagine if you wanted to find out the average cost of a front end collision by car of college aged drivers by major. 
> You would be joining insurance records against registrations for all of the universities in the US for those students between the ages of 17 and 25. 
> 
> How would you model this when in fact neither defining attribute is a FK? 
> (This is why you need a good Secondary Indexing implementation and not something brain dead that wasn’t alcohol induced. ;-) 
> 
> Does that make sense? 
> 
> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or Progressive Insurance are doing anything like this. But they could.
> 
>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
>> 
>> Not sure you want to ever do that... Designing an HBase application is far
>> different from designing an RDBMS one. Not sure those tools fit well here.
>> 
>> What's you're goal? Designing your HBase schema somewhere and then let the
>> tool generate your HBase tables?
>> 
>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
>> 
>>> Hi all,
>>>       Do you have any tools to manage Data Architecture & Modeling for
>>> HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>>> 
>>>       Please give me some advice.
>>> 
>>> Regards,
>>> Ben Liang
>>> 
>>> 
> 
> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
> Use at your own risk. 
> Michael Segel
> michael_segel (AT) hotmail.com
> 
> 
> 
> 
>

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Ben Liang <li...@hotmail.com>.

Thank you for your prompt reply.

In my daily work, I mainly used Oracle DB to build a data warehouse with star topology data modeling, about financial analysis and marketing analysis.
Now I trying to use Hbase to do it. 

 I has a question,
1) many tables from ERP should be Incremental loading every day , Including some insert and some update,  this scenario is appropriate to use  hbase to build data worehose？
2) Is there some case about Enterprise BI Solutions with HBASE? 

thanks.


Regards,
Ben Liang

> On Apr 6, 2015, at 20:27, Michael Segel <mi...@hotmail.com> wrote:
> 
> Yeah. Jean-Marc is right. 
> 
> You have to think more in terms of a hierarchical model where you’re modeling records not relationships. 
> 
> Your model would look like a single ER box per record type. 
> 
> The HBase schema is very simple.  Tables, column families and that’s it for static structures.  Even then, column families tend to get misused. 
> 
> If you’re looking at a relational model… Phoenix or Splice Machines would allow you to do something… although Phoenix is still VERY primitive. 
> (Do they take advantage of cell versioning like spice machines yet? ) 
> 
> 
> There are a couple of interesting things where you could create your own modeling tool / syntax (relationships)… 
> 
> 1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs. 
> 2) You can join entities on either a FK principle or on a weaker relationship type. 
> 
> HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a finite bounded length not to exceed the size of a region. So you could store an entire record as a CLOB within a cell.  Its in this sense that a cell can represent multiple attributes of your object/record that you gain an additional dimension and why you only need to use a single data type. 
> 
> HBase and Hadoop in general allow one to join orthogonal data sets that have a weak relationship.  So while you can still join sets against a FK which implies a relationship, you don’t have to do it. 
> 
> Imagine if you wanted to find out the average cost of a front end collision by car of college aged drivers by major. 
> You would be joining insurance records against registrations for all of the universities in the US for those students between the ages of 17 and 25. 
> 
> How would you model this when in fact neither defining attribute is a FK? 
> (This is why you need a good Secondary Indexing implementation and not something brain dead that wasn’t alcohol induced. ;-) 
> 
> Does that make sense? 
> 
> Note: I don’t know if anyone like CCCis, Allstate, State Farm, or Progressive Insurance are doing anything like this. But they could.
> 
>> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
>> 
>> Not sure you want to ever do that... Designing an HBase application is far
>> different from designing an RDBMS one. Not sure those tools fit well here.
>> 
>> What's you're goal? Designing your HBase schema somewhere and then let the
>> tool generate your HBase tables?
>> 
>> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
>> 
>>> Hi all,
>>>       Do you have any tools to manage Data Architecture & Modeling for
>>> HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>>> 
>>>       Please give me some advice.
>>> 
>>> Regards,
>>> Ben Liang
>>> 
>>> 
> 
> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
> Use at your own risk. 
> Michael Segel
> michael_segel (AT) hotmail.com
> 
> 
> 
> 
>

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Michael Segel <mi...@hotmail.com>.

Yeah. Jean-Marc is right. 

You have to think more in terms of a hierarchical model where you’re modeling records not relationships. 

Your model would look like a single ER box per record type. 

The HBase schema is very simple.  Tables, column families and that’s it for static structures.  Even then, column families tend to get misused. 

If you’re looking at a relational model… Phoenix or Splice Machines would allow you to do something… although Phoenix is still VERY primitive. 
(Do they take advantage of cell versioning like spice machines yet? ) 

There are a couple of interesting things where you could create your own modeling tool / syntax (relationships)… 

1) HBase is more 3D than RDBMS 2D and similar to ORDBMSs. 
2) You can join entities on either a FK principle or on a weaker relationship type. 

HBase stores CLOBS/BLOBs in each cell. Its all just byte arrays with a finite bounded length not to exceed the size of a region. So you could store an entire record as a CLOB within a cell.  Its in this sense that a cell can represent multiple attributes of your object/record that you gain an additional dimension and why you only need to use a single data type. 

HBase and Hadoop in general allow one to join orthogonal data sets that have a weak relationship.  So while you can still join sets against a FK which implies a relationship, you don’t have to do it. 

Imagine if you wanted to find out the average cost of a front end collision by car of college aged drivers by major. 
You would be joining insurance records against registrations for all of the universities in the US for those students between the ages of 17 and 25. 

How would you model this when in fact neither defining attribute is a FK? 
(This is why you need a good Secondary Indexing implementation and not something brain dead that wasn’t alcohol induced. ;-) 

Does that make sense? 

Note: I don’t know if anyone like CCCis, Allstate, State Farm, or Progressive Insurance are doing anything like this. But they could.

> On Apr 5, 2015, at 7:54 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
> 
> Not sure you want to ever do that... Designing an HBase application is far
> different from designing an RDBMS one. Not sure those tools fit well here.
> 
> What's you're goal? Designing your HBase schema somewhere and then let the
> tool generate your HBase tables?
> 
> 2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:
> 
>> Hi all,
>>        Do you have any tools to manage Data Architecture & Modeling for
>> HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>> 
>>        Please give me some advice.
>> 
>> Regards,
>> Ben Liang
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: How to Manage Data Architecture & Modeling for HBase

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Not sure you want to ever do that... Designing an HBase application is far
different from designing an RDBMS one. Not sure those tools fit well here.

What's you're goal? Designing your HBase schema somewhere and then let the
tool generate your HBase tables?

2015-04-05 18:26 GMT-04:00 Ben Liang <li...@hotmail.com>:

> Hi all,
>         Do you have any tools to manage Data Architecture & Modeling for
> HBase( or Phoenix) ?  Can we  use Powerdesinger or ERWin to do it?
>
>         Please give me some advice.
>
> Regards,
> Ben Liang
>
>