You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by jatinpreet <ja...@gmail.com> on 2014/11/24 18:31:57 UTC

HBase entity relationship

Hi,

I am designing my HBase table schema. I have two entities that are related
to each other in a nested structure. For example, consider two entities A
and B. Both of them are complex types.

Entity A contains one or more entity B values. Both entities have their own
tables. Each row of entity A can have multiple columns with the reference to
entity B table's row key. So, the qualifier goes like, b_{b-rowkey}. Each
row of entity B has a column referencing to the parent entity A's row.

At any time if I need to fetch all the B's contained in a single entity A,
which is better, fetching all row IDs of entity B from the entity A's row
and then going to the entity B table for those specific keys. Or scanning
for all the rows with a filter on the column with entity A's key in entity
B's table.

I would appreciate any help in this matter.

Thanks,
Jatin



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-entity-relationship-tp4066296.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase entity relationship

Posted by jatinpreet <ja...@gmail.com>.

That makes sense Wilm. You have been of immense help and I can't thank you
enough. I am more confident of my schema design and access patterns now.

Jatin



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-entity-relationship-tp4066296p4066379.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase entity relationship

Posted by Wilm Schumacher <wi...@cawoom.com>.


Am 26.11.2014 um 08:05 schrieb jatinpreet:
> I am curious to know if a hybrid of approaches 2 and 3 could be used. This
> means having the rowkeys of jobs inside source row like in approach 2. And
> having the parent source rowkey as a column in job row as in approach 3. 
If you just use "option 2" and add a "source" qualifier to the meta data
in the job as you mentioned, you would have such an approach.

But I don't see the an advantage of this idea. But there are
disadvantages. By a hybrid approach you decouple the tables, which is a
bad plan.

Assume you delete a source, but don't delete the jobs (e.g. because the
operation was interrupted). If you use "source" as leading concept, the
data cannot be accessed anymore (as the source is missing), but in the
database. In the this case you have "dead data" in the database. But if
you rely somehow on the "source" (hybrid approach) in a job, you have a
"wild pointer". The first is not good, the second is really bad.


> Now If I want to access all jobs for a particular source, I have the below
> mentioned options.
> 
> 1. Get all job rowkeys from source row, and then use them to fetch the jobs.
> This would be a direct getRows operation and as the rows in HBase are
> indexed by rowkey, I expect this to be fast. But in the process I make two
> calls to two separate tables.
> 
> 2.  As I also have source rowkey in each job row as a column, I can use a
> filter to get all jobs in a single scan of the table but this column is not
> indexed and hence the whole table will be scanned naively if I am correct. 
> 
> Now, in terms of response time, which of these methods will be faster.

The first, as you pointed out.

For n jobs, you would have to make at most n+1 gets. Plan 2 would need a
lot more computations because in the worst case you would have to
iterate every job (which could be millions as you mentioned).

A "get" of a row is the fastest way to fetch data.

You could imagine the source as the "index" of the jobs, which is always
faster than a filter or map reduce.

> Also, I didn't get the idea behind 'if at some point a reevaluation of a
> source has to be done, you could simply use a row lock to prevent race
> conditions'. An elaboration of this would be great!
As I have an idea of what you try, I do not know all the requirements.
And as we all know requirements change over time. So an approach has to
make future use cases as easy as possible. So I tried to make some
points on the advantages.

I could imagine an use case, where you have have to reevaluate all
sources (e.g. "no jobs from the a specific sector anymore"). So you have
to iterate all sources and reevaluate the source one by one, kick the
old jobs, add new ones for the source etc. This would take some time for
each row/source and as you don't want to take down the whole service the
users would access the db at the time of the reevaluation. This could
produce race conditions, which you can avoid by using a row lock of the
specific row you are working on at the time.

Or if your customer/boss or yourself says, that new inserted sources has
to be deleted regularly (evil jobs from evil guys). So if you just make
a list and delete it naively, you could produce race conditions, which
you can avoid by row locks, too.

It was just a small remark on something that was on my mind at the
moment I was writing the mail ;).

Best wishes

Wilm

Re: HBase entity relationship

Posted by jatinpreet <ja...@gmail.com>.

Thanks Wilm for the wonderful answer. I really like your recommendation
numbered 2 of having two tables with the rowkey of jobs inside source row.

I am curious to know if a hybrid of approaches 2 and 3 could be used. This
means having the rowkeys of jobs inside source row like in approach 2. And
having the parent source rowkey as a column in job row as in approach 3. 

Now If I want to access all jobs for a particular source, I have the below
mentioned options.

1. Get all job rowkeys from source row, and then use them to fetch the jobs.
This would be a direct getRows operation and as the rows in HBase are
indexed by rowkey, I expect this to be fast. But in the process I make two
calls to two separate tables.

2.  As I also have source rowkey in each job row as a column, I can use a
filter to get all jobs in a single scan of the table but this column is not
indexed and hence the whole table will be scanned naively if I am correct. 

Now, in terms of response time, which of these methods will be faster.

Also, I didn't get the idea behind 'if at some point a reevaluation of a
source has to be done, you could simply use a row lock to prevent race
conditions'. An elaboration of this would be great!

Thanks!
Jatin





--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-entity-relationship-tp4066296p4066374.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase entity relationship

Posted by Wilm Schumacher <wi...@cawoom.com>.

Hi,

thx for the example. That makes it more easy to consider some options.

In my opinion you have 3 basic options.

1.) leading source I

As I assumed, "source" seems to be the leading concept. Every "job" has
to have a "source". So you could pack the "jobs" in the "source"

So you could make a column family "data", with the meta data of the
source, and "jobs" where you put the jobs

Advantage: clean design. E.g. kicking all jobs from one source is easy.
Scanning for sources is easy, scanning for jobs is easy (just concanate
the job columns by source). Getting of all jobs is easy. Just a get to a
row and fetch the jobs column family. In this design a "key" for a job
will look like "<sourceID>-<timestamp>" to target one job ecactly

(small remark: by this design jobs and sources are still kind of
seperated. If you just need the meta data, you only fetch the column
family "data" and you're good to go)

Disadvantage: a "job" has to be represented by a byte array (as it is a
cell). e.g. json. Thus you have to parse it every time you need a
specific job.

So a row could look like this

"sourceXYZ" => {
  data : {
    "description" : "foo bar" ,
    "type" : "typeX"
  } ,
  jobs : {
    <timestamp1> : "{ 'job-data1' : 'foo' , 'job-data2' : bar }" ,
    <timestamp2> : "{ 'job-data1' : 'foz' , 'job-data2' : baz }" ,
    ...
  }
}

And if you want to delete a specific job .... just delete it from the
column family. No need to update the source data. And if you want to add
a job to a source, you can just add it.

And if there are "free hanging" jobs, jobs where the source is not an
URL, you could make just one row. "useradded" or something like that.

2.) leading source II

like above, but you represent a job by a key for a second table

"sourceXYZ" => {
  data : {
    "description" : "foo bar" ,
    "type" : "typeX"
  } ,
  jobs : {
    <timestamp1> : "job34" ,
    <timestamp2> : "job56" ,
    ...
  }
}

"job34" => {
  data : {
    "job-data1" : "foo" ,
    "job-data2" : "bar"
  }
}

advantage: like above, but furthermore no json parsing needed

disadvantage: to get a "full get" of one source you need one "getRow",
and one "getRows", or multiple single gets.

3.) two separate tables

like your plan. One table for the sources, one table for the jobs. with
key for "source" in the jobs table.

Advantage: simple to manipulate single jobs (add, delete etc.)
Disadvantage: complicate for more complicate operations (kick all jobs
for one source)

====

However, I think option 2 is the best way from the above. CPU-time
saving and most flexible. E.g. if at some point a reevaluation of a
source has to be done, you could simply use a row lock to prevent race
conditions.

All other more rare operations (e.g. "kick all sources of type 'typeY'")
can be done by simple MapReduce.

This would be be my recommendation. But perhaps someone else has another
idea.

Best

Wilm

Am 25.11.2014 um 06:43 schrieb jatinpreet:
> Thanks Wilm, 
> 
> Let me try to explain my scenario in more detail. Let me talk about two
> specific entities, Jobs and Sources. 
> 
> *Source- *A URL that is source of some data. It also contains other
> meta-info like description, type etc. So, the required columns are,
> source_name, url, description, type. 
> 
> *Job- *An independent entity created with data from the selected sources.
> Apart from job information, we need to keep a track of which sources were
> selected for this job, and this list is editable, hence addition/removal are
> possible. The columns needed in job are, job_name, description,
> source_{source-rowkey} and so on. 
> 
> I was considering following options, 
> 
> 1. Create a JSON of each source and dump it into the value field of
> source_{timestamp} column. But I need to be able to list all of the
> available sources before creating a job. This would mean scanning all jobs
> and finding just the unique sources from the all the lists. This seems like
> an overkill. 
> Another problem with this approach is that I would have to write my own
> custom filters if I need to filter jobs on basis of source. 
> 
> 2. Create a new table for sources and keep the rowkeys of the sources in job
> rows. This turns out to be somewhat like foreign keys thoguh which
> understandably is awkward for HBase. But now I have the option of scanning
> the sources table for listing purposes. 
> And this is where my question originated. When I need to fetch sources for a
> particular job I could just filter them based on job key column from source
> table. This would mean a long scan on all rows of the source table. 
> Another option is, to fetch the list of source rowkeys from job row and then
> directly hit the source table for these specific rowkeys. 
> If this option sustains, which of the above methods if more prudent. 
> 
> 
> This example might not seem to be based on huge data but I do expect
> millions of jobs to be created. Also, this is a common pattern which I need
> to implement in other parts of HBase tables too. 
> 
> Thanks, 
> Jatin
> 
> 
> 
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-entity-relationship-tp4066296p4066327.html
> Sent from the HBase User mailing list archive at Nabble.com.
>

Re: HBase entity relationship

Posted by jatinpreet <ja...@gmail.com>.

Thanks Wilm,

Let me try to explain my scenario in more detail. Let me talk about two
specific entities, Jobs and Sources.

*Source- *A URL that is source of some data. It also contains other
meta-info like description, type etc. So, the required columns are,
source_name, url, description, type.

*Job- *An independent entity created with data from the selected sources.
Apart from job information, we need to keep a track of which sources were
selected for this job, and this list is editable, hence addition/removal are
possible. The columns needed in job are, job_name, description,
source_{source-rowkey} and so on.

I was considering following options,

1. Create a JSON of each source and dump it into the value field of
source_{timestamp} column. But I need to be able to list all of the
available sources before creating a job. This would mean scanning all jobs
and finding just the unique sources from the all the lists. This seems like
an overkill.
Another problem with this approach is that I would have to write my own
custom filters if I need to filter jobs on basis of source.

2. Create a new table for sources and keep the rowkeys of the sources in job
rows. This turns out to be somewhat like foreign keys thoguh which
understandably is awkward for HBase. But now I have the option of scanning
the sources table for listing purposes.
And this is where my question originated. When I need to fetch sources for a
particular job I could just filter them based on job key column from source
table. This would mean a long scan on all rows of the source table.
Another option is, to fetch the list of source rowkeys from job row and then
directly hit the source table for these specific rowkeys.
If this option sustains, which of the above methods if more prudent.

This example might not seem to be based on huge data but I do expect
millions of jobs to be created. Also, this is a common pattern which I need
to implement in other parts of HBase tables too.

Thanks,
Jatin

--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-entity-relationship-tp4066296p4066327.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase entity relationship

Posted by jatinpreet <ja...@gmail.com>.

Thanks Wilm, Let me try to explain my scenario in more detail. Let me talk
about two specific entities, Jobs and Sources.*Source-* A URL that is source
of some data. It also contains other meta-info like description, type etc.
So, the required columns are, source_name, url, description, type.*Job-* An
independent entity created with data from the selected sources. Apart from
job information, we need to keep a track of which sources were selected for
this job, and this list is editable, hence addition/removal are possible.
The columns needed in job are, job_name, description, source_{source-rowkey}
and so on.I was considering following options,1. Create a JSON of each
source and dump it into the value field of source_{timestamp} column. But I
need to be able to list all of the available sources before creating a job.
This would mean scanning all jobs and finding just the unique sources from
the all the lists. This seems like an overkill.Another problem with this
approach is that I would have to write my own custom filters if I need to
filter jobs on basis of source.2. Create a new table for sources and keep
the rowkeys of the sources in job rows. This turns out to be somewhat like
foreign keys thoguh which understandably is awkward for HBase. But now I
have the option of scanning the sources table for listing purposes. And this
is where my question originated. When I need to fetch sources for a
particular job I could just filter them based on job key column from source
table. This would mean a long scan on all rows of the source table.Another
option is, to fetch the list of source rowkeys from job row and then
directly hit the source table for these specific rowkeys. If this option
sustains, which of the above methods if more prudent.This example might not
seem to be based on huge data but I do expect millions of jobs to be
created. Also, this is a common pattern which I need to implement in other
parts of HBase tables too.Thanks,Jatin

--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-entity-relationship-tp4066296p4066326.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase entity relationship

Posted by Wilm Schumacher <wi...@cawoom.com>.

Hi,

perhaps I'm wrong, but this sounds a little bit "sql-ish" to me,
relations by ids etc..

Is there a hierarchy in the data? Is A some sort of "container" for B?
Or are the connections arbitrary?

Could your make an example where this relation of classes A and B fit or
make a more verbose explanation of your problem?

I ask because there would be no harm in putting all data from B in an A
row. hbase is build for trillions of rows and millions of columns, so
you could fit all Bs related to an A into the specific A.

E.g. if you have resources A (e.g. cars), and a number of users B which
do stuff on A (e.g. renting), you could make up an A by two column
families ("data", "renting"), where data contains the resource specific
data (e.g. speed etc.), and "renting" would contain thousands of renting
informations, e.g. "renting:<timestamp>" => "<username>", so there would
be no need of using a second (or third) table for the renting informations.

An complete explanation of the problem could be very complicate (and
perhaps you do not want to share the specific problem), so perhaps you
find a similar problem which fits your problem space. This would make an
answer much easier ;).

Best

Wilm

Am 24.11.2014 um 18:31 schrieb jatinpreet:
> Hi,
> 
> I am designing my HBase table schema. I have two entities that are related
> to each other in a nested structure. For example, consider two entities A
> and B. Both of them are complex types.
> 
> Entity A contains one or more entity B values. Both entities have their own
> tables. Each row of entity A can have multiple columns with the reference to
> entity B table's row key. So, the qualifier goes like, b_{b-rowkey}. Each
> row of entity B has a column referencing to the parent entity A's row.
> 
> At any time if I need to fetch all the B's contained in a single entity A,
> which is better, fetching all row IDs of entity B from the entity A's row
> and then going to the entity B table for those specific keys. Or scanning
> for all the rows with a filter on the column with entity A's key in entity
> B's table.
> 
> I would appreciate any help in this matter.
> 
> Thanks,
> Jatin
> 
> 
> 
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-entity-relationship-tp4066296.html
> Sent from the HBase User mailing list archive at Nabble.com.
>