You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Ishaaq Chandy <is...@gmail.com> on 2009/07/03 09:10:51 UTC

indexing question

Hi all,
I am pretty new to HBase so forgive me if this seems like a silly question.

Each row in my Hbase table is a geographical location that is related to
other locations. For e.g. one relationship is the CONTAIN relationship. So,
Europe CONTAINs  England, France, Spain etc. There is an inverse
relationship as well called PARENT, so England has a PARENT called Europe.
However, note that, for various business reasons not pertinant to this
discussion, the inverse relationship need not always be set, i.e. we may not
store France with a PARENT value of Europe, even though Europe CONTAINs
France.

So, I store each location as a row with an id and the payload data for that
location as a separate data column. This data column includes the sets of
ids of the related locations.

Now, I want to be able to update/delete locations consistently. So, in my
example above, I might want to delete France, in which case I also want to
make sure that I delete the CONTAINs relationship that Europe has with
France as that is now obsolete. What is the most efficient way to do this? I
want to minimise the number of writes I would have to do - on the other hand
optimising read performance is more important as writes do not happen that
often (this is geographic data after all).

My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write
operation to delete France and n write operations to delete the
relationships that n other locations may have to France. In the case of a
root location like Europe that may have a large number of locations that
relate to it this may be expensive, but I see no other way.

So, I was wondering, how do I index this to speed this up as far as
possible. So, given the location Europe, what are the fields I should
include in its row and how to index them? I could create a column family for
each relationship type with a label - the label being the id of the location
this location is related to, so, for e.g., the Europe row would have a
column called CONTAIN:England (assuming "England" is the id for the England
column - in reality it would be a UUID). I would then have as many labels
under the CONTAIN family for Europe as locations that Europe contains.

How would I index this and ensure that when deleting France the query: "list
all locations that CONTAIN France" returns with Europe (and whatever else)
as quickly as possible?

Thanks,
Ishaaq

Re: indexing question

Posted by Jonathan Gray <jl...@streamy.com>.

IndexedTable will always be much slower than HTable.

HTable is just the client interface to HBase.  IndexedTable is a special 
facility that extends HTable to allow for secondary-index queries in a 
single query.

The reason it's much slower is because it's actually two separate 
queries.  First you go to the index table, then you go to the indexed 
table.  Two queries, I'd expect 1.5-2X slower.

JG

Ramesh.Ramasamy wrote:
> Hi St.Ack,
> 
>> In 0.19.x, this facility is experimental but works.
> 
> Does it meant, It may/ not give faster data read, than HTable?
> 
> In my case, still HTable is giving better performance than IndexedTable (I'm
> experimenting this with HBase 0.19.1. my sample IndexedTable has 1 Family
> with 12 qualifiers, 35 millions of rows..may be I should try this as a Wide
> table, instead of Tall table which I have presently)
> 
> The combination of HTable + ColumnValueFilter is awesome, but not
> IndexedTable + ColumnValueFilter.
> 
> TIA,
> Ramesh

Re: indexing question

Posted by "Ramesh.Ramasamy" <ra...@gmail.com>.

Hi St.Ack,

>In 0.19.x, this facility is experimental but works.

Does it meant, It may/ not give faster data read, than HTable?

In my case, still HTable is giving better performance than IndexedTable (I'm
experimenting this with HBase 0.19.1. my sample IndexedTable has 1 Family
with 12 qualifiers, 35 millions of rows..may be I should try this as a Wide
table, instead of Tall table which I have presently)

The combination of HTable + ColumnValueFilter is awesome, but not
IndexedTable + ColumnValueFilter.

TIA,
Ramesh
-- 
View this message in context: http://www.nabble.com/indexing-question-tp24318679p24515348.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: indexing question

Posted by stack <st...@duboce.net>.

On Sat, Jul 4, 2009 at 10:30 PM, Ishaaq Chandy <is...@gmail.com> wrote:

>
> However, I am still curious about the earlier comments about secondary
> indices not being available - does this mean I should avoid using
> IndexedTable and IndexConfiguration completely? I am using 0.19.3.
>



The lads were probably talking about TRUNK.  Transactional and indexed hbase
were absent a while as they were being rejiggered to fit new hbase
internals.  They are still in a state of flux but should be put back
together by time of 0.20.0 release.

In 0.19.x, this facility is experimental but works.

St.Ack



>
> Thanks,
> Ishaaq
>
>
> Jonathan Gray-2 wrote:
> >
> > Ishaaq,
> >
> > I think Michael has the right idea.  You can do this in a single table
> > with two families per relationship.
> >
> > The row key is the location uuid.  And you could have 6 families:
> >
> > CONTAINS, CONTAINED_IN, PARENTS, PARENT_OF, SURROUNDS, SURROUNDED_BY
> >
> > You would duplicate all your data, but random writes are quite
> > fast/cheap in HBase.  So updates to this big fat row would be very fast,
> > this is a big advantage with HBase.
> >
> > So to delete France, you would grab all the families from it's row, and
> > iterate down them, deleting the other side of the relationship.
> >
> > === (unverified pseudo code)
> >
> > HTable ht = new HTable("locations");
> > Result result = ht.get(new Get(france_id));
> >
> > NavigableMap<byte[],byte[]> containedIn =
> >    result.getFamilyMap(Bytes.toBytes("CONTAINED_IN"));
> >
> > for(byte [] containedIn_id : containedIn.keySet()) {
> >    Delete delete = new Delete(containedIn_id);
> >    delete.deleteColumns(Bytes.toBytes("CONTAINS"), france_id);
> >    ht.delete(delete);
> > }
> >
> > ==
> >
> > You'd have to do that for all 6 relationships, so it doubles your delete
> > work as well.
> >
> > How you design this really depends on what your requirements are.  Many
> > times I'll start with a design like this which basically allows any kind
> > of operation relatively quickly at the expense of doubling the data.
> >
> > Most other designs are going to require some type of scanning.
> >
> > JG
> >
> >
> > Michael Robellard wrote:
> >> Assuming you always know the location you want to start from:
> >>
> >> Can't you have a table with a column family called contains: which holds
> >> all
> >> the places that place contains.
> >>
> >> Another Column Family for the contained in relationship
> >>
> >> and a third column family for surrounding
> >>
> >> The keys for each column value would be the row key for the location and
> >> then if you had information that you used all the time you could store
> >> it in the value for the column so you don't have to do second table
> >> lookup all the time
> >>
> >> Ishaaq Chandy wrote:
> >>> No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am
> well
> >>> aware of the reasons why to use HBase over a traditional RDBMS - so am
> >>> not
> >>> complaining about this.
> >>>
> >>> No, I was asking the question because I was not sure what the best
> >>> approach
> >>> would be.
> >>>
> >>>
> >>> By the way, I did not convey the whole story - there is actually a
> third
> >>> type of relationship as well - SURROUNDING - i.e. adjacent geographical
> >>> locations SURROUND each other (again, for business reasons, this
> >>> relationship is not necessarily always reflexive - though it usually
> >>> is).
> >>>
> >>> So, when you say HBase doesn't provide declarative secondary indices
> you
> >>> lost me - what are these? How are these different from the ones
> >>> available
> >>> via IndexedTable and IndexSpecification?
> >>>
> >>> Hmm, I was hoping by using sparse values in a column family labelled
> >>> by the
> >>> location ids I would just have to search for rows which had a non-empty
> >>> value for the CONTAIN:France column to retrieve the values for that
> >>> example
> >>> query I mentioned. I understand that that would make the CONTAIN column
> >>> family (and the PARENT and SURROUNDING families too) quite wide but I
> >>> remember reading somewhere that that was quite acceptable for HBase.
> >>>
> >>> Further, I was hoping, since the columns labels themselves contain the
> >>> data
> >>> I am searching for, that there would an efficient way to do this
> >>> (don't know
> >>> why or how - I was just hoping).
> >>>
> >>> Anyway, if it means that the only way to do this efficiently in HBase
> is
> >>> using four tables - one for the locations and one for each of the three
> >>> types of relationships then so be it - that is what I'll have to do -
> >>> I was
> >>> just hoping for a simpler alternative with my idea to use column
> >>> families
> >>> labelled by the location ids.
> >>>
> >>> Ishaaq
> >>>
> >>>
> >>> Ryan Rawson wrote:
> >>>
> >>>> Hey,
> >>>>
> >>>> HBase doesn't provide declarative secondary indexes.  Your app code
> >>>> needs to maintain them, writing into 2 tables with dual writes.  You
> >>>> don't have to duplicate data, you can just use the secondary index as
> >>>> a pointer into the main table, causing you to have to chase down
> >>>> potentially thousands of extra RPCs. There are no hbase transactions
> >>>> when you are modifying multiple tables, but that isnt as big of a
> >>>> problem as it seems.
> >>>>
> >>>> If all this sounds very 'raw' and 'painful' and 'error prone', let me
> >>>> remind you what HBase is for, and perhaps you can make a better
> >>>> choice.
> >>>>
> >>>> HBase is when you hit the limits of what you can do with mysql.  When
> >>>> you work to scale mysql you end up removing the following features:
> >>>> - no transactions
> >>>> - no secondary indexes (slow on mysql/innodb)
> >>>> - separate multiple table indexes on different databases
> >>>> - sharding (last step)
> >>>>
> >>>> Once you hit the magical 300-500GB size and you have hit the end of
> >>>> where master-slave replication scaling can take you, you need to move
> >>>> on to different techniques and technology.  This is where HBase picks
> >>>> up.
> >>>>
> >>>> So all the things you list below as 'negatives' are the reality on the
> >>>> ground when you scale no matter what technology you use.  If they
> >>>> sound too ugly for you, perhaps you really need mysql?
> >>>>
> >>>>
> >>>> On Fri, Jul 3, 2009 at 12:37 AM, tim
> >>>> robertson<ti...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Those 2 tables could be collapsed into 1 table with 2 columns of
> >>>>> course...
> >>>>>
> >>>>> On Fri, Jul 3, 2009 at 9:24 AM, tim
> >>>>> robertson<ti...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Disclaimer: I am a newbie, so this is just one option, and I am
> >>>>>> basing
> >>>>>> on my understanding that secondary indexes are not yet working on
> >>>>>> HBase...
> >>>>>>
> >>>>>> So since HBase has very fast "get by primary key", but is *still*
> (?)
> >>>>>> without working secondary indexes, you would need to do scans to
> find
> >>>>>> the records.  A workaround would be to have 2 more tables
> >>>>>> "Country_Contains" and "Country_Contained_In", and in each table,
> the
> >>>>>> primary key is the unique ID of the country, the payload being the
> >>>>>> Keys to the rows in the main table.  Basically this is creating 2
> >>>>>> tables to act as the index manually.  This is a duplication of data,
> >>>>>> and would require management of 3 tables wrapped in a transaction
> >>>>>> when
> >>>>>> doing CRUD, but it would allow for lookup of the rows to modify
> >>>>>> without need for scanning.
> >>>>>>
> >>>>>> Just one idea...
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Tim
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<is...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi all,
> >>>>>>> I am pretty new to HBase so forgive me if this seems like a silly
> >>>>>>> question.
> >>>>>>>
> >>>>>>> Each row in my Hbase table is a geographical location that is
> >>>>>>> related
> >>>>>>> to
> >>>>>>> other locations. For e.g. one relationship is the CONTAIN
> >>>>>>> relationship.
> >>>>>>> So,
> >>>>>>> Europe CONTAINs  England, France, Spain etc. There is an inverse
> >>>>>>> relationship as well called PARENT, so England has a PARENT called
> >>>>>>> Europe.
> >>>>>>> However, note that, for various business reasons not pertinant to
> >>>>>>> this
> >>>>>>> discussion, the inverse relationship need not always be set, i.e.
> we
> >>>>>>> may not
> >>>>>>> store France with a PARENT value of Europe, even though Europe
> >>>>>>> CONTAINs
> >>>>>>> France.
> >>>>>>>
> >>>>>>> So, I store each location as a row with an id and the payload data
> >>>>>>> for
> >>>>>>> that
> >>>>>>> location as a separate data column. This data column includes the
> >>>>>>> sets
> >>>>>>> of
> >>>>>>> ids of the related locations.
> >>>>>>>
> >>>>>>> Now, I want to be able to update/delete locations consistently.
> >>>>>>> So, in
> >>>>>>> my
> >>>>>>> example above, I might want to delete France, in which case I also
> >>>>>>> want
> >>>>>>> to
> >>>>>>> make sure that I delete the CONTAINs relationship that Europe has
> >>>>>>> with
> >>>>>>> France as that is now obsolete. What is the most efficient way to
> do
> >>>>>>> this? I
> >>>>>>> want to minimise the number of writes I would have to do - on the
> >>>>>>> other
> >>>>>>> hand
> >>>>>>> optimising read performance is more important as writes do not
> >>>>>>> happen
> >>>>>>> that
> >>>>>>> often (this is geographic data after all).
> >>>>>>>
> >>>>>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e.
> >>>>>>> 1
> >>>>>>> write
> >>>>>>> operation to delete France and n write operations to delete the
> >>>>>>> relationships that n other locations may have to France. In the
> >>>>>>> case of
> >>>>>>> a
> >>>>>>> root location like Europe that may have a large number of locations
> >>>>>>> that
> >>>>>>> relate to it this may be expensive, but I see no other way.
> >>>>>>>
> >>>>>>> So, I was wondering, how do I index this to speed this up as far as
> >>>>>>> possible. So, given the location Europe, what are the fields I
> >>>>>>> should
> >>>>>>> include in its row and how to index them? I could create a column
> >>>>>>> family for
> >>>>>>> each relationship type with a label - the label being the id of the
> >>>>>>> location
> >>>>>>> this location is related to, so, for e.g., the Europe row would
> >>>>>>> have a
> >>>>>>> column called CONTAIN:England (assuming "England" is the id for the
> >>>>>>> England
> >>>>>>> column - in reality it would be a UUID). I would then have as many
> >>>>>>> labels
> >>>>>>> under the CONTAIN family for Europe as locations that Europe
> >>>>>>> contains.
> >>>>>>>
> >>>>>>> How would I index this and ensure that when deleting France the
> >>>>>>> query:
> >>>>>>> "list
> >>>>>>> all locations that CONTAIN France" returns with Europe (and
> whatever
> >>>>>>> else)
> >>>>>>> as quickly as possible?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Ishaaq
> >>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/indexing-question-tp24318679p24340332.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: indexing question

Posted by Ishaaq Chandy <is...@gmail.com>.

Thanks for the responses guys, sounds like multiple relationship columns are
the way to go.

However, I am still curious about the earlier comments about secondary
indices not being available - does this mean I should avoid using
IndexedTable and IndexConfiguration completely? I am using 0.19.3.

Thanks,
Ishaaq


Jonathan Gray-2 wrote:
> 
> Ishaaq,
> 
> I think Michael has the right idea.  You can do this in a single table 
> with two families per relationship.
> 
> The row key is the location uuid.  And you could have 6 families:
> 
> CONTAINS, CONTAINED_IN, PARENTS, PARENT_OF, SURROUNDS, SURROUNDED_BY
> 
> You would duplicate all your data, but random writes are quite 
> fast/cheap in HBase.  So updates to this big fat row would be very fast, 
> this is a big advantage with HBase.
> 
> So to delete France, you would grab all the families from it's row, and 
> iterate down them, deleting the other side of the relationship.
> 
> === (unverified pseudo code)
> 
> HTable ht = new HTable("locations");
> Result result = ht.get(new Get(france_id));
> 
> NavigableMap<byte[],byte[]> containedIn =
>    result.getFamilyMap(Bytes.toBytes("CONTAINED_IN"));
> 
> for(byte [] containedIn_id : containedIn.keySet()) {
>    Delete delete = new Delete(containedIn_id);
>    delete.deleteColumns(Bytes.toBytes("CONTAINS"), france_id);
>    ht.delete(delete);
> }
> 
> ==
> 
> You'd have to do that for all 6 relationships, so it doubles your delete 
> work as well.
> 
> How you design this really depends on what your requirements are.  Many 
> times I'll start with a design like this which basically allows any kind 
> of operation relatively quickly at the expense of doubling the data.
> 
> Most other designs are going to require some type of scanning.
> 
> JG
> 
> 
> Michael Robellard wrote:
>> Assuming you always know the location you want to start from:
>> 
>> Can't you have a table with a column family called contains: which holds 
>> all
>> the places that place contains.
>> 
>> Another Column Family for the contained in relationship
>> 
>> and a third column family for surrounding
>> 
>> The keys for each column value would be the row key for the location and 
>> then if you had information that you used all the time you could store 
>> it in the value for the column so you don't have to do second table 
>> lookup all the time
>> 
>> Ishaaq Chandy wrote:
>>> No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am well
>>> aware of the reasons why to use HBase over a traditional RDBMS - so am 
>>> not
>>> complaining about this.
>>>
>>> No, I was asking the question because I was not sure what the best 
>>> approach
>>> would be.
>>>
>>>
>>> By the way, I did not convey the whole story - there is actually a third
>>> type of relationship as well - SURROUNDING - i.e. adjacent geographical
>>> locations SURROUND each other (again, for business reasons, this
>>> relationship is not necessarily always reflexive - though it usually
>>> is).
>>>
>>> So, when you say HBase doesn't provide declarative secondary indices you
>>> lost me - what are these? How are these different from the ones
>>> available
>>> via IndexedTable and IndexSpecification?
>>>
>>> Hmm, I was hoping by using sparse values in a column family labelled 
>>> by the
>>> location ids I would just have to search for rows which had a non-empty
>>> value for the CONTAIN:France column to retrieve the values for that 
>>> example
>>> query I mentioned. I understand that that would make the CONTAIN column
>>> family (and the PARENT and SURROUNDING families too) quite wide but I
>>> remember reading somewhere that that was quite acceptable for HBase.
>>>
>>> Further, I was hoping, since the columns labels themselves contain the 
>>> data
>>> I am searching for, that there would an efficient way to do this 
>>> (don't know
>>> why or how - I was just hoping).
>>>
>>> Anyway, if it means that the only way to do this efficiently in HBase is
>>> using four tables - one for the locations and one for each of the three
>>> types of relationships then so be it - that is what I'll have to do - 
>>> I was
>>> just hoping for a simpler alternative with my idea to use column
>>> families
>>> labelled by the location ids.
>>>
>>> Ishaaq
>>>
>>>
>>> Ryan Rawson wrote:
>>>  
>>>> Hey,
>>>>
>>>> HBase doesn't provide declarative secondary indexes.  Your app code
>>>> needs to maintain them, writing into 2 tables with dual writes.  You
>>>> don't have to duplicate data, you can just use the secondary index as
>>>> a pointer into the main table, causing you to have to chase down
>>>> potentially thousands of extra RPCs. There are no hbase transactions
>>>> when you are modifying multiple tables, but that isnt as big of a
>>>> problem as it seems.
>>>>
>>>> If all this sounds very 'raw' and 'painful' and 'error prone', let me
>>>> remind you what HBase is for, and perhaps you can make a better
>>>> choice.
>>>>
>>>> HBase is when you hit the limits of what you can do with mysql.  When
>>>> you work to scale mysql you end up removing the following features:
>>>> - no transactions
>>>> - no secondary indexes (slow on mysql/innodb)
>>>> - separate multiple table indexes on different databases
>>>> - sharding (last step)
>>>>
>>>> Once you hit the magical 300-500GB size and you have hit the end of
>>>> where master-slave replication scaling can take you, you need to move
>>>> on to different techniques and technology.  This is where HBase picks
>>>> up.
>>>>
>>>> So all the things you list below as 'negatives' are the reality on the
>>>> ground when you scale no matter what technology you use.  If they
>>>> sound too ugly for you, perhaps you really need mysql?
>>>>
>>>>
>>>> On Fri, Jul 3, 2009 at 12:37 AM, tim 
>>>> robertson<ti...@gmail.com>
>>>> wrote:
>>>>    
>>>>> Those 2 tables could be collapsed into 1 table with 2 columns of
>>>>> course...
>>>>>
>>>>> On Fri, Jul 3, 2009 at 9:24 AM, tim 
>>>>> robertson<ti...@gmail.com>
>>>>> wrote:
>>>>>      
>>>>>> Hi,
>>>>>>
>>>>>> Disclaimer: I am a newbie, so this is just one option, and I am
>>>>>> basing
>>>>>> on my understanding that secondary indexes are not yet working on
>>>>>> HBase...
>>>>>>
>>>>>> So since HBase has very fast "get by primary key", but is *still* (?)
>>>>>> without working secondary indexes, you would need to do scans to find
>>>>>> the records.  A workaround would be to have 2 more tables
>>>>>> "Country_Contains" and "Country_Contained_In", and in each table, the
>>>>>> primary key is the unique ID of the country, the payload being the
>>>>>> Keys to the rows in the main table.  Basically this is creating 2
>>>>>> tables to act as the index manually.  This is a duplication of data,
>>>>>> and would require management of 3 tables wrapped in a transaction
>>>>>> when
>>>>>> doing CRUD, but it would allow for lookup of the rows to modify
>>>>>> without need for scanning.
>>>>>>
>>>>>> Just one idea...
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<is...@gmail.com>
>>>>>> wrote:
>>>>>>        
>>>>>>> Hi all,
>>>>>>> I am pretty new to HBase so forgive me if this seems like a silly
>>>>>>> question.
>>>>>>>
>>>>>>> Each row in my Hbase table is a geographical location that is
>>>>>>> related
>>>>>>> to
>>>>>>> other locations. For e.g. one relationship is the CONTAIN 
>>>>>>> relationship.
>>>>>>> So,
>>>>>>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>>>>>>> relationship as well called PARENT, so England has a PARENT called
>>>>>>> Europe.
>>>>>>> However, note that, for various business reasons not pertinant to 
>>>>>>> this
>>>>>>> discussion, the inverse relationship need not always be set, i.e. we
>>>>>>> may not
>>>>>>> store France with a PARENT value of Europe, even though Europe 
>>>>>>> CONTAINs
>>>>>>> France.
>>>>>>>
>>>>>>> So, I store each location as a row with an id and the payload data 
>>>>>>> for
>>>>>>> that
>>>>>>> location as a separate data column. This data column includes the 
>>>>>>> sets
>>>>>>> of
>>>>>>> ids of the related locations.
>>>>>>>
>>>>>>> Now, I want to be able to update/delete locations consistently. 
>>>>>>> So, in
>>>>>>> my
>>>>>>> example above, I might want to delete France, in which case I also 
>>>>>>> want
>>>>>>> to
>>>>>>> make sure that I delete the CONTAINs relationship that Europe has 
>>>>>>> with
>>>>>>> France as that is now obsolete. What is the most efficient way to do
>>>>>>> this? I
>>>>>>> want to minimise the number of writes I would have to do - on the 
>>>>>>> other
>>>>>>> hand
>>>>>>> optimising read performance is more important as writes do not
>>>>>>> happen
>>>>>>> that
>>>>>>> often (this is geographic data after all).
>>>>>>>
>>>>>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e.
>>>>>>> 1
>>>>>>> write
>>>>>>> operation to delete France and n write operations to delete the
>>>>>>> relationships that n other locations may have to France. In the 
>>>>>>> case of
>>>>>>> a
>>>>>>> root location like Europe that may have a large number of locations
>>>>>>> that
>>>>>>> relate to it this may be expensive, but I see no other way.
>>>>>>>
>>>>>>> So, I was wondering, how do I index this to speed this up as far as
>>>>>>> possible. So, given the location Europe, what are the fields I
>>>>>>> should
>>>>>>> include in its row and how to index them? I could create a column
>>>>>>> family for
>>>>>>> each relationship type with a label - the label being the id of the
>>>>>>> location
>>>>>>> this location is related to, so, for e.g., the Europe row would 
>>>>>>> have a
>>>>>>> column called CONTAIN:England (assuming "England" is the id for the
>>>>>>> England
>>>>>>> column - in reality it would be a UUID). I would then have as many
>>>>>>> labels
>>>>>>> under the CONTAIN family for Europe as locations that Europe 
>>>>>>> contains.
>>>>>>>
>>>>>>> How would I index this and ensure that when deleting France the 
>>>>>>> query:
>>>>>>> "list
>>>>>>> all locations that CONTAIN France" returns with Europe (and whatever
>>>>>>> else)
>>>>>>> as quickly as possible?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ishaaq
>>>>>>>
>>>>>>>           
>>>>     
>>>
>>>   
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/indexing-question-tp24318679p24340332.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: indexing question

Posted by Jonathan Gray <jl...@streamy.com>.

Ishaaq,

I think Michael has the right idea.  You can do this in a single table 
with two families per relationship.

The row key is the location uuid.  And you could have 6 families:

CONTAINS, CONTAINED_IN, PARENTS, PARENT_OF, SURROUNDS, SURROUNDED_BY

You would duplicate all your data, but random writes are quite 
fast/cheap in HBase.  So updates to this big fat row would be very fast, 
this is a big advantage with HBase.

So to delete France, you would grab all the families from it's row, and 
iterate down them, deleting the other side of the relationship.

=== (unverified pseudo code)

HTable ht = new HTable("locations");
Result result = ht.get(new Get(france_id));

NavigableMap<byte[],byte[]> containedIn =
   result.getFamilyMap(Bytes.toBytes("CONTAINED_IN"));

for(byte [] containedIn_id : containedIn.keySet()) {
   Delete delete = new Delete(containedIn_id);
   delete.deleteColumns(Bytes.toBytes("CONTAINS"), france_id);
   ht.delete(delete);
}

==

You'd have to do that for all 6 relationships, so it doubles your delete 
work as well.

How you design this really depends on what your requirements are.  Many 
times I'll start with a design like this which basically allows any kind 
of operation relatively quickly at the expense of doubling the data.

Most other designs are going to require some type of scanning.

JG


Michael Robellard wrote:
> Assuming you always know the location you want to start from:
> 
> Can't you have a table with a column family called contains: which holds 
> all
> the places that place contains.
> 
> Another Column Family for the contained in relationship
> 
> and a third column family for surrounding
> 
> The keys for each column value would be the row key for the location and 
> then if you had information that you used all the time you could store 
> it in the value for the column so you don't have to do second table 
> lookup all the time
> 
> Ishaaq Chandy wrote:
>> No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am well
>> aware of the reasons why to use HBase over a traditional RDBMS - so am 
>> not
>> complaining about this.
>>
>> No, I was asking the question because I was not sure what the best 
>> approach
>> would be.
>>
>>
>> By the way, I did not convey the whole story - there is actually a third
>> type of relationship as well - SURROUNDING - i.e. adjacent geographical
>> locations SURROUND each other (again, for business reasons, this
>> relationship is not necessarily always reflexive - though it usually is).
>>
>> So, when you say HBase doesn't provide declarative secondary indices you
>> lost me - what are these? How are these different from the ones available
>> via IndexedTable and IndexSpecification?
>>
>> Hmm, I was hoping by using sparse values in a column family labelled 
>> by the
>> location ids I would just have to search for rows which had a non-empty
>> value for the CONTAIN:France column to retrieve the values for that 
>> example
>> query I mentioned. I understand that that would make the CONTAIN column
>> family (and the PARENT and SURROUNDING families too) quite wide but I
>> remember reading somewhere that that was quite acceptable for HBase.
>>
>> Further, I was hoping, since the columns labels themselves contain the 
>> data
>> I am searching for, that there would an efficient way to do this 
>> (don't know
>> why or how - I was just hoping).
>>
>> Anyway, if it means that the only way to do this efficiently in HBase is
>> using four tables - one for the locations and one for each of the three
>> types of relationships then so be it - that is what I'll have to do - 
>> I was
>> just hoping for a simpler alternative with my idea to use column families
>> labelled by the location ids.
>>
>> Ishaaq
>>
>>
>> Ryan Rawson wrote:
>>  
>>> Hey,
>>>
>>> HBase doesn't provide declarative secondary indexes.  Your app code
>>> needs to maintain them, writing into 2 tables with dual writes.  You
>>> don't have to duplicate data, you can just use the secondary index as
>>> a pointer into the main table, causing you to have to chase down
>>> potentially thousands of extra RPCs. There are no hbase transactions
>>> when you are modifying multiple tables, but that isnt as big of a
>>> problem as it seems.
>>>
>>> If all this sounds very 'raw' and 'painful' and 'error prone', let me
>>> remind you what HBase is for, and perhaps you can make a better
>>> choice.
>>>
>>> HBase is when you hit the limits of what you can do with mysql.  When
>>> you work to scale mysql you end up removing the following features:
>>> - no transactions
>>> - no secondary indexes (slow on mysql/innodb)
>>> - separate multiple table indexes on different databases
>>> - sharding (last step)
>>>
>>> Once you hit the magical 300-500GB size and you have hit the end of
>>> where master-slave replication scaling can take you, you need to move
>>> on to different techniques and technology.  This is where HBase picks
>>> up.
>>>
>>> So all the things you list below as 'negatives' are the reality on the
>>> ground when you scale no matter what technology you use.  If they
>>> sound too ugly for you, perhaps you really need mysql?
>>>
>>>
>>> On Fri, Jul 3, 2009 at 12:37 AM, tim 
>>> robertson<ti...@gmail.com>
>>> wrote:
>>>    
>>>> Those 2 tables could be collapsed into 1 table with 2 columns of
>>>> course...
>>>>
>>>> On Fri, Jul 3, 2009 at 9:24 AM, tim 
>>>> robertson<ti...@gmail.com>
>>>> wrote:
>>>>      
>>>>> Hi,
>>>>>
>>>>> Disclaimer: I am a newbie, so this is just one option, and I am basing
>>>>> on my understanding that secondary indexes are not yet working on
>>>>> HBase...
>>>>>
>>>>> So since HBase has very fast "get by primary key", but is *still* (?)
>>>>> without working secondary indexes, you would need to do scans to find
>>>>> the records.  A workaround would be to have 2 more tables
>>>>> "Country_Contains" and "Country_Contained_In", and in each table, the
>>>>> primary key is the unique ID of the country, the payload being the
>>>>> Keys to the rows in the main table.  Basically this is creating 2
>>>>> tables to act as the index manually.  This is a duplication of data,
>>>>> and would require management of 3 tables wrapped in a transaction when
>>>>> doing CRUD, but it would allow for lookup of the rows to modify
>>>>> without need for scanning.
>>>>>
>>>>> Just one idea...
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<is...@gmail.com> wrote:
>>>>>        
>>>>>> Hi all,
>>>>>> I am pretty new to HBase so forgive me if this seems like a silly
>>>>>> question.
>>>>>>
>>>>>> Each row in my Hbase table is a geographical location that is related
>>>>>> to
>>>>>> other locations. For e.g. one relationship is the CONTAIN 
>>>>>> relationship.
>>>>>> So,
>>>>>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>>>>>> relationship as well called PARENT, so England has a PARENT called
>>>>>> Europe.
>>>>>> However, note that, for various business reasons not pertinant to 
>>>>>> this
>>>>>> discussion, the inverse relationship need not always be set, i.e. we
>>>>>> may not
>>>>>> store France with a PARENT value of Europe, even though Europe 
>>>>>> CONTAINs
>>>>>> France.
>>>>>>
>>>>>> So, I store each location as a row with an id and the payload data 
>>>>>> for
>>>>>> that
>>>>>> location as a separate data column. This data column includes the 
>>>>>> sets
>>>>>> of
>>>>>> ids of the related locations.
>>>>>>
>>>>>> Now, I want to be able to update/delete locations consistently. 
>>>>>> So, in
>>>>>> my
>>>>>> example above, I might want to delete France, in which case I also 
>>>>>> want
>>>>>> to
>>>>>> make sure that I delete the CONTAINs relationship that Europe has 
>>>>>> with
>>>>>> France as that is now obsolete. What is the most efficient way to do
>>>>>> this? I
>>>>>> want to minimise the number of writes I would have to do - on the 
>>>>>> other
>>>>>> hand
>>>>>> optimising read performance is more important as writes do not happen
>>>>>> that
>>>>>> often (this is geographic data after all).
>>>>>>
>>>>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1
>>>>>> write
>>>>>> operation to delete France and n write operations to delete the
>>>>>> relationships that n other locations may have to France. In the 
>>>>>> case of
>>>>>> a
>>>>>> root location like Europe that may have a large number of locations
>>>>>> that
>>>>>> relate to it this may be expensive, but I see no other way.
>>>>>>
>>>>>> So, I was wondering, how do I index this to speed this up as far as
>>>>>> possible. So, given the location Europe, what are the fields I should
>>>>>> include in its row and how to index them? I could create a column
>>>>>> family for
>>>>>> each relationship type with a label - the label being the id of the
>>>>>> location
>>>>>> this location is related to, so, for e.g., the Europe row would 
>>>>>> have a
>>>>>> column called CONTAIN:England (assuming "England" is the id for the
>>>>>> England
>>>>>> column - in reality it would be a UUID). I would then have as many
>>>>>> labels
>>>>>> under the CONTAIN family for Europe as locations that Europe 
>>>>>> contains.
>>>>>>
>>>>>> How would I index this and ensure that when deleting France the 
>>>>>> query:
>>>>>> "list
>>>>>> all locations that CONTAIN France" returns with Europe (and whatever
>>>>>> else)
>>>>>> as quickly as possible?
>>>>>>
>>>>>> Thanks,
>>>>>> Ishaaq
>>>>>>
>>>>>>           
>>>     
>>
>>   
>

Re: indexing question

Posted by Michael Robellard <mi...@robellard.com>.

Assuming you always know the location you want to start from:

Can't you have a table with a column family called contains: which holds all
the places that place contains.

Another Column Family for the contained in relationship

and a third column family for surrounding

The keys for each column value would be the row key for the location and 
then if you had information that you used all the time you could store 
it in the value for the column so you don't have to do second table 
lookup all the time

Ishaaq Chandy wrote:
> No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am well
> aware of the reasons why to use HBase over a traditional RDBMS - so am not
> complaining about this.
>
> No, I was asking the question because I was not sure what the best approach
> would be.
>
>
> By the way, I did not convey the whole story - there is actually a third
> type of relationship as well - SURROUNDING - i.e. adjacent geographical
> locations SURROUND each other (again, for business reasons, this
> relationship is not necessarily always reflexive - though it usually is).
>
> So, when you say HBase doesn't provide declarative secondary indices you
> lost me - what are these? How are these different from the ones available
> via IndexedTable and IndexSpecification?
>
> Hmm, I was hoping by using sparse values in a column family labelled by the
> location ids I would just have to search for rows which had a non-empty
> value for the CONTAIN:France column to retrieve the values for that example
> query I mentioned. I understand that that would make the CONTAIN column
> family (and the PARENT and SURROUNDING families too) quite wide but I
> remember reading somewhere that that was quite acceptable for HBase.
>
> Further, I was hoping, since the columns labels themselves contain the data
> I am searching for, that there would an efficient way to do this (don't know
> why or how - I was just hoping).
>
> Anyway, if it means that the only way to do this efficiently in HBase is
> using four tables - one for the locations and one for each of the three
> types of relationships then so be it - that is what I'll have to do - I was
> just hoping for a simpler alternative with my idea to use column families
> labelled by the location ids.
>
> Ishaaq
>
>
> Ryan Rawson wrote:
>   
>> Hey,
>>
>> HBase doesn't provide declarative secondary indexes.  Your app code
>> needs to maintain them, writing into 2 tables with dual writes.  You
>> don't have to duplicate data, you can just use the secondary index as
>> a pointer into the main table, causing you to have to chase down
>> potentially thousands of extra RPCs. There are no hbase transactions
>> when you are modifying multiple tables, but that isnt as big of a
>> problem as it seems.
>>
>> If all this sounds very 'raw' and 'painful' and 'error prone', let me
>> remind you what HBase is for, and perhaps you can make a better
>> choice.
>>
>> HBase is when you hit the limits of what you can do with mysql.  When
>> you work to scale mysql you end up removing the following features:
>> - no transactions
>> - no secondary indexes (slow on mysql/innodb)
>> - separate multiple table indexes on different databases
>> - sharding (last step)
>>
>> Once you hit the magical 300-500GB size and you have hit the end of
>> where master-slave replication scaling can take you, you need to move
>> on to different techniques and technology.  This is where HBase picks
>> up.
>>
>> So all the things you list below as 'negatives' are the reality on the
>> ground when you scale no matter what technology you use.  If they
>> sound too ugly for you, perhaps you really need mysql?
>>
>>
>> On Fri, Jul 3, 2009 at 12:37 AM, tim robertson<ti...@gmail.com>
>> wrote:
>>     
>>> Those 2 tables could be collapsed into 1 table with 2 columns of
>>> course...
>>>
>>> On Fri, Jul 3, 2009 at 9:24 AM, tim robertson<ti...@gmail.com>
>>> wrote:
>>>       
>>>> Hi,
>>>>
>>>> Disclaimer: I am a newbie, so this is just one option, and I am basing
>>>> on my understanding that secondary indexes are not yet working on
>>>> HBase...
>>>>
>>>> So since HBase has very fast "get by primary key", but is *still* (?)
>>>> without working secondary indexes, you would need to do scans to find
>>>> the records.  A workaround would be to have 2 more tables
>>>> "Country_Contains" and "Country_Contained_In", and in each table, the
>>>> primary key is the unique ID of the country, the payload being the
>>>> Keys to the rows in the main table.  Basically this is creating 2
>>>> tables to act as the index manually.  This is a duplication of data,
>>>> and would require management of 3 tables wrapped in a transaction when
>>>> doing CRUD, but it would allow for lookup of the rows to modify
>>>> without need for scanning.
>>>>
>>>> Just one idea...
>>>>
>>>> Cheers,
>>>>
>>>> Tim
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<is...@gmail.com> wrote:
>>>>         
>>>>> Hi all,
>>>>> I am pretty new to HBase so forgive me if this seems like a silly
>>>>> question.
>>>>>
>>>>> Each row in my Hbase table is a geographical location that is related
>>>>> to
>>>>> other locations. For e.g. one relationship is the CONTAIN relationship.
>>>>> So,
>>>>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>>>>> relationship as well called PARENT, so England has a PARENT called
>>>>> Europe.
>>>>> However, note that, for various business reasons not pertinant to this
>>>>> discussion, the inverse relationship need not always be set, i.e. we
>>>>> may not
>>>>> store France with a PARENT value of Europe, even though Europe CONTAINs
>>>>> France.
>>>>>
>>>>> So, I store each location as a row with an id and the payload data for
>>>>> that
>>>>> location as a separate data column. This data column includes the sets
>>>>> of
>>>>> ids of the related locations.
>>>>>
>>>>> Now, I want to be able to update/delete locations consistently. So, in
>>>>> my
>>>>> example above, I might want to delete France, in which case I also want
>>>>> to
>>>>> make sure that I delete the CONTAINs relationship that Europe has with
>>>>> France as that is now obsolete. What is the most efficient way to do
>>>>> this? I
>>>>> want to minimise the number of writes I would have to do - on the other
>>>>> hand
>>>>> optimising read performance is more important as writes do not happen
>>>>> that
>>>>> often (this is geographic data after all).
>>>>>
>>>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1
>>>>> write
>>>>> operation to delete France and n write operations to delete the
>>>>> relationships that n other locations may have to France. In the case of
>>>>> a
>>>>> root location like Europe that may have a large number of locations
>>>>> that
>>>>> relate to it this may be expensive, but I see no other way.
>>>>>
>>>>> So, I was wondering, how do I index this to speed this up as far as
>>>>> possible. So, given the location Europe, what are the fields I should
>>>>> include in its row and how to index them? I could create a column
>>>>> family for
>>>>> each relationship type with a label - the label being the id of the
>>>>> location
>>>>> this location is related to, so, for e.g., the Europe row would have a
>>>>> column called CONTAIN:England (assuming "England" is the id for the
>>>>> England
>>>>> column - in reality it would be a UUID). I would then have as many
>>>>> labels
>>>>> under the CONTAIN family for Europe as locations that Europe contains.
>>>>>
>>>>> How would I index this and ensure that when deleting France the query:
>>>>> "list
>>>>> all locations that CONTAIN France" returns with Europe (and whatever
>>>>> else)
>>>>> as quickly as possible?
>>>>>
>>>>> Thanks,
>>>>> Ishaaq
>>>>>
>>>>>           
>>     
>
>

Re: indexing question

Posted by Ishaaq Chandy <is...@gmail.com>.

No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am well
aware of the reasons why to use HBase over a traditional RDBMS - so am not
complaining about this.

No, I was asking the question because I was not sure what the best approach
would be.


By the way, I did not convey the whole story - there is actually a third
type of relationship as well - SURROUNDING - i.e. adjacent geographical
locations SURROUND each other (again, for business reasons, this
relationship is not necessarily always reflexive - though it usually is).

So, when you say HBase doesn't provide declarative secondary indices you
lost me - what are these? How are these different from the ones available
via IndexedTable and IndexSpecification?

Hmm, I was hoping by using sparse values in a column family labelled by the
location ids I would just have to search for rows which had a non-empty
value for the CONTAIN:France column to retrieve the values for that example
query I mentioned. I understand that that would make the CONTAIN column
family (and the PARENT and SURROUNDING families too) quite wide but I
remember reading somewhere that that was quite acceptable for HBase.

Further, I was hoping, since the columns labels themselves contain the data
I am searching for, that there would an efficient way to do this (don't know
why or how - I was just hoping).

Anyway, if it means that the only way to do this efficiently in HBase is
using four tables - one for the locations and one for each of the three
types of relationships then so be it - that is what I'll have to do - I was
just hoping for a simpler alternative with my idea to use column families
labelled by the location ids.

Ishaaq


Ryan Rawson wrote:
> 
> Hey,
> 
> HBase doesn't provide declarative secondary indexes.  Your app code
> needs to maintain them, writing into 2 tables with dual writes.  You
> don't have to duplicate data, you can just use the secondary index as
> a pointer into the main table, causing you to have to chase down
> potentially thousands of extra RPCs. There are no hbase transactions
> when you are modifying multiple tables, but that isnt as big of a
> problem as it seems.
> 
> If all this sounds very 'raw' and 'painful' and 'error prone', let me
> remind you what HBase is for, and perhaps you can make a better
> choice.
> 
> HBase is when you hit the limits of what you can do with mysql.  When
> you work to scale mysql you end up removing the following features:
> - no transactions
> - no secondary indexes (slow on mysql/innodb)
> - separate multiple table indexes on different databases
> - sharding (last step)
> 
> Once you hit the magical 300-500GB size and you have hit the end of
> where master-slave replication scaling can take you, you need to move
> on to different techniques and technology.  This is where HBase picks
> up.
> 
> So all the things you list below as 'negatives' are the reality on the
> ground when you scale no matter what technology you use.  If they
> sound too ugly for you, perhaps you really need mysql?
> 
> 
> On Fri, Jul 3, 2009 at 12:37 AM, tim robertson<ti...@gmail.com>
> wrote:
>> Those 2 tables could be collapsed into 1 table with 2 columns of
>> course...
>>
>> On Fri, Jul 3, 2009 at 9:24 AM, tim robertson<ti...@gmail.com>
>> wrote:
>>> Hi,
>>>
>>> Disclaimer: I am a newbie, so this is just one option, and I am basing
>>> on my understanding that secondary indexes are not yet working on
>>> HBase...
>>>
>>> So since HBase has very fast "get by primary key", but is *still* (?)
>>> without working secondary indexes, you would need to do scans to find
>>> the records.  A workaround would be to have 2 more tables
>>> "Country_Contains" and "Country_Contained_In", and in each table, the
>>> primary key is the unique ID of the country, the payload being the
>>> Keys to the rows in the main table.  Basically this is creating 2
>>> tables to act as the index manually.  This is a duplication of data,
>>> and would require management of 3 tables wrapped in a transaction when
>>> doing CRUD, but it would allow for lookup of the rows to modify
>>> without need for scanning.
>>>
>>> Just one idea...
>>>
>>> Cheers,
>>>
>>> Tim
>>>
>>>
>>>
>>>
>>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<is...@gmail.com> wrote:
>>>> Hi all,
>>>> I am pretty new to HBase so forgive me if this seems like a silly
>>>> question.
>>>>
>>>> Each row in my Hbase table is a geographical location that is related
>>>> to
>>>> other locations. For e.g. one relationship is the CONTAIN relationship.
>>>> So,
>>>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>>>> relationship as well called PARENT, so England has a PARENT called
>>>> Europe.
>>>> However, note that, for various business reasons not pertinant to this
>>>> discussion, the inverse relationship need not always be set, i.e. we
>>>> may not
>>>> store France with a PARENT value of Europe, even though Europe CONTAINs
>>>> France.
>>>>
>>>> So, I store each location as a row with an id and the payload data for
>>>> that
>>>> location as a separate data column. This data column includes the sets
>>>> of
>>>> ids of the related locations.
>>>>
>>>> Now, I want to be able to update/delete locations consistently. So, in
>>>> my
>>>> example above, I might want to delete France, in which case I also want
>>>> to
>>>> make sure that I delete the CONTAINs relationship that Europe has with
>>>> France as that is now obsolete. What is the most efficient way to do
>>>> this? I
>>>> want to minimise the number of writes I would have to do - on the other
>>>> hand
>>>> optimising read performance is more important as writes do not happen
>>>> that
>>>> often (this is geographic data after all).
>>>>
>>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1
>>>> write
>>>> operation to delete France and n write operations to delete the
>>>> relationships that n other locations may have to France. In the case of
>>>> a
>>>> root location like Europe that may have a large number of locations
>>>> that
>>>> relate to it this may be expensive, but I see no other way.
>>>>
>>>> So, I was wondering, how do I index this to speed this up as far as
>>>> possible. So, given the location Europe, what are the fields I should
>>>> include in its row and how to index them? I could create a column
>>>> family for
>>>> each relationship type with a label - the label being the id of the
>>>> location
>>>> this location is related to, so, for e.g., the Europe row would have a
>>>> column called CONTAIN:England (assuming "England" is the id for the
>>>> England
>>>> column - in reality it would be a UUID). I would then have as many
>>>> labels
>>>> under the CONTAIN family for Europe as locations that Europe contains.
>>>>
>>>> How would I index this and ensure that when deleting France the query:
>>>> "list
>>>> all locations that CONTAIN France" returns with Europe (and whatever
>>>> else)
>>>> as quickly as possible?
>>>>
>>>> Thanks,
>>>> Ishaaq
>>>>
>>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/indexing-question-tp24318679p24320773.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: indexing question

Posted by Ryan Rawson <ry...@gmail.com>.

Hey,

HBase doesn't provide declarative secondary indexes.  Your app code
needs to maintain them, writing into 2 tables with dual writes.  You
don't have to duplicate data, you can just use the secondary index as
a pointer into the main table, causing you to have to chase down
potentially thousands of extra RPCs. There are no hbase transactions
when you are modifying multiple tables, but that isnt as big of a
problem as it seems.

If all this sounds very 'raw' and 'painful' and 'error prone', let me
remind you what HBase is for, and perhaps you can make a better
choice.

HBase is when you hit the limits of what you can do with mysql.  When
you work to scale mysql you end up removing the following features:
- no transactions
- no secondary indexes (slow on mysql/innodb)
- separate multiple table indexes on different databases
- sharding (last step)

Once you hit the magical 300-500GB size and you have hit the end of
where master-slave replication scaling can take you, you need to move
on to different techniques and technology.  This is where HBase picks
up.

So all the things you list below as 'negatives' are the reality on the
ground when you scale no matter what technology you use.  If they
sound too ugly for you, perhaps you really need mysql?


On Fri, Jul 3, 2009 at 12:37 AM, tim robertson<ti...@gmail.com> wrote:
> Those 2 tables could be collapsed into 1 table with 2 columns of course...
>
> On Fri, Jul 3, 2009 at 9:24 AM, tim robertson<ti...@gmail.com> wrote:
>> Hi,
>>
>> Disclaimer: I am a newbie, so this is just one option, and I am basing
>> on my understanding that secondary indexes are not yet working on
>> HBase...
>>
>> So since HBase has very fast "get by primary key", but is *still* (?)
>> without working secondary indexes, you would need to do scans to find
>> the records.  A workaround would be to have 2 more tables
>> "Country_Contains" and "Country_Contained_In", and in each table, the
>> primary key is the unique ID of the country, the payload being the
>> Keys to the rows in the main table.  Basically this is creating 2
>> tables to act as the index manually.  This is a duplication of data,
>> and would require management of 3 tables wrapped in a transaction when
>> doing CRUD, but it would allow for lookup of the rows to modify
>> without need for scanning.
>>
>> Just one idea...
>>
>> Cheers,
>>
>> Tim
>>
>>
>>
>>
>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<is...@gmail.com> wrote:
>>> Hi all,
>>> I am pretty new to HBase so forgive me if this seems like a silly question.
>>>
>>> Each row in my Hbase table is a geographical location that is related to
>>> other locations. For e.g. one relationship is the CONTAIN relationship. So,
>>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>>> relationship as well called PARENT, so England has a PARENT called Europe.
>>> However, note that, for various business reasons not pertinant to this
>>> discussion, the inverse relationship need not always be set, i.e. we may not
>>> store France with a PARENT value of Europe, even though Europe CONTAINs
>>> France.
>>>
>>> So, I store each location as a row with an id and the payload data for that
>>> location as a separate data column. This data column includes the sets of
>>> ids of the related locations.
>>>
>>> Now, I want to be able to update/delete locations consistently. So, in my
>>> example above, I might want to delete France, in which case I also want to
>>> make sure that I delete the CONTAINs relationship that Europe has with
>>> France as that is now obsolete. What is the most efficient way to do this? I
>>> want to minimise the number of writes I would have to do - on the other hand
>>> optimising read performance is more important as writes do not happen that
>>> often (this is geographic data after all).
>>>
>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write
>>> operation to delete France and n write operations to delete the
>>> relationships that n other locations may have to France. In the case of a
>>> root location like Europe that may have a large number of locations that
>>> relate to it this may be expensive, but I see no other way.
>>>
>>> So, I was wondering, how do I index this to speed this up as far as
>>> possible. So, given the location Europe, what are the fields I should
>>> include in its row and how to index them? I could create a column family for
>>> each relationship type with a label - the label being the id of the location
>>> this location is related to, so, for e.g., the Europe row would have a
>>> column called CONTAIN:England (assuming "England" is the id for the England
>>> column - in reality it would be a UUID). I would then have as many labels
>>> under the CONTAIN family for Europe as locations that Europe contains.
>>>
>>> How would I index this and ensure that when deleting France the query: "list
>>> all locations that CONTAIN France" returns with Europe (and whatever else)
>>> as quickly as possible?
>>>
>>> Thanks,
>>> Ishaaq
>>>
>>
>

Re: indexing question

Posted by tim robertson <ti...@gmail.com>.

Those 2 tables could be collapsed into 1 table with 2 columns of course...

On Fri, Jul 3, 2009 at 9:24 AM, tim robertson<ti...@gmail.com> wrote:
> Hi,
>
> Disclaimer: I am a newbie, so this is just one option, and I am basing
> on my understanding that secondary indexes are not yet working on
> HBase...
>
> So since HBase has very fast "get by primary key", but is *still* (?)
> without working secondary indexes, you would need to do scans to find
> the records.  A workaround would be to have 2 more tables
> "Country_Contains" and "Country_Contained_In", and in each table, the
> primary key is the unique ID of the country, the payload being the
> Keys to the rows in the main table.  Basically this is creating 2
> tables to act as the index manually.  This is a duplication of data,
> and would require management of 3 tables wrapped in a transaction when
> doing CRUD, but it would allow for lookup of the rows to modify
> without need for scanning.
>
> Just one idea...
>
> Cheers,
>
> Tim
>
>
>
>
> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<is...@gmail.com> wrote:
>> Hi all,
>> I am pretty new to HBase so forgive me if this seems like a silly question.
>>
>> Each row in my Hbase table is a geographical location that is related to
>> other locations. For e.g. one relationship is the CONTAIN relationship. So,
>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>> relationship as well called PARENT, so England has a PARENT called Europe.
>> However, note that, for various business reasons not pertinant to this
>> discussion, the inverse relationship need not always be set, i.e. we may not
>> store France with a PARENT value of Europe, even though Europe CONTAINs
>> France.
>>
>> So, I store each location as a row with an id and the payload data for that
>> location as a separate data column. This data column includes the sets of
>> ids of the related locations.
>>
>> Now, I want to be able to update/delete locations consistently. So, in my
>> example above, I might want to delete France, in which case I also want to
>> make sure that I delete the CONTAINs relationship that Europe has with
>> France as that is now obsolete. What is the most efficient way to do this? I
>> want to minimise the number of writes I would have to do - on the other hand
>> optimising read performance is more important as writes do not happen that
>> often (this is geographic data after all).
>>
>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write
>> operation to delete France and n write operations to delete the
>> relationships that n other locations may have to France. In the case of a
>> root location like Europe that may have a large number of locations that
>> relate to it this may be expensive, but I see no other way.
>>
>> So, I was wondering, how do I index this to speed this up as far as
>> possible. So, given the location Europe, what are the fields I should
>> include in its row and how to index them? I could create a column family for
>> each relationship type with a label - the label being the id of the location
>> this location is related to, so, for e.g., the Europe row would have a
>> column called CONTAIN:England (assuming "England" is the id for the England
>> column - in reality it would be a UUID). I would then have as many labels
>> under the CONTAIN family for Europe as locations that Europe contains.
>>
>> How would I index this and ensure that when deleting France the query: "list
>> all locations that CONTAIN France" returns with Europe (and whatever else)
>> as quickly as possible?
>>
>> Thanks,
>> Ishaaq
>>
>

Re: indexing question

Posted by tim robertson <ti...@gmail.com>.

Hi,

Disclaimer: I am a newbie, so this is just one option, and I am basing
on my understanding that secondary indexes are not yet working on
HBase...

So since HBase has very fast "get by primary key", but is *still* (?)
without working secondary indexes, you would need to do scans to find
the records.  A workaround would be to have 2 more tables
"Country_Contains" and "Country_Contained_In", and in each table, the
primary key is the unique ID of the country, the payload being the
Keys to the rows in the main table.  Basically this is creating 2
tables to act as the index manually.  This is a duplication of data,
and would require management of 3 tables wrapped in a transaction when
doing CRUD, but it would allow for lookup of the rows to modify
without need for scanning.

Just one idea...

Cheers,

Tim




On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<is...@gmail.com> wrote:
> Hi all,
> I am pretty new to HBase so forgive me if this seems like a silly question.
>
> Each row in my Hbase table is a geographical location that is related to
> other locations. For e.g. one relationship is the CONTAIN relationship. So,
> Europe CONTAINs  England, France, Spain etc. There is an inverse
> relationship as well called PARENT, so England has a PARENT called Europe.
> However, note that, for various business reasons not pertinant to this
> discussion, the inverse relationship need not always be set, i.e. we may not
> store France with a PARENT value of Europe, even though Europe CONTAINs
> France.
>
> So, I store each location as a row with an id and the payload data for that
> location as a separate data column. This data column includes the sets of
> ids of the related locations.
>
> Now, I want to be able to update/delete locations consistently. So, in my
> example above, I might want to delete France, in which case I also want to
> make sure that I delete the CONTAINs relationship that Europe has with
> France as that is now obsolete. What is the most efficient way to do this? I
> want to minimise the number of writes I would have to do - on the other hand
> optimising read performance is more important as writes do not happen that
> often (this is geographic data after all).
>
> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write
> operation to delete France and n write operations to delete the
> relationships that n other locations may have to France. In the case of a
> root location like Europe that may have a large number of locations that
> relate to it this may be expensive, but I see no other way.
>
> So, I was wondering, how do I index this to speed this up as far as
> possible. So, given the location Europe, what are the fields I should
> include in its row and how to index them? I could create a column family for
> each relationship type with a label - the label being the id of the location
> this location is related to, so, for e.g., the Europe row would have a
> column called CONTAIN:England (assuming "England" is the id for the England
> column - in reality it would be a UUID). I would then have as many labels
> under the CONTAIN family for Europe as locations that Europe contains.
>
> How would I index this and ensure that when deleting France the query: "list
> all locations that CONTAIN France" returns with Europe (and whatever else)
> as quickly as possible?
>
> Thanks,
> Ishaaq
>