You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by bharath vissapragada <bh...@gmail.com> on 2009/08/17 16:08:56 UTC

Indexed Table in Hbase

Hi all ,

I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..  I
have seen some methods used to create an Indexed Table (on some column).. I
have some doubts regarding the same ...

1) Are these somewhat similar to Hash indexes(in RDBMS) where i can easily
lookup a column value and find it's corresponding rowkey(s)
2) Can i find any performance gain when i use IndexedTable to search for a
paritcular column value .. instead of scanning an entire normal HTable ..

Kindly clarify my doubts

Thanks in advance

Re: Indexed Table in Hbase

Posted by Ski Gh3 <sk...@gmail.com>.
Agree, I think the index-scan can probably be more useful than the
index-get.
Actually in Jonathan's example, I would compose the index table key with
"indexedvalue"+"primarykey".
Many rows may have the same indexed values (not in this email example, but
think about other stuff),
then I can get all primary keys with the same indexed values.

Cheers

On Mon, Aug 17, 2009 at 10:23 AM, Gary Helmling <gh...@gmail.com> wrote:

> When defining the IndexSpecification for your table, you can pass your
> own implementation of
> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
>
> This allows you to control how the row keys are generated for the
> secondary index table.  For example, you could append the original
> table's row key to the indexed value to ensure uniqueness in
> referencing the original rows.
>
> When you create an indexed scanner, the secondary index code opens and
> wraps a scanner on the secondary index table, based on the start row
> you specify (the indexed value you're looking up).  It applies any
> filter passed to rows on the secondary index table, so make sure
> anything you want to filter on is listed in the "indexed columns" in
> your IndexSpecification.
>
> For any rows returned by the wrapped scanner, the client code then
> does a get for the original table record (the original row key is
> stored in the "__INDEX__" column family I think).
>
> So in total, when using secondary indexes, you wind up with 1 scan + N
> gets to look at N rows.
>
> At least, this was my understanding of how things worked as of 0.19.
> I'm actually moving indexing into my app layer as I update to 0.20.
>
> Hope this helps.
>
> --gh
>
>
> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<jl...@streamy.com> wrote:
> > I'm actually unsure about that.  Look at the code or experiment.
> >
> > Seems to me that there would be a uniqueness requirement, otherwise what
> do
> > you expect the behavior to be?  A get can only return a single row, so
> > multiple index hits doesn't really make sense.
> >
> > Clint?  You out there? :)
> >
> > JG
> >
> > bharath vissapragada wrote:
> >>
> >> I got it ... I think this is definitely useful in my app because iam
> >> performing a full table scan everytime for selecting the rowkeys based
> on
> >> some column values .
> >>
> >> BUT ..
> >>
> >>  we can have more than one rowkey for the same column value .Can you
> >> please
> >> tell me how they are stored .
> >>
> >> Thanks in advance
> >>
> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com>
> wrote:
> >>
> >>> It's not an actual hash or btree index, but rather secondary indexes in
> >>> HBase are implemented by creating an additional HBase table.
> >>>
> >>> If I have a table "users" (row key is userid) with family "data" and
> >>> column
> >>> "email", and I want to index the value in that column...
> >>>
> >>> I can create a table "users_email" where the row key is the email
> address
> >>> (value from the column in "users" table) and a single column that
> >>> contains
> >>> the userid.
> >>>
> >>> Doing an "index lookup" would mean doing a get on "users_email" and
> then
> >>> using that userid to do a lookup on the "users" table.
> >>>
> >>> IndexedTable does this transparently, but still does require two
> queries.
> >>>  So it's slower than a single query, but certainly faster than a full
> >>> table
> >>> scan.
> >>>
> >>> If you need hash-level performance on the index lookup, there are lots
> of
> >>> solutions outside of HBase that would work... In-memory Java HashMap,
> >>> Tokyo
> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> >>> indexing,
> >>> you can use Lucene or the like.
> >>>
> >>> Make sense?
> >>>
> >>> JG
> >>>
> >>>
> >>> bharath vissapragada wrote:
> >>>
> >>>> But i have read somewhere that Secondary indexes are somewhat slow
> >>>> compared
> >>>> to normal Hbase tables ..Does that effect the performance ?
> >>>>
> >>>> Also do you know the type of index created on the column(i mean Hash
> >>>> type
> >>>> or
> >>>> Btree etc)
> >>>>
> >>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com>
> >>>> wrote:
> >>>>
> >>>>  Hi!
> >>>>>
> >>>>> As far as I understand you are talking about the secondary indexes.
> >>>>> Yes,
> >>>>> they can be used to quickly get the rowkey by a value in the indexed
> >>>>> column.
> >>>>>
> >>>>> --Kirill
> >>>>>
> >>>>>
> >>>>> bharath vissapragada wrote:
> >>>>>
> >>>>>  Hi all ,
> >>>>>>
> >>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3
> API
> >>>>>> ..
> >>>>>>  I
> >>>>>> have seen some methods used to create an Indexed Table (on some
> >>>>>> column)..
> >>>>>> I
> >>>>>> have some doubts regarding the same ...
> >>>>>>
> >>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
> >>>>>> easily
> >>>>>> lookup a column value and find it's corresponding rowkey(s)
> >>>>>> 2) Can i find any performance gain when i use IndexedTable to search
> >>>>>> for
> >>>>>> a
> >>>>>> paritcular column value .. instead of scanning an entire normal
> HTable
> >>>>>> ..
> >>>>>>
> >>>>>> Kindly clarify my doubts
> >>>>>>
> >>>>>> Thanks in advance
> >>>>>>
> >>>>>>
> >>>>>>
> >>
> >
>

Re: Indexed Table in Hbase

Posted by Schubert Zhang <zs...@gmail.com>.
The tow approachs of  Gary.H and Travis.H are good to work.
But I think there is a risk for Travis.H's (columns) approach, when there
are many keys for a column value. Then the total size of a index table row
may large than a region-size. I think this is not a general approach, you
should be very clear about your application.

And in one of our implementations, we also use timestamps to store multiple
rowkeys in the index table, just as Bharrath says. But there is also risks:
(1) If two rows with same index-column-value are inserted at the same time,
the timestamp may be same, the the latest inserted index row will overwrite
the previous one. (2) same a Travis.H's (columns) approach.

Schubert

On Tue, Aug 18, 2009 at 6:39 PM, bharath vissapragada <
bharathvissapragada1990@gmail.com> wrote:

> Thanks Gary .. for explaining .. I got it ...
>
> On Tue, Aug 18, 2009 at 12:02 AM, Gary Helmling <gh...@gmail.com>
> wrote:
>
> > Hi Bharath,
> >
> > If you're using the default key generator
> > (org.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator),
> > it actually appends the base table row key for you.  So even though
> > the column value may be the same for multiple rows, the secondary
> > index table will still have 1 row for each row with the value in the
> > original table.  Here is relevant method from SimpleIndexKeyGenerator:
> >
> >  public byte[] createIndexKey(byte[] rowKey, Map<byte[], byte[]> columns)
> {
> >    return Bytes.add(columns.get(column), rowKey);
> >  }
> >
> > So, say you have a table "mytable", with the columns:
> >    info:keycol       (say this is the one you want to index)
> >    info:col2
> >    info:col3
> >
> > If you define your table with the index specification -- new
> > IndexSpecification("keycol", Bytes.toBytes("info:keycol")) -- then
> > HBase will create the secondary index table named "mytable-by_keycol".
> >
> > Then, say you add the following rows to "mytable":
> >
> > "row1":  info:keycol="one", info:col2="abc", info:col3="def"
> > "row2":  info:keycol="one", info:col2="ghi", info:col3="jkl"
> >
> > At this point, your index table ("mytable-by_keycol") will have the
> > following rows:
> >
> > "onerow1": info:keycol="one", __INDEX__:ROW="row1"
> > "onerow2": info:keycol="one", __INDEX__:ROW="row2"
> >
> > So you wind up with 2 rows in the index table (with unique row keys)
> > pointing back at the original table rows, even though we've only
> > stored a single distinct value for info:keycol.
> >
> > To access the rows by the secondary index to create a scanner using
> > IndexedTable.getIndexedScanner(...).  I don't think there's support
> > for using the indexes when performing a random read with
> > HTable.getRow()/HTable.get().  (But maybe I'm wrong?)
> >
> > As Travis mentions, you could always use an alternate approach to
> > implement your own indexing (use the index value as the row key for
> > your own index table and store the original table row keys as
> > individual columns).  I'm using the same approach for one access
> > pattern and so far it seems to work very well.
> >
> > But as far as I know the built in secondary indexing assumes 1
> > secondary index table row -> 1 original table row.
> >
> > Sorry if this got a bit long-winded.  It gets a little complicated to
> > explain in text...
> >
> > --gh
> >
> >
> > On Mon, Aug 17, 2009 at 1:46 PM, bharath
> > vissapragada<bh...@gmail.com> wrote:
> > > Thanks for ur explanation Gary ,
> > >
> > > Consider my case where i can have repetitions of values .. So u say
> that
> > i
> > > edit the IndexKeyGenerator in such a way that instead of storing
> > > (column->rowkey) i should do in such a way that (coulmn->
> > rowkey1,rowkey2)
> > > as diff timestamps ... if yes is that a good way ?
> > >
> > > On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling <gh...@gmail.com>
> > wrote:
> > >
> > >> When defining the IndexSpecification for your table, you can pass your
> > >> own implementation of
> > >> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
> > >>
> > >> This allows you to control how the row keys are generated for the
> > >> secondary index table.  For example, you could append the original
> > >> table's row key to the indexed value to ensure uniqueness in
> > >> referencing the original rows.
> > >>
> > >> When you create an indexed scanner, the secondary index code opens and
> > >> wraps a scanner on the secondary index table, based on the start row
> > >> you specify (the indexed value you're looking up).  It applies any
> > >> filter passed to rows on the secondary index table, so make sure
> > >> anything you want to filter on is listed in the "indexed columns" in
> > >> your IndexSpecification.
> > >>
> > >> For any rows returned by the wrapped scanner, the client code then
> > >> does a get for the original table record (the original row key is
> > >> stored in the "__INDEX__" column family I think).
> > >>
> > >> So in total, when using secondary indexes, you wind up with 1 scan + N
> > >> gets to look at N rows.
> > >>
> > >> At least, this was my understanding of how things worked as of 0.19.
> > >> I'm actually moving indexing into my app layer as I update to 0.20.
> > >>
> > >> Hope this helps.
> > >>
> > >> --gh
> > >>
> > >>
> > >> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<jl...@streamy.com>
> > wrote:
> > >> > I'm actually unsure about that.  Look at the code or experiment.
> > >> >
> > >> > Seems to me that there would be a uniqueness requirement, otherwise
> > what
> > >> do
> > >> > you expect the behavior to be?  A get can only return a single row,
> so
> > >> > multiple index hits doesn't really make sense.
> > >> >
> > >> > Clint?  You out there? :)
> > >> >
> > >> > JG
> > >> >
> > >> > bharath vissapragada wrote:
> > >> >>
> > >> >> I got it ... I think this is definitely useful in my app because
> iam
> > >> >> performing a full table scan everytime for selecting the rowkeys
> > based
> > >> on
> > >> >> some column values .
> > >> >>
> > >> >> BUT ..
> > >> >>
> > >> >>  we can have more than one rowkey for the same column value .Can
> you
> > >> >> please
> > >> >> tell me how they are stored .
> > >> >>
> > >> >> Thanks in advance
> > >> >>
> > >> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com>
> > >> wrote:
> > >> >>
> > >> >>> It's not an actual hash or btree index, but rather secondary
> indexes
> > in
> > >> >>> HBase are implemented by creating an additional HBase table.
> > >> >>>
> > >> >>> If I have a table "users" (row key is userid) with family "data"
> and
> > >> >>> column
> > >> >>> "email", and I want to index the value in that column...
> > >> >>>
> > >> >>> I can create a table "users_email" where the row key is the email
> > >> address
> > >> >>> (value from the column in "users" table) and a single column that
> > >> >>> contains
> > >> >>> the userid.
> > >> >>>
> > >> >>> Doing an "index lookup" would mean doing a get on "users_email"
> and
> > >> then
> > >> >>> using that userid to do a lookup on the "users" table.
> > >> >>>
> > >> >>> IndexedTable does this transparently, but still does require two
> > >> queries.
> > >> >>>  So it's slower than a single query, but certainly faster than a
> > full
> > >> >>> table
> > >> >>> scan.
> > >> >>>
> > >> >>> If you need hash-level performance on the index lookup, there are
> > lots
> > >> of
> > >> >>> solutions outside of HBase that would work... In-memory Java
> > HashMap,
> > >> >>> Tokyo
> > >> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> > >> >>> indexing,
> > >> >>> you can use Lucene or the like.
> > >> >>>
> > >> >>> Make sense?
> > >> >>>
> > >> >>> JG
> > >> >>>
> > >> >>>
> > >> >>> bharath vissapragada wrote:
> > >> >>>
> > >> >>>> But i have read somewhere that Secondary indexes are somewhat
> slow
> > >> >>>> compared
> > >> >>>> to normal Hbase tables ..Does that effect the performance ?
> > >> >>>>
> > >> >>>> Also do you know the type of index created on the column(i mean
> > Hash
> > >> >>>> type
> > >> >>>> or
> > >> >>>> Btree etc)
> > >> >>>>
> > >> >>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <
> e2k_1@yahoo.com>
> > >> >>>> wrote:
> > >> >>>>
> > >> >>>>  Hi!
> > >> >>>>>
> > >> >>>>> As far as I understand you are talking about the secondary
> > indexes.
> > >> >>>>> Yes,
> > >> >>>>> they can be used to quickly get the rowkey by a value in the
> > indexed
> > >> >>>>> column.
> > >> >>>>>
> > >> >>>>> --Kirill
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> bharath vissapragada wrote:
> > >> >>>>>
> > >> >>>>>  Hi all ,
> > >> >>>>>>
> > >> >>>>>> I have gone through the IndexedTableAdmin classes in Hbase
> 0.19.3
> > >> API
> > >> >>>>>> ..
> > >> >>>>>>  I
> > >> >>>>>> have seen some methods used to create an Indexed Table (on some
> > >> >>>>>> column)..
> > >> >>>>>> I
> > >> >>>>>> have some doubts regarding the same ...
> > >> >>>>>>
> > >> >>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i
> > can
> > >> >>>>>> easily
> > >> >>>>>> lookup a column value and find it's corresponding rowkey(s)
> > >> >>>>>> 2) Can i find any performance gain when i use IndexedTable to
> > search
> > >> >>>>>> for
> > >> >>>>>> a
> > >> >>>>>> paritcular column value .. instead of scanning an entire normal
> > >> HTable
> > >> >>>>>> ..
> > >> >>>>>>
> > >> >>>>>> Kindly clarify my doubts
> > >> >>>>>>
> > >> >>>>>> Thanks in advance
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>>
> > >> >>
> > >> >
> > >>
> > >
> >
>

Re: Indexed Table in Hbase

Posted by bharath vissapragada <bh...@gmail.com>.
Thanks Gary .. for explaining .. I got it ...

On Tue, Aug 18, 2009 at 12:02 AM, Gary Helmling <gh...@gmail.com> wrote:

> Hi Bharath,
>
> If you're using the default key generator
> (org.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator),
> it actually appends the base table row key for you.  So even though
> the column value may be the same for multiple rows, the secondary
> index table will still have 1 row for each row with the value in the
> original table.  Here is relevant method from SimpleIndexKeyGenerator:
>
>  public byte[] createIndexKey(byte[] rowKey, Map<byte[], byte[]> columns) {
>    return Bytes.add(columns.get(column), rowKey);
>  }
>
> So, say you have a table "mytable", with the columns:
>    info:keycol       (say this is the one you want to index)
>    info:col2
>    info:col3
>
> If you define your table with the index specification -- new
> IndexSpecification("keycol", Bytes.toBytes("info:keycol")) -- then
> HBase will create the secondary index table named "mytable-by_keycol".
>
> Then, say you add the following rows to "mytable":
>
> "row1":  info:keycol="one", info:col2="abc", info:col3="def"
> "row2":  info:keycol="one", info:col2="ghi", info:col3="jkl"
>
> At this point, your index table ("mytable-by_keycol") will have the
> following rows:
>
> "onerow1": info:keycol="one", __INDEX__:ROW="row1"
> "onerow2": info:keycol="one", __INDEX__:ROW="row2"
>
> So you wind up with 2 rows in the index table (with unique row keys)
> pointing back at the original table rows, even though we've only
> stored a single distinct value for info:keycol.
>
> To access the rows by the secondary index to create a scanner using
> IndexedTable.getIndexedScanner(...).  I don't think there's support
> for using the indexes when performing a random read with
> HTable.getRow()/HTable.get().  (But maybe I'm wrong?)
>
> As Travis mentions, you could always use an alternate approach to
> implement your own indexing (use the index value as the row key for
> your own index table and store the original table row keys as
> individual columns).  I'm using the same approach for one access
> pattern and so far it seems to work very well.
>
> But as far as I know the built in secondary indexing assumes 1
> secondary index table row -> 1 original table row.
>
> Sorry if this got a bit long-winded.  It gets a little complicated to
> explain in text...
>
> --gh
>
>
> On Mon, Aug 17, 2009 at 1:46 PM, bharath
> vissapragada<bh...@gmail.com> wrote:
> > Thanks for ur explanation Gary ,
> >
> > Consider my case where i can have repetitions of values .. So u say that
> i
> > edit the IndexKeyGenerator in such a way that instead of storing
> > (column->rowkey) i should do in such a way that (coulmn->
> rowkey1,rowkey2)
> > as diff timestamps ... if yes is that a good way ?
> >
> > On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling <gh...@gmail.com>
> wrote:
> >
> >> When defining the IndexSpecification for your table, you can pass your
> >> own implementation of
> >> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
> >>
> >> This allows you to control how the row keys are generated for the
> >> secondary index table.  For example, you could append the original
> >> table's row key to the indexed value to ensure uniqueness in
> >> referencing the original rows.
> >>
> >> When you create an indexed scanner, the secondary index code opens and
> >> wraps a scanner on the secondary index table, based on the start row
> >> you specify (the indexed value you're looking up).  It applies any
> >> filter passed to rows on the secondary index table, so make sure
> >> anything you want to filter on is listed in the "indexed columns" in
> >> your IndexSpecification.
> >>
> >> For any rows returned by the wrapped scanner, the client code then
> >> does a get for the original table record (the original row key is
> >> stored in the "__INDEX__" column family I think).
> >>
> >> So in total, when using secondary indexes, you wind up with 1 scan + N
> >> gets to look at N rows.
> >>
> >> At least, this was my understanding of how things worked as of 0.19.
> >> I'm actually moving indexing into my app layer as I update to 0.20.
> >>
> >> Hope this helps.
> >>
> >> --gh
> >>
> >>
> >> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<jl...@streamy.com>
> wrote:
> >> > I'm actually unsure about that.  Look at the code or experiment.
> >> >
> >> > Seems to me that there would be a uniqueness requirement, otherwise
> what
> >> do
> >> > you expect the behavior to be?  A get can only return a single row, so
> >> > multiple index hits doesn't really make sense.
> >> >
> >> > Clint?  You out there? :)
> >> >
> >> > JG
> >> >
> >> > bharath vissapragada wrote:
> >> >>
> >> >> I got it ... I think this is definitely useful in my app because iam
> >> >> performing a full table scan everytime for selecting the rowkeys
> based
> >> on
> >> >> some column values .
> >> >>
> >> >> BUT ..
> >> >>
> >> >>  we can have more than one rowkey for the same column value .Can you
> >> >> please
> >> >> tell me how they are stored .
> >> >>
> >> >> Thanks in advance
> >> >>
> >> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com>
> >> wrote:
> >> >>
> >> >>> It's not an actual hash or btree index, but rather secondary indexes
> in
> >> >>> HBase are implemented by creating an additional HBase table.
> >> >>>
> >> >>> If I have a table "users" (row key is userid) with family "data" and
> >> >>> column
> >> >>> "email", and I want to index the value in that column...
> >> >>>
> >> >>> I can create a table "users_email" where the row key is the email
> >> address
> >> >>> (value from the column in "users" table) and a single column that
> >> >>> contains
> >> >>> the userid.
> >> >>>
> >> >>> Doing an "index lookup" would mean doing a get on "users_email" and
> >> then
> >> >>> using that userid to do a lookup on the "users" table.
> >> >>>
> >> >>> IndexedTable does this transparently, but still does require two
> >> queries.
> >> >>>  So it's slower than a single query, but certainly faster than a
> full
> >> >>> table
> >> >>> scan.
> >> >>>
> >> >>> If you need hash-level performance on the index lookup, there are
> lots
> >> of
> >> >>> solutions outside of HBase that would work... In-memory Java
> HashMap,
> >> >>> Tokyo
> >> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> >> >>> indexing,
> >> >>> you can use Lucene or the like.
> >> >>>
> >> >>> Make sense?
> >> >>>
> >> >>> JG
> >> >>>
> >> >>>
> >> >>> bharath vissapragada wrote:
> >> >>>
> >> >>>> But i have read somewhere that Secondary indexes are somewhat slow
> >> >>>> compared
> >> >>>> to normal Hbase tables ..Does that effect the performance ?
> >> >>>>
> >> >>>> Also do you know the type of index created on the column(i mean
> Hash
> >> >>>> type
> >> >>>> or
> >> >>>> Btree etc)
> >> >>>>
> >> >>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com>
> >> >>>> wrote:
> >> >>>>
> >> >>>>  Hi!
> >> >>>>>
> >> >>>>> As far as I understand you are talking about the secondary
> indexes.
> >> >>>>> Yes,
> >> >>>>> they can be used to quickly get the rowkey by a value in the
> indexed
> >> >>>>> column.
> >> >>>>>
> >> >>>>> --Kirill
> >> >>>>>
> >> >>>>>
> >> >>>>> bharath vissapragada wrote:
> >> >>>>>
> >> >>>>>  Hi all ,
> >> >>>>>>
> >> >>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3
> >> API
> >> >>>>>> ..
> >> >>>>>>  I
> >> >>>>>> have seen some methods used to create an Indexed Table (on some
> >> >>>>>> column)..
> >> >>>>>> I
> >> >>>>>> have some doubts regarding the same ...
> >> >>>>>>
> >> >>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i
> can
> >> >>>>>> easily
> >> >>>>>> lookup a column value and find it's corresponding rowkey(s)
> >> >>>>>> 2) Can i find any performance gain when i use IndexedTable to
> search
> >> >>>>>> for
> >> >>>>>> a
> >> >>>>>> paritcular column value .. instead of scanning an entire normal
> >> HTable
> >> >>>>>> ..
> >> >>>>>>
> >> >>>>>> Kindly clarify my doubts
> >> >>>>>>
> >> >>>>>> Thanks in advance
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>
> >> >
> >>
> >
>

Re: Indexed Table in Hbase

Posted by Gary Helmling <gh...@gmail.com>.
Hi Bharath,

If you're using the default key generator
(org.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator),
it actually appends the base table row key for you.  So even though
the column value may be the same for multiple rows, the secondary
index table will still have 1 row for each row with the value in the
original table.  Here is relevant method from SimpleIndexKeyGenerator:

  public byte[] createIndexKey(byte[] rowKey, Map<byte[], byte[]> columns) {
    return Bytes.add(columns.get(column), rowKey);
  }

So, say you have a table "mytable", with the columns:
    info:keycol       (say this is the one you want to index)
    info:col2
    info:col3

If you define your table with the index specification -- new
IndexSpecification("keycol", Bytes.toBytes("info:keycol")) -- then
HBase will create the secondary index table named "mytable-by_keycol".

Then, say you add the following rows to "mytable":

"row1":  info:keycol="one", info:col2="abc", info:col3="def"
"row2":  info:keycol="one", info:col2="ghi", info:col3="jkl"

At this point, your index table ("mytable-by_keycol") will have the
following rows:

"onerow1": info:keycol="one", __INDEX__:ROW="row1"
"onerow2": info:keycol="one", __INDEX__:ROW="row2"

So you wind up with 2 rows in the index table (with unique row keys)
pointing back at the original table rows, even though we've only
stored a single distinct value for info:keycol.

To access the rows by the secondary index to create a scanner using
IndexedTable.getIndexedScanner(...).  I don't think there's support
for using the indexes when performing a random read with
HTable.getRow()/HTable.get().  (But maybe I'm wrong?)

As Travis mentions, you could always use an alternate approach to
implement your own indexing (use the index value as the row key for
your own index table and store the original table row keys as
individual columns).  I'm using the same approach for one access
pattern and so far it seems to work very well.

But as far as I know the built in secondary indexing assumes 1
secondary index table row -> 1 original table row.

Sorry if this got a bit long-winded.  It gets a little complicated to
explain in text...

--gh


On Mon, Aug 17, 2009 at 1:46 PM, bharath
vissapragada<bh...@gmail.com> wrote:
> Thanks for ur explanation Gary ,
>
> Consider my case where i can have repetitions of values .. So u say that i
> edit the IndexKeyGenerator in such a way that instead of storing
> (column->rowkey) i should do in such a way that (coulmn-> rowkey1,rowkey2)
> as diff timestamps ... if yes is that a good way ?
>
> On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling <gh...@gmail.com> wrote:
>
>> When defining the IndexSpecification for your table, you can pass your
>> own implementation of
>> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
>>
>> This allows you to control how the row keys are generated for the
>> secondary index table.  For example, you could append the original
>> table's row key to the indexed value to ensure uniqueness in
>> referencing the original rows.
>>
>> When you create an indexed scanner, the secondary index code opens and
>> wraps a scanner on the secondary index table, based on the start row
>> you specify (the indexed value you're looking up).  It applies any
>> filter passed to rows on the secondary index table, so make sure
>> anything you want to filter on is listed in the "indexed columns" in
>> your IndexSpecification.
>>
>> For any rows returned by the wrapped scanner, the client code then
>> does a get for the original table record (the original row key is
>> stored in the "__INDEX__" column family I think).
>>
>> So in total, when using secondary indexes, you wind up with 1 scan + N
>> gets to look at N rows.
>>
>> At least, this was my understanding of how things worked as of 0.19.
>> I'm actually moving indexing into my app layer as I update to 0.20.
>>
>> Hope this helps.
>>
>> --gh
>>
>>
>> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<jl...@streamy.com> wrote:
>> > I'm actually unsure about that.  Look at the code or experiment.
>> >
>> > Seems to me that there would be a uniqueness requirement, otherwise what
>> do
>> > you expect the behavior to be?  A get can only return a single row, so
>> > multiple index hits doesn't really make sense.
>> >
>> > Clint?  You out there? :)
>> >
>> > JG
>> >
>> > bharath vissapragada wrote:
>> >>
>> >> I got it ... I think this is definitely useful in my app because iam
>> >> performing a full table scan everytime for selecting the rowkeys based
>> on
>> >> some column values .
>> >>
>> >> BUT ..
>> >>
>> >>  we can have more than one rowkey for the same column value .Can you
>> >> please
>> >> tell me how they are stored .
>> >>
>> >> Thanks in advance
>> >>
>> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com>
>> wrote:
>> >>
>> >>> It's not an actual hash or btree index, but rather secondary indexes in
>> >>> HBase are implemented by creating an additional HBase table.
>> >>>
>> >>> If I have a table "users" (row key is userid) with family "data" and
>> >>> column
>> >>> "email", and I want to index the value in that column...
>> >>>
>> >>> I can create a table "users_email" where the row key is the email
>> address
>> >>> (value from the column in "users" table) and a single column that
>> >>> contains
>> >>> the userid.
>> >>>
>> >>> Doing an "index lookup" would mean doing a get on "users_email" and
>> then
>> >>> using that userid to do a lookup on the "users" table.
>> >>>
>> >>> IndexedTable does this transparently, but still does require two
>> queries.
>> >>>  So it's slower than a single query, but certainly faster than a full
>> >>> table
>> >>> scan.
>> >>>
>> >>> If you need hash-level performance on the index lookup, there are lots
>> of
>> >>> solutions outside of HBase that would work... In-memory Java HashMap,
>> >>> Tokyo
>> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
>> >>> indexing,
>> >>> you can use Lucene or the like.
>> >>>
>> >>> Make sense?
>> >>>
>> >>> JG
>> >>>
>> >>>
>> >>> bharath vissapragada wrote:
>> >>>
>> >>>> But i have read somewhere that Secondary indexes are somewhat slow
>> >>>> compared
>> >>>> to normal Hbase tables ..Does that effect the performance ?
>> >>>>
>> >>>> Also do you know the type of index created on the column(i mean Hash
>> >>>> type
>> >>>> or
>> >>>> Btree etc)
>> >>>>
>> >>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com>
>> >>>> wrote:
>> >>>>
>> >>>>  Hi!
>> >>>>>
>> >>>>> As far as I understand you are talking about the secondary indexes.
>> >>>>> Yes,
>> >>>>> they can be used to quickly get the rowkey by a value in the indexed
>> >>>>> column.
>> >>>>>
>> >>>>> --Kirill
>> >>>>>
>> >>>>>
>> >>>>> bharath vissapragada wrote:
>> >>>>>
>> >>>>>  Hi all ,
>> >>>>>>
>> >>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3
>> API
>> >>>>>> ..
>> >>>>>>  I
>> >>>>>> have seen some methods used to create an Indexed Table (on some
>> >>>>>> column)..
>> >>>>>> I
>> >>>>>> have some doubts regarding the same ...
>> >>>>>>
>> >>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
>> >>>>>> easily
>> >>>>>> lookup a column value and find it's corresponding rowkey(s)
>> >>>>>> 2) Can i find any performance gain when i use IndexedTable to search
>> >>>>>> for
>> >>>>>> a
>> >>>>>> paritcular column value .. instead of scanning an entire normal
>> HTable
>> >>>>>> ..
>> >>>>>>
>> >>>>>> Kindly clarify my doubts
>> >>>>>>
>> >>>>>> Thanks in advance
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>
>> >
>>
>

RE: Indexed Table in Hbase

Posted by "Hegner, Travis" <TH...@trilliumit.com>.
I'm not familiar with tableindexed at all, but my manually indexed tables have the value as the row key, and a single column for each row of the original table that has that value.

The key user@domain.com would have columns rows:user1, rows:user7, rows:user12, etc.

Then just do a get on user@domain.com and you'll have a whole list of users with that email address. The added benefit is that you can put some useful piece of info into any of the rows:user1 cells like whether the address is primary, or whatever fits your design.

Just a thought, perhaps you could implement that method with the tableindexed.IndexKeyGenerator that Gary mentioned.

Thanks,

Travis Hegner
http://www.travishegner.com/


-----Original Message-----
From: bharath vissapragada [mailto:bharathvissapragada1990@gmail.com]
Sent: Monday, August 17, 2009 1:46 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Indexed Table in Hbase

Thanks for ur explanation Gary ,

Consider my case where i can have repetitions of values .. So u say that i
edit the IndexKeyGenerator in such a way that instead of storing
(column->rowkey) i should do in such a way that (coulmn-> rowkey1,rowkey2)
as diff timestamps ... if yes is that a good way ?

On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling <gh...@gmail.com> wrote:

> When defining the IndexSpecification for your table, you can pass your
> own implementation of
> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
>
> This allows you to control how the row keys are generated for the
> secondary index table.  For example, you could append the original
> table's row key to the indexed value to ensure uniqueness in
> referencing the original rows.
>
> When you create an indexed scanner, the secondary index code opens and
> wraps a scanner on the secondary index table, based on the start row
> you specify (the indexed value you're looking up).  It applies any
> filter passed to rows on the secondary index table, so make sure
> anything you want to filter on is listed in the "indexed columns" in
> your IndexSpecification.
>
> For any rows returned by the wrapped scanner, the client code then
> does a get for the original table record (the original row key is
> stored in the "__INDEX__" column family I think).
>
> So in total, when using secondary indexes, you wind up with 1 scan + N
> gets to look at N rows.
>
> At least, this was my understanding of how things worked as of 0.19.
> I'm actually moving indexing into my app layer as I update to 0.20.
>
> Hope this helps.
>
> --gh
>
>
> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<jl...@streamy.com> wrote:
> > I'm actually unsure about that.  Look at the code or experiment.
> >
> > Seems to me that there would be a uniqueness requirement, otherwise what
> do
> > you expect the behavior to be?  A get can only return a single row, so
> > multiple index hits doesn't really make sense.
> >
> > Clint?  You out there? :)
> >
> > JG
> >
> > bharath vissapragada wrote:
> >>
> >> I got it ... I think this is definitely useful in my app because iam
> >> performing a full table scan everytime for selecting the rowkeys based
> on
> >> some column values .
> >>
> >> BUT ..
> >>
> >>  we can have more than one rowkey for the same column value .Can you
> >> please
> >> tell me how they are stored .
> >>
> >> Thanks in advance
> >>
> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com>
> wrote:
> >>
> >>> It's not an actual hash or btree index, but rather secondary indexes in
> >>> HBase are implemented by creating an additional HBase table.
> >>>
> >>> If I have a table "users" (row key is userid) with family "data" and
> >>> column
> >>> "email", and I want to index the value in that column...
> >>>
> >>> I can create a table "users_email" where the row key is the email
> address
> >>> (value from the column in "users" table) and a single column that
> >>> contains
> >>> the userid.
> >>>
> >>> Doing an "index lookup" would mean doing a get on "users_email" and
> then
> >>> using that userid to do a lookup on the "users" table.
> >>>
> >>> IndexedTable does this transparently, but still does require two
> queries.
> >>>  So it's slower than a single query, but certainly faster than a full
> >>> table
> >>> scan.
> >>>
> >>> If you need hash-level performance on the index lookup, there are lots
> of
> >>> solutions outside of HBase that would work... In-memory Java HashMap,
> >>> Tokyo
> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> >>> indexing,
> >>> you can use Lucene or the like.
> >>>
> >>> Make sense?
> >>>
> >>> JG
> >>>
> >>>
> >>> bharath vissapragada wrote:
> >>>
> >>>> But i have read somewhere that Secondary indexes are somewhat slow
> >>>> compared
> >>>> to normal Hbase tables ..Does that effect the performance ?
> >>>>
> >>>> Also do you know the type of index created on the column(i mean Hash
> >>>> type
> >>>> or
> >>>> Btree etc)
> >>>>
> >>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com>
> >>>> wrote:
> >>>>
> >>>>  Hi!
> >>>>>
> >>>>> As far as I understand you are talking about the secondary indexes.
> >>>>> Yes,
> >>>>> they can be used to quickly get the rowkey by a value in the indexed
> >>>>> column.
> >>>>>
> >>>>> --Kirill
> >>>>>
> >>>>>
> >>>>> bharath vissapragada wrote:
> >>>>>
> >>>>>  Hi all ,
> >>>>>>
> >>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3
> API
> >>>>>> ..
> >>>>>>  I
> >>>>>> have seen some methods used to create an Indexed Table (on some
> >>>>>> column)..
> >>>>>> I
> >>>>>> have some doubts regarding the same ...
> >>>>>>
> >>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
> >>>>>> easily
> >>>>>> lookup a column value and find it's corresponding rowkey(s)
> >>>>>> 2) Can i find any performance gain when i use IndexedTable to search
> >>>>>> for
> >>>>>> a
> >>>>>> paritcular column value .. instead of scanning an entire normal
> HTable
> >>>>>> ..
> >>>>>>
> >>>>>> Kindly clarify my doubts
> >>>>>>
> >>>>>> Thanks in advance
> >>>>>>
> >>>>>>
> >>>>>>
> >>
> >
>

The information contained in this communication is confidential and is intended only for the use of the named recipient.  Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful.  If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender or our IT Department at  866.459.4599.

Re: Indexed Table in Hbase

Posted by bharath vissapragada <bh...@gmail.com>.
Thanks for ur explanation Gary ,

Consider my case where i can have repetitions of values .. So u say that i
edit the IndexKeyGenerator in such a way that instead of storing
(column->rowkey) i should do in such a way that (coulmn-> rowkey1,rowkey2)
as diff timestamps ... if yes is that a good way ?

On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling <gh...@gmail.com> wrote:

> When defining the IndexSpecification for your table, you can pass your
> own implementation of
> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
>
> This allows you to control how the row keys are generated for the
> secondary index table.  For example, you could append the original
> table's row key to the indexed value to ensure uniqueness in
> referencing the original rows.
>
> When you create an indexed scanner, the secondary index code opens and
> wraps a scanner on the secondary index table, based on the start row
> you specify (the indexed value you're looking up).  It applies any
> filter passed to rows on the secondary index table, so make sure
> anything you want to filter on is listed in the "indexed columns" in
> your IndexSpecification.
>
> For any rows returned by the wrapped scanner, the client code then
> does a get for the original table record (the original row key is
> stored in the "__INDEX__" column family I think).
>
> So in total, when using secondary indexes, you wind up with 1 scan + N
> gets to look at N rows.
>
> At least, this was my understanding of how things worked as of 0.19.
> I'm actually moving indexing into my app layer as I update to 0.20.
>
> Hope this helps.
>
> --gh
>
>
> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<jl...@streamy.com> wrote:
> > I'm actually unsure about that.  Look at the code or experiment.
> >
> > Seems to me that there would be a uniqueness requirement, otherwise what
> do
> > you expect the behavior to be?  A get can only return a single row, so
> > multiple index hits doesn't really make sense.
> >
> > Clint?  You out there? :)
> >
> > JG
> >
> > bharath vissapragada wrote:
> >>
> >> I got it ... I think this is definitely useful in my app because iam
> >> performing a full table scan everytime for selecting the rowkeys based
> on
> >> some column values .
> >>
> >> BUT ..
> >>
> >>  we can have more than one rowkey for the same column value .Can you
> >> please
> >> tell me how they are stored .
> >>
> >> Thanks in advance
> >>
> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com>
> wrote:
> >>
> >>> It's not an actual hash or btree index, but rather secondary indexes in
> >>> HBase are implemented by creating an additional HBase table.
> >>>
> >>> If I have a table "users" (row key is userid) with family "data" and
> >>> column
> >>> "email", and I want to index the value in that column...
> >>>
> >>> I can create a table "users_email" where the row key is the email
> address
> >>> (value from the column in "users" table) and a single column that
> >>> contains
> >>> the userid.
> >>>
> >>> Doing an "index lookup" would mean doing a get on "users_email" and
> then
> >>> using that userid to do a lookup on the "users" table.
> >>>
> >>> IndexedTable does this transparently, but still does require two
> queries.
> >>>  So it's slower than a single query, but certainly faster than a full
> >>> table
> >>> scan.
> >>>
> >>> If you need hash-level performance on the index lookup, there are lots
> of
> >>> solutions outside of HBase that would work... In-memory Java HashMap,
> >>> Tokyo
> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> >>> indexing,
> >>> you can use Lucene or the like.
> >>>
> >>> Make sense?
> >>>
> >>> JG
> >>>
> >>>
> >>> bharath vissapragada wrote:
> >>>
> >>>> But i have read somewhere that Secondary indexes are somewhat slow
> >>>> compared
> >>>> to normal Hbase tables ..Does that effect the performance ?
> >>>>
> >>>> Also do you know the type of index created on the column(i mean Hash
> >>>> type
> >>>> or
> >>>> Btree etc)
> >>>>
> >>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com>
> >>>> wrote:
> >>>>
> >>>>  Hi!
> >>>>>
> >>>>> As far as I understand you are talking about the secondary indexes.
> >>>>> Yes,
> >>>>> they can be used to quickly get the rowkey by a value in the indexed
> >>>>> column.
> >>>>>
> >>>>> --Kirill
> >>>>>
> >>>>>
> >>>>> bharath vissapragada wrote:
> >>>>>
> >>>>>  Hi all ,
> >>>>>>
> >>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3
> API
> >>>>>> ..
> >>>>>>  I
> >>>>>> have seen some methods used to create an Indexed Table (on some
> >>>>>> column)..
> >>>>>> I
> >>>>>> have some doubts regarding the same ...
> >>>>>>
> >>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
> >>>>>> easily
> >>>>>> lookup a column value and find it's corresponding rowkey(s)
> >>>>>> 2) Can i find any performance gain when i use IndexedTable to search
> >>>>>> for
> >>>>>> a
> >>>>>> paritcular column value .. instead of scanning an entire normal
> HTable
> >>>>>> ..
> >>>>>>
> >>>>>> Kindly clarify my doubts
> >>>>>>
> >>>>>> Thanks in advance
> >>>>>>
> >>>>>>
> >>>>>>
> >>
> >
>

Re: Indexed Table in Hbase

Posted by Gary Helmling <gh...@gmail.com>.
When defining the IndexSpecification for your table, you can pass your
own implementation of
org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.

This allows you to control how the row keys are generated for the
secondary index table.  For example, you could append the original
table's row key to the indexed value to ensure uniqueness in
referencing the original rows.

When you create an indexed scanner, the secondary index code opens and
wraps a scanner on the secondary index table, based on the start row
you specify (the indexed value you're looking up).  It applies any
filter passed to rows on the secondary index table, so make sure
anything you want to filter on is listed in the "indexed columns" in
your IndexSpecification.

For any rows returned by the wrapped scanner, the client code then
does a get for the original table record (the original row key is
stored in the "__INDEX__" column family I think).

So in total, when using secondary indexes, you wind up with 1 scan + N
gets to look at N rows.

At least, this was my understanding of how things worked as of 0.19.
I'm actually moving indexing into my app layer as I update to 0.20.

Hope this helps.

--gh


On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<jl...@streamy.com> wrote:
> I'm actually unsure about that.  Look at the code or experiment.
>
> Seems to me that there would be a uniqueness requirement, otherwise what do
> you expect the behavior to be?  A get can only return a single row, so
> multiple index hits doesn't really make sense.
>
> Clint?  You out there? :)
>
> JG
>
> bharath vissapragada wrote:
>>
>> I got it ... I think this is definitely useful in my app because iam
>> performing a full table scan everytime for selecting the rowkeys based on
>> some column values .
>>
>> BUT ..
>>
>>  we can have more than one rowkey for the same column value .Can you
>> please
>> tell me how they are stored .
>>
>> Thanks in advance
>>
>> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com> wrote:
>>
>>> It's not an actual hash or btree index, but rather secondary indexes in
>>> HBase are implemented by creating an additional HBase table.
>>>
>>> If I have a table "users" (row key is userid) with family "data" and
>>> column
>>> "email", and I want to index the value in that column...
>>>
>>> I can create a table "users_email" where the row key is the email address
>>> (value from the column in "users" table) and a single column that
>>> contains
>>> the userid.
>>>
>>> Doing an "index lookup" would mean doing a get on "users_email" and then
>>> using that userid to do a lookup on the "users" table.
>>>
>>> IndexedTable does this transparently, but still does require two queries.
>>>  So it's slower than a single query, but certainly faster than a full
>>> table
>>> scan.
>>>
>>> If you need hash-level performance on the index lookup, there are lots of
>>> solutions outside of HBase that would work... In-memory Java HashMap,
>>> Tokyo
>>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
>>> indexing,
>>> you can use Lucene or the like.
>>>
>>> Make sense?
>>>
>>> JG
>>>
>>>
>>> bharath vissapragada wrote:
>>>
>>>> But i have read somewhere that Secondary indexes are somewhat slow
>>>> compared
>>>> to normal Hbase tables ..Does that effect the performance ?
>>>>
>>>> Also do you know the type of index created on the column(i mean Hash
>>>> type
>>>> or
>>>> Btree etc)
>>>>
>>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com>
>>>> wrote:
>>>>
>>>>  Hi!
>>>>>
>>>>> As far as I understand you are talking about the secondary indexes.
>>>>> Yes,
>>>>> they can be used to quickly get the rowkey by a value in the indexed
>>>>> column.
>>>>>
>>>>> --Kirill
>>>>>
>>>>>
>>>>> bharath vissapragada wrote:
>>>>>
>>>>>  Hi all ,
>>>>>>
>>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API
>>>>>> ..
>>>>>>  I
>>>>>> have seen some methods used to create an Indexed Table (on some
>>>>>> column)..
>>>>>> I
>>>>>> have some doubts regarding the same ...
>>>>>>
>>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
>>>>>> easily
>>>>>> lookup a column value and find it's corresponding rowkey(s)
>>>>>> 2) Can i find any performance gain when i use IndexedTable to search
>>>>>> for
>>>>>> a
>>>>>> paritcular column value .. instead of scanning an entire normal HTable
>>>>>> ..
>>>>>>
>>>>>> Kindly clarify my doubts
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>>
>>>>>>
>>
>

Re: Indexed Table in Hbase

Posted by bharath vissapragada <bh...@gmail.com>.
Generally one may expect that apart frm the rowkey other columns can have
repeated attributes and similar is the case with my application ..
In the API . there seems to be no such function doing that job

If any others know more abt it or faced the same situation kindly reply.

Thanks .


On Mon, Aug 17, 2009 at 10:30 PM, Jonathan Gray <jl...@streamy.com> wrote:

> I'm actually unsure about that.  Look at the code or experiment.
>
> Seems to me that there would be a uniqueness requirement, otherwise what do
> you expect the behavior to be?  A get can only return a single row, so
> multiple index hits doesn't really make sense.
>
> Clint?  You out there? :)
>
> JG
>
>
> bharath vissapragada wrote:
>
>> I got it ... I think this is definitely useful in my app because iam
>> performing a full table scan everytime for selecting the rowkeys based on
>> some column values .
>>
>> BUT ..
>>
>>  we can have more than one rowkey for the same column value .Can you
>> please
>> tell me how they are stored .
>>
>> Thanks in advance
>>
>> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com> wrote:
>>
>>  It's not an actual hash or btree index, but rather secondary indexes in
>>> HBase are implemented by creating an additional HBase table.
>>>
>>> If I have a table "users" (row key is userid) with family "data" and
>>> column
>>> "email", and I want to index the value in that column...
>>>
>>> I can create a table "users_email" where the row key is the email address
>>> (value from the column in "users" table) and a single column that
>>> contains
>>> the userid.
>>>
>>> Doing an "index lookup" would mean doing a get on "users_email" and then
>>> using that userid to do a lookup on the "users" table.
>>>
>>> IndexedTable does this transparently, but still does require two queries.
>>>  So it's slower than a single query, but certainly faster than a full
>>> table
>>> scan.
>>>
>>> If you need hash-level performance on the index lookup, there are lots of
>>> solutions outside of HBase that would work... In-memory Java HashMap,
>>> Tokyo
>>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
>>> indexing,
>>> you can use Lucene or the like.
>>>
>>> Make sense?
>>>
>>> JG
>>>
>>>
>>> bharath vissapragada wrote:
>>>
>>>  But i have read somewhere that Secondary indexes are somewhat slow
>>>> compared
>>>> to normal Hbase tables ..Does that effect the performance ?
>>>>
>>>> Also do you know the type of index created on the column(i mean Hash
>>>> type
>>>> or
>>>> Btree etc)
>>>>
>>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com>
>>>> wrote:
>>>>
>>>>  Hi!
>>>>
>>>>> As far as I understand you are talking about the secondary indexes.
>>>>> Yes,
>>>>> they can be used to quickly get the rowkey by a value in the indexed
>>>>> column.
>>>>>
>>>>> --Kirill
>>>>>
>>>>>
>>>>> bharath vissapragada wrote:
>>>>>
>>>>>  Hi all ,
>>>>>
>>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API
>>>>>> ..
>>>>>>  I
>>>>>> have seen some methods used to create an Indexed Table (on some
>>>>>> column)..
>>>>>> I
>>>>>> have some doubts regarding the same ...
>>>>>>
>>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
>>>>>> easily
>>>>>> lookup a column value and find it's corresponding rowkey(s)
>>>>>> 2) Can i find any performance gain when i use IndexedTable to search
>>>>>> for
>>>>>> a
>>>>>> paritcular column value .. instead of scanning an entire normal HTable
>>>>>> ..
>>>>>>
>>>>>> Kindly clarify my doubts
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>

Re: Indexed Table in Hbase

Posted by Jonathan Gray <jl...@streamy.com>.
I'm actually unsure about that.  Look at the code or experiment.

Seems to me that there would be a uniqueness requirement, otherwise what 
do you expect the behavior to be?  A get can only return a single row, 
so multiple index hits doesn't really make sense.

Clint?  You out there? :)

JG

bharath vissapragada wrote:
> I got it ... I think this is definitely useful in my app because iam
> performing a full table scan everytime for selecting the rowkeys based on
> some column values .
> 
> BUT ..
> 
>  we can have more than one rowkey for the same column value .Can you please
> tell me how they are stored .
> 
> Thanks in advance
> 
> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com> wrote:
> 
>> It's not an actual hash or btree index, but rather secondary indexes in
>> HBase are implemented by creating an additional HBase table.
>>
>> If I have a table "users" (row key is userid) with family "data" and column
>> "email", and I want to index the value in that column...
>>
>> I can create a table "users_email" where the row key is the email address
>> (value from the column in "users" table) and a single column that contains
>> the userid.
>>
>> Doing an "index lookup" would mean doing a get on "users_email" and then
>> using that userid to do a lookup on the "users" table.
>>
>> IndexedTable does this transparently, but still does require two queries.
>>  So it's slower than a single query, but certainly faster than a full table
>> scan.
>>
>> If you need hash-level performance on the index lookup, there are lots of
>> solutions outside of HBase that would work... In-memory Java HashMap, Tokyo
>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text indexing,
>> you can use Lucene or the like.
>>
>> Make sense?
>>
>> JG
>>
>>
>> bharath vissapragada wrote:
>>
>>> But i have read somewhere that Secondary indexes are somewhat slow
>>> compared
>>> to normal Hbase tables ..Does that effect the performance ?
>>>
>>> Also do you know the type of index created on the column(i mean Hash type
>>> or
>>> Btree etc)
>>>
>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com> wrote:
>>>
>>>  Hi!
>>>> As far as I understand you are talking about the secondary indexes. Yes,
>>>> they can be used to quickly get the rowkey by a value in the indexed
>>>> column.
>>>>
>>>> --Kirill
>>>>
>>>>
>>>> bharath vissapragada wrote:
>>>>
>>>>  Hi all ,
>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..
>>>>>  I
>>>>> have seen some methods used to create an Indexed Table (on some
>>>>> column)..
>>>>> I
>>>>> have some doubts regarding the same ...
>>>>>
>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
>>>>> easily
>>>>> lookup a column value and find it's corresponding rowkey(s)
>>>>> 2) Can i find any performance gain when i use IndexedTable to search for
>>>>> a
>>>>> paritcular column value .. instead of scanning an entire normal HTable
>>>>> ..
>>>>>
>>>>> Kindly clarify my doubts
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>>
> 

Re: Indexed Table in Hbase

Posted by bharath vissapragada <bh...@gmail.com>.
I got it ... I think this is definitely useful in my app because iam
performing a full table scan everytime for selecting the rowkeys based on
some column values .

BUT ..

 we can have more than one rowkey for the same column value .Can you please
tell me how they are stored .

Thanks in advance

On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com> wrote:

> It's not an actual hash or btree index, but rather secondary indexes in
> HBase are implemented by creating an additional HBase table.
>
> If I have a table "users" (row key is userid) with family "data" and column
> "email", and I want to index the value in that column...
>
> I can create a table "users_email" where the row key is the email address
> (value from the column in "users" table) and a single column that contains
> the userid.
>
> Doing an "index lookup" would mean doing a get on "users_email" and then
> using that userid to do a lookup on the "users" table.
>
> IndexedTable does this transparently, but still does require two queries.
>  So it's slower than a single query, but certainly faster than a full table
> scan.
>
> If you need hash-level performance on the index lookup, there are lots of
> solutions outside of HBase that would work... In-memory Java HashMap, Tokyo
> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text indexing,
> you can use Lucene or the like.
>
> Make sense?
>
> JG
>
>
> bharath vissapragada wrote:
>
>> But i have read somewhere that Secondary indexes are somewhat slow
>> compared
>> to normal Hbase tables ..Does that effect the performance ?
>>
>> Also do you know the type of index created on the column(i mean Hash type
>> or
>> Btree etc)
>>
>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com> wrote:
>>
>>  Hi!
>>>
>>> As far as I understand you are talking about the secondary indexes. Yes,
>>> they can be used to quickly get the rowkey by a value in the indexed
>>> column.
>>>
>>> --Kirill
>>>
>>>
>>> bharath vissapragada wrote:
>>>
>>>  Hi all ,
>>>>
>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..
>>>>  I
>>>> have seen some methods used to create an Indexed Table (on some
>>>> column)..
>>>> I
>>>> have some doubts regarding the same ...
>>>>
>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
>>>> easily
>>>> lookup a column value and find it's corresponding rowkey(s)
>>>> 2) Can i find any performance gain when i use IndexedTable to search for
>>>> a
>>>> paritcular column value .. instead of scanning an entire normal HTable
>>>> ..
>>>>
>>>> Kindly clarify my doubts
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>>>
>>

Re: Indexed Table in Hbase

Posted by Jonathan Gray <jl...@streamy.com>.
It's not an actual hash or btree index, but rather secondary indexes in 
HBase are implemented by creating an additional HBase table.

If I have a table "users" (row key is userid) with family "data" and 
column "email", and I want to index the value in that column...

I can create a table "users_email" where the row key is the email 
address (value from the column in "users" table) and a single column 
that contains the userid.

Doing an "index lookup" would mean doing a get on "users_email" and then 
using that userid to do a lookup on the "users" table.

IndexedTable does this transparently, but still does require two 
queries.  So it's slower than a single query, but certainly faster than 
a full table scan.

If you need hash-level performance on the index lookup, there are lots 
of solutions outside of HBase that would work... In-memory Java HashMap, 
Tokyo Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text 
indexing, you can use Lucene or the like.

Make sense?

JG

bharath vissapragada wrote:
> But i have read somewhere that Secondary indexes are somewhat slow compared
> to normal Hbase tables ..Does that effect the performance ?
> 
> Also do you know the type of index created on the column(i mean Hash type or
> Btree etc)
> 
> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com> wrote:
> 
>> Hi!
>>
>> As far as I understand you are talking about the secondary indexes. Yes,
>> they can be used to quickly get the rowkey by a value in the indexed column.
>>
>> --Kirill
>>
>>
>> bharath vissapragada wrote:
>>
>>> Hi all ,
>>>
>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..
>>>  I
>>> have seen some methods used to create an Indexed Table (on some column)..
>>> I
>>> have some doubts regarding the same ...
>>>
>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can easily
>>> lookup a column value and find it's corresponding rowkey(s)
>>> 2) Can i find any performance gain when i use IndexedTable to search for a
>>> paritcular column value .. instead of scanning an entire normal HTable ..
>>>
>>> Kindly clarify my doubts
>>>
>>> Thanks in advance
>>>
>>>
> 

Re: Indexed Table in Hbase

Posted by bharath vissapragada <bh...@gmail.com>.
But i have read somewhere that Secondary indexes are somewhat slow compared
to normal Hbase tables ..Does that effect the performance ?

Also do you know the type of index created on the column(i mean Hash type or
Btree etc)

On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com> wrote:

> Hi!
>
> As far as I understand you are talking about the secondary indexes. Yes,
> they can be used to quickly get the rowkey by a value in the indexed column.
>
> --Kirill
>
>
> bharath vissapragada wrote:
>
>> Hi all ,
>>
>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..
>>  I
>> have seen some methods used to create an Indexed Table (on some column)..
>> I
>> have some doubts regarding the same ...
>>
>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can easily
>> lookup a column value and find it's corresponding rowkey(s)
>> 2) Can i find any performance gain when i use IndexedTable to search for a
>> paritcular column value .. instead of scanning an entire normal HTable ..
>>
>> Kindly clarify my doubts
>>
>> Thanks in advance
>>
>>

Re: Indexed Table in Hbase

Posted by Kirill Shabunov <e2...@yahoo.com>.
Hi!

As far as I understand you are talking about the secondary indexes. Yes, 
they can be used to quickly get the rowkey by a value in the indexed column.

--Kirill

bharath vissapragada wrote:
> Hi all ,
> 
> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..  I
> have seen some methods used to create an Indexed Table (on some column).. I
> have some doubts regarding the same ...
> 
> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can easily
> lookup a column value and find it's corresponding rowkey(s)
> 2) Can i find any performance gain when i use IndexedTable to search for a
> paritcular column value .. instead of scanning an entire normal HTable ..
> 
> Kindly clarify my doubts
> 
> Thanks in advance
>