You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Otis Gospodnetic <ot...@gmail.com> on 2014/03/26 03:58:15 UTC

Schema on read vs. secondary indexes

Hi,

When I saw "Schema on read" my heart jumped because I thought that meant:

1) being able to insert rows without having to define columns ahead of
time, and

2) being able to query against any column in a row without having to know
which columns one will be searching against.  For example, if a row with
"anyRandomColumn" gets added, I could run a query like select .... where
anyRandomColumn='foo' and select that row even though I didn't set a
secondary index on anyRandomColumn.

But after reading a bit about Phoenix I think Phoenix can do 1), but cannot
do 2) -- one has to tell it which columns to build indexes.  Is this
correct?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

Re: Schema on read vs. secondary indexes

Posted by James Taylor <ja...@apache.org>.

On Thu, Mar 27, 2014 at 9:36 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi,
>
> Sorry, more Qs.  This is exciting stuff!
>

No problem. Thanks for the interest!


>
> On Wed, Mar 26, 2014 at 2:29 PM, James Taylor <ja...@apache.org>wrote:
>
>> On Wed, Mar 26, 2014 at 9:17 AM, Otis Gospodnetic <
>> otis.gospodnetic@gmail.com> wrote:
>>
>>> Hi James,
>>>
>>> On Wed, Mar 26, 2014 at 2:15 AM, James Taylor <ja...@apache.org>wrote:
>>>
>>>> Hi Otis,
>>>> That's an excellent idea. Phoenix does support (1) & (2), but we don't
>>>> support adding a secondary index on a dynamic column. However, there's
>>>> really no reason why we couldn't - we've never had anyone ask for this. Our
>>>> mutable secondary index support is done at the HBase level, so as long as
>>>> only Puts and Deletes are done on the data, it should work fine. We'd just
>>>> need to add the syntax for dynamic column declaration for CREATE INDEX to
>>>> our grammar.
>>>>
>>>
>>> Only Puts in our case.  And batch Deletes when data's TTL is reached,
>>> but I assume an external job/process would have to go through HBase rows
>>> and delete and that would automatically update any data structures and
>>> files Phoenix uses?
>>>
>>
>> If you're using the standard batch Delete API, you'd be fine. If you're
>> relying on setting a TTL on the HBase table, then you'd need to set the
>> same TTL on the index table (and this would be somewhat dangerous, as HBase
>> might prune the data table and index table at different times).
>>
>
> How about time-based partitioning instead of TTL? e.g. a table for today,
> a table for yesterday, etc. (or maybe not "table", but some sort of
> non-table partitioning mechanism?).
> I'm asking because I assume dropping tables/partitions when they are old
> enough to be removed would be faster than having HBase scan everything and
> remove based on TTL.
>

The TTL feature in HBase piggybacks on compaction, so there's no additional
overhead. As far as partitioning, essentially your row key in HBase is your
partitioning mechanism, so if you use a date/time column as the leading
part of your primary key in Phoenix, you've partitioned the table by time.


>
> And if such partitioned structure is doable at the HBase level (e.g.
> multiple tables), would that work for Phoenix?  For example, if I have
> daily partitions/tables, and I query for SELECT model, colour FROM Cars
> WHERE saleDate BETWEEN now() and now() - 10 days GROUP BY colour, would
> Phoenix be able to query across all 10 daily partitions?
>
>  What's the use case you have in mind? Keep in mind too, that adding
>>>> secondary indexes has an impact on write performance (from the HBase POV,
>>>> your doing two Puts instead of one and there's some cost associated with
>>>> the incremental maintenance).
>>>>
>>>
>>> Think of it as an organization-wide "throw anything in and query it
>>> right away database". :)
>>>
>>
>> +1. I like it!
>>
>>>
>>> Ideally, I'd like to be able to have HBase that is schemaless ("throw
>>> anything in"), that is multi-tenant ("anyone at my company can start
>>> inserting their data"), where each tenant can start adding rows with their
>>> own columns ("schemaless"), and then search against any of them right away,
>>> without anyone having to run the "CREATE INDEX..." stuff.  Imagine I work
>>> in an org with such a database service and I decide I want to start
>>> tracking the number of user logins in my web application.  So I may decide
>>> to start adding rows like this:
>>>
>>
>> Ah, this is an ideal use case for our multi-tenant support. Take a look
>> at this: http://phoenix.incubator.apache.org/multi-tenancy.html. So
>> you'd create a "base" multi-tenant table with the columns below. Then each
>> tenant would create a "view" over this base table and they could evolve
>> independently.
>>
>
> But imagine there is no base table.  Really, imagine only something like
> "timestamp" column is required everywhere, and maybe tenantID, but all
> other columns are completely ad-hoc.  That is, somebody in the Sales
> department will want to insert rows with sales-specific columns, Marketing
> person will shove in marketing data, the IT person may throw in names and
> hardware specs and locations of their 10000 servers, the engineer may
> decide to stuff performance metrics in this, and the front-end developer
> may feed in clickstream data.  So there is no such thing as "base table",
> it's really completely open.
>
> Would the multi-tenant approach with a super minimal base table still
> work, even though the number of shared/base columns would be very, very
> small compared to the total number of different columns across all tenants?
>

Yes, that's no problem, and not uncommon. You might just have a tenantID
column (that's a requirement as the leading column) plus a date/time
column. Not sure if you'd have a userID column as well. Note that with
Phoenix, our DATE type maintains time at the granularity of milliseconds,
so assuming that's a fine enough granularity, I'd recommend that over a
TIMESTAMP.

>
>    timestamp, userId, countryOfUser, genderOfUser, userAgent,
>>> operatingSystem
>>>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
>>> operatingSystem
>>>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
>>> operatingSystem
>>>   ...
>>>
>>> And then I want to switch to some UI where I can enter:
>>>
>>>   select count(*) where userAgent like '%chrome%' and countryOfUser='FR'
>>> group by gender
>>>
>>> I assume this will work ONLY if somebody does CREATE INDEX on userAgent
>>> and countryOfUser first, right?
>>>
>>
>> No, this will work fine with and without an index. An index will just
>> make the query execute faster (at the cost of slower writes, of course).
>>
>
> Right, but now much slower are reads without an index vs. what is the cost
> of maintaining that index at write time?
>

That's difficult to generalize - it depends also on the filters in your
WHERE clause, as they get pushed down to HBase as well. Remember, though,
that in a multi-tenant world, you'll always be constraining the rows that
get queried by the tenantID (that's done automatically for you
under-the-covers when a tenant-specific connection is used).


>
> If you look at my examples above, some of them will be low volume (e.g.
> sales figures), but some will be *crazy* high volume with many thousands of
> inserts per second across hundreds or more tenants.
> So data volume will grow very very fast.
> As such, I'm guessing searching against non-indexed columns will go well
> beyond a second or two.
> But I don't have a sense of the index write cost (disk IO, CPU, disk
> space, etc.) and memory cost (I'm assuming some portion of the index is
> read into memory/heap and cached?)
>

Besides metadata caching, the caching is all handled by HBase (through it's
block cache).


>
> Are there any guidelines or anecdotes or benchmarks that would give one
> some rough ideas about this?
>
>  Assuming the answer is yes:
>>> 1) How much overhead does this add in terms of disk space or heap/memory
>>> or CPU used for creation/updates of this index?
>>>
>>
>> Take a look at this:
>> http://phoenix.incubator.apache.org/secondary_indexing.html and in
>> particular, checkout the presentation at the bottom that Jesse put
>> together, as that provides more technical depth. Post 3.0/4.0 release, we
>> (specifically Rajeshbabu) plan to also support an orthogonal "local index"
>> mechanism where the index data and table data are colocated on the same RS.
>> He has a nice writeup on it here: https://github.com/Huawei-Hadoop/hindex
>>
>
> I had a look, thanks.
> But this makes me wonder - isn't this, at least conceptually, then very,
> very, very similar to saying "Here is my HBase table, let me hook up Lucene
> and build side-car Lucene index(es), so I can use them for queries that
> would be better handled through the index Lucene provides instead of doing
> indexless scans against the primary HBase table"?  Sure, there is no
> inverted index in the hindex or in secondary indexing, but isn't concept
> the same?
> If yes, why not just use Lucene for example?
>

The difference is the level at which these secondary index solutions are
integrated with HBase. The HIndex solution is going to be awesome on the
write side of the fence - basically you're not hit with any additional cost
for writes. The crux of their solution is a custom load balancer that
ensures that the index data remains colocated with the corresponding table
data. There are tons of corner cases the get right, and using Lucene, I
suspect would be difficult to get those right, b/c you'd have to integrate
at a higher level.

Our current secondary index solution works well if many more reads are done
than writes. You take no hit at read time - it's as if you had a
different/optimal row key for your query, and you're just querying the
index table instead of the data table (which is really what's happening).


>
> And the reason I'm asking is because we have tried using Lucene-based tech
> for the use case described here.  The problem was the maintenance of the
> index.  Lucene has the notion of segments and these segments need to be
> merged (
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html- nice video showing the merges) as one indexes data, and these merges make
> it very hard to process thousands of inserts per second.
>

That is one fantastic blog post. Wow. I'd recommend first measuring the
insert performance of HBase to see if it's going to meet your needs. HBase
is another implementation of an LSM tree. There are only so many ways to
"skin a cat" (as Lars H likes to say), so I wouldn't expect things to be
drastically different than it was with Lucene (I say, knowing nothing about
Lucene :-) ).


>
> So seeing "secondary indexing" in Phoenix makes me worried that we'd see
> the same problem here.
> Although.... I suspect Phoenix's indexing is faster because it sounds like
> the extra processing is simpler - sounds like the main thing that happens
> is that "special keys" are constructed and inserted in separate HBase
> tables.  So there is no notion of inverted index, segment merges, just the
> usual puts, region splits and compactions, etc.  Right?
>

Yes, that's correct. The extra processing is to lookup the prior value of a
row that is being changed. This is required during index maintenance to
issue a delete of the prior index row plus to form the row key of the new
index row (as often times, for an index with multiple columns, you only
have the new column value for one of the columns, the one that changed, and
you need to find the current value for the other one as well).

Note that we do have an optimization for write-once data where you can
declare at DDL time that your table has IMMUTABLE_ROWS=true. In this case,
we do not need to do incremental index maintenance, as the same row is
never written more than once.


>
>>  2) I assume there is nothing to autodetect new columns being inserted,
>>> so one would have to "trap" insertion of new columns and quickly do CREATE
>>> INDEX?
>>>
>>
>>  See above - indexes are really orthogonal to querying. They just make
>> them faster.
>>
>
> Right, but when the Marketing person says "Hey, let's start tracking <some
> new type of marketing data with new columns>" then I assume something
> should first issue a CREATE TABLE or CREATE VIEW and/or CREATE INDEX
> statements first, before anything is inserted, right?  Actually, I'm
> guessing CREATE INDEX can be done at a later time, but the first two would
> have to be done before inserts?
>

Yes, that's correct. You can issue the CREATE VIEW or ALTER VIEW on-the-fly
(and even include the IF NOT EXISTS if you can to just ignore the case
where the table or column already exist). These DDL statements are not your
typical heavy weight RDBMS calls. They basically boil down to doing a
batched Put on our metadata table.


>
>
>>  3) How realistic is such a system in a large org with a few 10s of
>>> thousands of potential users throwing in millions of rows per day and
>>> expecting responses in a few seconds at most, for simple queries like in my
>>> above example?.  Sure, depends on the details and hardware, etc., but is
>>> CREATE INDEX on any one column really realistic?
>>>
>>
>> There's the rub, and why we added multi-tenant support. HBase won't scale
>> past a few hundred tables (at most). In general, HBase is happier with few
>> big tables than lots of small tables. With multi-tenancy, the Phoenix
>> tables share the same physical HBase table. Then each "tenant" has his
>> "view" which is independent of other tenant's views (i.e. columns can be
>> added to one and won't affect the other). The advantage of this (beyond the
>> scaling one) is that each tenant can still introspect their own table to
>> see what columns it has. Plus you get all the nice type checking and error
>> checking that you'd want, AND a tenant can add secondary indexes if they
>> need better perf (these are shared in a second HBase table).
>>
>
> Makes sense.
>
> I was going to ask about partitioning as a way to handle (querying
> against) large volumes of data.  This is related to my Q above about
> date-based partitioning.  But I'm wondering if one can go further.
>  Partitioning by date, partitioning by tenant, but then also partitioning
> by some other columns, which would be different for each type of data being
> inserted. e.g. for sales data maybe the partitions would be date, tenantID,
> but then also customerCountry, customerGender, etc.  For performance
> metrics data maybe it would be date, tenantID, but then also environment
> (prod vs. dev), or applicationType (e.g. my HBase cluster performance
> metrics vs. my Tomcat performance metrics), and so on.
>

Essentially, a secondary index is declaring a partitioning. The indexed
columns make up the row key which in HBase determines the partitioning.


>
> Is there any room for that in Phoenix to further speed up queries? (or
> maybe that's what Impala does?  Didn't fully read up on that yet...)
>

There's always room for better performance! :-) Some at the Phoenix layer,
some at the HBase layer. We work with them constantly - they're super
responsive and great to work with. Lot's going on there.

Would be great to have you contribute too if this sounds promising.


>
> Thank you for answering all my questions so far, I really appreciate that!
>
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
>>
>>>
>>>
>>>>
>>>> On Tue, Mar 25, 2014 at 7:58 PM, Otis Gospodnetic <
>>>> otis.gospodnetic@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> When I saw "Schema on read" my heart jumped because I thought that
>>>>> meant:
>>>>>
>>>>> 1) being able to insert rows without having to define columns ahead of
>>>>> time, and
>>>>>
>>>>> 2) being able to query against any column in a row without having to
>>>>> know which columns one will be searching against.  For example, if a row
>>>>> with "anyRandomColumn" gets added, I could run a query like select ....
>>>>> where anyRandomColumn='foo' and select that row even though I didn't set a
>>>>> secondary index on anyRandomColumn.
>>>>>
>>>>> But after reading a bit about Phoenix I think Phoenix can do 1), but
>>>>> cannot do 2) -- one has to tell it which columns to build indexes.  Is this
>>>>> correct?
>>>>>
>>>>> Thanks,
>>>>> Otis
>>>>> --
>>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Schema on read vs. secondary indexes

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

Sorry, more Qs.  This is exciting stuff!

On Wed, Mar 26, 2014 at 2:29 PM, James Taylor <ja...@apache.org>wrote:

> On Wed, Mar 26, 2014 at 9:17 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi James,
>>
>> On Wed, Mar 26, 2014 at 2:15 AM, James Taylor <ja...@apache.org>wrote:
>>
>>> Hi Otis,
>>> That's an excellent idea. Phoenix does support (1) & (2), but we don't
>>> support adding a secondary index on a dynamic column. However, there's
>>> really no reason why we couldn't - we've never had anyone ask for this. Our
>>> mutable secondary index support is done at the HBase level, so as long as
>>> only Puts and Deletes are done on the data, it should work fine. We'd just
>>> need to add the syntax for dynamic column declaration for CREATE INDEX to
>>> our grammar.
>>>
>>
>> Only Puts in our case.  And batch Deletes when data's TTL is reached, but
>> I assume an external job/process would have to go through HBase rows and
>> delete and that would automatically update any data structures and files
>> Phoenix uses?
>>
>
> If you're using the standard batch Delete API, you'd be fine. If you're
> relying on setting a TTL on the HBase table, then you'd need to set the
> same TTL on the index table (and this would be somewhat dangerous, as HBase
> might prune the data table and index table at different times).
>

How about time-based partitioning instead of TTL? e.g. a table for today, a
table for yesterday, etc. (or maybe not "table", but some sort of non-table
partitioning mechanism?).
I'm asking because I assume dropping tables/partitions when they are old
enough to be removed would be faster than having HBase scan everything and
remove based on TTL.

And if such partitioned structure is doable at the HBase level (e.g.
multiple tables), would that work for Phoenix?  For example, if I have
daily partitions/tables, and I query for SELECT model, colour FROM Cars
WHERE saleDate BETWEEN now() and now() - 10 days GROUP BY colour, would
Phoenix be able to query across all 10 daily partitions?

What's the use case you have in mind? Keep in mind too, that adding
>>> secondary indexes has an impact on write performance (from the HBase POV,
>>> your doing two Puts instead of one and there's some cost associated with
>>> the incremental maintenance).
>>>
>>
>> Think of it as an organization-wide "throw anything in and query it right
>> away database". :)
>>
>
> +1. I like it!
>
>>
>> Ideally, I'd like to be able to have HBase that is schemaless ("throw
>> anything in"), that is multi-tenant ("anyone at my company can start
>> inserting their data"), where each tenant can start adding rows with their
>> own columns ("schemaless"), and then search against any of them right away,
>> without anyone having to run the "CREATE INDEX..." stuff.  Imagine I work
>> in an org with such a database service and I decide I want to start
>> tracking the number of user logins in my web application.  So I may decide
>> to start adding rows like this:
>>
>
> Ah, this is an ideal use case for our multi-tenant support. Take a look at
> this: http://phoenix.incubator.apache.org/multi-tenancy.html. So you'd
> create a "base" multi-tenant table with the columns below. Then each tenant
> would create a "view" over this base table and they could evolve
> independently.
>

But imagine there is no base table.  Really, imagine only something like
"timestamp" column is required everywhere, and maybe tenantID, but all
other columns are completely ad-hoc.  That is, somebody in the Sales
department will want to insert rows with sales-specific columns, Marketing
person will shove in marketing data, the IT person may throw in names and
hardware specs and locations of their 10000 servers, the engineer may
decide to stuff performance metrics in this, and the front-end developer
may feed in clickstream data.  So there is no such thing as "base table",
it's really completely open.

Would the multi-tenant approach with a super minimal base table still work,
even though the number of shared/base columns would be very, very small
compared to the total number of different columns across all tenants?

  timestamp, userId, countryOfUser, genderOfUser, userAgent, operatingSystem
>>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
>> operatingSystem
>>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
>> operatingSystem
>>   ...
>>
>> And then I want to switch to some UI where I can enter:
>>
>>   select count(*) where userAgent like '%chrome%' and countryOfUser='FR'
>> group by gender
>>
>> I assume this will work ONLY if somebody does CREATE INDEX on userAgent
>> and countryOfUser first, right?
>>
>
> No, this will work fine with and without an index. An index will just make
> the query execute faster (at the cost of slower writes, of course).
>

Right, but now much slower are reads without an index vs. what is the cost
of maintaining that index at write time?

If you look at my examples above, some of them will be low volume (e.g.
sales figures), but some will be *crazy* high volume with many thousands of
inserts per second across hundreds or more tenants.
So data volume will grow very very fast.
As such, I'm guessing searching against non-indexed columns will go well
beyond a second or two.
But I don't have a sense of the index write cost (disk IO, CPU, disk space,
etc.) and memory cost (I'm assuming some portion of the index is read into
memory/heap and cached?)

Are there any guidelines or anecdotes or benchmarks that would give one
some rough ideas about this?

Assuming the answer is yes:
>> 1) How much overhead does this add in terms of disk space or heap/memory
>> or CPU used for creation/updates of this index?
>>
>
> Take a look at this:
> http://phoenix.incubator.apache.org/secondary_indexing.html and in
> particular, checkout the presentation at the bottom that Jesse put
> together, as that provides more technical depth. Post 3.0/4.0 release, we
> (specifically Rajeshbabu) plan to also support an orthogonal "local index"
> mechanism where the index data and table data are colocated on the same RS.
> He has a nice writeup on it here: https://github.com/Huawei-Hadoop/hindex
>

I had a look, thanks.
But this makes me wonder - isn't this, at least conceptually, then very,
very, very similar to saying "Here is my HBase table, let me hook up Lucene
and build side-car Lucene index(es), so I can use them for queries that
would be better handled through the index Lucene provides instead of doing
indexless scans against the primary HBase table"?  Sure, there is no
inverted index in the hindex or in secondary indexing, but isn't concept
the same?
If yes, why not just use Lucene for example?

And the reason I'm asking is because we have tried using Lucene-based tech
for the use case described here.  The problem was the maintenance of the
index.  Lucene has the notion of segments and these segments need to be
merged (
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html-
nice video showing the merges) as one indexes data, and these merges
make
it very hard to process thousands of inserts per second.

So seeing "secondary indexing" in Phoenix makes me worried that we'd see
the same problem here.
Although.... I suspect Phoenix's indexing is faster because it sounds like
the extra processing is simpler - sounds like the main thing that happens
is that "special keys" are constructed and inserted in separate HBase
tables.  So there is no notion of inverted index, segment merges, just the
usual puts, region splits and compactions, etc.  Right?

> 2) I assume there is nothing to autodetect new columns being inserted, so
>> one would have to "trap" insertion of new columns and quickly do CREATE
>> INDEX?
>>
>
> See above - indexes are really orthogonal to querying. They just make them
> faster.
>

Right, but when the Marketing person says "Hey, let's start tracking <some
new type of marketing data with new columns>" then I assume something
should first issue a CREATE TABLE or CREATE VIEW and/or CREATE INDEX
statements first, before anything is inserted, right?  Actually, I'm
guessing CREATE INDEX can be done at a later time, but the first two would
have to be done before inserts?

> 3) How realistic is such a system in a large org with a few 10s of
>> thousands of potential users throwing in millions of rows per day and
>> expecting responses in a few seconds at most, for simple queries like in my
>> above example?.  Sure, depends on the details and hardware, etc., but is
>> CREATE INDEX on any one column really realistic?
>>
>
> There's the rub, and why we added multi-tenant support. HBase won't scale
> past a few hundred tables (at most). In general, HBase is happier with few
> big tables than lots of small tables. With multi-tenancy, the Phoenix
> tables share the same physical HBase table. Then each "tenant" has his
> "view" which is independent of other tenant's views (i.e. columns can be
> added to one and won't affect the other). The advantage of this (beyond the
> scaling one) is that each tenant can still introspect their own table to
> see what columns it has. Plus you get all the nice type checking and error
> checking that you'd want, AND a tenant can add secondary indexes if they
> need better perf (these are shared in a second HBase table).
>

Makes sense.

I was going to ask about partitioning as a way to handle (querying against)
large volumes of data.  This is related to my Q above about date-based
partitioning.  But I'm wondering if one can go further.  Partitioning by
date, partitioning by tenant, but then also partitioning by some other
columns, which would be different for each type of data being inserted.
e.g. for sales data maybe the partitions would be date, tenantID, but then
also customerCountry, customerGender, etc.  For performance metrics data
maybe it would be date, tenantID, but then also environment (prod vs. dev),
or applicationType (e.g. my HBase cluster performance metrics vs. my Tomcat
performance metrics), and so on.

Is there any room for that in Phoenix to further speed up queries? (or
maybe that's what Impala does?  Didn't fully read up on that yet...)

Thank you for answering all my questions so far, I really appreciate that!

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

>
>>
>>
>>>
>>> On Tue, Mar 25, 2014 at 7:58 PM, Otis Gospodnetic <
>>> otis.gospodnetic@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> When I saw "Schema on read" my heart jumped because I thought that
>>>> meant:
>>>>
>>>> 1) being able to insert rows without having to define columns ahead of
>>>> time, and
>>>>
>>>> 2) being able to query against any column in a row without having to
>>>> know which columns one will be searching against.  For example, if a row
>>>> with "anyRandomColumn" gets added, I could run a query like select ....
>>>> where anyRandomColumn='foo' and select that row even though I didn't set a
>>>> secondary index on anyRandomColumn.
>>>>
>>>> But after reading a bit about Phoenix I think Phoenix can do 1), but
>>>> cannot do 2) -- one has to tell it which columns to build indexes.  Is this
>>>> correct?
>>>>
>>>> Thanks,
>>>> Otis
>>>> --
>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>
>>
>

Re: Schema on read vs. secondary indexes

Posted by James Taylor <ja...@apache.org>.

On Wed, Mar 26, 2014 at 9:17 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi James,
>
> On Wed, Mar 26, 2014 at 2:15 AM, James Taylor <ja...@apache.org>wrote:
>
>> Hi Otis,
>> That's an excellent idea. Phoenix does support (1) & (2), but we don't
>> support adding a secondary index on a dynamic column. However, there's
>> really no reason why we couldn't - we've never had anyone ask for this. Our
>> mutable secondary index support is done at the HBase level, so as long as
>> only Puts and Deletes are done on the data, it should work fine. We'd just
>> need to add the syntax for dynamic column declaration for CREATE INDEX to
>> our grammar.
>>
>
> Only Puts in our case.  And batch Deletes when data's TTL is reached, but
> I assume an external job/process would have to go through HBase rows and
> delete and that would automatically update any data structures and files
> Phoenix uses?
>

If you're using the standard batch Delete API, you'd be fine. If you're
relying on setting a TTL on the HBase table, then you'd need to set the
same TTL on the index table (and this would be somewhat dangerous, as HBase
might prune the data table and index table at different times).

>
>
>> What's the use case you have in mind? Keep in mind too, that adding
>> secondary indexes has an impact on write performance (from the HBase POV,
>> your doing two Puts instead of one and there's some cost associated with
>> the incremental maintenance).
>>
>
> Think of it as an organization-wide "throw anything in and query it right
> away database". :)
>

+1. I like it!

>
> Ideally, I'd like to be able to have HBase that is schemaless ("throw
> anything in"), that is multi-tenant ("anyone at my company can start
> inserting their data"), where each tenant can start adding rows with their
> own columns ("schemaless"), and then search against any of them right away,
> without anyone having to run the "CREATE INDEX..." stuff.  Imagine I work
> in an org with such a database service and I decide I want to start
> tracking the number of user logins in my web application.  So I may decide
> to start adding rows like this:
>

Ah, this is an ideal use case for our multi-tenant support. Take a look at
this: http://phoenix.incubator.apache.org/multi-tenancy.html. So you'd
create a "base" multi-tenant table with the columns below. Then each tenant
would create a "view" over this base table and they could evolve
independently.

>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
> operatingSystem
>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
> operatingSystem
>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
> operatingSystem
>   ...
>
> And then I want to switch to some UI where I can enter:
>
>   select count(*) where userAgent like '%chrome%' and countryOfUser='FR'
> group by gender
>
> I assume this will work ONLY if somebody does CREATE INDEX on userAgent
> and countryOfUser first, right?
>

No, this will work fine with and without an index. An index will just make
the query execute faster (at the cost of slower writes, of course).

>
> Assuming the answer is yes:
> 1) How much overhead does this add in terms of disk space or heap/memory
> or CPU used for creation/updates of this index?
>

Take a look at this:
http://phoenix.incubator.apache.org/secondary_indexing.html and in
particular, checkout the presentation at the bottom that Jesse put
together, as that provides more technical depth. Post 3.0/4.0 release, we
(specifically Rajeshbabu) plan to also support an orthogonal "local index"
mechanism where the index data and table data are colocated on the same RS.
He has a nice writeup on it here: https://github.com/Huawei-Hadoop/hindex

> 2) I assume there is nothing to autodetect new columns being inserted, so
> one would have to "trap" insertion of new columns and quickly do CREATE
> INDEX?
>

See above - indexes are really orthogonal to querying. They just make them
faster.

> 3) How realistic is such a system in a large org with a few 10s of
> thousands of potential users throwing in millions of rows per day and
> expecting responses in a few seconds at most, for simple queries like in my
> above example?.  Sure, depends on the details and hardware, etc., but is
> CREATE INDEX on any one column really realistic?
>

There's the rub, and why we added multi-tenant support. HBase won't scale
past a few hundred tables (at most). In general, HBase is happier with few
big tables than lots of small tables. With multi-tenancy, the Phoenix
tables share the same physical HBase table. Then each "tenant" has his
"view" which is independent of other tenant's views (i.e. columns can be
added to one and won't affect the other). The advantage of this (beyond the
scaling one) is that each tenant can still introspect their own table to
see what columns it has. Plus you get all the nice type checking and error
checking that you'd want, AND a tenant can add secondary indexes if they
need better perf (these are shared in a second HBase table).

Thanks,
James

>
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
>>
>> On Tue, Mar 25, 2014 at 7:58 PM, Otis Gospodnetic <
>> otis.gospodnetic@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> When I saw "Schema on read" my heart jumped because I thought that meant:
>>>
>>> 1) being able to insert rows without having to define columns ahead of
>>> time, and
>>>
>>> 2) being able to query against any column in a row without having to
>>> know which columns one will be searching against.  For example, if a row
>>> with "anyRandomColumn" gets added, I could run a query like select ....
>>> where anyRandomColumn='foo' and select that row even though I didn't set a
>>> secondary index on anyRandomColumn.
>>>
>>> But after reading a bit about Phoenix I think Phoenix can do 1), but
>>> cannot do 2) -- one has to tell it which columns to build indexes.  Is this
>>> correct?
>>>
>>> Thanks,
>>> Otis
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>
>

Re: Schema on read vs. secondary indexes

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi James,

On Wed, Mar 26, 2014 at 2:15 AM, James Taylor <ja...@apache.org>wrote:

> Hi Otis,
> That's an excellent idea. Phoenix does support (1) & (2), but we don't
> support adding a secondary index on a dynamic column. However, there's
> really no reason why we couldn't - we've never had anyone ask for this. Our
> mutable secondary index support is done at the HBase level, so as long as
> only Puts and Deletes are done on the data, it should work fine. We'd just
> need to add the syntax for dynamic column declaration for CREATE INDEX to
> our grammar.
>

Only Puts in our case.  And batch Deletes when data's TTL is reached, but I
assume an external job/process would have to go through HBase rows and
delete and that would automatically update any data structures and files
Phoenix uses?

> What's the use case you have in mind? Keep in mind too, that adding
> secondary indexes has an impact on write performance (from the HBase POV,
> your doing two Puts instead of one and there's some cost associated with
> the incremental maintenance).
>

Think of it as an organization-wide "throw anything in and query it right
away database". :)

Ideally, I'd like to be able to have HBase that is schemaless ("throw
anything in"), that is multi-tenant ("anyone at my company can start
inserting their data"), where each tenant can start adding rows with their
own columns ("schemaless"), and then search against any of them right away,
without anyone having to run the "CREATE INDEX..." stuff.  Imagine I work
in an org with such a database service and I decide I want to start
tracking the number of user logins in my web application.  So I may decide
to start adding rows like this:

  timestamp, userId, countryOfUser, genderOfUser, userAgent, operatingSystem
  timestamp, userId, countryOfUser, genderOfUser, userAgent, operatingSystem
  timestamp, userId, countryOfUser, genderOfUser, userAgent, operatingSystem
  ...

And then I want to switch to some UI where I can enter:

  select count(*) where userAgent like '%chrome%' and countryOfUser='FR'
group by gender

I assume this will work ONLY if somebody does CREATE INDEX on userAgent and
countryOfUser first, right?

Assuming the answer is yes:
1) How much overhead does this add in terms of disk space or heap/memory or
CPU used for creation/updates of this index?
2) I assume there is nothing to autodetect new columns being inserted, so
one would have to "trap" insertion of new columns and quickly do CREATE
INDEX?
3) How realistic is such a system in a large org with a few 10s of
thousands of potential users throwing in millions of rows per day and
expecting responses in a few seconds at most, for simple queries like in my
above example?.  Sure, depends on the details and hardware, etc., but is
CREATE INDEX on any one column really realistic?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

>
> On Tue, Mar 25, 2014 at 7:58 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi,
>>
>> When I saw "Schema on read" my heart jumped because I thought that meant:
>>
>> 1) being able to insert rows without having to define columns ahead of
>> time, and
>>
>> 2) being able to query against any column in a row without having to know
>> which columns one will be searching against.  For example, if a row with
>> "anyRandomColumn" gets added, I could run a query like select .... where
>> anyRandomColumn='foo' and select that row even though I didn't set a
>> secondary index on anyRandomColumn.
>>
>> But after reading a bit about Phoenix I think Phoenix can do 1), but
>> cannot do 2) -- one has to tell it which columns to build indexes.  Is this
>> correct?
>>
>> Thanks,
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>

Re: Schema on read vs. secondary indexes

Posted by James Taylor <ja...@apache.org>.

Hi Otis,
That's an excellent idea. Phoenix does support (1) & (2), but we don't
support adding a secondary index on a dynamic column. However, there's
really no reason why we couldn't - we've never had anyone ask for this. Our
mutable secondary index support is done at the HBase level, so as long as
only Puts and Deletes are done on the data, it should work fine. We'd just
need to add the syntax for dynamic column declaration for CREATE INDEX to
our grammar.

What's the use case you have in mind? Keep in mind too, that adding
secondary indexes has an impact on write performance (from the HBase POV,
your doing two Puts instead of one and there's some cost associated with
the incremental maintenance).

Thanks,
James

On Tue, Mar 25, 2014 at 7:58 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi,
>
> When I saw "Schema on read" my heart jumped because I thought that meant:
>
> 1) being able to insert rows without having to define columns ahead of
> time, and
>
> 2) being able to query against any column in a row without having to know
> which columns one will be searching against.  For example, if a row with
> "anyRandomColumn" gets added, I could run a query like select .... where
> anyRandomColumn='foo' and select that row even though I didn't set a
> secondary index on anyRandomColumn.
>
> But after reading a bit about Phoenix I think Phoenix can do 1), but
> cannot do 2) -- one has to tell it which columns to build indexes.  Is this
> correct?
>
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>