You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by z11373 <z1...@outlook.com> on 2015/08/17 17:36:42 UTC

sharding via different tables

Hi,
We have requirement to shard by customer id. I see there are two options:
1. put the customer id as column family
2. create tables for each customer id

The downside with option #1 is deleting rows only for specific customer id
would be pretty expensive (for option #2, it's simply as deleting tables),
and not sure if it'd be slower to scan too, though we can filter by column
family and Accumulo is optimized for that.

The downside with option #2 is when we have more customers later, we'll have
so many tables. Current implementation needs 4 tables, so we'll end up at
least (# of customers * 4) tables in Accumulo. Does Accumulo has limit on
number of tables?

I personally prefer option #2, but perhaps any of you had direct experiences
with this kind of issue before, and able to share the learning.

Thanks,
Z



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/sharding-via-different-tables-tp14884.html
Sent from the Developers mailing list archive at Nabble.com.

Re: sharding via different tables

Posted by Christopher <ct...@apache.org>.

I'd expect performance to be slightly better with separate tables than
locality groups, because managing locality groups can be relatively
cheap, but it's not entirely free.

Namespaces work like a table prefix, but also provide a means to
easily configure all of its tables at once. So, they're either
slightly or significantly better than a table prefix, depending on
your needs.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Mon, Aug 17, 2015 at 3:01 PM, z11373 <z1...@outlook.com> wrote:
> Thanks Christopher for valuable insight.
> Right now we don't have scenario which it needs to query data from multiple
> customers at once. Perhaps some time in the future, and that 'future' seems
> could be years from now (or perhaps never), so I think I am inclined to
> implement them as separate tables for now.
>
> Though they are in separate tables, I will still apply visibility column for
> each row in the table. The visibility string could be something like
> customer id. The caller will be another app of ours, so we can trust it
> (still need to pass that customer id as authz string).
>
> In term of scan performance, is it true that if we shard by column family or
> different table, it won't matter much since I'd think we also can create
> separate locality group for different column family)?
>
> Thanks for the tips on using namespace, originally I'd think of using prefix
> the table names with customer id. I guess they are no difference, right?
>
> Thanks,
> Z
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/sharding-via-different-tables-tp14884p14893.html
> Sent from the Developers mailing list archive at Nabble.com.

Re: sharding via different tables

Posted by z11373 <z1...@outlook.com>.

Thanks Christopher for valuable insight.
Right now we don't have scenario which it needs to query data from multiple
customers at once. Perhaps some time in the future, and that 'future' seems
could be years from now (or perhaps never), so I think I am inclined to
implement them as separate tables for now.

Though they are in separate tables, I will still apply visibility column for
each row in the table. The visibility string could be something like
customer id. The caller will be another app of ours, so we can trust it
(still need to pass that customer id as authz string).

In term of scan performance, is it true that if we shard by column family or
different table, it won't matter much since I'd think we also can create
separate locality group for different column family)?

Thanks for the tips on using namespace, originally I'd think of using prefix
the table names with customer id. I guess they are no difference, right?

Thanks,
Z



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/sharding-via-different-tables-tp14884p14893.html
Sent from the Developers mailing list archive at Nabble.com.

Re: sharding via different tables

Posted by Christopher <ct...@apache.org>.

On Mon, Aug 17, 2015 at 11:36 AM, z11373 <z1...@outlook.com> wrote:
> Hi,
> We have requirement to shard by customer id. I see there are two options:
> 1. put the customer id as column family
> 2. create tables for each customer id
>
> The downside with option #1 is deleting rows only for specific customer id
> would be pretty expensive (for option #2, it's simply as deleting tables),
> and not sure if it'd be slower to scan too, though we can filter by column
> family and Accumulo is optimized for that.
>
> The downside with option #2 is when we have more customers later, we'll have
> so many tables. Current implementation needs 4 tables, so we'll end up at
> least (# of customers * 4) tables in Accumulo. Does Accumulo has limit on
> number of tables?
>
> I personally prefer option #2, but perhaps any of you had direct experiences
> with this kind of issue before, and able to share the learning.

First, regarding your question: No, Accumulo does not have any limits
on the number of tables. More tables means more stored per-table state
in ZooKeeper (and possibly more in the metadata tables), so that's
something to keep in mind, but you're not likely to run into problems
creating a few hundred tables.

Second, you have more options than that for table schemas. It really
depends on your goals, though.

Will you ever need to query data from multiple customers at once? If
not, separate tables might be an option. Since you have need for 4
tables each, you could also do one namespace per customer, each
namespace with its own set of 4 (or more) tables.

If you expect that you will ever need to query data from multiple
customers at once (or if you find it easier to manage using a fewer
number of tables), you may want to consider putting your data in a
single set of tables. You can separate data using the visibility (one
authorization per customer data set), column family (for performance,
you can create a locality group per customer/column family), or you
can segment the table by prefixing your rows with the customer so that
each customer's data is logically separate. You can also combine the
visibility/authorization strategy with another strategy, if you want
to enforce some access controls and facilitate query performance.

The biggest driving factors for your schema should really be: "How do
I expect to query this data?" and "How do I want to protect this
data?"
If you only ever query customer data separately, and you're okay with
protecting the data at the application layer (when it selects which
table to read from), then separate tables only is probably sufficient.
Otherwise, there are many more options.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii