You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Russ Weeks <rw...@newbrightidea.com> on 2014/04/06 09:16:21 UTC

RowID format tradeoffs

Hi,

I'm looking for advice re. the best way to structure my row IDs.
Monotonically increasing IDs have the very appealing property that I can
quickly scan all recently-ingested unprocessed rows, particularly because I
maintain a "checkpoint" of the most-recently processed row.

Of course, the problem with increasing IDs is that it's the lowest-order
bits which are changing, which (I think?) means it's less optimal for
distributing data across my cluster. I guess that the ways to get around
this are to either reverse the ID or to define partitions, and use the
partition ID as the high-order bits of the row id? Reversing the ID will
destroy the property I describe above; I guess that using partitions may
preserve it as long as I use a BatchScanner, but would a BatchScanner play
nicely with AccumuloInputFormat? So many questions.

Anyways, I think there's a pretty good chance that I'm missing something
obvious in this analysis. For instance, if it's easy to "rebalance" the
data across my tablet servers periodically, then I'd probably just stick
with increasing IDs.

Very interested to hear your advice, or the pros and cons of any of these
approaches.

Thanks,
-Russ

Re: RowID format tradeoffs

Posted by Christopher <ct...@apache.org>.
Russ-

Yeah, there's probably some relationship between numShards and
numTServers. Something like numShards >= n * numTServers, where n >=
1. You may way to experiment with different values of n.

As David alludes to, there's probably some penalty if n is too high.
I'd suggest some experimentation for your circumstance.

David-

The relationship is: numShards (per hash(row)) <= numTablets
You can anticipate increased nodes with higher initial n or by
changing some prefix to your ShardIDs for new data. Your query would
have to know about the old shard format and the new shard format, to
include all the relevant ranges for batch scanning / map-reducing.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Fri, Apr 11, 2014 at 1:02 PM, David Medinets
<da...@gmail.com> wrote:
> What is the penalty if you set number of shards to be 999 when you only have
> 20 nodes?
> How is the shard number related to the number of tablets?
> What would happen if you doubled the number of nodes?
>
>
> On Fri, Apr 11, 2014 at 12:56 PM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>>
>> Hi, Chris,
>>
>> Thanks for your response, sorry it's taken me so long to process it.
>>
>> I guess there needs to be some sort of relationship between the number of
>> shards and the number of tablet servers, right? Would you typically set
>> numShards to be the greatest # of tablet servers you'd anticipate needing,
>> and then maintain a mapping on the client side to say, "OK, right now my
>> optimal range set is
>> ([shard0_mindate...shard15_maxdate],[shard16_mindate...shard32_maxdate]...)"
>> do you see what I'm getting at? Because it seems like the alternative is to
>> re-encode all your row IDs as the number of shards changes, and I'd *really*
>> like my row ids to be immutable.
>>
>> Right now we're using a distributed service very similar to Snowflake[1]
>> for row ID generation. It's not exactly monotonically increasing, but nearly
>> so.
>>
>> -Russ
>>
>> [1]: https://github.com/twitter/snowflake/
>>
>>
>> On Sun, Apr 6, 2014 at 7:48 AM, Christopher <ct...@apache.org> wrote:
>>>
>>> You could try sharding:
>>>
>>> If your RowID is ingest date (to achieve ability to scan over recently
>>> ingested data, as you describe), you could use RowID of
>>> "ShardID_IngestDate" instead, where:
>>>
>>> ShardID = hash(row) % numShards
>>>
>>> This will result in numShards number of rows for each IngestDate, and
>>> is chosen by you to be a value appropriate to your cluster. You can
>>> pre-split your cluster, for each ShardID, for better ingest and query.
>>>
>>> As for AccumuloInputFormat, it uses a regular scanner internally, but
>>> it supports multiple ranges, just like the BatchScanner, creating
>>> separate mappers for each one. All you need to do is query numShards
>>> number of ranges.
>>>
>>> Note: It sounds like you're currently using a 1-up increasing value
>>> for the current RowID. You may want to consider using IngestDate as I
>>> described above (to whatever degree of precision you need, as in
>>> YYYYMMddHHmmss...) or similar. This allows you to avoid maintaining a
>>> counter synchronized across your ingest, and could help you scale your
>>> ingest by parallelizing it with fewer concurrency issues. It also
>>> gives you the ability to analyze your ingest performance over time. It
>>> also allows you to do queries like "data processed in the last day".
>>> However, you'll lose the ability to do "last 100 rows processed". The
>>> sharding approach would work with either though.
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Sun, Apr 6, 2014 at 3:16 AM, Russ Weeks <rw...@newbrightidea.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > I'm looking for advice re. the best way to structure my row IDs.
>>> > Monotonically increasing IDs have the very appealing property that I
>>> > can
>>> > quickly scan all recently-ingested unprocessed rows, particularly
>>> > because I
>>> > maintain a "checkpoint" of the most-recently processed row.
>>> >
>>> > Of course, the problem with increasing IDs is that it's the
>>> > lowest-order
>>> > bits which are changing, which (I think?) means it's less optimal for
>>> > distributing data across my cluster. I guess that the ways to get
>>> > around
>>> > this are to either reverse the ID or to define partitions, and use the
>>> > partition ID as the high-order bits of the row id? Reversing the ID
>>> > will
>>> > destroy the property I describe above; I guess that using partitions
>>> > may
>>> > preserve it as long as I use a BatchScanner, but would a BatchScanner
>>> > play
>>> > nicely with AccumuloInputFormat? So many questions.
>>> >
>>> > Anyways, I think there's a pretty good chance that I'm missing
>>> > something
>>> > obvious in this analysis. For instance, if it's easy to "rebalance" the
>>> > data
>>> > across my tablet servers periodically, then I'd probably just stick
>>> > with
>>> > increasing IDs.
>>> >
>>> > Very interested to hear your advice, or the pros and cons of any of
>>> > these
>>> > approaches.
>>> >
>>> > Thanks,
>>> > -Russ
>>
>>
>

Re: RowID format tradeoffs

Posted by David Medinets <da...@gmail.com>.
What is the penalty if you set number of shards to be 999 when you only
have 20 nodes?
How is the shard number related to the number of tablets?
What would happen if you doubled the number of nodes?


On Fri, Apr 11, 2014 at 12:56 PM, Russ Weeks <rw...@newbrightidea.com>wrote:

> Hi, Chris,
>
> Thanks for your response, sorry it's taken me so long to process it.
>
> I guess there needs to be some sort of relationship between the number of
> shards and the number of tablet servers, right? Would you typically set
> numShards to be the greatest # of tablet servers you'd anticipate needing,
> and then maintain a mapping on the client side to say, "OK, right now my
> optimal range set is
> ([shard0_mindate...shard15_maxdate],[shard16_mindate...shard32_maxdate]...)"
> do you see what I'm getting at? Because it seems like the alternative is to
> re-encode all your row IDs as the number of shards changes, and I'd
> *really* like my row ids to be immutable.
>
> Right now we're using a distributed service very similar to Snowflake[1]
> for row ID generation. It's not exactly monotonically increasing, but
> nearly so.
>
> -Russ
>
> [1]: https://github.com/twitter/snowflake/
>
>
> On Sun, Apr 6, 2014 at 7:48 AM, Christopher <ct...@apache.org> wrote:
>
>> You could try sharding:
>>
>> If your RowID is ingest date (to achieve ability to scan over recently
>> ingested data, as you describe), you could use RowID of
>> "ShardID_IngestDate" instead, where:
>>
>> ShardID = hash(row) % numShards
>>
>> This will result in numShards number of rows for each IngestDate, and
>> is chosen by you to be a value appropriate to your cluster. You can
>> pre-split your cluster, for each ShardID, for better ingest and query.
>>
>> As for AccumuloInputFormat, it uses a regular scanner internally, but
>> it supports multiple ranges, just like the BatchScanner, creating
>> separate mappers for each one. All you need to do is query numShards
>> number of ranges.
>>
>> Note: It sounds like you're currently using a 1-up increasing value
>> for the current RowID. You may want to consider using IngestDate as I
>> described above (to whatever degree of precision you need, as in
>> YYYYMMddHHmmss...) or similar. This allows you to avoid maintaining a
>> counter synchronized across your ingest, and could help you scale your
>> ingest by parallelizing it with fewer concurrency issues. It also
>> gives you the ability to analyze your ingest performance over time. It
>> also allows you to do queries like "data processed in the last day".
>> However, you'll lose the ability to do "last 100 rows processed". The
>> sharding approach would work with either though.
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Sun, Apr 6, 2014 at 3:16 AM, Russ Weeks <rw...@newbrightidea.com>
>> wrote:
>> > Hi,
>> >
>> > I'm looking for advice re. the best way to structure my row IDs.
>> > Monotonically increasing IDs have the very appealing property that I can
>> > quickly scan all recently-ingested unprocessed rows, particularly
>> because I
>> > maintain a "checkpoint" of the most-recently processed row.
>> >
>> > Of course, the problem with increasing IDs is that it's the lowest-order
>> > bits which are changing, which (I think?) means it's less optimal for
>> > distributing data across my cluster. I guess that the ways to get around
>> > this are to either reverse the ID or to define partitions, and use the
>> > partition ID as the high-order bits of the row id? Reversing the ID will
>> > destroy the property I describe above; I guess that using partitions may
>> > preserve it as long as I use a BatchScanner, but would a BatchScanner
>> play
>> > nicely with AccumuloInputFormat? So many questions.
>> >
>> > Anyways, I think there's a pretty good chance that I'm missing something
>> > obvious in this analysis. For instance, if it's easy to "rebalance" the
>> data
>> > across my tablet servers periodically, then I'd probably just stick with
>> > increasing IDs.
>> >
>> > Very interested to hear your advice, or the pros and cons of any of
>> these
>> > approaches.
>> >
>> > Thanks,
>> > -Russ
>>
>
>

Re: RowID format tradeoffs

Posted by Russ Weeks <rw...@newbrightidea.com>.
Hi, Chris,

Thanks for your response, sorry it's taken me so long to process it.

I guess there needs to be some sort of relationship between the number of
shards and the number of tablet servers, right? Would you typically set
numShards to be the greatest # of tablet servers you'd anticipate needing,
and then maintain a mapping on the client side to say, "OK, right now my
optimal range set is
([shard0_mindate...shard15_maxdate],[shard16_mindate...shard32_maxdate]...)"
do you see what I'm getting at? Because it seems like the alternative is to
re-encode all your row IDs as the number of shards changes, and I'd
*really* like my row ids to be immutable.

Right now we're using a distributed service very similar to Snowflake[1]
for row ID generation. It's not exactly monotonically increasing, but
nearly so.

-Russ

[1]: https://github.com/twitter/snowflake/


On Sun, Apr 6, 2014 at 7:48 AM, Christopher <ct...@apache.org> wrote:

> You could try sharding:
>
> If your RowID is ingest date (to achieve ability to scan over recently
> ingested data, as you describe), you could use RowID of
> "ShardID_IngestDate" instead, where:
>
> ShardID = hash(row) % numShards
>
> This will result in numShards number of rows for each IngestDate, and
> is chosen by you to be a value appropriate to your cluster. You can
> pre-split your cluster, for each ShardID, for better ingest and query.
>
> As for AccumuloInputFormat, it uses a regular scanner internally, but
> it supports multiple ranges, just like the BatchScanner, creating
> separate mappers for each one. All you need to do is query numShards
> number of ranges.
>
> Note: It sounds like you're currently using a 1-up increasing value
> for the current RowID. You may want to consider using IngestDate as I
> described above (to whatever degree of precision you need, as in
> YYYYMMddHHmmss...) or similar. This allows you to avoid maintaining a
> counter synchronized across your ingest, and could help you scale your
> ingest by parallelizing it with fewer concurrency issues. It also
> gives you the ability to analyze your ingest performance over time. It
> also allows you to do queries like "data processed in the last day".
> However, you'll lose the ability to do "last 100 rows processed". The
> sharding approach would work with either though.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Sun, Apr 6, 2014 at 3:16 AM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
> > Hi,
> >
> > I'm looking for advice re. the best way to structure my row IDs.
> > Monotonically increasing IDs have the very appealing property that I can
> > quickly scan all recently-ingested unprocessed rows, particularly
> because I
> > maintain a "checkpoint" of the most-recently processed row.
> >
> > Of course, the problem with increasing IDs is that it's the lowest-order
> > bits which are changing, which (I think?) means it's less optimal for
> > distributing data across my cluster. I guess that the ways to get around
> > this are to either reverse the ID or to define partitions, and use the
> > partition ID as the high-order bits of the row id? Reversing the ID will
> > destroy the property I describe above; I guess that using partitions may
> > preserve it as long as I use a BatchScanner, but would a BatchScanner
> play
> > nicely with AccumuloInputFormat? So many questions.
> >
> > Anyways, I think there's a pretty good chance that I'm missing something
> > obvious in this analysis. For instance, if it's easy to "rebalance" the
> data
> > across my tablet servers periodically, then I'd probably just stick with
> > increasing IDs.
> >
> > Very interested to hear your advice, or the pros and cons of any of these
> > approaches.
> >
> > Thanks,
> > -Russ
>

Re: RowID format tradeoffs

Posted by Christopher <ct...@apache.org>.
You could try sharding:

If your RowID is ingest date (to achieve ability to scan over recently
ingested data, as you describe), you could use RowID of
"ShardID_IngestDate" instead, where:

ShardID = hash(row) % numShards

This will result in numShards number of rows for each IngestDate, and
is chosen by you to be a value appropriate to your cluster. You can
pre-split your cluster, for each ShardID, for better ingest and query.

As for AccumuloInputFormat, it uses a regular scanner internally, but
it supports multiple ranges, just like the BatchScanner, creating
separate mappers for each one. All you need to do is query numShards
number of ranges.

Note: It sounds like you're currently using a 1-up increasing value
for the current RowID. You may want to consider using IngestDate as I
described above (to whatever degree of precision you need, as in
YYYYMMddHHmmss...) or similar. This allows you to avoid maintaining a
counter synchronized across your ingest, and could help you scale your
ingest by parallelizing it with fewer concurrency issues. It also
gives you the ability to analyze your ingest performance over time. It
also allows you to do queries like "data processed in the last day".
However, you'll lose the ability to do "last 100 rows processed". The
sharding approach would work with either though.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Sun, Apr 6, 2014 at 3:16 AM, Russ Weeks <rw...@newbrightidea.com> wrote:
> Hi,
>
> I'm looking for advice re. the best way to structure my row IDs.
> Monotonically increasing IDs have the very appealing property that I can
> quickly scan all recently-ingested unprocessed rows, particularly because I
> maintain a "checkpoint" of the most-recently processed row.
>
> Of course, the problem with increasing IDs is that it's the lowest-order
> bits which are changing, which (I think?) means it's less optimal for
> distributing data across my cluster. I guess that the ways to get around
> this are to either reverse the ID or to define partitions, and use the
> partition ID as the high-order bits of the row id? Reversing the ID will
> destroy the property I describe above; I guess that using partitions may
> preserve it as long as I use a BatchScanner, but would a BatchScanner play
> nicely with AccumuloInputFormat? So many questions.
>
> Anyways, I think there's a pretty good chance that I'm missing something
> obvious in this analysis. For instance, if it's easy to "rebalance" the data
> across my tablet servers periodically, then I'd probably just stick with
> increasing IDs.
>
> Very interested to hear your advice, or the pros and cons of any of these
> approaches.
>
> Thanks,
> -Russ

Re: RowID format tradeoffs

Posted by Ariel Valentin <ar...@arielvalentin.com>.
Russ,

I experienced the same problem. In the end what we decided to do was to take another property and use it as a prefix and then presplit the tables
E.g. apples\0454316778
We still have situations where nodes run hot during peak usage but we are able to live with it

Thanks,
Ariel
---
Sent from my mobile device. Please excuse any errors.

> On Apr 6, 2014, at 3:16 AM, Russ Weeks <rw...@newbrightidea.com> wrote:
> 
> Hi,
> 
> I'm looking for advice re. the best way to structure my row IDs. Monotonically increasing IDs have the very appealing property that I can quickly scan all recently-ingested unprocessed rows, particularly because I maintain a "checkpoint" of the most-recently processed row.
> 
> Of course, the problem with increasing IDs is that it's the lowest-order bits which are changing, which (I think?) means it's less optimal for distributing data across my cluster. I guess that the ways to get around this are to either reverse the ID or to define partitions, and use the partition ID as the high-order bits of the row id? Reversing the ID will destroy the property I describe above; I guess that using partitions may preserve it as long as I use a BatchScanner, but would a BatchScanner play nicely with AccumuloInputFormat? So many questions.
> 
> Anyways, I think there's a pretty good chance that I'm missing something obvious in this analysis. For instance, if it's easy to "rebalance" the data across my tablet servers periodically, then I'd probably just stick with increasing IDs.
> 
> Very interested to hear your advice, or the pros and cons of any of these approaches.
> 
> Thanks,
> -Russ