You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Flavio Pompermaier <po...@okkam.it> on 2013/07/02 18:13:25 UTC

Help in designing row key

Hi to everybody,

in my use case I have to perform batch analysis skipping old data.
For example, I want to process all rows created after a certain timestamp,
passed as parameter.

What is the most effective way to do this?
Should I design my row-key to embed timestamp?
Or just filtering by timestamp of the row is fast as well? Or what else?

Initially I was thinking to compose my key as:
timestamp|source|title|type

but:

1) Using timestamp in row-keys is discouraged
2) If this design is ok, using this approach I still have problems
filtering by timestamp because I cannot found a way to numerically filer
(instead of alphanumerically/by string). Example:
1372776400441|something has timestamp lesser
than 1372778470913|somethingelse but I cannot filter all row whose key is
"numerically" greater than 1372776400441. Is it possible to overcome this
issue?
3) If this design is not ok, should I filter by a simpler row-key plus a
filter on timestamp? Or what else?

Best,
Flavio

Re: Help in designing row key

Posted by Flavio Pompermaier <po...@okkam.it>.

Yes I saw it. I followed Ted advice to use
scan.setTimeRange(sometimestamp, Long.MAX_VALUE)

On Wed, Jul 3, 2013 at 11:23 PM, Asaf Mesika <as...@gmail.com> wrote:

> Seems right. You can make it more efficient by creating your result array
> in advance and then fill it.
> Regarding time filtering. Have you see that in Scan you can set start time
> and end time?
>
> On Wednesday, July 3, 2013, Flavio Pompermaier wrote:
>
> > All my enums produce positive integers so I don't have +/-ve Integer
> > problems.
> > Obviously If I use fixed-length rowKeys I could take away the separator..
> >
> > Sorry but I'm very a newbie in this field..I'm trying to understand how
> to
> > compose my key with the bytes..
> > Is it correct the following?
> >
> > final byte[] firstToken = Bytes.toBytes(source);
> > final byte[] secondToken = Bytes.toBytes(type);
> > final byte[] thirdToken = Bytes.toBytes(qualifier);
> > final byte[] fourthToken = Bytes.toBytes(md5ofSomeString);
> > byte[] rowKey = Bytes.add(firstToken,secondToken,thirdToken);
> > rowKey =  Bytes.add(rowKey,fourthToken);
> >
> > Best,
> > Flavio
> >
> >
> > On Wed, Jul 3, 2013 at 11:58 AM, Anoop John <an...@gmail.com>
> wrote:
> >
> > > When you make the RK and convert the int parts into byte[] ( Use
> > > org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4
> bytes
> > > for every byte..  Be careful about the ordering...   When u convert a
> +ve
> > > and -ve integer into byte[] and u do Lexiographical compare (as done in
> > > HBase) u will see -ve number being greater than +ve..  If you dont have
> > to
> > > do deal with -ve numbers no issues  :)
> > >
> > > Well when all the parts of the RK is of fixed width u will need any
> > > seperator??
> > >
> > > -Anoop-
> > >
> > > On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <
> pompermaier@okkam.it
> > > >wrote:
> > >
> > > > Yeah, I was thinking to use a normalization step in order to allow
> the
> > > use
> > > > of FuzzyRowFilter but what is not clear to me is if integers must
> also
> > be
> > > > normalized or not.
> > > > I will explain myself better. Suppose that i follow your advice and I
> > > > produce keys like:
> > > >  - 1|1|somehash|sometimestamp
> > > >  - 55|555|somehash|sometimestamp
> > > >
> > > > Whould they match the same pattern or do I have to normalize them to
> > the
> > > > following?
> > > >  - 001|001|somehash|sometimestamp
> > > >  - 055|555|somehash|sometimestamp
> > > >
> > > > Moreover, I noticed that you used dots ('.') to separate things
> instead
> > > of
> > > > pipe ('|')..is there a reason for that (maybe performance or
> whatever)
> > or
> > > > is just your favourite separator?
> > > >
> > > > Best,
> > > > Flavio
> > > >
> > > >
> > > > On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mi...@axiak.net> wrote:
> > > >
> > > > > I'm not sure if you're eliding this fact or not, but you'd be much
> > > > > better off if you used a fixed-width format for your keys. So in
> your
> > > > > example, you'd have:
> > > > >
> > > > > PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed
> 128-bit
> > > > > hash.8-byte timestamp
> > > > >
> > > > > Example: \x00\x00\x00\x01\x00\x00\x02\x03....
> > > > >
> > > > > The advantage of this is not only that it's significantly less data
> > > > > (remember your key is stored on each KeyValue), but also you can
> now
> > > > > use FuzzyRowFilter and other techniques to quickly perform scans.
> The
> > > > > disadvantage is that you have to normalize the source-> integer
> but I
> > > > > find I can either store that in an enum or cache it for a long time
> > so
> > > > > it's not a big issue.
> > > > >
> > > > > -Mike
> > > > >
> > > > > On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <
> > > pompermaier@okkam.it
> > > > >
> > > > > wrote:
> > > > > > Thank you very much for the great support!
> > > > > > This is how I thought to design my key:
> > > > > >
> > > > > > PATTERN: source|type|qualifier|hash(name)|timestamp
> > > > > > EXAMPLE:
> > > > > >
> > google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> > > > > >
> > > > > > Do you think my key could be good for my scope (my search will be
> > > > > > essentially by source or source|type)?
> > > > > > Another point is that initially I will not have so many sources,
> > so I
> > > > > will
> > > > > > probably have only google|* but in the next phases there could be
> > > more
> > > > > > sources..
> > > > > >
> > > > > > Best,
> > > > > > Flavio
> > > > > >
> > > > > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > > >
> > > > > >> For #1, yes - the client receives less data after filtering.
> > > > > >>
> > > > > >> For #2, please take a look at TestMultiVersions
> > > > > >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java
> in
> > > > 0.94)
> > > > > >> for time range:
> > > > >
>

Re: Help in designing row key

Posted by Asaf Mesika <as...@gmail.com>.

Seems right. You can make it more efficient by creating your result array
in advance and then fill it.
Regarding time filtering. Have you see that in Scan you can set start time
and end time?

On Wednesday, July 3, 2013, Flavio Pompermaier wrote:

> All my enums produce positive integers so I don't have +/-ve Integer
> problems.
> Obviously If I use fixed-length rowKeys I could take away the separator..
>
> Sorry but I'm very a newbie in this field..I'm trying to understand how to
> compose my key with the bytes..
> Is it correct the following?
>
> final byte[] firstToken = Bytes.toBytes(source);
> final byte[] secondToken = Bytes.toBytes(type);
> final byte[] thirdToken = Bytes.toBytes(qualifier);
> final byte[] fourthToken = Bytes.toBytes(md5ofSomeString);
> byte[] rowKey = Bytes.add(firstToken,secondToken,thirdToken);
> rowKey =  Bytes.add(rowKey,fourthToken);
>
> Best,
> Flavio
>
>
> On Wed, Jul 3, 2013 at 11:58 AM, Anoop John <an...@gmail.com> wrote:
>
> > When you make the RK and convert the int parts into byte[] ( Use
> > org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
> > for every byte..  Be careful about the ordering...   When u convert a +ve
> > and -ve integer into byte[] and u do Lexiographical compare (as done in
> > HBase) u will see -ve number being greater than +ve..  If you dont have
> to
> > do deal with -ve numbers no issues  :)
> >
> > Well when all the parts of the RK is of fixed width u will need any
> > seperator??
> >
> > -Anoop-
> >
> > On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <pompermaier@okkam.it
> > >wrote:
> >
> > > Yeah, I was thinking to use a normalization step in order to allow the
> > use
> > > of FuzzyRowFilter but what is not clear to me is if integers must also
> be
> > > normalized or not.
> > > I will explain myself better. Suppose that i follow your advice and I
> > > produce keys like:
> > >  - 1|1|somehash|sometimestamp
> > >  - 55|555|somehash|sometimestamp
> > >
> > > Whould they match the same pattern or do I have to normalize them to
> the
> > > following?
> > >  - 001|001|somehash|sometimestamp
> > >  - 055|555|somehash|sometimestamp
> > >
> > > Moreover, I noticed that you used dots ('.') to separate things instead
> > of
> > > pipe ('|')..is there a reason for that (maybe performance or whatever)
> or
> > > is just your favourite separator?
> > >
> > > Best,
> > > Flavio
> > >
> > >
> > > On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mi...@axiak.net> wrote:
> > >
> > > > I'm not sure if you're eliding this fact or not, but you'd be much
> > > > better off if you used a fixed-width format for your keys. So in your
> > > > example, you'd have:
> > > >
> > > > PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
> > > > hash.8-byte timestamp
> > > >
> > > > Example: \x00\x00\x00\x01\x00\x00\x02\x03....
> > > >
> > > > The advantage of this is not only that it's significantly less data
> > > > (remember your key is stored on each KeyValue), but also you can now
> > > > use FuzzyRowFilter and other techniques to quickly perform scans. The
> > > > disadvantage is that you have to normalize the source-> integer but I
> > > > find I can either store that in an enum or cache it for a long time
> so
> > > > it's not a big issue.
> > > >
> > > > -Mike
> > > >
> > > > On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <
> > pompermaier@okkam.it
> > > >
> > > > wrote:
> > > > > Thank you very much for the great support!
> > > > > This is how I thought to design my key:
> > > > >
> > > > > PATTERN: source|type|qualifier|hash(name)|timestamp
> > > > > EXAMPLE:
> > > > >
> google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> > > > >
> > > > > Do you think my key could be good for my scope (my search will be
> > > > > essentially by source or source|type)?
> > > > > Another point is that initially I will not have so many sources,
> so I
> > > > will
> > > > > probably have only google|* but in the next phases there could be
> > more
> > > > > sources..
> > > > >
> > > > > Best,
> > > > > Flavio
> > > > >
> > > > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > > >
> > > > >> For #1, yes - the client receives less data after filtering.
> > > > >>
> > > > >> For #2, please take a look at TestMultiVersions
> > > > >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in
> > > 0.94)
> > > > >> for time range:
> > > >

Re: Help in designing row key

Posted by Ted Yu <yu...@gmail.com>.

The two argument Bytes.add() calls:

    return add(a, b, HConstants.EMPTY_BYTE_ARRAY);

where a new byte array is allocated:

    byte [] result = new byte[a.length + b.length + c.length];
Meaning your code below would allocate two byte arrays.

Consider writing a method that accepts 4 byte [] parameters.

Cheers

On Wed, Jul 3, 2013 at 3:20 AM, Flavio Pompermaier <po...@okkam.it>wrote:

> All my enums produce positive integers so I don't have +/-ve Integer
> problems.
> Obviously If I use fixed-length rowKeys I could take away the separator..
>
> Sorry but I'm very a newbie in this field..I'm trying to understand how to
> compose my key with the bytes..
> Is it correct the following?
>
> final byte[] firstToken = Bytes.toBytes(source);
> final byte[] secondToken = Bytes.toBytes(type);
> final byte[] thirdToken = Bytes.toBytes(qualifier);
> final byte[] fourthToken = Bytes.toBytes(md5ofSomeString);
> byte[] rowKey = Bytes.add(firstToken,secondToken,thirdToken);
> rowKey =  Bytes.add(rowKey,fourthToken);
>
> Best,
> Flavio
>
>
> On Wed, Jul 3, 2013 at 11:58 AM, Anoop John <an...@gmail.com> wrote:
>
> > When you make the RK and convert the int parts into byte[] ( Use
> > org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
> > for every byte..  Be careful about the ordering...   When u convert a +ve
> > and -ve integer into byte[] and u do Lexiographical compare (as done in
> > HBase) u will see -ve number being greater than +ve..  If you dont have
> to
> > do deal with -ve numbers no issues  :)
> >
> > Well when all the parts of the RK is of fixed width u will need any
> > seperator??
> >
> > -Anoop-
> >
> > On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <pompermaier@okkam.it
> > >wrote:
> >
> > > Yeah, I was thinking to use a normalization step in order to allow the
> > use
> > > of FuzzyRowFilter but what is not clear to me is if integers must also
> be
> > > normalized or not.
> > > I will explain myself better. Suppose that i follow your advice and I
> > > produce keys like:
> > >  - 1|1|somehash|sometimestamp
> > >  - 55|555|somehash|sometimestamp
> > >
> > > Whould they match the same pattern or do I have to normalize them to
> the
> > > following?
> > >  - 001|001|somehash|sometimestamp
> > >  - 055|555|somehash|sometimestamp
> > >
> > > Moreover, I noticed that you used dots ('.') to separate things instead
> > of
> > > pipe ('|')..is there a reason for that (maybe performance or whatever)
> or
> > > is just your favourite separator?
> > >
> > > Best,
> > > Flavio
> > >
> > >
> > > On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mi...@axiak.net> wrote:
> > >
> > > > I'm not sure if you're eliding this fact or not, but you'd be much
> > > > better off if you used a fixed-width format for your keys. So in your
> > > > example, you'd have:
> > > >
> > > > PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
> > > > hash.8-byte timestamp
> > > >
> > > > Example: \x00\x00\x00\x01\x00\x00\x02\x03....
> > > >
> > > > The advantage of this is not only that it's significantly less data
> > > > (remember your key is stored on each KeyValue), but also you can now
> > > > use FuzzyRowFilter and other techniques to quickly perform scans. The
> > > > disadvantage is that you have to normalize the source-> integer but I
> > > > find I can either store that in an enum or cache it for a long time
> so
> > > > it's not a big issue.
> > > >
> > > > -Mike
> > > >
> > > > On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <
> > pompermaier@okkam.it
> > > >
> > > > wrote:
> > > > > Thank you very much for the great support!
> > > > > This is how I thought to design my key:
> > > > >
> > > > > PATTERN: source|type|qualifier|hash(name)|timestamp
> > > > > EXAMPLE:
> > > > >
> google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> > > > >
> > > > > Do you think my key could be good for my scope (my search will be
> > > > > essentially by source or source|type)?
> > > > > Another point is that initially I will not have so many sources,
> so I
> > > > will
> > > > > probably have only google|* but in the next phases there could be
> > more
> > > > > sources..
> > > > >
> > > > > Best,
> > > > > Flavio
> > > > >
> > > > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > > >
> > > > >> For #1, yes - the client receives less data after filtering.
> > > > >>
> > > > >> For #2, please take a look at TestMultiVersions
> > > > >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in
> > > 0.94)
> > > > >> for time range:
> > > > >>
> > > > >>     scan = new Scan();
> > > > >>
> > > > >>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> > > > >> For row key selection, you need a filter. Take a look at
> > > > >> FuzzyRowFilter.java
> > > > >>
> > > > >> Cheers
> > > > >>
> > > > >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
> > > > pompermaier@okkam.it
> > > > >> >wrote:
> > > > >>
> > > > >> >  Thanks for the reply! I thus have two questions more:
> > > > >> >
> > > > >> > 1) is it true that filtering on timestamps doesn't affect
> > > > performance..?
> > > > >> > 2) could you send me a little snippet of how you would do such a
> > > > filter
> > > > >> (by
> > > > >> > row key + timestamps)? For example get all rows whose key starts
> > > with
> > > > >> > 'someid-' and whose timestamps is greater than some timestamp?
> > > > >> >
> > > > >> > Best,
> > > > >> > Flavio
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > >> >
> > > > >> > > bq. Using timestamp in row-keys is discouraged
> > > > >> > >
> > > > >> > > The above is true.
> > > > >> > > Prefixing row key with timestamp would create hot region.
> > > > >> > >
> > > > >> > > bq. should I filter by a simpler row-key plus a filter on
> > > timestamp?
> > > > >> > >
> > > > >> > > You can do the above.
> > > > >> > >
> > > > >> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
> > > > >> pompermaier@okkam.it
> > > > >> > > >wrote:
> > > > >> > >
> > > > >> > > > Hi to everybody,
> > > > >> > > >
> > > > >> > > > in my use case I have to perform batch analysis skipping old
> > > data.
> > > > >> > > > For example, I want to process all rows created after a
> > certain
> > > > >> > > timestamp,
> > > > >> > > > passed as parameter.
> > > > >> > > >
> > > > >> > > > What is the most effective way to do this?
> > > > >> > > > Should I design my row-key to embed timestamp?
> > > > >> > > > Or just filtering by timestamp of the row is fast as well?
> Or
> > > what
> > > > >> > else?
> > > > >> > > >
> > > > >> > > > Initially I was thinking to compose my key as:
> > > > >> > > > timestamp|source|title|type
> > > > >> > > >
> > > > >> > > > but:
> > > > >> > > >
> > > > >> > > > 1) Using timestamp in row-keys is discouraged
> > > > >> > > > 2) If this design is ok, using this approach I still have
> > > problems
> > > > >> > > > filtering by timestamp because I cannot found a way to
> > > numerically
> > > > >> > filer
> > > > >> > > > (instead of alphanumerically/by string). Example:
> > > > >> > > > 1372776400441|something has timestamp lesser
> > > > >> > > > than 1372778470913|somethingelse but I cannot filter all row
> > > whose
> > > > >> key
> > > > >> > is
> > > > >> > > > "numerically" greater than 1372776400441. Is it possible to
> > > > overcome
> > > > >> > this
> > > > >> > > > issue?
> > > > >> > > > 3) If this design is not ok, should I filter by a simpler
> > > row-key
> > > > >> plus
> > > > >> > a
> > > > >> > > > filter on timestamp? Or what else?
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > Flavio
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: Help in designing row key

Posted by Flavio Pompermaier <po...@okkam.it>.

All my enums produce positive integers so I don't have +/-ve Integer
problems.
Obviously If I use fixed-length rowKeys I could take away the separator..

Sorry but I'm very a newbie in this field..I'm trying to understand how to
compose my key with the bytes..
Is it correct the following?

final byte[] firstToken = Bytes.toBytes(source);
final byte[] secondToken = Bytes.toBytes(type);
final byte[] thirdToken = Bytes.toBytes(qualifier);
final byte[] fourthToken = Bytes.toBytes(md5ofSomeString);
byte[] rowKey = Bytes.add(firstToken,secondToken,thirdToken);
rowKey =  Bytes.add(rowKey,fourthToken);

Best,
Flavio


On Wed, Jul 3, 2013 at 11:58 AM, Anoop John <an...@gmail.com> wrote:

> When you make the RK and convert the int parts into byte[] ( Use
> org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
> for every byte..  Be careful about the ordering...   When u convert a +ve
> and -ve integer into byte[] and u do Lexiographical compare (as done in
> HBase) u will see -ve number being greater than +ve..  If you dont have to
> do deal with -ve numbers no issues  :)
>
> Well when all the parts of the RK is of fixed width u will need any
> seperator??
>
> -Anoop-
>
> On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <pompermaier@okkam.it
> >wrote:
>
> > Yeah, I was thinking to use a normalization step in order to allow the
> use
> > of FuzzyRowFilter but what is not clear to me is if integers must also be
> > normalized or not.
> > I will explain myself better. Suppose that i follow your advice and I
> > produce keys like:
> >  - 1|1|somehash|sometimestamp
> >  - 55|555|somehash|sometimestamp
> >
> > Whould they match the same pattern or do I have to normalize them to the
> > following?
> >  - 001|001|somehash|sometimestamp
> >  - 055|555|somehash|sometimestamp
> >
> > Moreover, I noticed that you used dots ('.') to separate things instead
> of
> > pipe ('|')..is there a reason for that (maybe performance or whatever) or
> > is just your favourite separator?
> >
> > Best,
> > Flavio
> >
> >
> > On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mi...@axiak.net> wrote:
> >
> > > I'm not sure if you're eliding this fact or not, but you'd be much
> > > better off if you used a fixed-width format for your keys. So in your
> > > example, you'd have:
> > >
> > > PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
> > > hash.8-byte timestamp
> > >
> > > Example: \x00\x00\x00\x01\x00\x00\x02\x03....
> > >
> > > The advantage of this is not only that it's significantly less data
> > > (remember your key is stored on each KeyValue), but also you can now
> > > use FuzzyRowFilter and other techniques to quickly perform scans. The
> > > disadvantage is that you have to normalize the source-> integer but I
> > > find I can either store that in an enum or cache it for a long time so
> > > it's not a big issue.
> > >
> > > -Mike
> > >
> > > On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <
> pompermaier@okkam.it
> > >
> > > wrote:
> > > > Thank you very much for the great support!
> > > > This is how I thought to design my key:
> > > >
> > > > PATTERN: source|type|qualifier|hash(name)|timestamp
> > > > EXAMPLE:
> > > > google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> > > >
> > > > Do you think my key could be good for my scope (my search will be
> > > > essentially by source or source|type)?
> > > > Another point is that initially I will not have so many sources, so I
> > > will
> > > > probably have only google|* but in the next phases there could be
> more
> > > > sources..
> > > >
> > > > Best,
> > > > Flavio
> > > >
> > > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > >> For #1, yes - the client receives less data after filtering.
> > > >>
> > > >> For #2, please take a look at TestMultiVersions
> > > >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in
> > 0.94)
> > > >> for time range:
> > > >>
> > > >>     scan = new Scan();
> > > >>
> > > >>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> > > >> For row key selection, you need a filter. Take a look at
> > > >> FuzzyRowFilter.java
> > > >>
> > > >> Cheers
> > > >>
> > > >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
> > > pompermaier@okkam.it
> > > >> >wrote:
> > > >>
> > > >> >  Thanks for the reply! I thus have two questions more:
> > > >> >
> > > >> > 1) is it true that filtering on timestamps doesn't affect
> > > performance..?
> > > >> > 2) could you send me a little snippet of how you would do such a
> > > filter
> > > >> (by
> > > >> > row key + timestamps)? For example get all rows whose key starts
> > with
> > > >> > 'someid-' and whose timestamps is greater than some timestamp?
> > > >> >
> > > >> > Best,
> > > >> > Flavio
> > > >> >
> > > >> >
> > > >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > >> >
> > > >> > > bq. Using timestamp in row-keys is discouraged
> > > >> > >
> > > >> > > The above is true.
> > > >> > > Prefixing row key with timestamp would create hot region.
> > > >> > >
> > > >> > > bq. should I filter by a simpler row-key plus a filter on
> > timestamp?
> > > >> > >
> > > >> > > You can do the above.
> > > >> > >
> > > >> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
> > > >> pompermaier@okkam.it
> > > >> > > >wrote:
> > > >> > >
> > > >> > > > Hi to everybody,
> > > >> > > >
> > > >> > > > in my use case I have to perform batch analysis skipping old
> > data.
> > > >> > > > For example, I want to process all rows created after a
> certain
> > > >> > > timestamp,
> > > >> > > > passed as parameter.
> > > >> > > >
> > > >> > > > What is the most effective way to do this?
> > > >> > > > Should I design my row-key to embed timestamp?
> > > >> > > > Or just filtering by timestamp of the row is fast as well? Or
> > what
> > > >> > else?
> > > >> > > >
> > > >> > > > Initially I was thinking to compose my key as:
> > > >> > > > timestamp|source|title|type
> > > >> > > >
> > > >> > > > but:
> > > >> > > >
> > > >> > > > 1) Using timestamp in row-keys is discouraged
> > > >> > > > 2) If this design is ok, using this approach I still have
> > problems
> > > >> > > > filtering by timestamp because I cannot found a way to
> > numerically
> > > >> > filer
> > > >> > > > (instead of alphanumerically/by string). Example:
> > > >> > > > 1372776400441|something has timestamp lesser
> > > >> > > > than 1372778470913|somethingelse but I cannot filter all row
> > whose
> > > >> key
> > > >> > is
> > > >> > > > "numerically" greater than 1372776400441. Is it possible to
> > > overcome
> > > >> > this
> > > >> > > > issue?
> > > >> > > > 3) If this design is not ok, should I filter by a simpler
> > row-key
> > > >> plus
> > > >> > a
> > > >> > > > filter on timestamp? Or what else?
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > Flavio
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Re: Help in designing row key

Posted by James Taylor <jt...@salesforce.com>.

Sure, but FYI Phoenix is not just faster, but much easier as well (as 
this email chain shows).

On 07/03/2013 04:25 AM, Flavio Pompermaier wrote:
> No, I've never seen Phoenix, but it looks like a very useful project!
> However I don't have such strict performance issues in my use case, I just
> want to have balanced regions as much as possible.
> So I think that in this case I will still use Bytes concatenation if
> someone confirm I'm doing it in the right way.
>
>
> On Wed, Jul 3, 2013 at 12:33 PM, James Taylor <jt...@salesforce.com>wrote:
>
>> Hi Flavio,
>> Have you had a look at Phoenix (https://github.com/**forcedotcom/phoenix<https://github.com/forcedotcom/phoenix>)?
>> It will allow you to model your multi-part row key like this:
>>
>> CREATE TABLE flavio.analytics (
>>      source INTEGER,
>>      type INTEGER,
>>      qual VARCHAR,
>>      hash VARCHAR,
>>      ts DATE
>>      CONSTRAINT pk PRIMARY KEY (source, type, qual, hash, ts) // Defines
>> columns that make up the row key
>> )
>>
>> Then you can issue SQL queries like this (to query for the last 7 days
>> worth of data):
>> SELECT * FROM flavio.analytics WHERE source IN (1,2,5) AND type IN (55,66)
>> AND ts > CURRENT_DATE() - 7
>>
>> This will internally take advantage of our SkipScan (http://phoenix-hbase.
>> **blogspot.com/2013/05/**demystifying-skip-scan-in-**phoenix.html<http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html>)
>> to jump through your key space similar to FuzzyRowFilter, but in parallel
>> from the client taking into account your region boundaries.
>>
>> Or do more complex GROUP BY queries like this (to aggregate over the last
>> 30 days worth of data, bucketized by day):
>> SELECT type,COUNT(*) FROM flavio.analytics WHERE ts > CURRENT_DATE() - 30
>> GROUP BY type,TRUNCATE(ts,'DAY')
>>
>> No need to worry about lexicographical sort order, flipping sign bits,
>> normalizing/padding integer values, and all the other nuances of working
>> with an API that works at the level of bytes. No need to write and manage
>> installation of your own coprocessors to make aggregation efficient,
>> perform topN queries, etc.
>>
>> HTH.
>>
>> Regards,
>> James
>> @JamesPlusPlus
>>
>>
>> On 07/03/2013 02:58 AM, Anoop John wrote:
>>
>>> When you make the RK and convert the int parts into byte[] ( Use
>>> org.apache.hadoop.hbase.util.**Bytes#toBytes(*int) *)  it will give 4
>>> bytes
>>> for every byte..  Be careful about the ordering...   When u convert a +ve
>>> and -ve integer into byte[] and u do Lexiographical compare (as done in
>>> HBase) u will see -ve number being greater than +ve..  If you dont have to
>>> do deal with -ve numbers no issues  :)
>>>
>>> Well when all the parts of the RK is of fixed width u will need any
>>> seperator??
>>>
>>> -Anoop-
>>>
>>> On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <pompermaier@okkam.it
>>>> wrote:
>>>   Yeah, I was thinking to use a normalization step in order to allow the
>>>> use
>>>> of FuzzyRowFilter but what is not clear to me is if integers must also be
>>>> normalized or not.
>>>> I will explain myself better. Suppose that i follow your advice and I
>>>> produce keys like:
>>>>    - 1|1|somehash|sometimestamp
>>>>    - 55|555|somehash|sometimestamp
>>>>
>>>> Whould they match the same pattern or do I have to normalize them to the
>>>> following?
>>>>    - 001|001|somehash|sometimestamp
>>>>    - 055|555|somehash|sometimestamp
>>>>
>>>> Moreover, I noticed that you used dots ('.') to separate things instead
>>>> of
>>>> pipe ('|')..is there a reason for that (maybe performance or whatever) or
>>>> is just your favourite separator?
>>>>
>>>> Best,
>>>> Flavio
>>>>
>>>>
>>>> On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mi...@axiak.net> wrote:
>>>>
>>>>   I'm not sure if you're eliding this fact or not, but you'd be much
>>>>> better off if you used a fixed-width format for your keys. So in your
>>>>> example, you'd have:
>>>>>
>>>>> PATTERN: source(4-byte-int).type(4-**byte-int or smaller).fixed 128-bit
>>>>> hash.8-byte timestamp
>>>>>
>>>>> Example: \x00\x00\x00\x01\x00\x00\x02\**x03....
>>>>>
>>>>> The advantage of this is not only that it's significantly less data
>>>>> (remember your key is stored on each KeyValue), but also you can now
>>>>> use FuzzyRowFilter and other techniques to quickly perform scans. The
>>>>> disadvantage is that you have to normalize the source-> integer but I
>>>>> find I can either store that in an enum or cache it for a long time so
>>>>> it's not a big issue.
>>>>>
>>>>> -Mike
>>>>>
>>>>> On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it
>>>>>
>>>>> wrote:
>>>>>
>>>>>> Thank you very much for the great support!
>>>>>> This is how I thought to design my key:
>>>>>>
>>>>>> PATTERN: source|type|qualifier|hash(**name)|timestamp
>>>>>> EXAMPLE:
>>>>>> google|appliance|oven|**be9173589a7471a7179e928adc1a86**
>>>>>> f7|1372837702753
>>>>>>
>>>>>> Do you think my key could be good for my scope (my search will be
>>>>>> essentially by source or source|type)?
>>>>>> Another point is that initially I will not have so many sources, so I
>>>>>>
>>>>> will
>>>>>
>>>>>> probably have only google|* but in the next phases there could be more
>>>>>> sources..
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>> On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>   For #1, yes - the client receives less data after filtering.
>>>>>>> For #2, please take a look at TestMultiVersions
>>>>>>> (./src/test/java/org/apache/**hadoop/hbase/**TestMultiVersions.java
>>>>>>> in
>>>>>>>
>>>>>> 0.94)
>>>>> for time range:
>>>>>>>       scan = new Scan();
>>>>>>>
>>>>>>>       scan.setTimeRange(1000L, Long.MAX_VALUE);
>>>>>>> For row key selection, you need a filter. Take a look at
>>>>>>> FuzzyRowFilter.java
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
>>>>>>>
>>>>>> pompermaier@okkam.it
>>>>>> wrote:
>>>>>>>>    Thanks for the reply! I thus have two questions more:
>>>>>>>>
>>>>>>>> 1) is it true that filtering on timestamps doesn't affect
>>>>>>>>
>>>>>>> performance..?
>>>>>> 2) could you send me a little snippet of how you would do such a
>>>>>>> filter
>>>>>> (by
>>>>>>>> row key + timestamps)? For example get all rows whose key starts
>>>>>>>>
>>>>>>> with
>>>>>   'someid-' and whose timestamps is greater than some timestamp?
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>   bq. Using timestamp in row-keys is discouraged
>>>>>>>>> The above is true.
>>>>>>>>> Prefixing row key with timestamp would create hot region.
>>>>>>>>>
>>>>>>>>> bq. should I filter by a simpler row-key plus a filter on
>>>>>>>>>
>>>>>>>> timestamp?
>>>>>   You can do the above.
>>>>>>>>> On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
>>>>>>>>>
>>>>>>>> pompermaier@okkam.it
>>>>>>>> wrote:
>>>>>>>>>> Hi to everybody,
>>>>>>>>>>
>>>>>>>>>> in my use case I have to perform batch analysis skipping old
>>>>>>>>>>
>>>>>>>>> data.
>>>>>   For example, I want to process all rows created after a certain
>>>>>>>>> timestamp,
>>>>>>>>>
>>>>>>>>>> passed as parameter.
>>>>>>>>>>
>>>>>>>>>> What is the most effective way to do this?
>>>>>>>>>> Should I design my row-key to embed timestamp?
>>>>>>>>>> Or just filtering by timestamp of the row is fast as well? Or
>>>>>>>>>>
>>>>>>>>> what
>>>>>   else?
>>>>>>>>> Initially I was thinking to compose my key as:
>>>>>>>>>> timestamp|source|title|type
>>>>>>>>>>
>>>>>>>>>> but:
>>>>>>>>>>
>>>>>>>>>> 1) Using timestamp in row-keys is discouraged
>>>>>>>>>> 2) If this design is ok, using this approach I still have
>>>>>>>>>>
>>>>>>>>> problems
>>>>>   filtering by timestamp because I cannot found a way to
>>>>>>>>> numerically
>>>>>   filer
>>>>>>>>> (instead of alphanumerically/by string). Example:
>>>>>>>>>> 1372776400441|something has timestamp lesser
>>>>>>>>>> than 1372778470913|somethingelse but I cannot filter all row
>>>>>>>>>>
>>>>>>>>> whose
>>>>> key
>>>>>>>> is
>>>>>>>>
>>>>>>>>> "numerically" greater than 1372776400441. Is it possible to
>>>>>>>>> overcome
>>>>>> this
>>>>>>>>> issue?
>>>>>>>>>> 3) If this design is not ok, should I filter by a simpler
>>>>>>>>>>
>>>>>>>>> row-key
>>>>> plus
>>>>>>>> a
>>>>>>>>
>>>>>>>>> filter on timestamp? Or what else?
>>>>>>>>>> Best,
>>>>>>>>>> Flavio
>>>>>>>>>>
>>>>>>>>>>
>
> --
>
> Flavio Pompermaier
> *Development Department
> *_______________________________________________
> *OKKAM**Srl **- www.okkam.it*
>
> *Phone:* +(39) 0461 283 702
> *Fax:* + (39) 0461 186 6433
> *Email:* f.pompermaier@okkam.it
> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> *Registered office:* Trento (Italy), via Segantini 23
>
> Confidentially notice. This e-mail transmission may contain legally
> privileged and/or confidential information. Please do not read it if you
> are not the intended recipient(S). Any use, distribution, reproduction or
> disclosure by any other person is strictly prohibited. If you have received
> this e-mail in error, please notify the sender and destroy the original
> transmission and its attachments without reading or saving it in any manner.

Re: Help in designing row key

Posted by Flavio Pompermaier <po...@okkam.it>.

No, I've never seen Phoenix, but it looks like a very useful project!
However I don't have such strict performance issues in my use case, I just
want to have balanced regions as much as possible.
So I think that in this case I will still use Bytes concatenation if
someone confirm I'm doing it in the right way.


On Wed, Jul 3, 2013 at 12:33 PM, James Taylor <jt...@salesforce.com>wrote:

> Hi Flavio,
> Have you had a look at Phoenix (https://github.com/**forcedotcom/phoenix<https://github.com/forcedotcom/phoenix>)?
> It will allow you to model your multi-part row key like this:
>
> CREATE TABLE flavio.analytics (
>     source INTEGER,
>     type INTEGER,
>     qual VARCHAR,
>     hash VARCHAR,
>     ts DATE
>     CONSTRAINT pk PRIMARY KEY (source, type, qual, hash, ts) // Defines
> columns that make up the row key
> )
>
> Then you can issue SQL queries like this (to query for the last 7 days
> worth of data):
> SELECT * FROM flavio.analytics WHERE source IN (1,2,5) AND type IN (55,66)
> AND ts > CURRENT_DATE() - 7
>
> This will internally take advantage of our SkipScan (http://phoenix-hbase.
> **blogspot.com/2013/05/**demystifying-skip-scan-in-**phoenix.html<http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html>)
> to jump through your key space similar to FuzzyRowFilter, but in parallel
> from the client taking into account your region boundaries.
>
> Or do more complex GROUP BY queries like this (to aggregate over the last
> 30 days worth of data, bucketized by day):
> SELECT type,COUNT(*) FROM flavio.analytics WHERE ts > CURRENT_DATE() - 30
> GROUP BY type,TRUNCATE(ts,'DAY')
>
> No need to worry about lexicographical sort order, flipping sign bits,
> normalizing/padding integer values, and all the other nuances of working
> with an API that works at the level of bytes. No need to write and manage
> installation of your own coprocessors to make aggregation efficient,
> perform topN queries, etc.
>
> HTH.
>
> Regards,
> James
> @JamesPlusPlus
>
>
> On 07/03/2013 02:58 AM, Anoop John wrote:
>
>> When you make the RK and convert the int parts into byte[] ( Use
>> org.apache.hadoop.hbase.util.**Bytes#toBytes(*int) *)  it will give 4
>> bytes
>> for every byte..  Be careful about the ordering...   When u convert a +ve
>> and -ve integer into byte[] and u do Lexiographical compare (as done in
>> HBase) u will see -ve number being greater than +ve..  If you dont have to
>> do deal with -ve numbers no issues  :)
>>
>> Well when all the parts of the RK is of fixed width u will need any
>> seperator??
>>
>> -Anoop-
>>
>> On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <pompermaier@okkam.it
>> >wrote:
>>
>>  Yeah, I was thinking to use a normalization step in order to allow the
>>> use
>>> of FuzzyRowFilter but what is not clear to me is if integers must also be
>>> normalized or not.
>>> I will explain myself better. Suppose that i follow your advice and I
>>> produce keys like:
>>>   - 1|1|somehash|sometimestamp
>>>   - 55|555|somehash|sometimestamp
>>>
>>> Whould they match the same pattern or do I have to normalize them to the
>>> following?
>>>   - 001|001|somehash|sometimestamp
>>>   - 055|555|somehash|sometimestamp
>>>
>>> Moreover, I noticed that you used dots ('.') to separate things instead
>>> of
>>> pipe ('|')..is there a reason for that (maybe performance or whatever) or
>>> is just your favourite separator?
>>>
>>> Best,
>>> Flavio
>>>
>>>
>>> On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mi...@axiak.net> wrote:
>>>
>>>  I'm not sure if you're eliding this fact or not, but you'd be much
>>>> better off if you used a fixed-width format for your keys. So in your
>>>> example, you'd have:
>>>>
>>>> PATTERN: source(4-byte-int).type(4-**byte-int or smaller).fixed 128-bit
>>>> hash.8-byte timestamp
>>>>
>>>> Example: \x00\x00\x00\x01\x00\x00\x02\**x03....
>>>>
>>>> The advantage of this is not only that it's significantly less data
>>>> (remember your key is stored on each KeyValue), but also you can now
>>>> use FuzzyRowFilter and other techniques to quickly perform scans. The
>>>> disadvantage is that you have to normalize the source-> integer but I
>>>> find I can either store that in an enum or cache it for a long time so
>>>> it's not a big issue.
>>>>
>>>> -Mike
>>>>
>>>> On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <
>>>> pompermaier@okkam.it
>>>>
>>>> wrote:
>>>>
>>>>> Thank you very much for the great support!
>>>>> This is how I thought to design my key:
>>>>>
>>>>> PATTERN: source|type|qualifier|hash(**name)|timestamp
>>>>> EXAMPLE:
>>>>> google|appliance|oven|**be9173589a7471a7179e928adc1a86**
>>>>> f7|1372837702753
>>>>>
>>>>> Do you think my key could be good for my scope (my search will be
>>>>> essentially by source or source|type)?
>>>>> Another point is that initially I will not have so many sources, so I
>>>>>
>>>> will
>>>>
>>>>> probably have only google|* but in the next phases there could be more
>>>>> sources..
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>> On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>  For #1, yes - the client receives less data after filtering.
>>>>>>
>>>>>> For #2, please take a look at TestMultiVersions
>>>>>> (./src/test/java/org/apache/**hadoop/hbase/**TestMultiVersions.java
>>>>>> in
>>>>>>
>>>>> 0.94)
>>>
>>>> for time range:
>>>>>>
>>>>>>      scan = new Scan();
>>>>>>
>>>>>>      scan.setTimeRange(1000L, Long.MAX_VALUE);
>>>>>> For row key selection, you need a filter. Take a look at
>>>>>> FuzzyRowFilter.java
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
>>>>>>
>>>>> pompermaier@okkam.it
>>>>
>>>>> wrote:
>>>>>>>   Thanks for the reply! I thus have two questions more:
>>>>>>>
>>>>>>> 1) is it true that filtering on timestamps doesn't affect
>>>>>>>
>>>>>> performance..?
>>>>
>>>>> 2) could you send me a little snippet of how you would do such a
>>>>>>>
>>>>>> filter
>>>>
>>>>> (by
>>>>>>
>>>>>>> row key + timestamps)? For example get all rows whose key starts
>>>>>>>
>>>>>> with
>>>
>>>>  'someid-' and whose timestamps is greater than some timestamp?
>>>>>>>
>>>>>>> Best,
>>>>>>> Flavio
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>
>>>>>>>  bq. Using timestamp in row-keys is discouraged
>>>>>>>>
>>>>>>>> The above is true.
>>>>>>>> Prefixing row key with timestamp would create hot region.
>>>>>>>>
>>>>>>>> bq. should I filter by a simpler row-key plus a filter on
>>>>>>>>
>>>>>>> timestamp?
>>>
>>>>  You can do the above.
>>>>>>>>
>>>>>>>> On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
>>>>>>>>
>>>>>>> pompermaier@okkam.it
>>>>>>
>>>>>>> wrote:
>>>>>>>>> Hi to everybody,
>>>>>>>>>
>>>>>>>>> in my use case I have to perform batch analysis skipping old
>>>>>>>>>
>>>>>>>> data.
>>>
>>>>  For example, I want to process all rows created after a certain
>>>>>>>>>
>>>>>>>> timestamp,
>>>>>>>>
>>>>>>>>> passed as parameter.
>>>>>>>>>
>>>>>>>>> What is the most effective way to do this?
>>>>>>>>> Should I design my row-key to embed timestamp?
>>>>>>>>> Or just filtering by timestamp of the row is fast as well? Or
>>>>>>>>>
>>>>>>>> what
>>>
>>>>  else?
>>>>>>>
>>>>>>>> Initially I was thinking to compose my key as:
>>>>>>>>> timestamp|source|title|type
>>>>>>>>>
>>>>>>>>> but:
>>>>>>>>>
>>>>>>>>> 1) Using timestamp in row-keys is discouraged
>>>>>>>>> 2) If this design is ok, using this approach I still have
>>>>>>>>>
>>>>>>>> problems
>>>
>>>>  filtering by timestamp because I cannot found a way to
>>>>>>>>>
>>>>>>>> numerically
>>>
>>>>  filer
>>>>>>>
>>>>>>>> (instead of alphanumerically/by string). Example:
>>>>>>>>> 1372776400441|something has timestamp lesser
>>>>>>>>> than 1372778470913|somethingelse but I cannot filter all row
>>>>>>>>>
>>>>>>>> whose
>>>
>>>> key
>>>>>>
>>>>>>> is
>>>>>>>
>>>>>>>> "numerically" greater than 1372776400441. Is it possible to
>>>>>>>>>
>>>>>>>> overcome
>>>>
>>>>> this
>>>>>>>
>>>>>>>> issue?
>>>>>>>>> 3) If this design is not ok, should I filter by a simpler
>>>>>>>>>
>>>>>>>> row-key
>>>
>>>> plus
>>>>>>
>>>>>>> a
>>>>>>>
>>>>>>>> filter on timestamp? Or what else?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>>
>


-- 

Flavio Pompermaier
*Development Department
*_______________________________________________
*OKKAM**Srl **- www.okkam.it*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* f.pompermaier@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any manner.

Re: Help in designing row key

Posted by James Taylor <jt...@salesforce.com>.

Hi Flavio,
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? 
It will allow you to model your multi-part row key like this:

CREATE TABLE flavio.analytics (
     source INTEGER,
     type INTEGER,
     qual VARCHAR,
     hash VARCHAR,
     ts DATE
     CONSTRAINT pk PRIMARY KEY (source, type, qual, hash, ts) // Defines 
columns that make up the row key
)

Then you can issue SQL queries like this (to query for the last 7 days 
worth of data):
SELECT * FROM flavio.analytics WHERE source IN (1,2,5) AND type IN 
(55,66) AND ts > CURRENT_DATE() - 7

This will internally take advantage of our SkipScan 
(http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html) 
to jump through your key space similar to FuzzyRowFilter, but in 
parallel from the client taking into account your region boundaries.

Or do more complex GROUP BY queries like this (to aggregate over the 
last 30 days worth of data, bucketized by day):
SELECT type,COUNT(*) FROM flavio.analytics WHERE ts > CURRENT_DATE() - 
30 GROUP BY type,TRUNCATE(ts,'DAY')

No need to worry about lexicographical sort order, flipping sign bits, 
normalizing/padding integer values, and all the other nuances of working 
with an API that works at the level of bytes. No need to write and 
manage installation of your own coprocessors to make aggregation 
efficient, perform topN queries, etc.

HTH.

Regards,
James
@JamesPlusPlus

On 07/03/2013 02:58 AM, Anoop John wrote:
> When you make the RK and convert the int parts into byte[] ( Use
> org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
> for every byte..  Be careful about the ordering...   When u convert a +ve
> and -ve integer into byte[] and u do Lexiographical compare (as done in
> HBase) u will see -ve number being greater than +ve..  If you dont have to
> do deal with -ve numbers no issues  :)
>
> Well when all the parts of the RK is of fixed width u will need any
> seperator??
>
> -Anoop-
>
> On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <po...@okkam.it>wrote:
>
>> Yeah, I was thinking to use a normalization step in order to allow the use
>> of FuzzyRowFilter but what is not clear to me is if integers must also be
>> normalized or not.
>> I will explain myself better. Suppose that i follow your advice and I
>> produce keys like:
>>   - 1|1|somehash|sometimestamp
>>   - 55|555|somehash|sometimestamp
>>
>> Whould they match the same pattern or do I have to normalize them to the
>> following?
>>   - 001|001|somehash|sometimestamp
>>   - 055|555|somehash|sometimestamp
>>
>> Moreover, I noticed that you used dots ('.') to separate things instead of
>> pipe ('|')..is there a reason for that (maybe performance or whatever) or
>> is just your favourite separator?
>>
>> Best,
>> Flavio
>>
>>
>> On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mi...@axiak.net> wrote:
>>
>>> I'm not sure if you're eliding this fact or not, but you'd be much
>>> better off if you used a fixed-width format for your keys. So in your
>>> example, you'd have:
>>>
>>> PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
>>> hash.8-byte timestamp
>>>
>>> Example: \x00\x00\x00\x01\x00\x00\x02\x03....
>>>
>>> The advantage of this is not only that it's significantly less data
>>> (remember your key is stored on each KeyValue), but also you can now
>>> use FuzzyRowFilter and other techniques to quickly perform scans. The
>>> disadvantage is that you have to normalize the source-> integer but I
>>> find I can either store that in an enum or cache it for a long time so
>>> it's not a big issue.
>>>
>>> -Mike
>>>
>>> On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <pompermaier@okkam.it
>>>
>>> wrote:
>>>> Thank you very much for the great support!
>>>> This is how I thought to design my key:
>>>>
>>>> PATTERN: source|type|qualifier|hash(name)|timestamp
>>>> EXAMPLE:
>>>> google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
>>>>
>>>> Do you think my key could be good for my scope (my search will be
>>>> essentially by source or source|type)?
>>>> Another point is that initially I will not have so many sources, so I
>>> will
>>>> probably have only google|* but in the next phases there could be more
>>>> sources..
>>>>
>>>> Best,
>>>> Flavio
>>>>
>>>> On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> For #1, yes - the client receives less data after filtering.
>>>>>
>>>>> For #2, please take a look at TestMultiVersions
>>>>> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in
>> 0.94)
>>>>> for time range:
>>>>>
>>>>>      scan = new Scan();
>>>>>
>>>>>      scan.setTimeRange(1000L, Long.MAX_VALUE);
>>>>> For row key selection, you need a filter. Take a look at
>>>>> FuzzyRowFilter.java
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
>>> pompermaier@okkam.it
>>>>>> wrote:
>>>>>>   Thanks for the reply! I thus have two questions more:
>>>>>>
>>>>>> 1) is it true that filtering on timestamps doesn't affect
>>> performance..?
>>>>>> 2) could you send me a little snippet of how you would do such a
>>> filter
>>>>> (by
>>>>>> row key + timestamps)? For example get all rows whose key starts
>> with
>>>>>> 'someid-' and whose timestamps is greater than some timestamp?
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>> bq. Using timestamp in row-keys is discouraged
>>>>>>>
>>>>>>> The above is true.
>>>>>>> Prefixing row key with timestamp would create hot region.
>>>>>>>
>>>>>>> bq. should I filter by a simpler row-key plus a filter on
>> timestamp?
>>>>>>> You can do the above.
>>>>>>>
>>>>>>> On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it
>>>>>>>> wrote:
>>>>>>>> Hi to everybody,
>>>>>>>>
>>>>>>>> in my use case I have to perform batch analysis skipping old
>> data.
>>>>>>>> For example, I want to process all rows created after a certain
>>>>>>> timestamp,
>>>>>>>> passed as parameter.
>>>>>>>>
>>>>>>>> What is the most effective way to do this?
>>>>>>>> Should I design my row-key to embed timestamp?
>>>>>>>> Or just filtering by timestamp of the row is fast as well? Or
>> what
>>>>>> else?
>>>>>>>> Initially I was thinking to compose my key as:
>>>>>>>> timestamp|source|title|type
>>>>>>>>
>>>>>>>> but:
>>>>>>>>
>>>>>>>> 1) Using timestamp in row-keys is discouraged
>>>>>>>> 2) If this design is ok, using this approach I still have
>> problems
>>>>>>>> filtering by timestamp because I cannot found a way to
>> numerically
>>>>>> filer
>>>>>>>> (instead of alphanumerically/by string). Example:
>>>>>>>> 1372776400441|something has timestamp lesser
>>>>>>>> than 1372778470913|somethingelse but I cannot filter all row
>> whose
>>>>> key
>>>>>> is
>>>>>>>> "numerically" greater than 1372776400441. Is it possible to
>>> overcome
>>>>>> this
>>>>>>>> issue?
>>>>>>>> 3) If this design is not ok, should I filter by a simpler
>> row-key
>>>>> plus
>>>>>> a
>>>>>>>> filter on timestamp? Or what else?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>>

Re: Help in designing row key

Posted by Anoop John <an...@gmail.com>.

When you make the RK and convert the int parts into byte[] ( Use
org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
for every byte..  Be careful about the ordering...   When u convert a +ve
and -ve integer into byte[] and u do Lexiographical compare (as done in
HBase) u will see -ve number being greater than +ve..  If you dont have to
do deal with -ve numbers no issues  :)

Well when all the parts of the RK is of fixed width u will need any
seperator??

-Anoop-

On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <po...@okkam.it>wrote:

> Yeah, I was thinking to use a normalization step in order to allow the use
> of FuzzyRowFilter but what is not clear to me is if integers must also be
> normalized or not.
> I will explain myself better. Suppose that i follow your advice and I
> produce keys like:
>  - 1|1|somehash|sometimestamp
>  - 55|555|somehash|sometimestamp
>
> Whould they match the same pattern or do I have to normalize them to the
> following?
>  - 001|001|somehash|sometimestamp
>  - 055|555|somehash|sometimestamp
>
> Moreover, I noticed that you used dots ('.') to separate things instead of
> pipe ('|')..is there a reason for that (maybe performance or whatever) or
> is just your favourite separator?
>
> Best,
> Flavio
>
>
> On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mi...@axiak.net> wrote:
>
> > I'm not sure if you're eliding this fact or not, but you'd be much
> > better off if you used a fixed-width format for your keys. So in your
> > example, you'd have:
> >
> > PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
> > hash.8-byte timestamp
> >
> > Example: \x00\x00\x00\x01\x00\x00\x02\x03....
> >
> > The advantage of this is not only that it's significantly less data
> > (remember your key is stored on each KeyValue), but also you can now
> > use FuzzyRowFilter and other techniques to quickly perform scans. The
> > disadvantage is that you have to normalize the source-> integer but I
> > find I can either store that in an enum or cache it for a long time so
> > it's not a big issue.
> >
> > -Mike
> >
> > On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <pompermaier@okkam.it
> >
> > wrote:
> > > Thank you very much for the great support!
> > > This is how I thought to design my key:
> > >
> > > PATTERN: source|type|qualifier|hash(name)|timestamp
> > > EXAMPLE:
> > > google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> > >
> > > Do you think my key could be good for my scope (my search will be
> > > essentially by source or source|type)?
> > > Another point is that initially I will not have so many sources, so I
> > will
> > > probably have only google|* but in the next phases there could be more
> > > sources..
> > >
> > > Best,
> > > Flavio
> > >
> > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > >> For #1, yes - the client receives less data after filtering.
> > >>
> > >> For #2, please take a look at TestMultiVersions
> > >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in
> 0.94)
> > >> for time range:
> > >>
> > >>     scan = new Scan();
> > >>
> > >>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> > >> For row key selection, you need a filter. Take a look at
> > >> FuzzyRowFilter.java
> > >>
> > >> Cheers
> > >>
> > >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
> > pompermaier@okkam.it
> > >> >wrote:
> > >>
> > >> >  Thanks for the reply! I thus have two questions more:
> > >> >
> > >> > 1) is it true that filtering on timestamps doesn't affect
> > performance..?
> > >> > 2) could you send me a little snippet of how you would do such a
> > filter
> > >> (by
> > >> > row key + timestamps)? For example get all rows whose key starts
> with
> > >> > 'someid-' and whose timestamps is greater than some timestamp?
> > >> >
> > >> > Best,
> > >> > Flavio
> > >> >
> > >> >
> > >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com> wrote:
> > >> >
> > >> > > bq. Using timestamp in row-keys is discouraged
> > >> > >
> > >> > > The above is true.
> > >> > > Prefixing row key with timestamp would create hot region.
> > >> > >
> > >> > > bq. should I filter by a simpler row-key plus a filter on
> timestamp?
> > >> > >
> > >> > > You can do the above.
> > >> > >
> > >> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
> > >> pompermaier@okkam.it
> > >> > > >wrote:
> > >> > >
> > >> > > > Hi to everybody,
> > >> > > >
> > >> > > > in my use case I have to perform batch analysis skipping old
> data.
> > >> > > > For example, I want to process all rows created after a certain
> > >> > > timestamp,
> > >> > > > passed as parameter.
> > >> > > >
> > >> > > > What is the most effective way to do this?
> > >> > > > Should I design my row-key to embed timestamp?
> > >> > > > Or just filtering by timestamp of the row is fast as well? Or
> what
> > >> > else?
> > >> > > >
> > >> > > > Initially I was thinking to compose my key as:
> > >> > > > timestamp|source|title|type
> > >> > > >
> > >> > > > but:
> > >> > > >
> > >> > > > 1) Using timestamp in row-keys is discouraged
> > >> > > > 2) If this design is ok, using this approach I still have
> problems
> > >> > > > filtering by timestamp because I cannot found a way to
> numerically
> > >> > filer
> > >> > > > (instead of alphanumerically/by string). Example:
> > >> > > > 1372776400441|something has timestamp lesser
> > >> > > > than 1372778470913|somethingelse but I cannot filter all row
> whose
> > >> key
> > >> > is
> > >> > > > "numerically" greater than 1372776400441. Is it possible to
> > overcome
> > >> > this
> > >> > > > issue?
> > >> > > > 3) If this design is not ok, should I filter by a simpler
> row-key
> > >> plus
> > >> > a
> > >> > > > filter on timestamp? Or what else?
> > >> > > >
> > >> > > > Best,
> > >> > > > Flavio
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>

Re: Help in designing row key

Posted by Flavio Pompermaier <po...@okkam.it>.

Yeah, I was thinking to use a normalization step in order to allow the use
of FuzzyRowFilter but what is not clear to me is if integers must also be
normalized or not.
I will explain myself better. Suppose that i follow your advice and I
produce keys like:
 - 1|1|somehash|sometimestamp
 - 55|555|somehash|sometimestamp

Whould they match the same pattern or do I have to normalize them to the
following?
 - 001|001|somehash|sometimestamp
 - 055|555|somehash|sometimestamp

Moreover, I noticed that you used dots ('.') to separate things instead of
pipe ('|')..is there a reason for that (maybe performance or whatever) or
is just your favourite separator?

Best,
Flavio


On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mi...@axiak.net> wrote:

> I'm not sure if you're eliding this fact or not, but you'd be much
> better off if you used a fixed-width format for your keys. So in your
> example, you'd have:
>
> PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
> hash.8-byte timestamp
>
> Example: \x00\x00\x00\x01\x00\x00\x02\x03....
>
> The advantage of this is not only that it's significantly less data
> (remember your key is stored on each KeyValue), but also you can now
> use FuzzyRowFilter and other techniques to quickly perform scans. The
> disadvantage is that you have to normalize the source-> integer but I
> find I can either store that in an enum or cache it for a long time so
> it's not a big issue.
>
> -Mike
>
> On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <po...@okkam.it>
> wrote:
> > Thank you very much for the great support!
> > This is how I thought to design my key:
> >
> > PATTERN: source|type|qualifier|hash(name)|timestamp
> > EXAMPLE:
> > google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> >
> > Do you think my key could be good for my scope (my search will be
> > essentially by source or source|type)?
> > Another point is that initially I will not have so many sources, so I
> will
> > probably have only google|* but in the next phases there could be more
> > sources..
> >
> > Best,
> > Flavio
> >
> > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> For #1, yes - the client receives less data after filtering.
> >>
> >> For #2, please take a look at TestMultiVersions
> >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in 0.94)
> >> for time range:
> >>
> >>     scan = new Scan();
> >>
> >>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> >> For row key selection, you need a filter. Take a look at
> >> FuzzyRowFilter.java
> >>
> >> Cheers
> >>
> >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
> pompermaier@okkam.it
> >> >wrote:
> >>
> >> >  Thanks for the reply! I thus have two questions more:
> >> >
> >> > 1) is it true that filtering on timestamps doesn't affect
> performance..?
> >> > 2) could you send me a little snippet of how you would do such a
> filter
> >> (by
> >> > row key + timestamps)? For example get all rows whose key starts with
> >> > 'someid-' and whose timestamps is greater than some timestamp?
> >> >
> >> > Best,
> >> > Flavio
> >> >
> >> >
> >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com> wrote:
> >> >
> >> > > bq. Using timestamp in row-keys is discouraged
> >> > >
> >> > > The above is true.
> >> > > Prefixing row key with timestamp would create hot region.
> >> > >
> >> > > bq. should I filter by a simpler row-key plus a filter on timestamp?
> >> > >
> >> > > You can do the above.
> >> > >
> >> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
> >> pompermaier@okkam.it
> >> > > >wrote:
> >> > >
> >> > > > Hi to everybody,
> >> > > >
> >> > > > in my use case I have to perform batch analysis skipping old data.
> >> > > > For example, I want to process all rows created after a certain
> >> > > timestamp,
> >> > > > passed as parameter.
> >> > > >
> >> > > > What is the most effective way to do this?
> >> > > > Should I design my row-key to embed timestamp?
> >> > > > Or just filtering by timestamp of the row is fast as well? Or what
> >> > else?
> >> > > >
> >> > > > Initially I was thinking to compose my key as:
> >> > > > timestamp|source|title|type
> >> > > >
> >> > > > but:
> >> > > >
> >> > > > 1) Using timestamp in row-keys is discouraged
> >> > > > 2) If this design is ok, using this approach I still have problems
> >> > > > filtering by timestamp because I cannot found a way to numerically
> >> > filer
> >> > > > (instead of alphanumerically/by string). Example:
> >> > > > 1372776400441|something has timestamp lesser
> >> > > > than 1372778470913|somethingelse but I cannot filter all row whose
> >> key
> >> > is
> >> > > > "numerically" greater than 1372776400441. Is it possible to
> overcome
> >> > this
> >> > > > issue?
> >> > > > 3) If this design is not ok, should I filter by a simpler row-key
> >> plus
> >> > a
> >> > > > filter on timestamp? Or what else?
> >> > > >
> >> > > > Best,
> >> > > > Flavio
> >> > > >
> >> > >
> >> >
> >>
>

Re: Help in designing row key

Posted by Mike Axiak <mi...@axiak.net>.

I'm not sure if you're eliding this fact or not, but you'd be much
better off if you used a fixed-width format for your keys. So in your
example, you'd have:

PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
hash.8-byte timestamp

Example: \x00\x00\x00\x01\x00\x00\x02\x03....

The advantage of this is not only that it's significantly less data
(remember your key is stored on each KeyValue), but also you can now
use FuzzyRowFilter and other techniques to quickly perform scans. The
disadvantage is that you have to normalize the source-> integer but I
find I can either store that in an enum or cache it for a long time so
it's not a big issue.

-Mike

On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <po...@okkam.it> wrote:
> Thank you very much for the great support!
> This is how I thought to design my key:
>
> PATTERN: source|type|qualifier|hash(name)|timestamp
> EXAMPLE:
> google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
>
> Do you think my key could be good for my scope (my search will be
> essentially by source or source|type)?
> Another point is that initially I will not have so many sources, so I will
> probably have only google|* but in the next phases there could be more
> sources..
>
> Best,
> Flavio
>
> On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> For #1, yes - the client receives less data after filtering.
>>
>> For #2, please take a look at TestMultiVersions
>> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in 0.94)
>> for time range:
>>
>>     scan = new Scan();
>>
>>     scan.setTimeRange(1000L, Long.MAX_VALUE);
>> For row key selection, you need a filter. Take a look at
>> FuzzyRowFilter.java
>>
>> Cheers
>>
>> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <pompermaier@okkam.it
>> >wrote:
>>
>> >  Thanks for the reply! I thus have two questions more:
>> >
>> > 1) is it true that filtering on timestamps doesn't affect performance..?
>> > 2) could you send me a little snippet of how you would do such a filter
>> (by
>> > row key + timestamps)? For example get all rows whose key starts with
>> > 'someid-' and whose timestamps is greater than some timestamp?
>> >
>> > Best,
>> > Flavio
>> >
>> >
>> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > bq. Using timestamp in row-keys is discouraged
>> > >
>> > > The above is true.
>> > > Prefixing row key with timestamp would create hot region.
>> > >
>> > > bq. should I filter by a simpler row-key plus a filter on timestamp?
>> > >
>> > > You can do the above.
>> > >
>> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
>> pompermaier@okkam.it
>> > > >wrote:
>> > >
>> > > > Hi to everybody,
>> > > >
>> > > > in my use case I have to perform batch analysis skipping old data.
>> > > > For example, I want to process all rows created after a certain
>> > > timestamp,
>> > > > passed as parameter.
>> > > >
>> > > > What is the most effective way to do this?
>> > > > Should I design my row-key to embed timestamp?
>> > > > Or just filtering by timestamp of the row is fast as well? Or what
>> > else?
>> > > >
>> > > > Initially I was thinking to compose my key as:
>> > > > timestamp|source|title|type
>> > > >
>> > > > but:
>> > > >
>> > > > 1) Using timestamp in row-keys is discouraged
>> > > > 2) If this design is ok, using this approach I still have problems
>> > > > filtering by timestamp because I cannot found a way to numerically
>> > filer
>> > > > (instead of alphanumerically/by string). Example:
>> > > > 1372776400441|something has timestamp lesser
>> > > > than 1372778470913|somethingelse but I cannot filter all row whose
>> key
>> > is
>> > > > "numerically" greater than 1372776400441. Is it possible to overcome
>> > this
>> > > > issue?
>> > > > 3) If this design is not ok, should I filter by a simpler row-key
>> plus
>> > a
>> > > > filter on timestamp? Or what else?
>> > > >
>> > > > Best,
>> > > > Flavio
>> > > >
>> > >
>> >
>>

Re: Help in designing row key

Posted by Flavio Pompermaier <po...@okkam.it>.

Thank you very much for the great support!
This is how I thought to design my key:

PATTERN: source|type|qualifier|hash(name)|timestamp
EXAMPLE:
google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753

Do you think my key could be good for my scope (my search will be
essentially by source or source|type)?
Another point is that initially I will not have so many sources, so I will
probably have only google|* but in the next phases there could be more
sources..

Best,
Flavio

On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yu...@gmail.com> wrote:

> For #1, yes - the client receives less data after filtering.
>
> For #2, please take a look at TestMultiVersions
> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in 0.94)
> for time range:
>
>     scan = new Scan();
>
>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> For row key selection, you need a filter. Take a look at
> FuzzyRowFilter.java
>
> Cheers
>
> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <pompermaier@okkam.it
> >wrote:
>
> >  Thanks for the reply! I thus have two questions more:
> >
> > 1) is it true that filtering on timestamps doesn't affect performance..?
> > 2) could you send me a little snippet of how you would do such a filter
> (by
> > row key + timestamps)? For example get all rows whose key starts with
> > 'someid-' and whose timestamps is greater than some timestamp?
> >
> > Best,
> > Flavio
> >
> >
> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > bq. Using timestamp in row-keys is discouraged
> > >
> > > The above is true.
> > > Prefixing row key with timestamp would create hot region.
> > >
> > > bq. should I filter by a simpler row-key plus a filter on timestamp?
> > >
> > > You can do the above.
> > >
> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
> pompermaier@okkam.it
> > > >wrote:
> > >
> > > > Hi to everybody,
> > > >
> > > > in my use case I have to perform batch analysis skipping old data.
> > > > For example, I want to process all rows created after a certain
> > > timestamp,
> > > > passed as parameter.
> > > >
> > > > What is the most effective way to do this?
> > > > Should I design my row-key to embed timestamp?
> > > > Or just filtering by timestamp of the row is fast as well? Or what
> > else?
> > > >
> > > > Initially I was thinking to compose my key as:
> > > > timestamp|source|title|type
> > > >
> > > > but:
> > > >
> > > > 1) Using timestamp in row-keys is discouraged
> > > > 2) If this design is ok, using this approach I still have problems
> > > > filtering by timestamp because I cannot found a way to numerically
> > filer
> > > > (instead of alphanumerically/by string). Example:
> > > > 1372776400441|something has timestamp lesser
> > > > than 1372778470913|somethingelse but I cannot filter all row whose
> key
> > is
> > > > "numerically" greater than 1372776400441. Is it possible to overcome
> > this
> > > > issue?
> > > > 3) If this design is not ok, should I filter by a simpler row-key
> plus
> > a
> > > > filter on timestamp? Or what else?
> > > >
> > > > Best,
> > > > Flavio
> > > >
> > >
> >
>

Re: Help in designing row key

Posted by Ted Yu <yu...@gmail.com>.

For #1, yes - the client receives less data after filtering.

For #2, please take a look at TestMultiVersions
(./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in 0.94)
for time range:

    scan = new Scan();

    scan.setTimeRange(1000L, Long.MAX_VALUE);
For row key selection, you need a filter. Take a look at FuzzyRowFilter.java

Cheers

On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <po...@okkam.it>wrote:

>  Thanks for the reply! I thus have two questions more:
>
> 1) is it true that filtering on timestamps doesn't affect performance..?
> 2) could you send me a little snippet of how you would do such a filter (by
> row key + timestamps)? For example get all rows whose key starts with
> 'someid-' and whose timestamps is greater than some timestamp?
>
> Best,
> Flavio
>
>
> On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > bq. Using timestamp in row-keys is discouraged
> >
> > The above is true.
> > Prefixing row key with timestamp would create hot region.
> >
> > bq. should I filter by a simpler row-key plus a filter on timestamp?
> >
> > You can do the above.
> >
> > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <pompermaier@okkam.it
> > >wrote:
> >
> > > Hi to everybody,
> > >
> > > in my use case I have to perform batch analysis skipping old data.
> > > For example, I want to process all rows created after a certain
> > timestamp,
> > > passed as parameter.
> > >
> > > What is the most effective way to do this?
> > > Should I design my row-key to embed timestamp?
> > > Or just filtering by timestamp of the row is fast as well? Or what
> else?
> > >
> > > Initially I was thinking to compose my key as:
> > > timestamp|source|title|type
> > >
> > > but:
> > >
> > > 1) Using timestamp in row-keys is discouraged
> > > 2) If this design is ok, using this approach I still have problems
> > > filtering by timestamp because I cannot found a way to numerically
> filer
> > > (instead of alphanumerically/by string). Example:
> > > 1372776400441|something has timestamp lesser
> > > than 1372778470913|somethingelse but I cannot filter all row whose key
> is
> > > "numerically" greater than 1372776400441. Is it possible to overcome
> this
> > > issue?
> > > 3) If this design is not ok, should I filter by a simpler row-key plus
> a
> > > filter on timestamp? Or what else?
> > >
> > > Best,
> > > Flavio
> > >
> >
>
>
>
> --
>
> Flavio Pompermaier
> *Development Department
> *_______________________________________________
> *OKKAM**Srl **- www.okkam.it*
>
> *Phone:* +(39) 0461 283 702
> *Fax:* + (39) 0461 186 6433
> *Email:* f.pompermaier@okkam.it
> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> *Registered office:* Trento (Italy), via Segantini 23
>
> Confidentially notice. This e-mail transmission may contain legally
> privileged and/or confidential information. Please do not read it if you
> are not the intended recipient(S). Any use, distribution, reproduction or
> disclosure by any other person is strictly prohibited. If you have received
> this e-mail in error, please notify the sender and destroy the original
> transmission and its attachments without reading or saving it in any
> manner.
>

Re: Help in designing row key

Posted by Flavio Pompermaier <po...@okkam.it>.

 Thanks for the reply! I thus have two questions more:

1) is it true that filtering on timestamps doesn't affect performance..?
2) could you send me a little snippet of how you would do such a filter (by
row key + timestamps)? For example get all rows whose key starts with
'someid-' and whose timestamps is greater than some timestamp?

Best,
Flavio

On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. Using timestamp in row-keys is discouraged
>
> The above is true.
> Prefixing row key with timestamp would create hot region.
>
> bq. should I filter by a simpler row-key plus a filter on timestamp?
>
> You can do the above.
>
> On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <pompermaier@okkam.it
> >wrote:
>
> > Hi to everybody,
> >
> > in my use case I have to perform batch analysis skipping old data.
> > For example, I want to process all rows created after a certain
> timestamp,
> > passed as parameter.
> >
> > What is the most effective way to do this?
> > Should I design my row-key to embed timestamp?
> > Or just filtering by timestamp of the row is fast as well? Or what else?
> >
> > Initially I was thinking to compose my key as:
> > timestamp|source|title|type
> >
> > but:
> >
> > 1) Using timestamp in row-keys is discouraged
> > 2) If this design is ok, using this approach I still have problems
> > filtering by timestamp because I cannot found a way to numerically filer
> > (instead of alphanumerically/by string). Example:
> > 1372776400441|something has timestamp lesser
> > than 1372778470913|somethingelse but I cannot filter all row whose key is
> > "numerically" greater than 1372776400441. Is it possible to overcome this
> > issue?
> > 3) If this design is not ok, should I filter by a simpler row-key plus a
> > filter on timestamp? Or what else?
> >
> > Best,
> > Flavio
> >
>

-- 

Flavio Pompermaier
*Development Department
*_______________________________________________
*OKKAM**Srl **- www.okkam.it*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* f.pompermaier@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any manner.

Re: Help in designing row key

Posted by Ted Yu <yu...@gmail.com>.

bq. Using timestamp in row-keys is discouraged

The above is true.
Prefixing row key with timestamp would create hot region.

bq. should I filter by a simpler row-key plus a filter on timestamp?

You can do the above.

On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <po...@okkam.it>wrote:

> Hi to everybody,
>
> in my use case I have to perform batch analysis skipping old data.
> For example, I want to process all rows created after a certain timestamp,
> passed as parameter.
>
> What is the most effective way to do this?
> Should I design my row-key to embed timestamp?
> Or just filtering by timestamp of the row is fast as well? Or what else?
>
> Initially I was thinking to compose my key as:
> timestamp|source|title|type
>
> but:
>
> 1) Using timestamp in row-keys is discouraged
> 2) If this design is ok, using this approach I still have problems
> filtering by timestamp because I cannot found a way to numerically filer
> (instead of alphanumerically/by string). Example:
> 1372776400441|something has timestamp lesser
> than 1372778470913|somethingelse but I cannot filter all row whose key is
> "numerically" greater than 1372776400441. Is it possible to overcome this
> issue?
> 3) If this design is not ok, should I filter by a simpler row-key plus a
> filter on timestamp? Or what else?
>
> Best,
> Flavio
>