You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Nurettin Şimşek <nu...@gmail.com> on 2013/02/13 09:35:07 UTC

RowKey design with hashing

Hi All,

In our project mail adresses are row key. Which rowkey design  we should
choose?

1) com.yahoo@xxxx (Reversed)
2) xxxx@yahoo.com
3) md5 hash(xxxx@yahoo.com)
4) Any other solution.

Many thanks.

-- 
M. Nurettin ŞİMŞEK

Re: RowKey design with hashing

Posted by Nurettin Şimşek <nu...@gmail.com>.

Thanks Jean,

3 can be good for us.

Re: RowKey design with hashing

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

I don't see any issue with #2 and it might be the simplest one. But
all will depend on your read pattern. If you need to scan by domain, 1
is better. I you need to list the emails without knowing it, 2 might
be better. If you only access it given a specific address, 3 can be
good.

So I will say, all depend on what you want to do with it...

2013/2/13, Nurettin Şimşek <nu...@gmail.com>:
> I want to search email adress equality. There are many many domains not
> only yahoo.
>
> What is disadvantages of using hashing?
>

Re: RowKey design with hashing

Posted by Nurettin Şimşek <nu...@gmail.com>.

I want to search email adress equality. There are many many domains not
only yahoo.

What is disadvantages of using hashing?

Re: RowKey design with hashing

Posted by Amit Sela <am...@infolinks.com>.

If you have a good distribution of domains then use the reversed domain
key, it will allow you to scan over domains faster.

On Wed, Feb 13, 2013 at 10:40 AM, Alexander Ignatov
<ai...@mirantis.com>wrote:

> If you have only one domain 'yahoo.com' for all mail addresses you
> probably can use row keys as 'xxxx' without adding '@yahoo.com'.
>
> --
> Regards,
> Alexander Ignatov
>
>
>
> On 2/13/2013 12:35 PM, Nurettin Şimşek wrote:
>
>> Hi All,
>>
>> In our project mail adresses are row key. Which rowkey design  we should
>> choose?
>>
>> 1) com.yahoo@xxxx (Reversed)
>> 2) xxxx@yahoo.com
>> 3) md5 hash(xxxx@yahoo.com)
>> 4) Any other solution.
>>
>> Many thanks.
>>
>>
>

Re: RowKey design with hashing

Posted by Alexander Ignatov <ai...@mirantis.com>.

If you have only one domain 'yahoo.com' for all mail addresses you 
probably can use row keys as 'xxxx' without adding '@yahoo.com'.

-- 
Regards,
Alexander Ignatov

On 2/13/2013 12:35 PM, Nurettin Şimşek wrote:
> Hi All,
>
> In our project mail adresses are row key. Which rowkey design  we should
> choose?
>
> 1) com.yahoo@xxxx (Reversed)
> 2) xxxx@yahoo.com
> 3) md5 hash(xxxx@yahoo.com)
> 4) Any other solution.
>
> Many thanks.
>

Re: Updating from 0.90.2 to 0.94

Posted by Yusup Ashrap <ap...@gmail.com>.

Hi Viral , thanks for reply.
Thing is that my table is kinda big. I have about one hour maximum downtime for this upgrade.
That is not enough to distcp 20TB data to another backup cluster, even I modify replication factor to 1, 
and from my exp it's highly unlikely to succeed  for the first time with distcp.

thanks for the offer anyway. 

-- 
Best Regards
Yusup Ashrap

On Tuesday, February 26, 2013 at 4:51 PM, Viral Bajaria wrote:

> Well if you can afford a longer downtime, you can always distcp your
> existing hbase data. This way if things get screwed up you can always
> restore a 0.90.x on that old backup. You cannot distcp while the cluster is
> running since it will not be able to get locks on file (I think I faced
> that issue but not sure since I did this upgrade mid-2012).
> 
> When I did an upgrade, I brought up a test 0.94 cluster using the distcp
> backup (I took like 2-3 copies since I was very paranoid). Once I ran all
> my production jobs against the test instance and I was happy with all the
> results, that's when I re-ran the script to do the migration. It was all
> scripted out but I no longer have access to those scripts or else I would
> have shared it with you.
> 
> Thanks,
> Viral
> 
>

Re: Updating from 0.90.2 to 0.94

Posted by Viral Bajaria <vi...@gmail.com>.

Well if you can afford a longer downtime, you can always distcp your
existing hbase data. This way if things get screwed up you can always
restore a 0.90.x on that old backup. You cannot distcp while the cluster is
running since it will not be able to get locks on file (I think I faced
that issue but not sure since I did this upgrade mid-2012).

When I did an upgrade, I brought up a test 0.94 cluster using the distcp
backup (I took like 2-3 copies since I was very paranoid). Once I ran all
my production jobs against the test instance and I was happy with all the
results, that's when I re-ran the script to do the migration. It was all
scripted out but I no longer have access to those scripts or else I would
have shared it with you.

Thanks,
Viral

On Tue, Feb 26, 2013 at 12:00 AM, Yusup Ashrap <ap...@gmail.com> wrote:

> Hi Kiran , thanks for reply
> From what I've read from online docs , downtime is inevitable for
> upgrading from 0.90.2 to 0.94,
> and I can afford some downtime. I cannot afford dataloss, so I am
> concerning potential problems with rolling back to 0.90.2 if I fail to
> upgrade.
>
>
> --
> Best Regards
> Yusup Ashrap
>
>
> On Tuesday, February 26, 2013 at 3:53 PM, kiran wrote:
>
> > Hi,
> >
> > We also upgraded the version very recently. If you can afford couple of
> > minutes downtime then you can safely bring down the cluster and do
> upgrade.
> > As such, there will be no data loss, but be careful with splits. The
> > default split policy has been changed in this version if I am not wrong.
> It
> > causes some weird things.
> >
> > Thanks
> > Kiran
> >
> >
> > On Tue, Feb 26, 2013 at 1:03 PM, Yusup Ashrap <aphawk@gmail.com (mailto:
> aphawk@gmail.com)> wrote:
> >
> > > hi all,
> > > I am updating production cluster from 0.90.2 to 0.94 . My table's size
> is
> > > about 20TB+ .
> > > Scheduled update includes upgrading both hbase,hadoop version,
> > > and I am also changing user with which I start up both hadoop and hbase
> > > cluster from user "A" to user "B".
> > >
> > > It's production environment , so I wanted know what steps I should not
> > > miss regarding this upgrade.
> > > Table is kinda big and I don't have backup cluster to backup my data.
> > > I wanted to know will there be a data loss scenario if I rollback after
> > > having failed to upgrade. thanks.
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > > Yusup Ashrap
> > >
> >
> >
> >
> > --
> > Thank you
> > Kiran Sarvabhotla
> >
> > -----Even a correct decision is wrong when it is taken late
>
>

Re: Updating from 0.90.2 to 0.94

Posted by Stack <st...@duboce.net>.

On Tue, Feb 26, 2013 at 12:00 AM, Yusup Ashrap <ap...@gmail.com> wrote:

> Hi Kiran , thanks for reply
> From what I've read from online docs , downtime is inevitable for
> upgrading from 0.90.2 to 0.94,
>

Yes.

Going from 0.90.x to 0.92., you will need to restart.

You will be able to do a rolling upgrade going from 0.92 to 0.94.



> and I can afford some downtime. I cannot afford dataloss, so I am
> concerning potential problems with rolling back to 0.90.2 if I fail to
> upgrade.
>
>
This is tough.  We do now support rolling back so as Viral suggests, it
would be ugly copy of data.   Search the mailing list.  There should be
description of doing background distcp and then doing a copy after the
shutdown to copy the differences only (try it first.. .I don't know of any
locks but if Viral says he had issue, I would pay attention).  Ask again if
no luck finding description...

St.Ack

Re: Updating from 0.90.2 to 0.94

Posted by Yusup Ashrap <ap...@gmail.com>.

Hi Kiran , thanks for reply
>From what I've read from online docs , downtime is inevitable for upgrading from 0.90.2 to 0.94, 
and I can afford some downtime. I cannot afford dataloss, so I am concerning potential problems with rolling back to 0.90.2 if I fail to upgrade.


-- 
Best Regards
Yusup Ashrap


On Tuesday, February 26, 2013 at 3:53 PM, kiran wrote:

> Hi,
> 
> We also upgraded the version very recently. If you can afford couple of
> minutes downtime then you can safely bring down the cluster and do upgrade.
> As such, there will be no data loss, but be careful with splits. The
> default split policy has been changed in this version if I am not wrong. It
> causes some weird things.
> 
> Thanks
> Kiran
> 
> 
> On Tue, Feb 26, 2013 at 1:03 PM, Yusup Ashrap <aphawk@gmail.com (mailto:aphawk@gmail.com)> wrote:
> 
> > hi all,
> > I am updating production cluster from 0.90.2 to 0.94 . My table's size is
> > about 20TB+ .
> > Scheduled update includes upgrading both hbase,hadoop version,
> > and I am also changing user with which I start up both hadoop and hbase
> > cluster from user "A" to user "B".
> > 
> > It's production environment , so I wanted know what steps I should not
> > miss regarding this upgrade.
> > Table is kinda big and I don't have backup cluster to backup my data.
> > I wanted to know will there be a data loss scenario if I rollback after
> > having failed to upgrade. thanks.
> > 
> > 
> > 
> > 
> > 
> > --
> > Best Regards
> > Yusup Ashrap
> > 
> 
> 
> 
> -- 
> Thank you
> Kiran Sarvabhotla
> 
> -----Even a correct decision is wrong when it is taken late

Re: Updating from 0.90.2 to 0.94

Posted by kiran <ki...@gmail.com>.

Hi,

We also upgraded the version very recently. If you can afford couple of
minutes downtime then you can safely bring down the cluster and do upgrade.
As such, there will be no data loss, but be careful with splits. The
default split policy has been changed in this version if I am not wrong. It
causes some weird things.

Thanks
Kiran

On Tue, Feb 26, 2013 at 1:03 PM, Yusup Ashrap <ap...@gmail.com> wrote:

> hi all,
> I am updating production cluster from 0.90.2 to 0.94 .  My table's size is
> about 20TB+ .
> Scheduled update includes upgrading both hbase,hadoop version,
>  and I am also changing user with which I  start up both hadoop and hbase
> cluster  from user "A" to user "B".
>
> It's production environment , so I wanted know what steps I should not
> miss regarding this upgrade.
> Table is kinda big and I don't have backup cluster to backup my data.
> I wanted to know will there be a data loss scenario if I rollback after
> having failed to upgrade. thanks.
>
>
>
>
>
> --
> Best Regards
> Yusup Ashrap
>
>

-- 
Thank you
Kiran Sarvabhotla

-----Even a correct decision is wrong when it is taken late

Updating from 0.90.2 to 0.94

Posted by Yusup Ashrap <ap...@gmail.com>.

hi all, 
I am updating production cluster from 0.90.2 to 0.94 .  My table's size is about 20TB+ .
Scheduled update includes upgrading both hbase,hadoop version,
 and I am also changing user with which I  start up both hadoop and hbase cluster  from user "A" to user "B".

It's production environment , so I wanted know what steps I should not miss regarding this upgrade.
Table is kinda big and I don't have backup cluster to backup my data. 
I wanted to know will there be a data loss scenario if I rollback after having failed to upgrade. thanks.





-- 
Best Regards
Yusup Ashrap

Re: RowKey design with hashing

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Ted,

Thanks for pointing me to HBASE-4218. I will take a look at it.

JM

2013/2/13 Ted Yu <yu...@gmail.com>

> My name is Ted, not Lars :-)
>
> On Wed, Feb 13, 2013 at 7:41 PM, Mehmet Simsek <nurettinsimsek@gmail.com
> >wrote:
>
> > Thanks Lars
> >
> > M.Nurettin Şimşek
> >
> > On 14 Şub 2013, at 05:18, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Jean-Marc:
> > > You can find almost all the details you need from this JIRA:
> > > HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding /
> prefix
> > > compression)
> > >
> > > Cheers
> > >
> > > On Wed, Feb 13, 2013 at 6:09 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > >> Hi Lars,
> > >>
> > >> Can you please tell more about key prefix block encoding? Or refer to
> > >> some blog/doc? How it works, what it is, etc.?
> > >>
> > >> Thanks,
> > >>
> > >> JM
> > >>
> > >> 2013/2/13, lars hofhansl <la...@apache.org>:
> > >>> Depends on you search pattern.
> > >>> If you never care about scans ordering i.e. you only do point gets to
> > see
> > >>> whether you've already seen an email address, do the hash part.
> > >>>
> > >>> I'd perfer #1 over #2, because it would let you do efficient key
> prefix
> > >>> block encoding (FAST_DIFF).
> > >>>
> > >>> -- Lars
> > >>>
> > >>>
> > >>>
> > >>> ________________________________
> > >>> From: Nurettin Şimşek <nu...@gmail.com>
> > >>> To: user@hbase.apache.org
> > >>> Sent: Wednesday, February 13, 2013 12:35 AM
> > >>> Subject: RowKey design with hashing
> > >>>
> > >>> Hi All,
> > >>>
> > >>> In our project mail adresses are row key. Which rowkey design  we
> > should
> > >>> choose?
> > >>>
> > >>> 1) com.yahoo@xxxx (Reversed)
> > >>> 2) xxxx@yahoo.com
> > >>> 3) md5 hash(xxxx@yahoo.com)
> > >>> 4) Any other solution.
> > >>>
> > >>> Many thanks.
> > >>>
> > >>> --
> > >>> M. Nurettin ŞİMŞEK
> > >>
> >
>

Re: RowKey design with hashing

Posted by Ted Yu <yu...@gmail.com>.

My name is Ted, not Lars :-)

On Wed, Feb 13, 2013 at 7:41 PM, Mehmet Simsek <nu...@gmail.com>wrote:

> Thanks Lars
>
> M.Nurettin Şimşek
>
> On 14 Şub 2013, at 05:18, Ted Yu <yu...@gmail.com> wrote:
>
> > Jean-Marc:
> > You can find almost all the details you need from this JIRA:
> > HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix
> > compression)
> >
> > Cheers
> >
> > On Wed, Feb 13, 2013 at 6:09 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi Lars,
> >>
> >> Can you please tell more about key prefix block encoding? Or refer to
> >> some blog/doc? How it works, what it is, etc.?
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >> 2013/2/13, lars hofhansl <la...@apache.org>:
> >>> Depends on you search pattern.
> >>> If you never care about scans ordering i.e. you only do point gets to
> see
> >>> whether you've already seen an email address, do the hash part.
> >>>
> >>> I'd perfer #1 over #2, because it would let you do efficient key prefix
> >>> block encoding (FAST_DIFF).
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Nurettin Şimşek <nu...@gmail.com>
> >>> To: user@hbase.apache.org
> >>> Sent: Wednesday, February 13, 2013 12:35 AM
> >>> Subject: RowKey design with hashing
> >>>
> >>> Hi All,
> >>>
> >>> In our project mail adresses are row key. Which rowkey design  we
> should
> >>> choose?
> >>>
> >>> 1) com.yahoo@xxxx (Reversed)
> >>> 2) xxxx@yahoo.com
> >>> 3) md5 hash(xxxx@yahoo.com)
> >>> 4) Any other solution.
> >>>
> >>> Many thanks.
> >>>
> >>> --
> >>> M. Nurettin ŞİMŞEK
> >>
>

Re: RowKey design with hashing

Posted by Mehmet Simsek <nu...@gmail.com>.

Thanks Lars

M.Nurettin Şimşek

On 14 Şub 2013, at 05:18, Ted Yu <yu...@gmail.com> wrote:

> Jean-Marc:
> You can find almost all the details you need from this JIRA:
> HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix
> compression)
> 
> Cheers
> 
> On Wed, Feb 13, 2013 at 6:09 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
> 
>> Hi Lars,
>> 
>> Can you please tell more about key prefix block encoding? Or refer to
>> some blog/doc? How it works, what it is, etc.?
>> 
>> Thanks,
>> 
>> JM
>> 
>> 2013/2/13, lars hofhansl <la...@apache.org>:
>>> Depends on you search pattern.
>>> If you never care about scans ordering i.e. you only do point gets to see
>>> whether you've already seen an email address, do the hash part.
>>> 
>>> I'd perfer #1 over #2, because it would let you do efficient key prefix
>>> block encoding (FAST_DIFF).
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Nurettin Şimşek <nu...@gmail.com>
>>> To: user@hbase.apache.org
>>> Sent: Wednesday, February 13, 2013 12:35 AM
>>> Subject: RowKey design with hashing
>>> 
>>> Hi All,
>>> 
>>> In our project mail adresses are row key. Which rowkey design  we should
>>> choose?
>>> 
>>> 1) com.yahoo@xxxx (Reversed)
>>> 2) xxxx@yahoo.com
>>> 3) md5 hash(xxxx@yahoo.com)
>>> 4) Any other solution.
>>> 
>>> Many thanks.
>>> 
>>> --
>>> M. Nurettin ŞİMŞEK
>>

Re: RowKey design with hashing

Posted by Ted Yu <yu...@gmail.com>.

Jean-Marc:
You can find almost all the details you need from this JIRA:
HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix
compression)

Cheers

On Wed, Feb 13, 2013 at 6:09 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Lars,
>
> Can you please tell more about key prefix block encoding? Or refer to
> some blog/doc? How it works, what it is, etc.?
>
> Thanks,
>
> JM
>
> 2013/2/13, lars hofhansl <la...@apache.org>:
> > Depends on you search pattern.
> > If you never care about scans ordering i.e. you only do point gets to see
> > whether you've already seen an email address, do the hash part.
> >
> > I'd perfer #1 over #2, because it would let you do efficient key prefix
> > block encoding (FAST_DIFF).
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Nurettin Şimşek <nu...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Wednesday, February 13, 2013 12:35 AM
> > Subject: RowKey design with hashing
> >
> > Hi All,
> >
> > In our project mail adresses are row key. Which rowkey design  we should
> > choose?
> >
> > 1) com.yahoo@xxxx (Reversed)
> > 2) xxxx@yahoo.com
> > 3) md5 hash(xxxx@yahoo.com)
> > 4) Any other solution.
> >
> > Many thanks.
> >
> > --
> > M. Nurettin ŞİMŞEK
>

Re: RowKey design with hashing

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Lars,

Can you please tell more about key prefix block encoding? Or refer to
some blog/doc? How it works, what it is, etc.?

Thanks,

JM

2013/2/13, lars hofhansl <la...@apache.org>:
> Depends on you search pattern.
> If you never care about scans ordering i.e. you only do point gets to see
> whether you've already seen an email address, do the hash part.
>
> I'd perfer #1 over #2, because it would let you do efficient key prefix
> block encoding (FAST_DIFF).
>
> -- Lars
>
>
>
> ________________________________
>  From: Nurettin Şimşek <nu...@gmail.com>
> To: user@hbase.apache.org
> Sent: Wednesday, February 13, 2013 12:35 AM
> Subject: RowKey design with hashing
>
> Hi All,
>
> In our project mail adresses are row key. Which rowkey design  we should
> choose?
>
> 1) com.yahoo@xxxx (Reversed)
> 2) xxxx@yahoo.com
> 3) md5 hash(xxxx@yahoo.com)
> 4) Any other solution.
>
> Many thanks.
>
> --
> M. Nurettin ŞİMŞEK

Re: RowKey design with hashing

Posted by lars hofhansl <la...@apache.org>.

Depends on you search pattern.
If you never care about scans ordering i.e. you only do point gets to see whether you've already seen an email address, do the hash part.

I'd perfer #1 over #2, because it would let you do efficient key prefix block encoding (FAST_DIFF).

-- Lars



________________________________
 From: Nurettin Şimşek <nu...@gmail.com>
To: user@hbase.apache.org 
Sent: Wednesday, February 13, 2013 12:35 AM
Subject: RowKey design with hashing
 
Hi All,

In our project mail adresses are row key. Which rowkey design  we should
choose?

1) com.yahoo@xxxx (Reversed)
2) xxxx@yahoo.com
3) md5 hash(xxxx@yahoo.com)
4) Any other solution.

Many thanks.

-- 
M. Nurettin ŞİMŞEK