You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafodion.apache.org by "Liu, Ming (Ming)" <mi...@esgyn.cn> on 2016/02/12 13:15:26 UTC

how the SALT is caculated?

Hi, all,

I want to check the code that calculate the hash value for the _SALT_ column in Trafodion. Could anyone point me to the exact source code, which file and which function doing that?
I tried for a while and cannot find it yet.

So that I can write a function F, that F(all cluster key) => rowkey of the Trafodion table row.

Thanks,
Ming


答复: how the SALT is caculated?

Posted by "Liu, Ming (Ming)" <mi...@esgyn.cn>.
Yes, QiFan is correct, I forgot that the hash functions must preserve the order, and there is no way to detect collision when doing the calculation, so my suggestion make no sense.
Maybe one should NOT use a long VARCHAR as primary key in the first place :-) 

Thanks,
Ming

-----邮件原件-----
发件人: Qifan Chen [mailto:qifan.chen@esgyn.com] 
发送时间: 2016年2月13日 0:10
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: how the SALT is caculated?

Hi Eric,

I am sure you know to find a colision-free hashing for a data set is very hard :-(.

This is the ACM paper that I described briefly in a separate email:
http://dl.acm.org/citation.cfm?id=129623.

Basically, we constructed minimally perfect hash functions (MPHF) for large data sets. Each key is mapped to an integer, no collisions, and # of integers is exactly the same as the size of the data set. The function itself is just an array of computed bits of almost minimal in size.

It is trivial to produce an order preserving hash function with MPHF.

The problem with using the research result commercially is going through the university.

For the long key problem on hand, I wonder if you have thought about using compressions.

Thanks --Qifan



On Fri, Feb 12, 2016 at 9:28 AM, Eric Owhadi <er...@esgyn.com> wrote:

> Hi Ming,
> not sure what you are trying to implement but I am going to guess a 
> use
> case:
>
> Sometime, the  primary key construct in trafodion is long, and 
> contains strings with large max character.
> Given that these keys end up exploded and padded with zero on the 
> hbase key, an optimization could consist in putting a hash of these 
> long strings instead of them, especially if we cannot benefit from 
> keyed access.
>
> So for this use, making the hash unique is key. I had experienced 
> trying this idea with a 64 bit hash (using hash2partfunc twice to make 
> a 64 bit), and loading a 170 000 000 table, and got duplicates (hash 
> collision). So if your use case is around the same idea, please 
> consider more than 64 bit hashing function. The hash code that is used 
> for partitioning does not care about collision since it is just used for partitioning...
>
> Not sure if this helps,
> Regards,
> Eric
>
> -----Original Message-----
> From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> Sent: Friday, February 12, 2016 9:07 AM
> To: dev@trafodion.incubator.apache.org
> Subject: 答复: how the SALT is caculated?
>
> Thanks QiFan,
>
> Following your hint, I found the ExHDPHash::eval() and corresponding 
> hash() functions. Trying to understand them.
>
> Thanks,
> Ming
>
> -----邮件原件-----
> 发件人: Qifan Chen [mailto:qifan.chen@esgyn.com]
> 发送时间: 2016年2月12日 21:32
> 收件人: dev <de...@trafodion.incubator.apache.org>
> 主题: Re: how the SALT is caculated?
>
> Hi Ming,
>
> In trafodion,
>
> "salt using 8 partitions on A" is equivalent to "hash2partfunc(a for 8)".
>
> "salt using 16 partitions on (a,b)" is equivalent to 
> "hash2partfunc(a,b for 16)".
>
>
> Thanks --Qifan
>
>
>
> On Fri, Feb 12, 2016 at 6:15 AM, Liu, Ming (Ming) <mi...@esgyn.cn>
> wrote:
>
> > Hi, all,
> >
> > I want to check the code that calculate the hash value for the 
> > _SALT_ column in Trafodion. Could anyone point me to the exact 
> > source code, which file and which function doing that?
> > I tried for a while and cannot find it yet.
> >
> > So that I can write a function F, that F(all cluster key) => rowkey 
> > of the Trafodion table row.
> >
> > Thanks,
> > Ming
> >
> >
>
>
> --
> Regards, --Qifan
>



--
Regards, --Qifan

Re: how the SALT is caculated?

Posted by Qifan Chen <qi...@esgyn.com>.
Hi Eric,

I am sure you know to find a colision-free hashing for a data set is very
hard :-(.

This is the ACM paper that I described briefly in a separate email:
http://dl.acm.org/citation.cfm?id=129623.

Basically, we constructed minimally perfect hash functions (MPHF) for large
data sets. Each key is mapped to an integer, no collisions, and # of
integers is exactly the same as the size of the data set. The function
itself is just an array of computed bits of almost minimal in size.

It is trivial to produce an order preserving hash function with MPHF.

The problem with using the research result commercially is going through
the university.

For the long key problem on hand, I wonder if you have thought about using
compressions.

Thanks --Qifan



On Fri, Feb 12, 2016 at 9:28 AM, Eric Owhadi <er...@esgyn.com> wrote:

> Hi Ming,
> not sure what you are trying to implement but I am going to guess a use
> case:
>
> Sometime, the  primary key construct in trafodion is long, and contains
> strings with large max character.
> Given that these keys end up exploded and padded with zero on the hbase
> key,
> an optimization could consist in putting a hash of these long strings
> instead of them, especially if we cannot benefit from keyed access.
>
> So for this use, making the hash unique is key. I had experienced trying
> this idea with a 64 bit hash (using hash2partfunc twice to make a 64 bit),
> and loading a 170 000 000 table, and got duplicates (hash collision). So if
> your use case is around the same idea, please consider more than 64 bit
> hashing function. The hash code that is used for partitioning does not care
> about collision since it is just used for partitioning...
>
> Not sure if this helps,
> Regards,
> Eric
>
> -----Original Message-----
> From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> Sent: Friday, February 12, 2016 9:07 AM
> To: dev@trafodion.incubator.apache.org
> Subject: 答复: how the SALT is caculated?
>
> Thanks QiFan,
>
> Following your hint, I found the ExHDPHash::eval() and corresponding hash()
> functions. Trying to understand them.
>
> Thanks,
> Ming
>
> -----邮件原件-----
> 发件人: Qifan Chen [mailto:qifan.chen@esgyn.com]
> 发送时间: 2016年2月12日 21:32
> 收件人: dev <de...@trafodion.incubator.apache.org>
> 主题: Re: how the SALT is caculated?
>
> Hi Ming,
>
> In trafodion,
>
> "salt using 8 partitions on A" is equivalent to "hash2partfunc(a for 8)".
>
> "salt using 16 partitions on (a,b)" is equivalent to "hash2partfunc(a,b for
> 16)".
>
>
> Thanks --Qifan
>
>
>
> On Fri, Feb 12, 2016 at 6:15 AM, Liu, Ming (Ming) <mi...@esgyn.cn>
> wrote:
>
> > Hi, all,
> >
> > I want to check the code that calculate the hash value for the _SALT_
> > column in Trafodion. Could anyone point me to the exact source code,
> > which file and which function doing that?
> > I tried for a while and cannot find it yet.
> >
> > So that I can write a function F, that F(all cluster key) => rowkey of
> > the Trafodion table row.
> >
> > Thanks,
> > Ming
> >
> >
>
>
> --
> Regards, --Qifan
>



-- 
Regards, --Qifan

RE: how the SALT is caculated?

Posted by Eric Owhadi <er...@esgyn.com>.
Hi Ming,
I was not considering changing Hash2partfunc, as indeed that would make
incompatible change.
Rather, I think we should add a SHA1 and MD5 sql scalar extention function
in our library given the specific restriction on PK we have. Of course
people can still write UDF to do so, but given the usefulness of this
optimization, if someone wants to volunteer to add that as core feature,
that would be great.
BTW, I do not know about the implications from an import export standpoint
when it comes to adding crypto algorithm in software?
Eric

-----Original Message-----
From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
Sent: Friday, February 12, 2016 9:57 AM
To: dev@trafodion.incubator.apache.org
Subject: 答复: how the SALT is caculated?

Hi, Eric,

I am not trying to change the hash function. Your proposal is really very
interesting any way. It will benefit in a lot of ways!! But the difficultly
is as you describe, you have to find a hash function that can make result
unique. Larger than 64 bit is one way, but I am doubting it just reduce the
chance of collision but not prevent it. I am not good at math.. Maybe in
case of hash collision, just append all the left parts as it is of today.
Since collision is very rare, so this will dramatically reduce the size of
rowkey generally, which will save in HFile and affect performance in various
ways.
Maybe you can consider no need to change the hash, just solve the collision
when it happens? This is a non-backward-compatible change, so rather to do
it sooner than later, but I really feel very excited about this idea!

What I am trying to do is different, and just a try: I want to write a
simple prototype of library that can provide a NoSQL style of access to
Trafodion tables. Only for singleton IUD and get operation, maybe also good
to provide range scan. Pure java code, so it can be used by Spark/MR or some
kinds of application like storm. bypass the overhead of connectivity and SQL
layer, so a bit better in performance. It is odd, while Trafodion is
building SQL on top of Hadoop, so why bypass that and read Hbase directly? I
don't have a good explain... But I feel it may have some use cases. One in
my mind is to integrate with Spark where Trafodion as a datasource, today,
the only path is via JDBCRdd, which I think we can provide yet another
interface if possible will do no harm. I heard some traditional RDBMS
provide NoSQL interface as well, not sure what is the use case. So just an
initial research effort, to see feasibility. I want to do this for a long
time.

To do that, I have to do all the encoding/decoding in that library, and have
to calculate the right hbase rowkey. I cannot find a good way to invoke the
current executor C++ code without SQL compiler involved, so seems a simpler
way is to copy all the logic and rewrite in java.

Thanks,
Ming

-----邮件原件-----
发件人: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
发送时间: 2016年2月12日 23:29
收件人: dev@trafodion.incubator.apache.org
主题: RE: how the SALT is caculated?

Hi Ming,
not sure what you are trying to implement but I am going to guess a use
case:

Sometime, the  primary key construct in trafodion is long, and contains
strings with large max character.
Given that these keys end up exploded and padded with zero on the hbase key,
an optimization could consist in putting a hash of these long strings
instead of them, especially if we cannot benefit from keyed access.

So for this use, making the hash unique is key. I had experienced trying
this idea with a 64 bit hash (using hash2partfunc twice to make a 64 bit),
and loading a 170 000 000 table, and got duplicates (hash collision). So if
your use case is around the same idea, please consider more than 64 bit
hashing function. The hash code that is used for partitioning does not care
about collision since it is just used for partitioning...

Not sure if this helps,
Regards,
Eric

-----Original Message-----
From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
Sent: Friday, February 12, 2016 9:07 AM
To: dev@trafodion.incubator.apache.org
Subject: 答复: how the SALT is caculated?

Thanks QiFan,

Following your hint, I found the ExHDPHash::eval() and corresponding hash()
functions. Trying to understand them.

Thanks,
Ming

-----邮件原件-----
发件人: Qifan Chen [mailto:qifan.chen@esgyn.com]
发送时间: 2016年2月12日 21:32
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: how the SALT is caculated?

Hi Ming,

In trafodion,

"salt using 8 partitions on A" is equivalent to "hash2partfunc(a for 8)".

"salt using 16 partitions on (a,b)" is equivalent to "hash2partfunc(a,b for
16)".


Thanks --Qifan



On Fri, Feb 12, 2016 at 6:15 AM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
>
> I want to check the code that calculate the hash value for the _SALT_
> column in Trafodion. Could anyone point me to the exact source code,
> which file and which function doing that?
> I tried for a while and cannot find it yet.
>
> So that I can write a function F, that F(all cluster key) => rowkey of
> the Trafodion table row.
>
> Thanks,
> Ming
>
>


--
Regards, --Qifan

答复: how the SALT is caculated?

Posted by "Liu, Ming (Ming)" <mi...@esgyn.cn>.
Hi, Eric,

I am not trying to change the hash function. Your proposal is really very interesting any way. It will benefit in a lot of ways!! But the difficultly is as you describe, you have to find a hash function that can make result unique. Larger than 64 bit is one way, but I am doubting it just reduce the chance of collision but not prevent it. I am not good at math.. Maybe in case of hash collision, just append all the left parts as it is of today. Since collision is very rare, so this will dramatically reduce the size of rowkey generally, which will save in HFile and affect performance in various ways.
Maybe you can consider no need to change the hash, just solve the collision when it happens? This is a non-backward-compatible change, so rather to do it sooner than later, but I really feel very excited about this idea!

What I am trying to do is different, and just a try: I want to write a simple prototype of library that can provide a NoSQL style of access to Trafodion tables. Only for singleton IUD and get operation, maybe also good to provide range scan. Pure java code, so it can be used by Spark/MR or some kinds of application like storm. bypass the overhead of connectivity and SQL layer, so a bit better in performance. It is odd, while Trafodion is building SQL on top of Hadoop, so why bypass that and read Hbase directly? I don't have a good explain... But I feel it may have some use cases. One in my mind is to integrate with Spark where Trafodion as a datasource, today, the only path is via JDBCRdd, which I think we can provide yet another interface if possible will do no harm. I heard some traditional RDBMS provide NoSQL interface as well, not sure what is the use case. So just an initial research effort, to see feasibility. I want to do this for a long time.

To do that, I have to do all the encoding/decoding in that library, and have to calculate the right hbase rowkey. I cannot find a good way to invoke the current executor C++ code without SQL compiler involved, so seems a simpler way is to copy all the logic and rewrite in java. 

Thanks,
Ming

-----邮件原件-----
发件人: Eric Owhadi [mailto:eric.owhadi@esgyn.com] 
发送时间: 2016年2月12日 23:29
收件人: dev@trafodion.incubator.apache.org
主题: RE: how the SALT is caculated?

Hi Ming,
not sure what you are trying to implement but I am going to guess a use
case:

Sometime, the  primary key construct in trafodion is long, and contains strings with large max character.
Given that these keys end up exploded and padded with zero on the hbase key, an optimization could consist in putting a hash of these long strings instead of them, especially if we cannot benefit from keyed access.

So for this use, making the hash unique is key. I had experienced trying this idea with a 64 bit hash (using hash2partfunc twice to make a 64 bit), and loading a 170 000 000 table, and got duplicates (hash collision). So if your use case is around the same idea, please consider more than 64 bit hashing function. The hash code that is used for partitioning does not care about collision since it is just used for partitioning...

Not sure if this helps,
Regards,
Eric

-----Original Message-----
From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
Sent: Friday, February 12, 2016 9:07 AM
To: dev@trafodion.incubator.apache.org
Subject: 答复: how the SALT is caculated?

Thanks QiFan,

Following your hint, I found the ExHDPHash::eval() and corresponding hash() functions. Trying to understand them.

Thanks,
Ming

-----邮件原件-----
发件人: Qifan Chen [mailto:qifan.chen@esgyn.com]
发送时间: 2016年2月12日 21:32
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: how the SALT is caculated?

Hi Ming,

In trafodion,

"salt using 8 partitions on A" is equivalent to "hash2partfunc(a for 8)".

"salt using 16 partitions on (a,b)" is equivalent to "hash2partfunc(a,b for 16)".


Thanks --Qifan



On Fri, Feb 12, 2016 at 6:15 AM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
>
> I want to check the code that calculate the hash value for the _SALT_ 
> column in Trafodion. Could anyone point me to the exact source code, 
> which file and which function doing that?
> I tried for a while and cannot find it yet.
>
> So that I can write a function F, that F(all cluster key) => rowkey of 
> the Trafodion table row.
>
> Thanks,
> Ming
>
>


--
Regards, --Qifan

RE: how the SALT is caculated?

Posted by Eric Owhadi <er...@esgyn.com>.
Hi Ming,
not sure what you are trying to implement but I am going to guess a use
case:

Sometime, the  primary key construct in trafodion is long, and contains
strings with large max character.
Given that these keys end up exploded and padded with zero on the hbase key,
an optimization could consist in putting a hash of these long strings
instead of them, especially if we cannot benefit from keyed access.

So for this use, making the hash unique is key. I had experienced trying
this idea with a 64 bit hash (using hash2partfunc twice to make a 64 bit),
and loading a 170 000 000 table, and got duplicates (hash collision). So if
your use case is around the same idea, please consider more than 64 bit
hashing function. The hash code that is used for partitioning does not care
about collision since it is just used for partitioning...

Not sure if this helps,
Regards,
Eric

-----Original Message-----
From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
Sent: Friday, February 12, 2016 9:07 AM
To: dev@trafodion.incubator.apache.org
Subject: 答复: how the SALT is caculated?

Thanks QiFan,

Following your hint, I found the ExHDPHash::eval() and corresponding hash()
functions. Trying to understand them.

Thanks,
Ming

-----邮件原件-----
发件人: Qifan Chen [mailto:qifan.chen@esgyn.com]
发送时间: 2016年2月12日 21:32
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: how the SALT is caculated?

Hi Ming,

In trafodion,

"salt using 8 partitions on A" is equivalent to "hash2partfunc(a for 8)".

"salt using 16 partitions on (a,b)" is equivalent to "hash2partfunc(a,b for
16)".


Thanks --Qifan



On Fri, Feb 12, 2016 at 6:15 AM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
>
> I want to check the code that calculate the hash value for the _SALT_
> column in Trafodion. Could anyone point me to the exact source code,
> which file and which function doing that?
> I tried for a while and cannot find it yet.
>
> So that I can write a function F, that F(all cluster key) => rowkey of
> the Trafodion table row.
>
> Thanks,
> Ming
>
>


--
Regards, --Qifan

答复: how the SALT is caculated?

Posted by "Liu, Ming (Ming)" <mi...@esgyn.cn>.
Thanks QiFan,

Following your hint, I found the ExHDPHash::eval() and corresponding hash() functions. Trying to understand them.

Thanks,
Ming

-----邮件原件-----
发件人: Qifan Chen [mailto:qifan.chen@esgyn.com] 
发送时间: 2016年2月12日 21:32
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: how the SALT is caculated?

Hi Ming,

In trafodion,

"salt using 8 partitions on A" is equivalent to "hash2partfunc(a for 8)".

"salt using 16 partitions on (a,b)" is equivalent to "hash2partfunc(a,b for 16)".


Thanks --Qifan



On Fri, Feb 12, 2016 at 6:15 AM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
>
> I want to check the code that calculate the hash value for the _SALT_ 
> column in Trafodion. Could anyone point me to the exact source code, 
> which file and which function doing that?
> I tried for a while and cannot find it yet.
>
> So that I can write a function F, that F(all cluster key) => rowkey of 
> the Trafodion table row.
>
> Thanks,
> Ming
>
>


--
Regards, --Qifan

Re: how the SALT is caculated?

Posted by Qifan Chen <qi...@esgyn.com>.
Hi Ming,

In trafodion,

"salt using 8 partitions on A" is equivalent to "hash2partfunc(a for 8)".

"salt using 16 partitions on (a,b)" is equivalent to "hash2partfunc(a,b for
16)".


Thanks --Qifan



On Fri, Feb 12, 2016 at 6:15 AM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
>
> I want to check the code that calculate the hash value for the _SALT_
> column in Trafodion. Could anyone point me to the exact source code, which
> file and which function doing that?
> I tried for a while and cannot find it yet.
>
> So that I can write a function F, that F(all cluster key) => rowkey of the
> Trafodion table row.
>
> Thanks,
> Ming
>
>


-- 
Regards, --Qifan