You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by kant kodali <ka...@gmail.com> on 2020/02/23 23:53:39 UTC

https://spark-project.atlassian.net/browse/SPARK-1153

Hi All,

Any chance of fixing this one ?
https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work
around may be?

Currently, I got bunch of events streaming into kafka across various topics
and they are stamped with an UUIDv1 for each event. so it is easy to
construct edges using UUID. I am not quite sure how to generate a long
based unique id without synchronization in a distributed setting. I had
read this SO post
<https://stackoverflow.com/questions/15184820/how-to-generate-unique-positive-long-using-uuid>
which
shows there are two ways one may be able to achieve this

1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE

2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~
9223372036854251520L)

However I am concerned about collisions and looking for the probability of
collisions for the above two approaches. any suggestions?

I ran the Connected Components algorithms using graphframes it runs well
when long based id's are used but with string the performance drops
significantly as pointed out in the ticket. I understand that algorithm
depends on hashing integers heavily but I wonder why not fixed length
byte[] ? that way we can convert any datatype to sequence of bytes.

Thanks!

Re: https://spark-project.atlassian.net/browse/SPARK-1153

Posted by kant kodali <ka...@gmail.com>.

Sorry please ignore this. I accidentally ran it with GraphX instead of
Graphframes.

I see the code here
https://github.com/graphframes/graphframes/blob/a30adaf53dece8c548d96c895ac330ecb3931451/src/main/scala/org/graphframes/GraphFrame.scala#L539-L555
Which indeed generates its own id! that's great!

Thanks

On Sun, Feb 23, 2020 at 3:53 PM kant kodali <ka...@gmail.com> wrote:

> Hi All,
>
> Any chance of fixing this one ?
> https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work
> around may be?
>
> Currently, I got bunch of events streaming into kafka across various
> topics and they are stamped with an UUIDv1 for each event. so it is easy to
> construct edges using UUID. I am not quite sure how to generate a long
> based unique id without synchronization in a distributed setting. I had
> read this SO post
> <https://stackoverflow.com/questions/15184820/how-to-generate-unique-positive-long-using-uuid> which
> shows there are two ways one may be able to achieve this
>
> 1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE
>
> 2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~
> 9223372036854251520L)
>
> However I am concerned about collisions and looking for the probability of
> collisions for the above two approaches. any suggestions?
>
> I ran the Connected Components algorithms using graphframes it runs well
> when long based id's are used but with string the performance drops
> significantly as pointed out in the ticket. I understand that algorithm
> depends on hashing integers heavily but I wonder why not fixed length
> byte[] ? that way we can convert any datatype to sequence of bytes.
>
> Thanks!
>