You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Kevin Tran <ke...@gmail.com> on 2016/09/04 11:43:41 UTC

Best ID Generator for ID field in parquet ?

Hi everyone,
Please give me your opinions on what is the best ID Generator for ID field
in parquet ?

UUID.randomUUID();
AtomicReference<Long> currentTime = new
AtomicReference<>(System.currentTimeMillis());
AtomicLong counter = new AtomicLong(0);
....

Thanks,
Kevin.


----
https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
writing Parquet files)
https://github.com/apache/spark/pull/6864/files

Re: Best ID Generator for ID field in parquet ?

Posted by Mike Metzger <mi...@flexiblecreations.com>.
Hi Kevin -

   There's not really a race condition as the 64 bit value is split into a
31 bit partition id (the upper portion) and a 33 bit incrementing id.  In
other words, as long as each partition contains fewer than 8 billion
entries there should be no overlap and there is not any communication
between executors to get the next id.

Depending on what you mean by duplication, there shouldn't be any within a
column as long as you maintain some sort of state (ie, the startval Mich
shows, a previous maxid, etc.)  While these ids are unique in that sense,
they are not the same as a uuid / guid which are generally unique across
all entries assuming enough randomness.  Think of the monotonically
increasing id as an auto-incrementing column (with potentially massive gaps
in ids) from a relational database.

Thanks

Mike


On Sun, Sep 4, 2016 at 6:41 PM, Kevin Tran <ke...@gmail.com> wrote:

> Hi Mich,
> Thank you for your input.
> Does monotonically incremental ensure about race condition and does it
> duplicates the ids at some points with multi threads, multi instances, ... ?
>
> Even System.currentTimeMillis() still has duplication?
>
> Cheers,
> Kevin.
>
> On Mon, Sep 5, 2016 at 12:30 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> You can create a monotonically incrementing ID column on your table
>>
>> scala> val ll_18740868 = spark.table("accounts.ll_18740868")
>> scala> val startval = 1
>> scala> val df = ll_18740868.withColumn("id",
>> *monotonically_increasing_id()+* startval).show (2)
>> +---------------+---------------+---------+-------------+---
>> -------------------+-----------+------------+-------+---+
>> |transactiondate|transactiontype| sortcode|accountnumber|transac
>> tiondescription|debitamount|creditamount|balance| id|
>> +---------------+---------------+---------+-------------+---
>> -------------------+-----------+------------+-------+---+
>> |     2011-12-30|            DEB|'30-64-72|     18740868|  WWW.GFT.COM
>> CD 4628 |       50.0|        null| 304.89|  1|
>> |     2011-12-30|            DEB|'30-64-72|     18740868|
>> TDA.CONFECC.D.FRE...|      19.01|        null| 354.89|  2|
>> +---------------+---------------+---------+-------------+---
>> -------------------+-----------+------------+-------+---+
>>
>>
>> Now you have a new ID column
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 September 2016 at 12:43, Kevin Tran <ke...@gmail.com> wrote:
>>
>>> Hi everyone,
>>> Please give me your opinions on what is the best ID Generator for ID
>>> field in parquet ?
>>>
>>> UUID.randomUUID();
>>> AtomicReference<Long> currentTime = new AtomicReference<>(System.curre
>>> ntTimeMillis());
>>> AtomicLong counter = new AtomicLong(0);
>>> ....
>>>
>>> Thanks,
>>> Kevin.
>>>
>>>
>>> ----
>>> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
>>> writing Parquet files)
>>> https://github.com/apache/spark/pull/6864/files
>>>
>>
>>
>

Re: Best ID Generator for ID field in parquet ?

Posted by Kevin Tran <ke...@gmail.com>.
Hi Mich,
Thank you for your input.
Does monotonically incremental ensure about race condition and does it
duplicates the ids at some points with multi threads, multi instances, ... ?

Even System.currentTimeMillis() still has duplication?

Cheers,
Kevin.

On Mon, Sep 5, 2016 at 12:30 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> You can create a monotonically incrementing ID column on your table
>
> scala> val ll_18740868 = spark.table("accounts.ll_18740868")
> scala> val startval = 1
> scala> val df = ll_18740868.withColumn("id",
> *monotonically_increasing_id()+* startval).show (2)
> +---------------+---------------+---------+-------------+---
> -------------------+-----------+------------+-------+---+
> |transactiondate|transactiontype| sortcode|accountnumber|
> transactiondescription|debitamount|creditamount|balance| id|
> +---------------+---------------+---------+-------------+---
> -------------------+-----------+------------+-------+---+
> |     2011-12-30|            DEB|'30-64-72|     18740868|  WWW.GFT.COM CD
> 4628 |       50.0|        null| 304.89|  1|
> |     2011-12-30|            DEB|'30-64-72|     18740868|
> TDA.CONFECC.D.FRE...|      19.01|        null| 354.89|  2|
> +---------------+---------------+---------+-------------+---
> -------------------+-----------+------------+-------+---+
>
>
> Now you have a new ID column
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 September 2016 at 12:43, Kevin Tran <ke...@gmail.com> wrote:
>
>> Hi everyone,
>> Please give me your opinions on what is the best ID Generator for ID
>> field in parquet ?
>>
>> UUID.randomUUID();
>> AtomicReference<Long> currentTime = new AtomicReference<>(System.curre
>> ntTimeMillis());
>> AtomicLong counter = new AtomicLong(0);
>> ....
>>
>> Thanks,
>> Kevin.
>>
>>
>> ----
>> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
>> writing Parquet files)
>> https://github.com/apache/spark/pull/6864/files
>>
>
>

Re: Best ID Generator for ID field in parquet ?

Posted by Mich Talebzadeh <mi...@gmail.com>.
You can create a monotonically incrementing ID column on your table

scala> val ll_18740868 = spark.table("accounts.ll_18740868")
scala> val startval = 1
scala> val df = ll_18740868.withColumn("id",
*monotonically_increasing_id()+* startval).show (2)
+---------------+---------------+---------+-------------+----------------------+-----------+------------+-------+---+
|transactiondate|transactiontype|
sortcode|accountnumber|transactiondescription|debitamount|creditamount|balance|
id|
+---------------+---------------+---------+-------------+----------------------+-----------+------------+-------+---+
|     2011-12-30|            DEB|'30-64-72|     18740868|  WWW.GFT.COM CD
4628 |       50.0|        null| 304.89|  1|
|     2011-12-30|            DEB|'30-64-72|     18740868|
TDA.CONFECC.D.FRE...|      19.01|        null| 354.89|  2|
+---------------+---------------+---------+-------------+----------------------+-----------+------------+-------+---+


Now you have a new ID column

HTH






Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 September 2016 at 12:43, Kevin Tran <ke...@gmail.com> wrote:

> Hi everyone,
> Please give me your opinions on what is the best ID Generator for ID field
> in parquet ?
>
> UUID.randomUUID();
> AtomicReference<Long> currentTime = new AtomicReference<>(System.
> currentTimeMillis());
> AtomicLong counter = new AtomicLong(0);
> ....
>
> Thanks,
> Kevin.
>
>
> ----
> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
> writing Parquet files)
> https://github.com/apache/spark/pull/6864/files
>