You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Saurav Sinha <sa...@gmail.com> on 2016/10/17 13:57:59 UTC

Help in generating unique Id in spark row

Hi,

I am in situation where I want to generate unique Id for each row.

I have use monotonicallyIncreasingId but it is giving increasing values and
start generating from start if it fail.

I have two question here:

Q1. Does this method give me unique id even in failure situation becaue I
want to use that ID in my solr id.

Q2. If answer to previous question is NO. Then Is there way yo generate
UUID for each row which is uniqe and not updatedable.

As I have come up with situation where UUID is updated


val idUDF = udf(() => UUID.randomUUID().toString)
val a = withColumn("alarmUUID", lit(idUDF()))
a.persist(StorageLevel.MEMORY_AND_DISK)
rawDataDf.registerTempTable("rawAlarms")

///
/// I do some joines

but as I reach further below

I do sonthing like
b is transformation of a
sqlContext.sql("""Select a.alarmUUID,b.alarmUUID
                      from a right outer join b on a.alarmUUID =
b.alarmUUID""")

it give output as

+--------------------+--------------------+

|           alarmUUID|           alarmUUID|
+--------------------+--------------------+
|7d33a516-5532-410...|                null|
|                null|2439d6db-16a2-44b...|
+--------------------+--------------------+



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062

Re: Help in generating unique Id in spark row

Posted by Olivier Girardot <o....@lateral-thoughts.com>.
There is a way, you can use
org.apache.spark.sql.functions.monotonicallyIncreasingId it will give each rows
of your dataframe a unique Id
 





On Tue, Oct 18, 2016 10:36 AM, ayan guha guha.ayan@gmail.com
wrote:
Do you have any primary key or unique identifier in your data? Even if multiple
columns can make a composite key? In other words, can your data have exactly
same 2 rows with different unique ID? Also, do you have to have numeric ID? 

You may want to pursue hashing algorithm such as sha group to convert single or
composite unique columns to an ID. 

On 18 Oct 2016 15:32, "Saurav Sinha" <sa...@gmail.com> wrote:
Can any one help me out
On Mon, Oct 17, 2016 at 7:27 PM, Saurav Sinha <sa...@gmail.com>  wrote:
Hi,
I am in situation where I want to generate unique Id for each row.
I have use monotonicallyIncreasingId but it is giving increasing values and
start generating from start if it fail.
I have two question here:
Q1. Does this method give me unique id even in failure situation becaue I want
to use that ID in my solr id.
Q2. If answer to previous question is NO. Then Is there way yo generate UUID for
each row which is uniqe and not updatedable.
As I have come up with situation where UUID is updated

val idUDF = udf(() => UUID.randomUUID().toString)
val a = withColumn("alarmUUID", lit(idUDF()))a.persist(StorageLevel.MEMORY_
AND_DISK)
rawDataDf.registerTempTable("rawAlarms")

////// I do some joines
but as I reach further below
I do sonthing likeb is transformation of asqlContext.sql("""Select
a.alarmUUID,b.alarmUUID                      from a right outer join bon
a.alarmUUID = b.alarmUUID""")
it give output as
+--------------------+--------------------+|           alarmUUID|          
alarmUUID|+--------------------+--------------------+|7d33a516-5532-410...|    
           null||                null|2439d6db-16a2-44b...|
+--------------------+--------------------+


-- 
Thanks and Regards,
Saurav Sinha
Contact: 9742879062


-- 
Thanks and Regards,
Saurav Sinha
Contact: 9742879062

 

Olivier Girardot| Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94

Re: Help in generating unique Id in spark row

Posted by ayan guha <gu...@gmail.com>.
Do you have any primary key or unique identifier in your data? Even if
multiple columns can make a composite key? In other words, can your data
have exactly same 2 rows with different unique ID? Also, do you have to
have numeric ID?

You may want to pursue hashing algorithm such as sha group to convert
single or composite unique columns to an ID.
On 18 Oct 2016 15:32, "Saurav Sinha" <sa...@gmail.com> wrote:

> Can any one help me out
>
> On Mon, Oct 17, 2016 at 7:27 PM, Saurav Sinha <sa...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am in situation where I want to generate unique Id for each row.
>>
>> I have use monotonicallyIncreasingId but it is giving increasing values
>> and start generating from start if it fail.
>>
>> I have two question here:
>>
>> Q1. Does this method give me unique id even in failure situation becaue I
>> want to use that ID in my solr id.
>>
>> Q2. If answer to previous question is NO. Then Is there way yo generate
>> UUID for each row which is uniqe and not updatedable.
>>
>> As I have come up with situation where UUID is updated
>>
>>
>> val idUDF = udf(() => UUID.randomUUID().toString)
>> val a = withColumn("alarmUUID", lit(idUDF()))
>> a.persist(StorageLevel.MEMORY_AND_DISK)
>> rawDataDf.registerTempTable("rawAlarms")
>>
>> ///
>> /// I do some joines
>>
>> but as I reach further below
>>
>> I do sonthing like
>> b is transformation of a
>> sqlContext.sql("""Select a.alarmUUID,b.alarmUUID
>>                       from a right outer join b on a.alarmUUID =
>> b.alarmUUID""")
>>
>> it give output as
>>
>> +--------------------+--------------------+
>>
>> |           alarmUUID|           alarmUUID|
>> +--------------------+--------------------+
>> |7d33a516-5532-410...|                null|
>> |                null|2439d6db-16a2-44b...|
>> +--------------------+--------------------+
>>
>>
>>
>> --
>> Thanks and Regards,
>>
>> Saurav Sinha
>>
>> Contact: 9742879062
>>
>
>
>
> --
> Thanks and Regards,
>
> Saurav Sinha
>
> Contact: 9742879062
>

Re: Help in generating unique Id in spark row

Posted by Saurav Sinha <sa...@gmail.com>.
Can any one help me out

On Mon, Oct 17, 2016 at 7:27 PM, Saurav Sinha <sa...@gmail.com>
wrote:

> Hi,
>
> I am in situation where I want to generate unique Id for each row.
>
> I have use monotonicallyIncreasingId but it is giving increasing values
> and start generating from start if it fail.
>
> I have two question here:
>
> Q1. Does this method give me unique id even in failure situation becaue I
> want to use that ID in my solr id.
>
> Q2. If answer to previous question is NO. Then Is there way yo generate
> UUID for each row which is uniqe and not updatedable.
>
> As I have come up with situation where UUID is updated
>
>
> val idUDF = udf(() => UUID.randomUUID().toString)
> val a = withColumn("alarmUUID", lit(idUDF()))
> a.persist(StorageLevel.MEMORY_AND_DISK)
> rawDataDf.registerTempTable("rawAlarms")
>
> ///
> /// I do some joines
>
> but as I reach further below
>
> I do sonthing like
> b is transformation of a
> sqlContext.sql("""Select a.alarmUUID,b.alarmUUID
>                       from a right outer join b on a.alarmUUID =
> b.alarmUUID""")
>
> it give output as
>
> +--------------------+--------------------+
>
> |           alarmUUID|           alarmUUID|
> +--------------------+--------------------+
> |7d33a516-5532-410...|                null|
> |                null|2439d6db-16a2-44b...|
> +--------------------+--------------------+
>
>
>
> --
> Thanks and Regards,
>
> Saurav Sinha
>
> Contact: 9742879062
>



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062