You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Everett Anderson <ev...@nuna.com.INVALID> on 2017/04/07 22:56:14 UTC

Assigning a unique row ID

Hi,

What's the best way to assign a truly unique row ID (rather than a hash) to
a DataFrame/Dataset?

I originally thought that functions.monotonically_increasing_id would do
this, but it seems to have a rather unfortunate property that if you add it
as a column to table A and then derive tables X, Y, Z and save those, the
row ID values in X, Y, and Z may end up different. I assume this is because
it delays the actual computation to the point where each of those tables is
computed.

Re: Assigning a unique row ID

Posted by Ankur Srivastava <an...@gmail.com>.
You can use zipWithIndex or the approach Tim suggested or even the one you
are using but I believe the issue is that tableA is being materialized
every time you for the new transformations. Are you caching/persisting the
table A? If you do that you should not see this behavior.

Thanks
Ankur

On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <se...@gmail.com> wrote:

> http://stackoverflow.com/questions/37231616/add-a-new-
> column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>
>
> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <everett@nuna.com.invalid
> > wrote:
>
>> Hi,
>>
>> What's the best way to assign a truly unique row ID (rather than a hash)
>> to a DataFrame/Dataset?
>>
>> I originally thought that functions.monotonically_increasing_id would do
>> this, but it seems to have a rather unfortunate property that if you add it
>> as a column to table A and then derive tables X, Y, Z and save those, the
>> row ID values in X, Y, and Z may end up different. I assume this is because
>> it delays the actual computation to the point where each of those tables is
>> computed.
>>
>>
>
>
> --
>
> --
> Thanks,
>
> Tim
>

Re: Assigning a unique row ID

Posted by Everett Anderson <ev...@nuna.com.INVALID>.
Indeed, I tried persist with MEMORY_AND_DISK and it works! (I'm wary of
MEMORY_ONLY for this as it could potentially recompute shards if it
couldn't entirely cache in memory.)

Thanks for the help, everybody!!

On Sat, Apr 8, 2017 at 11:54 AM, Everett Anderson <ev...@nuna.com> wrote:

>
>
> On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram <su...@gmail.com>
> wrote:
>
>> Hi,
>>
>> We use monotonically_increasing_id() as well, but just cache the table
>> first like Ankur suggested. With that method, we get the same keys in all
>> derived tables.
>>
>
> Ah, okay, awesome. Let me give that a go.
>
>
>
>>
>> Thanks,
>> Subhash
>>
>> Sent from my iPhone
>>
>> On Apr 7, 2017, at 7:32 PM, Everett Anderson <ev...@nuna.com.INVALID>
>> wrote:
>>
>> Hi,
>>
>> Thanks, but that's using a random UUID. Certainly unlikely to have
>> collisions, but not guaranteed.
>>
>> I'd rather prefer something like monotonically_increasing_id or RDD's
>> zipWithUniqueId but with better behavioral characteristics -- so they don't
>> surprise people when 2+ outputs derived from an original table end up not
>> having the same IDs for the same rows, anymore.
>>
>> It seems like this would be possible under the covers, but would have the
>> performance penalty of needing to do perhaps a count() and then also a
>> checkpoint.
>>
>> I was hoping there's a better way.
>>
>>
>> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <se...@gmail.com> wrote:
>>
>>> http://stackoverflow.com/questions/37231616/add-a-new-column
>>> -to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>>>
>>>
>>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <
>>> everett@nuna.com.invalid> wrote:
>>>
>>>> Hi,
>>>>
>>>> What's the best way to assign a truly unique row ID (rather than a
>>>> hash) to a DataFrame/Dataset?
>>>>
>>>> I originally thought that functions.monotonically_increasing_id would
>>>> do this, but it seems to have a rather unfortunate property that if you add
>>>> it as a column to table A and then derive tables X, Y, Z and save those,
>>>> the row ID values in X, Y, and Z may end up different. I assume this is
>>>> because it delays the actual computation to the point where each of those
>>>> tables is computed.
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> --
>>> Thanks,
>>>
>>> Tim
>>>
>>
>>
>

Re: Assigning a unique row ID

Posted by Everett Anderson <ev...@nuna.com.INVALID>.
On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram <su...@gmail.com>
wrote:

> Hi,
>
> We use monotonically_increasing_id() as well, but just cache the table
> first like Ankur suggested. With that method, we get the same keys in all
> derived tables.
>

Ah, okay, awesome. Let me give that a go.



>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> On Apr 7, 2017, at 7:32 PM, Everett Anderson <ev...@nuna.com.INVALID>
> wrote:
>
> Hi,
>
> Thanks, but that's using a random UUID. Certainly unlikely to have
> collisions, but not guaranteed.
>
> I'd rather prefer something like monotonically_increasing_id or RDD's
> zipWithUniqueId but with better behavioral characteristics -- so they don't
> surprise people when 2+ outputs derived from an original table end up not
> having the same IDs for the same rows, anymore.
>
> It seems like this would be possible under the covers, but would have the
> performance penalty of needing to do perhaps a count() and then also a
> checkpoint.
>
> I was hoping there's a better way.
>
>
> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <se...@gmail.com> wrote:
>
>> http://stackoverflow.com/questions/37231616/add-a-new-column
>> -to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>>
>>
>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <
>> everett@nuna.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> What's the best way to assign a truly unique row ID (rather than a hash)
>>> to a DataFrame/Dataset?
>>>
>>> I originally thought that functions.monotonically_increasing_id would
>>> do this, but it seems to have a rather unfortunate property that if you add
>>> it as a column to table A and then derive tables X, Y, Z and save those,
>>> the row ID values in X, Y, and Z may end up different. I assume this is
>>> because it delays the actual computation to the point where each of those
>>> tables is computed.
>>>
>>>
>>
>>
>> --
>>
>> --
>> Thanks,
>>
>> Tim
>>
>
>

Re: Assigning a unique row ID

Posted by Subhash Sriram <su...@gmail.com>.
Hi,

We use monotonically_increasing_id() as well, but just cache the table first like Ankur suggested. With that method, we get the same keys in all derived tables. 

Thanks,
Subhash

Sent from my iPhone

> On Apr 7, 2017, at 7:32 PM, Everett Anderson <ev...@nuna.com.INVALID> wrote:
> 
> Hi,
> 
> Thanks, but that's using a random UUID. Certainly unlikely to have collisions, but not guaranteed.
> 
> I'd rather prefer something like monotonically_increasing_id or RDD's zipWithUniqueId but with better behavioral characteristics -- so they don't surprise people when 2+ outputs derived from an original table end up not having the same IDs for the same rows, anymore.
> 
> It seems like this would be possible under the covers, but would have the performance penalty of needing to do perhaps a count() and then also a checkpoint.
> 
> I was hoping there's a better way.
> 
> 
>> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <se...@gmail.com> wrote:
>> http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>> 
>> 
>>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <ev...@nuna.com.invalid> wrote:
>>> Hi,
>>> 
>>> What's the best way to assign a truly unique row ID (rather than a hash) to a DataFrame/Dataset?
>>> 
>>> I originally thought that functions.monotonically_increasing_id would do this, but it seems to have a rather unfortunate property that if you add it as a column to table A and then derive tables X, Y, Z and save those, the row ID values in X, Y, and Z may end up different. I assume this is because it delays the actual computation to the point where each of those tables is computed.
>>> 
>> 
>> 
>> 
>> -- 
>> 
>> --
>> Thanks,
>> 
>> Tim
> 

Re: Assigning a unique row ID

Posted by Everett Anderson <ev...@nuna.com.INVALID>.
Hi,

Thanks, but that's using a random UUID. Certainly unlikely to have
collisions, but not guaranteed.

I'd rather prefer something like monotonically_increasing_id or RDD's
zipWithUniqueId but with better behavioral characteristics -- so they don't
surprise people when 2+ outputs derived from an original table end up not
having the same IDs for the same rows, anymore.

It seems like this would be possible under the covers, but would have the
performance penalty of needing to do perhaps a count() and then also a
checkpoint.

I was hoping there's a better way.


On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <se...@gmail.com> wrote:

> http://stackoverflow.com/questions/37231616/add-a-new-
> column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>
>
> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <everett@nuna.com.invalid
> > wrote:
>
>> Hi,
>>
>> What's the best way to assign a truly unique row ID (rather than a hash)
>> to a DataFrame/Dataset?
>>
>> I originally thought that functions.monotonically_increasing_id would do
>> this, but it seems to have a rather unfortunate property that if you add it
>> as a column to table A and then derive tables X, Y, Z and save those, the
>> row ID values in X, Y, and Z may end up different. I assume this is because
>> it delays the actual computation to the point where each of those tables is
>> computed.
>>
>>
>
>
> --
>
> --
> Thanks,
>
> Tim
>

Re: Assigning a unique row ID

Posted by Tim Smith <se...@gmail.com>.
http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator


On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <ev...@nuna.com.invalid>
wrote:

> Hi,
>
> What's the best way to assign a truly unique row ID (rather than a hash)
> to a DataFrame/Dataset?
>
> I originally thought that functions.monotonically_increasing_id would do
> this, but it seems to have a rather unfortunate property that if you add it
> as a column to table A and then derive tables X, Y, Z and save those, the
> row ID values in X, Y, and Z may end up different. I assume this is because
> it delays the actual computation to the point where each of those tables is
> computed.
>
>


-- 

--
Thanks,

Tim