You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jacek Pliszka (Jira)" <ji...@apache.org> on 2022/10/24 19:21:00 UTC

[jira] [Comment Edited] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

    [ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623355#comment-17623355 ] 

Jacek Pliszka edited comment on ARROW-15474 at 10/24/22 7:20 PM:
-----------------------------------------------------------------

Lance - the code you have posted might not be very efficient - something like below should be faster:

1. add column with sequential number - index

{code:python}
import pyarrow.compute as pc
t1 = t.append_column('i', 
    pc.cumulative_sum(
        pc.equal(t.column('keys'), t.column('keys')).cast('int64')
    )
)
{code}

2. Find first row indices
{code:python}
t2 = t1.group_by(['keys', 'values']).aggregate([('i', 'min')]).column('i_min')
{code}

3. select rows with first row indices:
{code:python}
t.filter(pc.is_in(t1.column('i'), t2))
{code}


On my PC your code is 1.19s while code above is 0.25s. to_pandas.drop_duplicates was around .36s


was (Author: jacek.pliszka):
Lance - the code you have posted might not be very efficient - something like below should be faster:

1. add column with sequential number - index

{code:python}
import pyarrow.compute as pc
t1 = t.append_column('i', 
    pc.cumulative_sum(
        pc.equal(t.column('keys'), t.column('keys')).cast('int64')
    )
)
{code}

2. Find first row indices
{code:python}
t2 = t1.group_by(['keys', 'values']).aggregate([('i', 'min')]).column('i_min')
{code}

3. select rows with first row indices:
{code:python}
t.filter(pc.is_in(t1.column('i'), t2))
{code}


On my PC your code is 1.19s while code above is 0.25s

> [Python] Possibility of a table.drop_duplicates() function?
> -----------------------------------------------------------
>
>                 Key: ARROW-15474
>                 URL: https://issues.apache.org/jira/browse/ARROW-15474
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>    Affects Versions: 6.0.1
>            Reporter: Lance Dacey
>            Priority: Major
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the example above would be great. 
> I usually am reading an append-only dataset and then I need to report on latest version of each row. To drop duplicates, I am temporarily converting the append-only table to a pandas DataFrame, and then I convert it back to a table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)