You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Gary Clark <gc...@gmail.com> on 2020/09/06 14:30:07 UTC

[Python] fastest filtering approach for applying some operations over 1 million rows

Hi all,

I am looking for some suggestions on how I can speed up this analytical
workload I am currently working on. This would really help me prove out a
use case here for Apache Arrow to become our go to approach for data
intensive operations.

I have 2 datasets:

dataset 1 has 1.5 million rows in 1 parquet file
dataset 2 has 60 million rows in 8 parquet files within a directory

for each row in dataset 1, there are several columns that become the
conditions for how i filter for matching records in dataset 2. I am
applying these filters upon reading in dataset 2. Here is my code:

```
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import time

ds1= pq.read_table('dataset1.parquet')

ds1_df = ds1.to_pandas().head(5) # testing performance with a sample

for i, r in ds1_df.iterrows():
    start = time.time()
    filters = [
        ('timestamp_col', '>=', r.start_date),
        ('timestamp_col', '<=', r.end_date),
        ('condition1_non_null', '=', r.condition1),
        ('condition2_with_nulls', '=', r.condition2)
    ]

    ds2= pq.read_table('./path/dataset2/', filters=filters)

    ds2_df= ds2.to_pandas()

    r['count_condition1_distinct_records'] = ds2_df['condition1_non_null'
].nunique()

    print(time.time() - start)
```
Is this the fastest approach for doing something like this? I added some
timestamps, and it takes ~5-6 secs per row which will be like ~80 days.

One thought i had was to load dataset2 into memory once instead of each
time, but then applying the filters is where i struggled to understand if
PyArrow has this ability yet. One thing I noticed was in the filtering
conditions I couldn't figure out how to filter out ACTUAL null values (like
for condition 2 which has nulls), so instead, i created the parquet files
where the NULLs come in as strings instead.

Any guidance is appreciated. Thanks!

-- 
Gary Clark
*Data Scientist & Data Engineer*
*B.S. Mechanical Engineering, Howard University '13*
+1 (717) 798-6916
gclarkjr5@gmail.com

Re: [Python] fastest filtering approach for applying some operations over 1 million rows

Posted by Jeff Reback <je...@gmail.com>.

your code is extremely inefficient in pandas (using iterrows should only ever be a last resort) - you should ask on StackOverflow in suggestions. 

simply merging and filtering is likely to work and be very fast

> On Sep 10, 2020, at 1:37 PM, Matthew Corley <ma...@gmail.com> wrote:
> 
> 
> In this case, it looks like Gary would be served by the ability to filter a table after read, using a mask (or several masks, it looks like) generated by his first dataframe.  I still think the right answer is "do this in Pandas for now, the dataframe-like APIs you are after don't yet exist in PyArrow".  This means that you will bear the memory overhead of reading the full dataset and also the conversion to pandas (worst case, peak memory utilization double the dataset size).  I do remember there being some recent addition to the Table API related to filtering, but I haven't personally tested it and I think it's not available in all official builds yet (I think it is present in conda releases on Linux, though).
> 
>> On Wed, Sep 9, 2020 at 10:28 PM Micah Kornfield <em...@gmail.com> wrote:
>> Hi Gary,
>> I'm not sure I fully understand the use-case, but it sounds like this might be best handled by a join, which doesn't currently exist with Arrow (but I think the functionality exists in Pandas?).
>> 
>> -Micah
>> 
>>> On Sun, Sep 6, 2020 at 7:30 AM Gary Clark <gc...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> I am looking for some suggestions on how I can speed up this analytical workload I am currently working on. This would really help me prove out a use case here for Apache Arrow to become our go to approach for data intensive operations.
>>> 
>>> I have 2 datasets:
>>> 
>>> dataset 1 has 1.5 million rows in 1 parquet file
>>> dataset 2 has 60 million rows in 8 parquet files within a directory
>>> 
>>> for each row in dataset 1, there are several columns that become the conditions for how i filter for matching records in dataset 2. I am applying these filters upon reading in dataset 2. Here is my code:
>>> 
>>> ```
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pandas as pd
>>> import time
>>> 
>>> ds1= pq.read_table('dataset1.parquet')
>>> 
>>> ds1_df = ds1.to_pandas().head(5) # testing performance with a sample
>>> 
>>> for i, r in ds1_df.iterrows():
>>>     start = time.time()
>>>     filters = [
>>>         ('timestamp_col', '>=', r.start_date),
>>>         ('timestamp_col', '<=', r.end_date),
>>>         ('condition1_non_null', '=', r.condition1),
>>>         ('condition2_with_nulls', '=', r.condition2)
>>>     ]
>>> 
>>>     ds2= pq.read_table('./path/dataset2/', filters=filters)
>>> 
>>>     ds2_df= ds2.to_pandas()
>>> 
>>>     r['count_condition1_distinct_records'] = ds2_df['condition1_non_null'].nunique()
>>> 
>>>     print(time.time() - start)
>>> ```
>>> Is this the fastest approach for doing something like this? I added some timestamps, and it takes ~5-6 secs per row which will be like ~80 days.
>>> 
>>> One thought i had was to load dataset2 into memory once instead of each time, but then applying the filters is where i struggled to understand if PyArrow has this ability yet. One thing I noticed was in the filtering conditions I couldn't figure out how to filter out ACTUAL null values (like for condition 2 which has nulls), so instead, i created the parquet files where the NULLs come in as strings instead.
>>> 
>>> Any guidance is appreciated. Thanks!
>>> 
>>> -- 
>>> Gary Clark
>>> Data Scientist & Data Engineer
>>> B.S. Mechanical Engineering, Howard University '13
>>> +1 (717) 798-6916
>>> gclarkjr5@gmail.com

Re: [Python] fastest filtering approach for applying some operations over 1 million rows

Posted by Matthew Corley <ma...@gmail.com>.

In this case, it looks like Gary would be served by the ability to filter a
table after read, using a mask (or several masks, it looks like) generated
by his first dataframe.  I still think the right answer is "do this in
Pandas for now, the dataframe-like APIs you are after don't yet exist in
PyArrow".  This means that you will bear the memory overhead of reading the
full dataset and also the conversion to pandas (worst case, peak memory
utilization double the dataset size).  I do remember there being some
recent addition to the Table API related to filtering, but I haven't
personally tested it and I think it's not available in all official builds
yet (I think it is present in conda releases on Linux, though).

On Wed, Sep 9, 2020 at 10:28 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Gary,
> I'm not sure I fully understand the use-case, but it sounds like this
> might be best handled by a join, which doesn't currently exist with Arrow
> (but I think the functionality exists in Pandas?).
>
> -Micah
>
> On Sun, Sep 6, 2020 at 7:30 AM Gary Clark <gc...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am looking for some suggestions on how I can speed up this analytical
>> workload I am currently working on. This would really help me prove out a
>> use case here for Apache Arrow to become our go to approach for data
>> intensive operations.
>>
>> I have 2 datasets:
>>
>> dataset 1 has 1.5 million rows in 1 parquet file
>> dataset 2 has 60 million rows in 8 parquet files within a directory
>>
>> for each row in dataset 1, there are several columns that become the
>> conditions for how i filter for matching records in dataset 2. I am
>> applying these filters upon reading in dataset 2. Here is my code:
>>
>> ```
>> import pyarrow as pa
>> import pyarrow.parquet as pq
>> import pandas as pd
>> import time
>>
>> ds1= pq.read_table('dataset1.parquet')
>>
>> ds1_df = ds1.to_pandas().head(5) # testing performance with a sample
>>
>> for i, r in ds1_df.iterrows():
>>     start = time.time()
>>     filters = [
>>         ('timestamp_col', '>=', r.start_date),
>>         ('timestamp_col', '<=', r.end_date),
>>         ('condition1_non_null', '=', r.condition1),
>>         ('condition2_with_nulls', '=', r.condition2)
>>     ]
>>
>>     ds2= pq.read_table('./path/dataset2/', filters=filters)
>>
>>     ds2_df= ds2.to_pandas()
>>
>>     r['count_condition1_distinct_records'] = ds2_df['condition1_non_null'
>> ].nunique()
>>
>>     print(time.time() - start)
>> ```
>> Is this the fastest approach for doing something like this? I added some
>> timestamps, and it takes ~5-6 secs per row which will be like ~80 days.
>>
>> One thought i had was to load dataset2 into memory once instead of each
>> time, but then applying the filters is where i struggled to understand if
>> PyArrow has this ability yet. One thing I noticed was in the filtering
>> conditions I couldn't figure out how to filter out ACTUAL null values (like
>> for condition 2 which has nulls), so instead, i created the parquet files
>> where the NULLs come in as strings instead.
>>
>> Any guidance is appreciated. Thanks!
>>
>> --
>> Gary Clark
>> *Data Scientist & Data Engineer*
>> *B.S. Mechanical Engineering, Howard University '13*
>> +1 (717) 798-6916
>> gclarkjr5@gmail.com
>>
>

Re: [Python] fastest filtering approach for applying some operations over 1 million rows

Posted by Micah Kornfield <em...@gmail.com>.

Hi Gary,
I'm not sure I fully understand the use-case, but it sounds like this might
be best handled by a join, which doesn't currently exist with Arrow (but I
think the functionality exists in Pandas?).

-Micah

On Sun, Sep 6, 2020 at 7:30 AM Gary Clark <gc...@gmail.com> wrote:

> Hi all,
>
> I am looking for some suggestions on how I can speed up this analytical
> workload I am currently working on. This would really help me prove out a
> use case here for Apache Arrow to become our go to approach for data
> intensive operations.
>
> I have 2 datasets:
>
> dataset 1 has 1.5 million rows in 1 parquet file
> dataset 2 has 60 million rows in 8 parquet files within a directory
>
> for each row in dataset 1, there are several columns that become the
> conditions for how i filter for matching records in dataset 2. I am
> applying these filters upon reading in dataset 2. Here is my code:
>
> ```
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import time
>
> ds1= pq.read_table('dataset1.parquet')
>
> ds1_df = ds1.to_pandas().head(5) # testing performance with a sample
>
> for i, r in ds1_df.iterrows():
>     start = time.time()
>     filters = [
>         ('timestamp_col', '>=', r.start_date),
>         ('timestamp_col', '<=', r.end_date),
>         ('condition1_non_null', '=', r.condition1),
>         ('condition2_with_nulls', '=', r.condition2)
>     ]
>
>     ds2= pq.read_table('./path/dataset2/', filters=filters)
>
>     ds2_df= ds2.to_pandas()
>
>     r['count_condition1_distinct_records'] = ds2_df['condition1_non_null'
> ].nunique()
>
>     print(time.time() - start)
> ```
> Is this the fastest approach for doing something like this? I added some
> timestamps, and it takes ~5-6 secs per row which will be like ~80 days.
>
> One thought i had was to load dataset2 into memory once instead of each
> time, but then applying the filters is where i struggled to understand if
> PyArrow has this ability yet. One thing I noticed was in the filtering
> conditions I couldn't figure out how to filter out ACTUAL null values (like
> for condition 2 which has nulls), so instead, i created the parquet files
> where the NULLs come in as strings instead.
>
> Any guidance is appreciated. Thanks!
>
> --
> Gary Clark
> *Data Scientist & Data Engineer*
> *B.S. Mechanical Engineering, Howard University '13*
> +1 (717) 798-6916
> gclarkjr5@gmail.com
>