You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Li Jin <ic...@gmail.com> on 2023/02/15 15:17:59 UTC

Question about memory usage and type casting using pyarrow Table

Hello!

I have some questions about type casting memory usage with pyarrow Table.
Let's say I have a pyarrow Table with 100 columns.

(1) if I want to cast n columns to a different type (e.g., float to int).
What is the smallest memory overhead that I can do? (memory overhead of 1
column, n columns or 100 columns?)

(2) if I want to cast n timestamp columns from tz-native to tz-UTC. What is
the smallest memory overhead that I can do? (0, 1 column, n columns or 100
columns?)

Thanks!
Li

Re: Question about memory usage and type casting using pyarrow Table

Posted by Weston Pace <we...@gmail.com>.
> (1) if I want to cast n columns to a different type (e.g., float to int).
What is the smallest memory overhead that I can do? (memory overhead of 1
column, n columns or 100 columns?)

You should be able to do this with only 1 column of overhead.  Though you
might need to go a little out of your way to ensure the table is deleted so
it's not holding onto the old columns:

Example:

```
import pyarrow as pa
import pyarrow.compute as pc

my_table = pa.Table.from_pydict({'a': list(range(100)), 'b':
list(range(100)), 'c': list(range(100))})
print('Starting table')
print(my_table)
print(f'Starting RAM usage: {pa.default_memory_pool().bytes_allocated()}')

cols = my_table.columns
names = my_table.column_names
del my_table

for idx in range(len(cols)):
    cols[idx] = pc.cast(cols[idx], pa.int16())
    print(f'RAM usage after converting col {idx}:
{pa.default_memory_pool().bytes_allocated()}')

new_table = pa.Table.from_arrays(cols, names=names)
print('Final table')
print(new_table)

print(f'Final RAM usage: {pa.default_memory_pool().bytes_allocated()}')
```

Output:

Starting table
pyarrow.Table
a: int64
b: int64
c: int64
----
a: [[0,1,2,3,4,...,95,96,97,98,99]]
b: [[0,1,2,3,4,...,95,96,97,98,99]]
c: [[0,1,2,3,4,...,95,96,97,98,99]]
Starting RAM usage: 2496
RAM usage after converting col 0: 1984
RAM usage after converting col 1: 1472
RAM usage after converting col 2: 960
Final table
pyarrow.Table
a: int16
b: int16
c: int16
----
a: [[0,1,2,3,4,...,95,96,97,98,99]]
b: [[0,1,2,3,4,...,95,96,97,98,99]]
c: [[0,1,2,3,4,...,95,96,97,98,99]]
Final RAM usage: 960

On Wed, Feb 15, 2023 at 2:59 PM Aldrin <ak...@ucsc.edu.invalid> wrote:

> I think you can replace the schema metadata using [1]. You can perhaps also
> do the same for the field metadata, depending on where timezone metadata
> may be on a timestamp array [2].
>
> [1]:
>
> https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.replace_schema_metadata
> [2]:
>
> https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.with_metadata
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Wed, Feb 15, 2023 at 2:52 PM Li Jin <ic...@gmail.com> wrote:
>
> > Oh thanks that could be a workaround! I thought pa tables are supposed to
> > be immutable , is there a safe way to just change the metadata?
> >
> > On Wed, Feb 15, 2023 at 5:44 PM Rok Mihevc <ro...@gmail.com> wrote:
> >
> > > Well that's suboptimal. As a workaround I suppose you could just change
> > the
> > > metadata if the array is timezone aware.
> > >
> > > On Wed, Feb 15, 2023 at 10:37 PM Li Jin <ic...@gmail.com> wrote:
> > >
> > > > Oh found this comment:
> > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156
> > > >
> > > >
> > > >
> > > > On Wed, Feb 15, 2023 at 4:23 PM Li Jin <ic...@gmail.com>
> wrote:
> > > >
> > > > > Not sure if this is actually a bug or expected behavior - I filed
> > > > > https://github.com/apache/arrow/issues/34210
> > > > >
> > > > > On Wed, Feb 15, 2023 at 4:15 PM Li Jin <ic...@gmail.com>
> > wrote:
> > > > >
> > > > >> Hmm..something feels off here - I did the following experiment on
> > > Arrow
> > > > >> 11 and casting timestamp-naive to int64 is much faster than
> casting
> > > > >> timestamp-naive to timestamp-utc:
> > > > >>
> > > > >> In [16]: %time table.cast(schema_int)
> > > > >> CPU times: user 114 µs, sys: 30 µs, total: 144 µs
> > > > >> *Wall time: 231 µs*
> > > > >> Out[16]:
> > > > >> pyarrow.Table
> > > > >> time: int64
> > > > >> ----
> > > > >> time:
> [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
> > > > >>
> > > > >> In [17]: %time table.cast(schema_tz)
> > > > >> CPU times: user 119 ms, sys: 140 ms, total: 260 ms
> > > > >> *Wall time: 259 ms*
> > > > >> Out[17]:
> > > > >> pyarrow.Table
> > > > >> time: timestamp[ns, tz=UTC]
> > > > >> ----
> > > > >> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> > > > >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> > > > >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> > > > >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> > > > >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> > > > >> 00:00:00.099999999]]
> > > > >>
> > > > >> In [18]: table
> > > > >> Out[18]:
> > > > >> pyarrow.Table
> > > > >> time: timestamp[ns]
> > > > >> ----
> > > > >> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> > > > >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> > > > >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> > > > >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> > > > >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> > > > >> 00:00:00.099999999]]
> > > > >>
> > > > >> On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc <ro...@gmail.com>
> > > > wrote:
> > > > >>
> > > > >>> I'm not sure about (1) but I'm pretty sure for (2) doing a cast
> of
> > > > >>> tz-aware
> > > > >>> timestamp to tz-naive should be a metadata-only change.
> > > > >>>
> > > > >>> On Wed, Feb 15, 2023 at 4:19 PM Li Jin <ic...@gmail.com>
> > > wrote:
> > > > >>>
> > > > >>> > Asking (2) because IIUC this is a metadata operation that could
> > be
> > > > zero
> > > > >>> > copy but I am not sure if this is actually the case.
> > > > >>> >
> > > > >>> > On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ice.xelloss@gmail.com
> >
> > > > wrote:
> > > > >>> >
> > > > >>> > > Hello!
> > > > >>> > >
> > > > >>> > > I have some questions about type casting memory usage with
> > > pyarrow
> > > > >>> Table.
> > > > >>> > > Let's say I have a pyarrow Table with 100 columns.
> > > > >>> > >
> > > > >>> > > (1) if I want to cast n columns to a different type (e.g.,
> > float
> > > to
> > > > >>> int).
> > > > >>> > > What is the smallest memory overhead that I can do? (memory
> > > > overhead
> > > > >>> of 1
> > > > >>> > > column, n columns or 100 columns?)
> > > > >>> > >
> > > > >>> > > (2) if I want to cast n timestamp columns from tz-native to
> > > tz-UTC.
> > > > >>> What
> > > > >>> > > is the smallest memory overhead that I can do? (0, 1 column,
> n
> > > > >>> columns or
> > > > >>> > > 100 columns?)
> > > > >>> > >
> > > > >>> > > Thanks!
> > > > >>> > > Li
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > >
> > >
> >
>

Re: Question about memory usage and type casting using pyarrow Table

Posted by Aldrin <ak...@ucsc.edu.INVALID>.
I think you can replace the schema metadata using [1]. You can perhaps also
do the same for the field metadata, depending on where timezone metadata
may be on a timestamp array [2].

[1]:
https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.replace_schema_metadata
[2]:
https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.with_metadata

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Wed, Feb 15, 2023 at 2:52 PM Li Jin <ic...@gmail.com> wrote:

> Oh thanks that could be a workaround! I thought pa tables are supposed to
> be immutable , is there a safe way to just change the metadata?
>
> On Wed, Feb 15, 2023 at 5:44 PM Rok Mihevc <ro...@gmail.com> wrote:
>
> > Well that's suboptimal. As a workaround I suppose you could just change
> the
> > metadata if the array is timezone aware.
> >
> > On Wed, Feb 15, 2023 at 10:37 PM Li Jin <ic...@gmail.com> wrote:
> >
> > > Oh found this comment:
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156
> > >
> > >
> > >
> > > On Wed, Feb 15, 2023 at 4:23 PM Li Jin <ic...@gmail.com> wrote:
> > >
> > > > Not sure if this is actually a bug or expected behavior - I filed
> > > > https://github.com/apache/arrow/issues/34210
> > > >
> > > > On Wed, Feb 15, 2023 at 4:15 PM Li Jin <ic...@gmail.com>
> wrote:
> > > >
> > > >> Hmm..something feels off here - I did the following experiment on
> > Arrow
> > > >> 11 and casting timestamp-naive to int64 is much faster than casting
> > > >> timestamp-naive to timestamp-utc:
> > > >>
> > > >> In [16]: %time table.cast(schema_int)
> > > >> CPU times: user 114 µs, sys: 30 µs, total: 144 µs
> > > >> *Wall time: 231 µs*
> > > >> Out[16]:
> > > >> pyarrow.Table
> > > >> time: int64
> > > >> ----
> > > >> time: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
> > > >>
> > > >> In [17]: %time table.cast(schema_tz)
> > > >> CPU times: user 119 ms, sys: 140 ms, total: 260 ms
> > > >> *Wall time: 259 ms*
> > > >> Out[17]:
> > > >> pyarrow.Table
> > > >> time: timestamp[ns, tz=UTC]
> > > >> ----
> > > >> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> > > >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> > > >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> > > >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> > > >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> > > >> 00:00:00.099999999]]
> > > >>
> > > >> In [18]: table
> > > >> Out[18]:
> > > >> pyarrow.Table
> > > >> time: timestamp[ns]
> > > >> ----
> > > >> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> > > >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> > > >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> > > >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> > > >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> > > >> 00:00:00.099999999]]
> > > >>
> > > >> On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc <ro...@gmail.com>
> > > wrote:
> > > >>
> > > >>> I'm not sure about (1) but I'm pretty sure for (2) doing a cast of
> > > >>> tz-aware
> > > >>> timestamp to tz-naive should be a metadata-only change.
> > > >>>
> > > >>> On Wed, Feb 15, 2023 at 4:19 PM Li Jin <ic...@gmail.com>
> > wrote:
> > > >>>
> > > >>> > Asking (2) because IIUC this is a metadata operation that could
> be
> > > zero
> > > >>> > copy but I am not sure if this is actually the case.
> > > >>> >
> > > >>> > On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ic...@gmail.com>
> > > wrote:
> > > >>> >
> > > >>> > > Hello!
> > > >>> > >
> > > >>> > > I have some questions about type casting memory usage with
> > pyarrow
> > > >>> Table.
> > > >>> > > Let's say I have a pyarrow Table with 100 columns.
> > > >>> > >
> > > >>> > > (1) if I want to cast n columns to a different type (e.g.,
> float
> > to
> > > >>> int).
> > > >>> > > What is the smallest memory overhead that I can do? (memory
> > > overhead
> > > >>> of 1
> > > >>> > > column, n columns or 100 columns?)
> > > >>> > >
> > > >>> > > (2) if I want to cast n timestamp columns from tz-native to
> > tz-UTC.
> > > >>> What
> > > >>> > > is the smallest memory overhead that I can do? (0, 1 column, n
> > > >>> columns or
> > > >>> > > 100 columns?)
> > > >>> > >
> > > >>> > > Thanks!
> > > >>> > > Li
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > >
> >
>

Re: Question about memory usage and type casting using pyarrow Table

Posted by Li Jin <ic...@gmail.com>.
Oh thanks that could be a workaround! I thought pa tables are supposed to
be immutable , is there a safe way to just change the metadata?

On Wed, Feb 15, 2023 at 5:44 PM Rok Mihevc <ro...@gmail.com> wrote:

> Well that's suboptimal. As a workaround I suppose you could just change the
> metadata if the array is timezone aware.
>
> On Wed, Feb 15, 2023 at 10:37 PM Li Jin <ic...@gmail.com> wrote:
>
> > Oh found this comment:
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156
> >
> >
> >
> > On Wed, Feb 15, 2023 at 4:23 PM Li Jin <ic...@gmail.com> wrote:
> >
> > > Not sure if this is actually a bug or expected behavior - I filed
> > > https://github.com/apache/arrow/issues/34210
> > >
> > > On Wed, Feb 15, 2023 at 4:15 PM Li Jin <ic...@gmail.com> wrote:
> > >
> > >> Hmm..something feels off here - I did the following experiment on
> Arrow
> > >> 11 and casting timestamp-naive to int64 is much faster than casting
> > >> timestamp-naive to timestamp-utc:
> > >>
> > >> In [16]: %time table.cast(schema_int)
> > >> CPU times: user 114 µs, sys: 30 µs, total: 144 µs
> > >> *Wall time: 231 µs*
> > >> Out[16]:
> > >> pyarrow.Table
> > >> time: int64
> > >> ----
> > >> time: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
> > >>
> > >> In [17]: %time table.cast(schema_tz)
> > >> CPU times: user 119 ms, sys: 140 ms, total: 260 ms
> > >> *Wall time: 259 ms*
> > >> Out[17]:
> > >> pyarrow.Table
> > >> time: timestamp[ns, tz=UTC]
> > >> ----
> > >> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> > >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> > >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> > >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> > >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> > >> 00:00:00.099999999]]
> > >>
> > >> In [18]: table
> > >> Out[18]:
> > >> pyarrow.Table
> > >> time: timestamp[ns]
> > >> ----
> > >> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> > >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> > >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> > >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> > >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> > >> 00:00:00.099999999]]
> > >>
> > >> On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc <ro...@gmail.com>
> > wrote:
> > >>
> > >>> I'm not sure about (1) but I'm pretty sure for (2) doing a cast of
> > >>> tz-aware
> > >>> timestamp to tz-naive should be a metadata-only change.
> > >>>
> > >>> On Wed, Feb 15, 2023 at 4:19 PM Li Jin <ic...@gmail.com>
> wrote:
> > >>>
> > >>> > Asking (2) because IIUC this is a metadata operation that could be
> > zero
> > >>> > copy but I am not sure if this is actually the case.
> > >>> >
> > >>> > On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ic...@gmail.com>
> > wrote:
> > >>> >
> > >>> > > Hello!
> > >>> > >
> > >>> > > I have some questions about type casting memory usage with
> pyarrow
> > >>> Table.
> > >>> > > Let's say I have a pyarrow Table with 100 columns.
> > >>> > >
> > >>> > > (1) if I want to cast n columns to a different type (e.g., float
> to
> > >>> int).
> > >>> > > What is the smallest memory overhead that I can do? (memory
> > overhead
> > >>> of 1
> > >>> > > column, n columns or 100 columns?)
> > >>> > >
> > >>> > > (2) if I want to cast n timestamp columns from tz-native to
> tz-UTC.
> > >>> What
> > >>> > > is the smallest memory overhead that I can do? (0, 1 column, n
> > >>> columns or
> > >>> > > 100 columns?)
> > >>> > >
> > >>> > > Thanks!
> > >>> > > Li
> > >>> > >
> > >>> >
> > >>>
> > >>
> >
>

Re: Question about memory usage and type casting using pyarrow Table

Posted by Rok Mihevc <ro...@gmail.com>.
Well that's suboptimal. As a workaround I suppose you could just change the
metadata if the array is timezone aware.

On Wed, Feb 15, 2023 at 10:37 PM Li Jin <ic...@gmail.com> wrote:

> Oh found this comment:
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156
>
>
>
> On Wed, Feb 15, 2023 at 4:23 PM Li Jin <ic...@gmail.com> wrote:
>
> > Not sure if this is actually a bug or expected behavior - I filed
> > https://github.com/apache/arrow/issues/34210
> >
> > On Wed, Feb 15, 2023 at 4:15 PM Li Jin <ic...@gmail.com> wrote:
> >
> >> Hmm..something feels off here - I did the following experiment on Arrow
> >> 11 and casting timestamp-naive to int64 is much faster than casting
> >> timestamp-naive to timestamp-utc:
> >>
> >> In [16]: %time table.cast(schema_int)
> >> CPU times: user 114 µs, sys: 30 µs, total: 144 µs
> >> *Wall time: 231 µs*
> >> Out[16]:
> >> pyarrow.Table
> >> time: int64
> >> ----
> >> time: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
> >>
> >> In [17]: %time table.cast(schema_tz)
> >> CPU times: user 119 ms, sys: 140 ms, total: 260 ms
> >> *Wall time: 259 ms*
> >> Out[17]:
> >> pyarrow.Table
> >> time: timestamp[ns, tz=UTC]
> >> ----
> >> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> >> 00:00:00.099999999]]
> >>
> >> In [18]: table
> >> Out[18]:
> >> pyarrow.Table
> >> time: timestamp[ns]
> >> ----
> >> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> >> 00:00:00.099999999]]
> >>
> >> On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc <ro...@gmail.com>
> wrote:
> >>
> >>> I'm not sure about (1) but I'm pretty sure for (2) doing a cast of
> >>> tz-aware
> >>> timestamp to tz-naive should be a metadata-only change.
> >>>
> >>> On Wed, Feb 15, 2023 at 4:19 PM Li Jin <ic...@gmail.com> wrote:
> >>>
> >>> > Asking (2) because IIUC this is a metadata operation that could be
> zero
> >>> > copy but I am not sure if this is actually the case.
> >>> >
> >>> > On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ic...@gmail.com>
> wrote:
> >>> >
> >>> > > Hello!
> >>> > >
> >>> > > I have some questions about type casting memory usage with pyarrow
> >>> Table.
> >>> > > Let's say I have a pyarrow Table with 100 columns.
> >>> > >
> >>> > > (1) if I want to cast n columns to a different type (e.g., float to
> >>> int).
> >>> > > What is the smallest memory overhead that I can do? (memory
> overhead
> >>> of 1
> >>> > > column, n columns or 100 columns?)
> >>> > >
> >>> > > (2) if I want to cast n timestamp columns from tz-native to tz-UTC.
> >>> What
> >>> > > is the smallest memory overhead that I can do? (0, 1 column, n
> >>> columns or
> >>> > > 100 columns?)
> >>> > >
> >>> > > Thanks!
> >>> > > Li
> >>> > >
> >>> >
> >>>
> >>
>

Re: Question about memory usage and type casting using pyarrow Table

Posted by Li Jin <ic...@gmail.com>.
Oh found this comment:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156



On Wed, Feb 15, 2023 at 4:23 PM Li Jin <ic...@gmail.com> wrote:

> Not sure if this is actually a bug or expected behavior - I filed
> https://github.com/apache/arrow/issues/34210
>
> On Wed, Feb 15, 2023 at 4:15 PM Li Jin <ic...@gmail.com> wrote:
>
>> Hmm..something feels off here - I did the following experiment on Arrow
>> 11 and casting timestamp-naive to int64 is much faster than casting
>> timestamp-naive to timestamp-utc:
>>
>> In [16]: %time table.cast(schema_int)
>> CPU times: user 114 µs, sys: 30 µs, total: 144 µs
>> *Wall time: 231 µs*
>> Out[16]:
>> pyarrow.Table
>> time: int64
>> ----
>> time: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
>>
>> In [17]: %time table.cast(schema_tz)
>> CPU times: user 119 ms, sys: 140 ms, total: 260 ms
>> *Wall time: 259 ms*
>> Out[17]:
>> pyarrow.Table
>> time: timestamp[ns, tz=UTC]
>> ----
>> time: [[1970-01-01 00:00:00.000000000,1970-01-01
>> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
>> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
>> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
>> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
>> 00:00:00.099999999]]
>>
>> In [18]: table
>> Out[18]:
>> pyarrow.Table
>> time: timestamp[ns]
>> ----
>> time: [[1970-01-01 00:00:00.000000000,1970-01-01
>> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
>> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
>> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
>> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
>> 00:00:00.099999999]]
>>
>> On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc <ro...@gmail.com> wrote:
>>
>>> I'm not sure about (1) but I'm pretty sure for (2) doing a cast of
>>> tz-aware
>>> timestamp to tz-naive should be a metadata-only change.
>>>
>>> On Wed, Feb 15, 2023 at 4:19 PM Li Jin <ic...@gmail.com> wrote:
>>>
>>> > Asking (2) because IIUC this is a metadata operation that could be zero
>>> > copy but I am not sure if this is actually the case.
>>> >
>>> > On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ic...@gmail.com> wrote:
>>> >
>>> > > Hello!
>>> > >
>>> > > I have some questions about type casting memory usage with pyarrow
>>> Table.
>>> > > Let's say I have a pyarrow Table with 100 columns.
>>> > >
>>> > > (1) if I want to cast n columns to a different type (e.g., float to
>>> int).
>>> > > What is the smallest memory overhead that I can do? (memory overhead
>>> of 1
>>> > > column, n columns or 100 columns?)
>>> > >
>>> > > (2) if I want to cast n timestamp columns from tz-native to tz-UTC.
>>> What
>>> > > is the smallest memory overhead that I can do? (0, 1 column, n
>>> columns or
>>> > > 100 columns?)
>>> > >
>>> > > Thanks!
>>> > > Li
>>> > >
>>> >
>>>
>>

Re: Question about memory usage and type casting using pyarrow Table

Posted by Li Jin <ic...@gmail.com>.
Not sure if this is actually a bug or expected behavior - I filed
https://github.com/apache/arrow/issues/34210

On Wed, Feb 15, 2023 at 4:15 PM Li Jin <ic...@gmail.com> wrote:

> Hmm..something feels off here - I did the following experiment on Arrow 11
> and casting timestamp-naive to int64 is much faster than casting
> timestamp-naive to timestamp-utc:
>
> In [16]: %time table.cast(schema_int)
> CPU times: user 114 µs, sys: 30 µs, total: 144 µs
> *Wall time: 231 µs*
> Out[16]:
> pyarrow.Table
> time: int64
> ----
> time: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
>
> In [17]: %time table.cast(schema_tz)
> CPU times: user 119 ms, sys: 140 ms, total: 260 ms
> *Wall time: 259 ms*
> Out[17]:
> pyarrow.Table
> time: timestamp[ns, tz=UTC]
> ----
> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> 00:00:00.099999999]]
>
> In [18]: table
> Out[18]:
> pyarrow.Table
> time: timestamp[ns]
> ----
> time: [[1970-01-01 00:00:00.000000000,1970-01-01
> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
> 00:00:00.099999999]]
>
> On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc <ro...@gmail.com> wrote:
>
>> I'm not sure about (1) but I'm pretty sure for (2) doing a cast of
>> tz-aware
>> timestamp to tz-naive should be a metadata-only change.
>>
>> On Wed, Feb 15, 2023 at 4:19 PM Li Jin <ic...@gmail.com> wrote:
>>
>> > Asking (2) because IIUC this is a metadata operation that could be zero
>> > copy but I am not sure if this is actually the case.
>> >
>> > On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ic...@gmail.com> wrote:
>> >
>> > > Hello!
>> > >
>> > > I have some questions about type casting memory usage with pyarrow
>> Table.
>> > > Let's say I have a pyarrow Table with 100 columns.
>> > >
>> > > (1) if I want to cast n columns to a different type (e.g., float to
>> int).
>> > > What is the smallest memory overhead that I can do? (memory overhead
>> of 1
>> > > column, n columns or 100 columns?)
>> > >
>> > > (2) if I want to cast n timestamp columns from tz-native to tz-UTC.
>> What
>> > > is the smallest memory overhead that I can do? (0, 1 column, n
>> columns or
>> > > 100 columns?)
>> > >
>> > > Thanks!
>> > > Li
>> > >
>> >
>>
>

Re: Question about memory usage and type casting using pyarrow Table

Posted by Li Jin <ic...@gmail.com>.
Hmm..something feels off here - I did the following experiment on Arrow 11
and casting timestamp-naive to int64 is much faster than casting
timestamp-naive to timestamp-utc:

In [16]: %time table.cast(schema_int)
CPU times: user 114 µs, sys: 30 µs, total: 144 µs
*Wall time: 231 µs*
Out[16]:
pyarrow.Table
time: int64
----
time: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]

In [17]: %time table.cast(schema_tz)
CPU times: user 119 ms, sys: 140 ms, total: 260 ms
*Wall time: 259 ms*
Out[17]:
pyarrow.Table
time: timestamp[ns, tz=UTC]
----
time: [[1970-01-01 00:00:00.000000000,1970-01-01
00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
00:00:00.099999999]]

In [18]: table
Out[18]:
pyarrow.Table
time: timestamp[ns]
----
time: [[1970-01-01 00:00:00.000000000,1970-01-01
00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01
00:00:00.099999999]]

On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc <ro...@gmail.com> wrote:

> I'm not sure about (1) but I'm pretty sure for (2) doing a cast of tz-aware
> timestamp to tz-naive should be a metadata-only change.
>
> On Wed, Feb 15, 2023 at 4:19 PM Li Jin <ic...@gmail.com> wrote:
>
> > Asking (2) because IIUC this is a metadata operation that could be zero
> > copy but I am not sure if this is actually the case.
> >
> > On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ic...@gmail.com> wrote:
> >
> > > Hello!
> > >
> > > I have some questions about type casting memory usage with pyarrow
> Table.
> > > Let's say I have a pyarrow Table with 100 columns.
> > >
> > > (1) if I want to cast n columns to a different type (e.g., float to
> int).
> > > What is the smallest memory overhead that I can do? (memory overhead
> of 1
> > > column, n columns or 100 columns?)
> > >
> > > (2) if I want to cast n timestamp columns from tz-native to tz-UTC.
> What
> > > is the smallest memory overhead that I can do? (0, 1 column, n columns
> or
> > > 100 columns?)
> > >
> > > Thanks!
> > > Li
> > >
> >
>

Re: Question about memory usage and type casting using pyarrow Table

Posted by Rok Mihevc <ro...@gmail.com>.
I'm not sure about (1) but I'm pretty sure for (2) doing a cast of tz-aware
timestamp to tz-naive should be a metadata-only change.

On Wed, Feb 15, 2023 at 4:19 PM Li Jin <ic...@gmail.com> wrote:

> Asking (2) because IIUC this is a metadata operation that could be zero
> copy but I am not sure if this is actually the case.
>
> On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ic...@gmail.com> wrote:
>
> > Hello!
> >
> > I have some questions about type casting memory usage with pyarrow Table.
> > Let's say I have a pyarrow Table with 100 columns.
> >
> > (1) if I want to cast n columns to a different type (e.g., float to int).
> > What is the smallest memory overhead that I can do? (memory overhead of 1
> > column, n columns or 100 columns?)
> >
> > (2) if I want to cast n timestamp columns from tz-native to tz-UTC. What
> > is the smallest memory overhead that I can do? (0, 1 column, n columns or
> > 100 columns?)
> >
> > Thanks!
> > Li
> >
>

Re: Question about memory usage and type casting using pyarrow Table

Posted by Li Jin <ic...@gmail.com>.
Asking (2) because IIUC this is a metadata operation that could be zero
copy but I am not sure if this is actually the case.

On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ic...@gmail.com> wrote:

> Hello!
>
> I have some questions about type casting memory usage with pyarrow Table.
> Let's say I have a pyarrow Table with 100 columns.
>
> (1) if I want to cast n columns to a different type (e.g., float to int).
> What is the smallest memory overhead that I can do? (memory overhead of 1
> column, n columns or 100 columns?)
>
> (2) if I want to cast n timestamp columns from tz-native to tz-UTC. What
> is the smallest memory overhead that I can do? (0, 1 column, n columns or
> 100 columns?)
>
> Thanks!
> Li
>