You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "nikfio (via GitHub)" <gi...@apache.org> on 2024/04/10 20:23:43 UTC

[I] [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' [arrow]

nikfio opened a new issue, #41132:
URL: https://github.com/apache/arrow/issues/41132

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hi guys,
   
   I am trying to convert a timestamp column from string to timestamp datatype.
   I have this format: `%Y%m%d %H%M%S%f`
   To read, for example, a date equal to `20090101 185956000`.
   
   Assuming that
   `
   table['timestamp'][0] = '20090101 185956000'
   `
   
   Running the script below
   `
   import pyarrow.compute as pc
   `
   `
   pc.strptime(table['timestamp'][0], format='%Y%m%d %H%M%S%f', unit='ms')
   `
   throws the error:
   `
   *** pyarrow.lib.ArrowInvalid: Failed to parse string: '20090101 185956000' as a scalar of type timestamp[ms]
   `
   
   Meanwhile if I use strptime from datetime module:
   
   `
   test_date = str(table['timestamp'][0])
   `
   `
   datetime.strptime(str(df['timestamp'][0]), '%Y%m%d %H%M%S%f')
   `
   It gives as output a correctly read datetime object:
   `
   datetime.datetime(2009, 1, 1, 18, 59, 56)
   `
   
   
   Should the pyarrow strptime work like the standard datetime strptime function?
   What am I doing wrong?
   
   
   Many thanks,
   Nick
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' [arrow]

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on issue #41132:
URL: https://github.com/apache/arrow/issues/41132#issuecomment-2052635146

   Thanks for confirming it worked and closing the isssue @nikfio.
   We might want to implement `%f` and other flags in the future if there's enough interest. It would require a discussion first and some C++ work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' [arrow]

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on issue #41132:
URL: https://github.com/apache/arrow/issues/41132#issuecomment-2048573219

   Thanks for reporting this @nikfio !
   This is due to [`pc.strptime`](https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.strptime.html#pyarrow-compute-strptime) using the [C/C++ format semantics](https://man7.org/linux/man-pages/man3/strptime.3.html) for parsing instead of the Python ones. Namely it seems it's the `%f` flag that is causing this issue as this works fine:
   ```python
   >>> pc.strptime('20090101 185956', format='%Y%m%d %H%M%S', unit='ms')
   <pyarrow.TimestampScalar: '2009-01-01T18:59:56.000'>
   ```
   To work around this you can try [this (cumbersome) approach] for now:
   ```python
   import pyarrow as pa
   import pyarrow.compute as pc
   ts = pa.array(["1970-01-01T00:00:59.123456789", "2000-02-29T23:23:23.999999999"], pa.string())
   ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 19), format="%Y-%m-%dT%H:%M:%S", unit="ns")
   d = pc.utf8_slice_codeunits(ts, 20, 99).cast(pa.int64()).cast(pa.duration("ns"))
   pc.add(ts2, d)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' [arrow]

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on issue #41132:
URL: https://github.com/apache/arrow/issues/41132#issuecomment-2052555361

   Hey @nikfio, sorry I didn't use your raw data. This works for me on your example:
   ```python
   import pyarrow as pa
   import pyarrow.compute as pc
   ts = pa.array(["20090101 185956123"], pa.string())
   ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 15), format="%Y%m%d %H%M%S", unit="ms")
   d = pc.utf8_slice_codeunits(ts, 15, 99).cast(pa.int64()).cast(pa.duration("ms"))
   pc.add(ts2, d)
   ```
   ```
   <pyarrow.lib.TimestampArray object at 0x73006a846680>
   [
     2009-01-01 18:59:56.123
   ]
   ```
   Your Pandas/numpy workaround looks good. I'm not sure which approach would be better for your usecase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' [arrow]

Posted by "nikfio (via GitHub)" <gi...@apache.org>.
nikfio commented on issue #41132:
URL: https://github.com/apache/arrow/issues/41132#issuecomment-2052518403

   Hello @rok,
   
   thank you for your help.
   
   I tried your piece suggested. But executing the first operation still gives the same error :
   
   `ts2 = pc.strptime(pc.utf8_slice_codeunits('20090101 185956000',
                      0,
                                             19),
                     format='%Y%m%d %H%M%S%f',
                     unit="ns")`
   `*** pyarrow.lib.ArrowInvalid: Failed to parse string: '20090101 185956000' as a scalar of type timestamp[ns]`
   
   
   I know and I am sorry actually, this date format (`%Y%m%d %H%M%S%f`) is a pain in the ass.
   
   I managed to work aroud the issue by reading the timestamp at first as pyarrow string and the passing through a datetime conversion using pandas `to_datetime`. Then finally convert the datetime array into pyarrow array with type `timestamp('ms')`.
   
   1. convert timestamp with type pyarrow string  - call it `timestamp_str` - to pandas datetime using the date format wanted from the start:
   `import pandas as pd`
   `std_datetime = pd.to_datetime(timestamp_str.to_numpy(), 
                                          format=`%Y%m%d %H%M%S%f`)`
   2.  convert back to pyarrow `array`;
   `import pyarrow as pa`
   `timecol = pa.array(std_datetime, 
                                      type=pa.timestamp('ms'))`
   3. rebuild table as wanted 
   `target_schema = pa.schema([('timestamp', pa.timestamp('ms')), 'otehr columns types'])`
   `table = pa.Table.from arrays(
      [ timecol, 'other cols' ], schema=target_schema`
   
   Didn't wrote everything clear in point (3) to be more synthesized. 
   Hope it is understable from everyone, otherwise let me know.
   
   Thanks,
   Nick
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' [arrow]

Posted by "nikfio (via GitHub)" <gi...@apache.org>.
nikfio commented on issue #41132:
URL: https://github.com/apache/arrow/issues/41132#issuecomment-2052609675

   Close as a prime solution has been suggested and an alternative one has been proposed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' [arrow]

Posted by "nikfio (via GitHub)" <gi...@apache.org>.
nikfio closed issue #41132: [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f'
URL: https://github.com/apache/arrow/issues/41132


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' [arrow]

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on issue #41132:
URL: https://github.com/apache/arrow/issues/41132#issuecomment-2048574656

   Tag #31324 for reference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' [arrow]

Posted by "nikfio (via GitHub)" <gi...@apache.org>.
nikfio commented on issue #41132:
URL: https://github.com/apache/arrow/issues/41132#issuecomment-2052608698

   Great @rok I just tested it and it  works fine as expected.
   
   sorry also my fault I didn't understand at first the usage of `pc.utf8_slice_codeunits`.
   
   I think I'll go with yours in order to have a all-pyarrow operation.
   
   
   Thanks a lot,
   Nick
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org