You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "WillAyd (via GitHub)" <gi...@apache.org> on 2023/06/16 19:53:22 UTC

[GitHub] [arrow-adbc] WillAyd opened a new issue, #812: Document Comparison to pandas?

WillAyd opened a new issue, #812:
URL: https://github.com/apache/arrow-adbc/issues/812

   I was experimenting with the ADBC postgres driver in comparison to equivalent pandas read/write sql functions. I put a rough draft of that up on my blog:
   
   https://willayd.com/leveraging-the-adbc-driver-in-analytics-workflows.html
   
   Do you think any of that is worth integrating into the documentation here? Not sure how much we care to highlight differences here against other tools in the space


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #812: Document Comparison to pandas?

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1608209196

   Actually I'll just take a look at what pandas does currently when I get a chance, and then think about how to mimic that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] WillAyd commented on issue #812: Document Comparison to pandas?

Posted by "WillAyd (via GitHub)" <gi...@apache.org>.
WillAyd commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1608244136

   Yea the `parse_dates` argument is there in case the driver itself cannot infer the date, which lets pandas apply its own inferencing logic. But it isn't always required to specify and usually preferable to let the driver handle. With sqlite you can see it maintains this on roundtrip:
   
   ```python
   >>> import pandas as pd
   >>> from sqlalchemy import create_engine
   >>> df = pd.DataFrame([[pd.Timestamp("2023-01-01")]], columns=["dt"]) 
   >>> engine = create_engine('sqlite://', echo=False)
   >>> df.to_sql("test", con=engine, index=False)
   >>> pd.read_sql("test", con=engine).dtypes
   dt    datetime64[ns]
   dtype: object
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #812: Document Comparison to pandas?

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1608266517

   Ah, I see, thanks. In that case, maybe the right option to provide is some way to map the SQLite column type to a date/time/datetime Arrow type and format string, and then Pandas can configure it to mimic the standard library sqlite3 module. (Though it sounds like SQLAlchemy can do this itself as well from that reference.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] WillAyd commented on issue #812: Document Comparison to pandas?

Posted by "WillAyd (via GitHub)" <gi...@apache.org>.
WillAyd commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1595307702

   That recipes page looks nice - I'll see what I can add there. 
   
   @datapythonista @MarcoGorelli think this is worth tweeting from the pandas account? I have a Mastadon so can post there, but this _might_ be good for Twitter users
   
   As far as your medium to long term goal I don't want to speak for the entire pandas team just yet but I agree it would be good to integrate directly. The sql part of the pandas codebase has a lot of legacy cruft and isn't as actively maintained as other parts, so pandas should stand to gain a lot from using that internally


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] MarcoGorelli commented on issue #812: Document Comparison to pandas?

Posted by "MarcoGorelli (via GitHub)" <gi...@apache.org>.
MarcoGorelli commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1595661808

   nice!
   
   reading posting - the access to Twitter is in the 1password (see Joris' email), if you join then you should be able to access it and post


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] WillAyd commented on issue #812: Document Comparison to pandas?

Posted by "WillAyd (via GitHub)" <gi...@apache.org>.
WillAyd commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1608211661

   pandas will just defer to sqlalchemy or sqlite3. I think both just store those values as ISO strings. Here are relevant docs:
   
   https://docs.python.org/3/library/sqlite3.html#default-adapters-and-converters
   https://docs.sqlalchemy.org/en/20/dialects/sqlite.html#date-and-time-types


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #812: Document Comparison to pandas?

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1608223002

   Cool, thanks. 
   
   It looks like read_sql has you explicitly specify which columns to read as datetimes, so we can probably reasonably add an option for that to the SQLite driver. Though it might be easier/more consistent to just do it as a post-processing step instead of in-driver...? But given the layers in between, it may be valuable to just support it directly anyways.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] WillAyd commented on issue #812: Document Comparison to pandas?

Posted by "WillAyd (via GitHub)" <gi...@apache.org>.
WillAyd commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1608200527

   FYI I started integration with pandas in https://github.com/pandas-dev/pandas/pull/53869 . Looks like we aren't too far off on meeting the pandas requirements, just need int8 support for postgres and datetime support for the postgres/sqlite drivers


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #812: Document Comparison to pandas?

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1595299752

   This is quite cool, thanks for sharing!
   
   I wonder what the best way might be to integrate this. Much of this is really about how Arrow as a whole compares to Pandas (e.g. the data types).
   
   Maybe we could consider explicit "if you did this in Pandas, do this with ADBC" examples? There's some examples going into the next release: https://arrow.apache.org/adbc/main/python/recipe/postgresql.html
   
   The other thing could be highlighting your post somehow (retweeting it?)
   
   Medium-to-long term, I was actually hoping we could integrate ADBC directly in the Pandas read/write_sql functions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #812: Document Comparison to pandas?

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #812:
URL: https://github.com/apache/arrow-adbc/issues/812#issuecomment-1608204026

   Oh that's great!
   
   For SQLite: is there a standard date/time/datetime encoding? That seems to be the main issue with stuffing those values in SQLite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org