You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/19 18:26:47 UTC

[GitHub] [arrow-adbc] lidavidm opened a new issue, #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

lidavidm opened a new issue, #71:
URL: https://github.com/apache/arrow-adbc/issues/71

   Pgeon: https://github.com/0x0L/pgeon
   ConnectorX: https://sfu-db.github.io/connector-x/intro.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1340148677

   Async APIs are somewhere down on the list of things I would like to explore! But the 'base' API is all blocking. (I also haven't tried binding async C/C++ interfaces to Python's async APIs yet - I need to look at whether callbacks, polling, or something else is preferred/ergonomic.)
   
   Thanks for the heads up - I'll make sure to support the broader ecosystem (I quite like trio's ideas, even if I haven't gotten a chance to use it in practice).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1540150909

   > Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.
   
   FWIW, this is mentioned in the issue above. I _think_ when I looked at it, it seemed like libpq would read the entire response before returning to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] kou commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1540113320

   Yes.
   (https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.sql is the same performance as `select.c`.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] kou commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1540106549

   I think that it's "after".
   The benchmark is https://github.com/apache/arrow-flight-sql-postgresql/tree/main/benchmark/integer .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1540061633

   This benchmark (in slides) found that the libpq driver is very slow: https://www.clear-code.com/blog/2023/5/8/rubykaigi-2023-announce.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] kou commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1540107731

   FYI: The slide URL in the blog post: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2023/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1222665133

   ## ConnectorX
   
   - Can partition the query along a given column, then fetch the partitions in parallel
   - "Copy-exactly-once" architecture
   - Uses preallocated buffers where possible (also, appears to do things like implement its own conversion to Python strings)
   
   These optimizations would probably be difficult to support, though we should preallocate where possible.
   
   ## Turbodbc
   
   ConnectorX's docs compare it to Turbodbc which tends to trail it, though Turbodbc does not appear to implement parallelization (that might explain the difference). 
   
   Turbodbc also lists some optimizations:
   https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html
   
   In particular, it can interleave I/O and conversion. That may be interesting for us, though libpq seems to only either give you a choice between row-at-a-time or getting all query results at once. 
   
   Turbodbc also implements some _memory_ optimizations: dictionary-encoding string fields, and dynamically determining the minimum integer width.
   
   ## pgeon
   - [Uses COPY](https://github.com/0x0L/pgeon/blob/daa0a82429934511f6863f637da484f707c815f9/src/c%2B%2B/pg_interface.cc) (DuckDB appears to do this too, though note DuckDB's postgres extension is GPL) That honestly seems to be the main optimization
   - Queries some metadata tables up front to determine proper types


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1222763745

   Ah, thanks. I noticed that, and it seems like FETCH also requires you to manage a server-side cursor which isn't great. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1540109145

   Ah, so the 'libpq' column is https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.c ? In that case I would expect it to be slower by definition since we're doing extra work to convert the result set to Arrow. And the Flight SQL server has an advantage since it can grab the data directly from PostgreSQL without going through the PostgreSQL wire protocol.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] paleolimbot commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1540157185

   Yes, it won't return anything less than one row at a time. But right now we do download -> decode -> download ->decode and we in theory could do
   
   ```
   download -> download -> download -> download -> download ->
                           sync -> decode -> wait              sync -> decode
   ```
   
   ...such that the only time the user pays for is download time. (Probably complicated to get right, though).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] 0x0L commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by GitBox <gi...@apache.org>.

0x0L commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1222735811

   @lidavidm for pgeon I have experimented with FETCH instead of COPY. COPY was the fastest method in my [limited] testings


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] dhirschfeld commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by GitBox <gi...@apache.org>.

dhirschfeld commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1340130033

   > *In particular, it can interleave I/O and conversion*
   
   If you're implementing an async interface, as a [`trio`](https://trio.readthedocs.io/en/stable/) user, it would be great if you could use [`anyio`](https://github.com/agronholm/anyio) rather than native `acyncio` features. This will enable the code to be used with any async library.
   
   Perhaps the most prominent Python library to support AnyIO is [`fastapi`](https://fastapi.tiangolo.com/async/#write-your-own-async-code), and that's where I'd (eventually) like to make use of `adbc` - asynchronously connecting to databases for displaying data in FastAPI dashboards.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] paleolimbot commented on issue #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot commented on issue #71:
URL: https://github.com/apache/arrow-adbc/issues/71#issuecomment-1540091089

   Is that before or after #636?
   
   FWIW, after that PR you could write benchmarks for reading a raw COPY buffer (i.e., without reading over a connection). Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org