You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/10 16:15:03 UTC

[GitHub] [arrow-adbc] zeroshade opened a new issue, #61: Simplify Execute and Query interface

zeroshade opened a new issue, #61:
URL: https://github.com/apache/arrow-adbc/issues/61

   Rather than the separate `Execute` / `GetStream` functions, it might be better to follow something similar to FlightSQL's interface or Go's `database/sql` API.
   
   Have two functions:
   
   * Execute a query without expecting a result set: `Execute(struct AdbcConnection*, const char*, struct AdbcResult*, struct AdbcError*)` where `AdbcResult` would contain an optional LastInsertedID and Number of Rows affected
   * Execute a query to retrieve a result set: `Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*)` where the `ArrowArrayStream` is populated with the result set.
   
   Corresponding methods would exist for a Statement just without the need for taking the `const char*` as it would already be prepared in the statement.
   
   Some benefits of this idea:
   * With the interface enforcing the idea of the query -> retrieving the results being a single API call it makes concurrency easier to handle (any complexity would be handled by the driver implementation) by consumers of the interface.
   * Drivers can be aware of whether or not a user is expecting a result set and can operate accordingly. They don't have to interrogate the backend to know whether or not a result set is available or if the query was an update/insert vs a select.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1218147989

   Ok, I think we'll want to tweak the partitioning interface further after this refactor, see #68.
   
   Also I suppose this supersedes the 'affected row count' issue in #55.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1210987487

   I think it may still have sense to have a generic Execute to ease compatibility with APIs that do not differentiate between the types of queries (and note JDBC has all three!), but having an execute-with-rows-affected and execute-with-result-set is reasonable. 
   
   Concurrency is a separate discussion; I'm torn on whether we should push the complexity into the driver (and declare up front that, for example, everything must be thread-safe), or declare that everything is _not_ thread-safe and leave clients to deal with it (so Go would have to lock the statement to use it), or (like DBAPI in Python) provide a way for drivers to indicate what they support (this is probably the worst of both worlds though). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1226388118

   @zeroshade just to follow up, does the linked PR look reasonable (at least interface-wise)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1210994922

   CC @pitrou, @hannes, @krlmlr if you have opinions here? 
   
   @lwhite1 had the same feedback about executeQuery/execute in Java last month. So for consistency a query method returning a result set makes sense at the very least.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] krlmlr commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

krlmlr commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1214402390

   Just to be sure we're on the same page:
   
   - a "query" is a single SQL string that can return a result set but doesn't have to
   - a "statement" is the result of preparing an SQL string with placeholders, so that parameters can be added
   - we're debating whether the caller should declare up front if a result is expected
   
   In R DBI we're using the term "query" to indicate something that returns a result set (with or without parameters), and a "statement" doesn't return a result set but we can query the number of rows affected.
   
   The more information we can share with the driver up front, the better. Do we really need two methods, though -- would a single `Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*)` with an optional third argument (can be `NULL`) also work? Or perhaps an option argument that indicates if we expect a result set, and return metadata (rows affected etc.) via the arrow stream?
   
   What should happen to queries that indicate that no result set is expected but where a result set is available? Is a warning useful or annoying?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1210989539

   Also, possibly the driver manager could define execute-with-result-set and execute-with-rows-affected in terms of the generic execute + generic getters to retrieve the affected rows/last row ID. So a basic driver (e.g. SQLite, which doesn't differentiate) would be straightforward to implement still but clients can be mostly none the wiser. It does increase the actual API surface to have all those permutations though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1214424587

   Right, and on the other hand, databases like SQLite have no reliable way to get the info. But APIs like JDBC, Python DBAPI, and Go's database API expose standard ways to get last inserted ID(s). I wasn't originally concerned with it since it feels like the 'wrong' use case to worry about so I'm fine ignoring it.
   
   I think what's being discussed here is basically to have the same things as DBI, and the `Query` you propose seems sufficient? I suppose a header-only library cannot provide an (overridable) convenience function, so `Query` would have to be a full-fledged API that can be stubbed by the driver manager.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] paleolimbot commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

paleolimbot commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1227170540

   It seems like this would be cleaner as multiple functions (as Kirill noted) by output type? Maybe:
   
   ```c
   ADBC_EXPORT AdbcStatusCode AdbcStatementExecuteVoid(struct AdbcStatement* statement, struct AdbcError* error);
   ADBC_EXPORT AdbcStatusCode AdbcStatementExecute(struct AdbcStatement* statement, struct ArrowArrayStream* stream, struct AdbcError* error);
   ADBC_EXPORT AdbcStatusCode AdbcStatementExecuteParitioned(struct AdbcStatement* statement, void* something, struct AdbcError* error);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1227216915

   Alright, I'll split into three functions then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1227596827

   Updated the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1217131457

   Ok, while digging I realized we also need to account for partitioning (=Flight/Flight SQL's ability to return multiple endpoints in a GetFlightInfo). So what I'm tentatively about to refactor towards is this:
   
   ```c
   /// \brief Execute a statement and get the results.
   ///
   /// This invalidates any prior result sets.
   ///
   /// \param[in] statement The statement to execute.
   /// \param[in] out_type The expected result type:
   ///   - ADBC_OUTPUT_TYPE_NONE if the query should not generate a
   ///     result set;
   ///   - ADBC_OUTPUT_TYPE_ARROW for an ArrowArrayStream;
   ///   - ADBC_OUTPUT_TYPE_PARTITIONS for a count of partitions (see \ref
   ///     adbc-statement-partition below).
   ///   The result set will be in out.
   /// \param[out] out The results. Must be NULL for output type NONE, a
   ///   pointer to an ArrowArrayStream for ARROW_ARRAY_STREAM, or a
   ///   pointer to a size_t for PARTITIONS.
   /// \param[out] rows_affected The number of rows affected if known,
   ///   else -1. Pass NULL if the client does not want this information.
   /// \param[out] error An optional location to return an error
   ///   message if necessary.
   ADBC_EXPORT
   AdbcStatusCode AdbcStatementExecute(struct AdbcStatement* statement, int output_type,
                                       void* out, int64_t* rows_affected,
                                       struct AdbcError* error);
   
   /// \brief No results are expected from AdbcStatementExecute.  Pass
   ///   NULL to out.
   #define ADBC_OUTPUT_TYPE_NONE 0
   /// \brief Arrow data is expected from AdbcStatementExecute.  Pass
   ///   ArrowArrayStream* to out.
   #define ADBC_OUTPUT_TYPE_ARROW 1
   /// \brief Partitions are expected from AdbcStatementExecute.  Pass
   ///   size_t* to out to get the number of partitions, and use
   ///   AdbcStatementGetPartitionDesc to get a partition.
   ///
   /// Drivers are not required to support partitioning.  In that case,
   /// AdbcStatementExecute will return ADBC_STATUS_NOT_IMPLEMENTED.
   #define ADBC_OUTPUT_TYPE_PARTITIONS 2
   ```
   
   _If_ we decide to add native support for "affected row IDs", that could also go here. Otherwise I don't expect that we'd need more enum variants here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm closed issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm closed issue #61: [Format] Simplify Execute and Query interface
URL: https://github.com/apache/arrow-adbc/issues/61


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] krlmlr commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

krlmlr commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1226765739

   For the record, SQLite now also has `RETURNING`: https://www.sqlite.org/lang_returning.html . I'm in favor of descoping "last inserted ID" for now.
   
   How is the caller supposed to know the size of the `void* out` buffer? Should it be `void** out` instead?
   
   I suspect the type of `out` depends on `output_type`? Would it be worth having separate methods per output type?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1215630552

   Ok, so how does this sound (here, I'm ignoring #59 provide a "just query" method):
   
   Remove `AdbcStatementGetStream`
   Change `AdbcStatementExecute` to `AdbcStatementExecute(struct AdbcStatement*, struct ArrowArrayStream*, size_t*, struct AdbcError*)` where the ArrowArrayStream is for the result set and the size_t is for the rows affected (if known, else 0)
   Change the various `AdbcConnectionGetInfo` methods to take `struct ArrowArrayStream*` as their output instead of `struct AdbcStatement*` (but clarify they still may count as a 'statement' for the purposes of concurrency - that is, they may acquire the same locks, etc. internally)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1214419966

   > Just to be sure we're on the same page:
   > 
   >     * a "query" is a single SQL string that can return a result set but doesn't have to
   > 
   >     * a "statement" is the result of preparing an SQL string with placeholders, so that parameters can be added
   > 
   >     * we're debating whether the caller should declare up front if a result is expected
   > 
   > 
   > In R DBI we're using the term "query" to indicate something that returns a result set (with or without parameters), and a "statement" doesn't return a result set but we can query the number of rows affected.
   
   That sounds reasonable. (`adbc.h` should define some terminology up front!) Though I'm still leaning towards having a separate `Statement` struct even for things that aren't necessarily prepared statements, just because that makes it easier to add options (as suggested) without potentially breaking ABI. 
   
   > The more information we can share with the driver up front, the better. Do we really need two methods, though -- would a single `Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*)` with an optional third argument (can be `NULL`) also work? Or perhaps an option argument that indicates if we expect a result set, and return metadata (rows affected etc.) via the arrow stream?
   
   A single method sounds reasonable. The `Statement` struct can keep the fine-grained APIs. I'd possibly argue that if you want more metadata, you should resort to the lower level APIs. Maybe we can just add the row count parameter as well, for convenience?
   
   ```c
   Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, size_t*, struct AdbcError*)
   ```
   
   If the query returns a result set, the ArrowArrayStream contains the result set and the size_t is the number of rows (if known, else 0). If not, the ArrowArrayStream is not set (or: contains last inserted IDs, if supported and configured) and the size_t contains rows affected.
   
   > 
   > What should happen to queries that indicate that no result set is expected but where a result set is available? Is a warning useful or annoying?
   
   I think we can just ignore the result set in that case. This also makes it easy to support "Retrieve the last inserted id for inserts into an auto-increment table" that @zeroshade mentioned in #55: the caller can set an option to return the inserted IDs, and they'll come back via the result set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1227138564

   `out` depends on `output_type`, yes, so it's either `ArrowArrayStream*`, `NULL`, or some structure that I haven't fleshed out yet in the case of partitioned results. I'm mostly indifferent to whether it's a single function or multiple functions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1213320396

   Another reason to differentiate between queries with/without result sets: in a Postgres driver, that means we know when we can attempt to use `COPY (...) TO STDOUT (FORMAT binary)` (akin to DuckDB's integration with Postgres) to get bulk binary data instead of parsing data one row at a time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] krlmlr commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

krlmlr commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1214423568

   Last inserted IDs (or the results of computed columns in general, for that matter) can be obtained with the `RETURNING` syntax for most databases, SQL Server has `OUTPUT` . This seems far more robust to me than a "last inserted ID".
   
   R DBI has `dbGetQuery()` and `dbExecute()` as methods that are sufficient for most users, and `dbSendQuery()` + `dbBind()` + `dbFetch()` + `dbClearResult()` or `dbSendStatement()` + `dbBind()` + `dbClearResult()` . If ADBC is to remain header-only, what would a convenience method look like?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-adbc] zeroshade commented on issue #61: [Format] Simplify Execute and Query interface

Posted by GitBox <gi...@apache.org>.

zeroshade commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1215736580

   That seems reasonable to me @lidavidm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org