You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "joellubi (via GitHub)" <gi...@apache.org> on 2023/12/05 11:38:22 UTC

Re: [I] go/adbc/driver/snowflake: improve bulk ingestion speed [arrow-adbc]

joellubi commented on issue #1327:
URL: https://github.com/apache/arrow-adbc/issues/1327#issuecomment-1840602154

   Following up on #1322.
   
   The Snowflake Connector that our ADBC driver uses [claims to make optimizations](https://pkg.go.dev/github.com/snowflakedb/gosnowflake#hdr-Batch_Inserts_and_Binding_Parameters) when many values are bound to an `INSERT` statement. There are some limitations to when this optimization can be made, but it does appear that in this case the code is already going through the connector's optimized path. Given this still doesn't offer the throughput we would expect, it seems reasonable to handle this on the ADBC side while addressing some of the connector's existing limitations.
   
   The primary limitations we'd want our solution to overcome:
   1. Currently each batch becomes its own temp stage. We would want to upload multiple (or all) batches to a single stage and load from there.
   2. The connector relies on conversion to golang types which must then be loaded into a CSV for the stage. We could likely do a lot better with arrow type mapping by using parquet directly from arrow as the stage format.
   
   Open question: Does adbc_ingest need to optimize ingestion of small tables as well? Currently the connector uses a single `INSERT` query without staging any files for very small tables. Using COPY in all cases _might_ not perform well in these scenarios. Perhaps we can start with COPY in all cases and add better handling for small tables in the future if there are actually issues in these cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org