You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/16 20:08:27 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

alamb opened a new issue, #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   I would like to ensure that DataFusion gets the correct answers for SQL queries (especially in tricky corner cases like the one described in https://github.com/apache/arrow-datafusion/issues/4211)
   
   From experience, both in DataFusion and in prior jobs, the effort required to maintain tests (both to add new tests as well as update existing tests) is substantial. Making it easier to add new tests and maintain existing ones will help us keep up velocity. 
   
   Right now, we have two sql integration style tests:
   
   1. the `integration` test from @Jimexist 🦾  https://github.com/apache/arrow-datafusion/tree/master/integration-tests: Runs a limited number of queries against data in both postgres and datafusion and compares the results
   * The `sql_integration` test https://github.com/apache/arrow-datafusion/tree/master/datafusion/core/tests/sql: test setup, execution, and verification is written in rust. 
   
   The challenge with sql_integration test is that to add new tests or update existing ones, we need to change rust code and recompile, which takes a *loong* time
   
   Likewise, the integration test requires that the results are exactly the same as postgres which is not possible in all cases (like when testing for unsigned types, which postgres doesn't support, or testing some DataFusion specific thing)
   
   
   
   **Describe the solution you'd like**
   I would like some sort of data driven test to replace sql_integration
   
   You can see this style of test in duckdb:  https://github.com/duckdb/duckdb/tree/master/test/sql/join
   
   My ideal solution would be to implement a runner (ideally the same as [SQLLogicTests](https://duckdb.org/dev/testing#sqllogictests) from DuckDB)
   2. Using the same data file format as duckdb (will mean we could reuse their tests without much modification)
   3. Start porting as many of the tests in sql_integration over to this new format as possible)
   
   I implemented a impler version of this approach in https://github.com/influxdata/influxdb_iox/blob/main/query_tests/README.md which runs sql queries from a file and compares the result to known output.  I think the duckdb way is superior
   
   **Describe alternatives you've considered**
   Leave things the same
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mvanschellebeeck commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
mvanschellebeeck commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1317676032

   Hey @alamb, I'm trying to get more involved in the project and this looks like something more substantial that I can take on! 
   
   Let me know if that works with you and I can start poking around in the duckdb and influxdb approaches to data-driven tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] TennyZhuang commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
TennyZhuang commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1321193056

   Hi, I'm one of the maintainers in [sqllogictest-rs](https://github.com/risinglightdb/sqllogictest-rs). I'm very happy that arrow-datafusion and databend are interested in the project.
   
   Currently, the project is only used by [risinglight](https://github.com/risinglightdb/risinglight) and [risingwave](https://github.com/risingwavelabs/risingwave), while the maintainers for these three projects overlap highly. However, we really hope to increase the diversity of the sqllogictest project.
   
   ## Components
   
   There are several components in sqllogictest-rs:
   
   1. The core slt file parser.
   2. The driver trait `AsyncDB`.
   3. A cli tool to visit files by glob paths, parse, run and compare results in the different logical namespaces.
   4. (WIP/Experimental) A subprocess-based plugin framework to test the compatibility for non-rust drivers, e.g. [JDBC](https://github.com/risinglightdb/sqllogictest-driver-jdbc).
   
   
   ### Parser
   
   Currently, the parser part is generally stabilized and we add very few extensions to the original syntax, the most important thing is the `include` macro, which is useful for reusing some test codes.
   
   ### AsyncDB trait
   
   Currently, the trait is easy and general enough, it only accepts a SQL string and returns a concatenated result. This means that it's compatible with different SQL syntaxes (Postgres, MySQL, or even Spanner), but know very few things about the data type information. We even can't check whether the result types are correct for queries.
   
   We'll improve that, but I'm not sure if should we generalize the data types between different databases, it looks like hard work.
   
   ### Runner
   
   A simple and general part.
   
   ## Conclusion
   
   We have several ways of working together.
   
   1. Just fork the project, customize them by yourselves and contributions are always welcome!
   2. Use the sqllogictest-rs parser and customize other things by yourselves.
   3. Use the same project, and work together to make the `AsyncDB` and the `Postgres` implementations better.
   4. Make the project even more general, and open to every database interface. (Likely not a short-term goal).
   
   We would like to donate the project to some org if it would help our collaboration.
   
   Thanks for your interest in sqllogictest-rs!
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mvanschellebeeck commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
mvanschellebeeck commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1320762453

   Thanks @xudong963. sqllogictest-rs looks an awesome starting point!
   
   Looks like DuckDB runs an extended version of SQLLgic tests:
   > For testing plain SQL we use an extended version of the SQL logic test suite, adopted from [SQLite](https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki)
   
   ### But with:
   
   1. A port of the tests from [sql_integration](https://github.com/apache/arrow-datafusion/tree/master/datafusion/core/tests/sql)
   2. And the utilisation of the **huge** set existing SQLLite sqllogictests (https://www.sqlite.org/sqllogictest/dir?ci=tip&name=test) 
   
   We could develop a pretty solid testing foundation with [sqllogictest-rs](https://github.com/risinglightdb/sqllogictest-rs) as the parser and runner.
   
   ### On Current Test Migration
   We could migrate a large chunk of tests with a clever regex that captures on the query and result variables and manually handle some hard-to-automate edge cases like [this one](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/create_drop.rs#L116) (which creates a few tables) and [this one](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/create_drop.rs#L482) (that expects a failure).
   
   ### On Forking vs Directly using sqllogictest-rs
   On using sqllogictest-rs vs forking it -  using it directly seems like a good first step to get the test suite up and running. Though I can imagine some datafusion-specific queries (that @alamb noted) will eventually force us to move onto a fork in the future. As an example - Materialize created their own sqllogictest driver, [mz_sqllogictest](https://dev.materialize.com/api/rust/mz_sqllogictest/index.html) to deal with this.
   
   ### As a v0, how does this approach sound?
   
   1. Add a `tests/sqllogictests` folder at the root of arrow-datafusion
   2. Use `sqllogitest-rs` and `datafusion-cli`
   3. Implement the sqllogictest-rs runner using `datafusion-cli`
   4. Migrate some tests from [sql_integration](https://github.com/apache/arrow-datafusion/tree/master/datafusion/core/tests/sql)
   5. Setup a simple CI workflow to run the tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1318572830

   @xudong963  -- using https://github.com/risinglightdb/sqllogictest-rs (either directly or forking it) sounds like a great idea!
   
   Thank you for the write up 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1321855859

   > As a v0, how does this approach sound?
   
   @mvanschellebeeck  -- I think it sounds like a great idea and I can't wait to see it. Thank you so much


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1317606537

   I am marking this as "good first issue" as I think it is fairly self contained (the runner should only have to interact with DataFusion as a user) so internal knowledge isn't needed, but you would get a lot of experience with writing a rust tool, CI, etc)
   
   And you would get huge props from the community I think
   
   I am willing to help shepherd this project / review PRs / help with contributions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mvanschellebeeck commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
mvanschellebeeck commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1323775265

   > Hi @mvanschellebeeck, are you doing it?
   
   Yep! I can get out a draft PR today/tomorrow if that works with you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] xudong963 commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1321902594

   > 4\. Migrate some tests from [sql_integration](https://github.com/apache/arrow-datafusion/tree/master/datafusion/core/tests/sql)
   
   Others are good to me, but for this, I think maybe we don't migrate them first, I can directly add new tests from other DBs. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1321938255

   > Others are good to me, but for this, I think maybe we don't migrate them first, I can directly add new tests from other DBs.
   
   That would also be fine. One of the benfits of migrating a few of the tests from sql_integration would be to demonstrate that the harness could (eventually) subsume them all. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1333986150

   I think now that we have merged https://github.com/apache/arrow-datafusion/pull/4395 we should close this PR and I will write a follow on ticket / epic to track follow on improvements. Thanks again @xudong963 @mvanschellebeeck and everyone else who chimed in


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1333990025

   Filed https://github.com/apache/arrow-datafusion/issues/4460 for follow on work


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] xudong963 commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1323720926

   Hi @mvanschellebeeck, are you doing it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1331057196

   > datafusion-cli wasn't that easy to use as a library without [Command](https://doc.rust-lang.org/std/process/struct.Command.html) wizardry so I resorted to datafusion-core functions with format_batches and run_query at the bottom of this [file](https://github.com/apache/arrow-datafusion/blob/96bd42e0d5879470dc71281c297eb103115ca300/tests/sqllogictests/src/main.rs)
   
   I think hooking it up as you have it is the right way to go. We can look into testing datafusion-cli specific stuff as a follow on project
   
   > sqllogictests can not test the output column names (that are currently testing in integration tests)
   
   I think perhaps we could leave a few high level rust `#test` style tests for validating the column names and move the rest over
   
   > still figuring out a few TODOS in the aggregate.slt file
   
   👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] xudong963 commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1323823594

   @mvanschellebeeck Nice,no hurry, take your step.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1321862173

   @TennyZhuang  -- thank you very much for the writeup and explanation -- it sounds great. 
   
   > Use the same project, and work together to make the AsyncDB and the Postgres implementations better.
   
   I think this sounds like a great way to start. We can try to implement AsyncDB for DataFusion and see how it goes and propose extensions to `AsyncDB` if needed 🤔 
   
   > Make the project even more general, and open to every database interface. (Likely not a short-term goal).
   
   I love this idea, though agree it will be much more work; 
   
   > We would like to donate the project to some org if it would help our collaboration.
   
   From my perspective, given that sqllogictest-rs is apache licensed using it directly in DataFusion for testing would be totally fine as the only project affected would be DataFusion itself. I think additional considerations apply to new dependencies of DataFusion itself which not only affect DataFusion but all downstream projects that use DataFusion
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb closed issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)
URL: https://github.com/apache/arrow-datafusion/issues/4248


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mvanschellebeeck commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
mvanschellebeeck commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1328353056

   hey @xudong963, @alamb,
   
   I've got a draft PR up here: https://github.com/apache/arrow-datafusion/pull/4395
   
   In general, sqllogictests operate on a workflow where the `.slt` file creates tables, inserts values and then queries the values for expected results. From what I can tell, CREATE TABLE and INSERT INTO are not currently supported in datafusion so I raised #4396 and #4397 respectively. After they're implemented we'll be able to incorporate a lot of the existing sqllogictests.
   
   I've started migrating some of the tests from [aggregate.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/aggregates.rs) into an sqllogictest file, [aggregate.slt](https://github.com/apache/arrow-datafusion/blob/96bd42e0d5879470dc71281c297eb103115ca300/tests/sqllogictests/test_files/aggregate.slt). The queries here require a specific datafusion setup which I moved into [setup.rs](https://github.com/apache/arrow-datafusion/blob/96bd42e0d5879470dc71281c297eb103115ca300/tests/sqllogictests/src/setup.rs#L35).
   
   Some thoughts as I worked through this:
   - datafusion-cli wasn't that easy to use as a library without [Command](https://doc.rust-lang.org/std/process/struct.Command.html) wizardry so I resorted to datafusion-core functions with `format_batches` and `run_query` at the bottom of this [file](https://github.com/apache/arrow-datafusion/blob/96bd42e0d5879470dc71281c297eb103115ca300/tests/sqllogictests/src/main.rs)
   - sqllogictests can not test the output column names (that are currently testing in integration tests)
   - still figuring out a few TODOS in the aggregate.slt file
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1317682866

   That would be awesome @mvanschellebeeck  -- I think the idea of poking around ("background research" 😆 ) is a great way to start
   
   Also note there may be others who are interested but are not online due to their timezones so I think it would be good to ensure they can collaborate as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] xudong963 commented on issue #4248: Make a data driven SQL testing tool (so we can reuse duckdb test suite, example)

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #4248:
URL: https://github.com/apache/arrow-datafusion/issues/4248#issuecomment-1318028518

   Nice to see the issue, thanks @alamb !
   
   Test completeness is important for a database in the early stages of development, not only to ensure that new features can be tested completely but also to ensure that existing features are not broken in the rapid iteration process.
   
   During our involvement in databend development, logicaltest brought a lot of benefits and made us feel confident about our query results. (FYI: https://databend.rs/doc/contributing/rfcs/new_sql_logic_test_framework.)
   
   Currently, databend has leveraged crdb, ydb, and duckdb test cases.  Thanks for the open-source community!
   
   I also opened a related issue last year (https://github.com/apache/arrow-datafusion/issues/1453).
   
   For arrow-datafusion, I think we can use rust to build our logictest framework, [sqllogictest-rs](https://github.com/risinglightdb/sqllogictest-rs) may be a good candidate. 
   After the framework is finished, we can leverage community's power to rich our logictest cases from other databases :)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org