You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/30 21:50:23 UTC

[GitHub] [arrow-datafusion] alippai opened a new issue #451: Add Linked data benchmarks

alippai opened a new issue #451:
URL: https://github.com/apache/arrow-datafusion/issues/451


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Recently I came across LDBC benchmarks which is focused on graph-like workloads. I'm wondering whether Datafusion already covers the features the queries need. While I don't think it's as important as TPC-H it'd increase the coverage helping to identify performance regressions during the Datafusion development. This would be an extra tool to get a broader picture in a structured way (at least more structured than ad-hoc queries)
   
   **Describe the solution you'd like**
   Supporting the queries written for PostgreSQL: https://github.com/ldbc/ldbc_snb_bi/tree/main/postgres/queries .
   
   **Describe alternatives you've considered**
   Not implementing it. Optimizing Datafusion to perform well on this particular benchmark is out of the scope as well. My assumption is that OLAP should be first-class and this should be a second class target.
   
   **Additional context**
   While it's not an OLAP workload, I believe Datafusion would perform relatively or extremely well.
   
   Cc @Dandandan IIRC you contributed the most (CTE+UNION ALL) in this field


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #451: Add Linked data benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #451:
URL: https://github.com/apache/arrow-datafusion/issues/451#issuecomment-851666274


   Thanks a lot again 👍 
   
   I think the challenging part with recursive CTE in DataFusion will be doing it efficiently with arrow data, as .
   So also into what vectorized engines (can) do here.
   It might probably not always possible to do so in that case we should have some thing that efficiently does row by row processing.
   
   Anti joins is another feature - but that I think should be relatively easy to add!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alippai edited a comment on issue #451: Add Linked data benchmarks

Posted by GitBox <gi...@apache.org>.

alippai edited a comment on issue #451:
URL: https://github.com/apache/arrow-datafusion/issues/451#issuecomment-851672062


   In this case LSQB sounds to be a better first target. 👍 
   
   > So also into what vectorized engines (can) do here.
   
   I have a bad experience with dedicated "graph engines", usually a PostgreSQL or SQL Server based solution beats any dedicated solution out there, so I wouldn't be afraid that DataFusion's architecture is not fully exploited. Similarly Differential Dataflow/Materialize or a naive rust/c++ implementation traversing the data is ridiculously faster so there is a chance that Arrow's memory model and parallel joins help. Still, adding benchmarks measuring recursive CTE might side-track the main DataFusion development, I acknowledge that. My gut feeling is that DataFusion would perform these queries relatively well as they would work as "repeated high selectivity, high cardinality joins" and as far as I remember we are not particularly bad at that. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alippai commented on issue #451: Add Linked data benchmarks

Posted by GitBox <gi...@apache.org>.

alippai commented on issue #451:
URL: https://github.com/apache/arrow-datafusion/issues/451#issuecomment-851672062


   In this case LSQB sounds to be a better first target. 👍 
   
   > So also into what vectorized engines (can) do here.
   I have a bad experience with dedicated "graph engines", usually a PostgreSQL or SQL Server based solution beats any dedicated solution out there, so I wouldn't be afraid that DataFusion's architecture is not fully exploited. Similarly Differential Dataflow/Materialize or a naive rust/c++ implementation traversing the data is ridiculously faster so there is a chance that Arrow's memory model and parallel joins help. Still, adding benchmarks measuring recursive CTE might side-track the main DataFusion development, I acknowledge that. My gut feeling is that DataFusion would perform these queries relatively well as they would work as "repeated high selectivity, high cardinality joins" and as far as I remember we are not particularly bad at that. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alippai commented on issue #451: Add Linked data benchmarks

Posted by GitBox <gi...@apache.org>.

alippai commented on issue #451:
URL: https://github.com/apache/arrow-datafusion/issues/451#issuecomment-851646571


   @Dandandan I'm not sure on the recursive 
   CTE implementatiomm part, however PostgreSQL has a brief description on the algorithm https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-SELECT . You are right that some queries need pretty complex features like subquerries, window functions, date handling, generate_subscripts. A more lightweight version of the benchmark can be found as well: https://github.com/ldbc/lsqb/ this focuses on various joins (join, antijoin, outer join) instead of the "recursive" workload.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alippai commented on issue #451: Add Linked data benchmarks

Posted by GitBox <gi...@apache.org>.

alippai commented on issue #451:
URL: https://github.com/apache/arrow-datafusion/issues/451#issuecomment-851647603


   For the LSQB here is the paper https://szarnyasg.github.io/tsmb-grades21/ms.pdf and a presentation https://docs.google.com/presentation/d/1pxyX_CWhFVYEttjTG2BrzuaMkEuLRxfhf5iX6n0leZI/mobilepresent?slide=id.gc6f9544c1_0_0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #451: Add Linked data benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #451:
URL: https://github.com/apache/arrow-datafusion/issues/451#issuecomment-851679821


   Yeah I believe joins are reasonably fast currently. I do need to do some comparisions (e.g. add the join queries to https://github.com/h2oai/db-benchmark/pull/182)
   
   There are still some smaller tweaks that can be done and on the planning level some more can be done, such as:
   
   * Implement a better hash join reordering algorithm
   * Improve planning based on size of tables / expected nr. of rows


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #451: Add Linked data benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #451:
URL: https://github.com/apache/arrow-datafusion/issues/451#issuecomment-851634623


   I didn't hear of this benchmark before, thanks for referencing it! Sounds really cool/useful.
   
   I believe for graph processing you'll need (mostly) support for recursive CTEs, which is I guess quite a bit more work than CTEs themselves (which currently just references the query / logical plan) + union all (which just returns all the partitions of the plans).
   
   Do you happen to have some reference material on recursive CTEs?
   
   I think it would be very valuable to plan / add support for graph processing 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org