You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/01 16:03:46 UTC

[GitHub] [arrow-datafusion] matthewmturner opened a new pull request #2133: Update quarterly roadmap for Q2

matthewmturner opened a new pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1971 
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on a change in pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#discussion_r841077605



##########
File path: docs/source/specification/quarterly_roadmap.md
##########
@@ -21,52 +21,65 @@
 
 A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.
 
-## 2022 Q1
+## 2022 Q2
 
 ### DataFusion Core
 
-- Publish official Arrow2 branch
-- Implementation of memory manager (i.e. to enable spilling to disk as needed)
+- IO Improvements

Review comment:
       Not entirely sure what this specifically is referring to, but I definitely intend to focus on improving the IO and scheduling stories in arrow-rs and DataFusion. See https://github.com/apache/arrow-rs/issues/1473 and https://github.com/apache/arrow-datafusion/issues/2079. Not sure if we want to explicitly call out the scheduling side of this.
   
   I may also get to proper filter pushdown to parquet if I have time - https://github.com/apache/arrow-rs/issues/1191
   
   Edit: I've proposed a change with a very high-level statement of what I hope to achieve w.r.t scheduling

##########
File path: docs/source/specification/quarterly_roadmap.md
##########
@@ -21,52 +21,65 @@
 
 A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.
 
-## 2022 Q1
+## 2022 Q2
 
 ### DataFusion Core
 
-- Publish official Arrow2 branch
-- Implementation of memory manager (i.e. to enable spilling to disk as needed)
+- IO Improvements
+  - Reading, registering, and writing more file formats from both DataFrame API and SQL
+  - Additional options for IO including partitioning and metadata support
+- Memory Management

Review comment:
       ```suggestion
   - Work Scheduling
     - Improve predictability, observability and performance of IO and CPU-bound work
     - Develop a more explicit story for managing parallelism during plan execution
   - Memory Management
   ```
   
   I've yet to create a ticket for this, as I'm still exploring the problem domain, but the precursor discussions can be found https://github.com/apache/arrow-rs/issues/1473 and https://github.com/apache/arrow-datafusion/issues/2079.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#discussion_r841040281



##########
File path: docs/source/specification/quarterly_roadmap.md
##########
@@ -21,52 +21,65 @@
 
 A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.
 
-## 2022 Q1
+## 2022 Q2
 
 ### DataFusion Core
 
-- Publish official Arrow2 branch
-- Implementation of memory manager (i.e. to enable spilling to disk as needed)
+- IO Improvements
+  - Reading, registering, and writing more file formats from both DataFrame API and SQL
+  - Additional options for IO including partitioning and metadata support
+- Memory Management
+  - Add more operators for memory limited execution
+- Performance
+  - Incorporate row-format into operators such as aggregate
+  - Add row-format benchmarks

Review comment:
       ```suggestion
     - Add row-format benchmarks
     - Explore JIT-compiling complex expressions
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#discussion_r841062815



##########
File path: docs/source/specification/quarterly_roadmap.md
##########
@@ -21,52 +21,65 @@
 
 A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.
 
-## 2022 Q1
+## 2022 Q2
 
 ### DataFusion Core
 
-- Publish official Arrow2 branch
-- Implementation of memory manager (i.e. to enable spilling to disk as needed)
+- IO Improvements
+  - Reading, registering, and writing more file formats from both DataFrame API and SQL
+  - Additional options for IO including partitioning and metadata support
+- Memory Management
+  - Add more operators for memory limited execution
+- Performance
+  - Incorporate row-format into operators such as aggregate
+  - Add row-format benchmarks
+  - Explore LLVM for JIT, with inline Rust functions as the primary goal
+- Documentation

Review comment:
       ```suggestion
     - Improve performance of Sort and Merge using Row Format / JIT expressions
   - Documentation
   ```

##########
File path: docs/source/specification/quarterly_roadmap.md
##########
@@ -21,52 +21,65 @@
 
 A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.
 
-## 2022 Q1
+## 2022 Q2
 
 ### DataFusion Core
 
-- Publish official Arrow2 branch
-- Implementation of memory manager (i.e. to enable spilling to disk as needed)
+- IO Improvements

Review comment:
       FYI @tustvold 

##########
File path: docs/source/specification/quarterly_roadmap.md
##########
@@ -21,52 +21,65 @@
 
 A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.
 
-## 2022 Q1
+## 2022 Q2
 
 ### DataFusion Core
 
-- Publish official Arrow2 branch
-- Implementation of memory manager (i.e. to enable spilling to disk as needed)
+- IO Improvements
+  - Reading, registering, and writing more file formats from both DataFrame API and SQL
+  - Additional options for IO including partitioning and metadata support
+- Memory Management
+  - Add more operators for memory limited execution
+- Performance
+  - Incorporate row-format into operators such as aggregate
+  - Add row-format benchmarks
+  - Explore LLVM for JIT, with inline Rust functions as the primary goal
+- Documentation

Review comment:
       I hope to contribute improvements to the Sort performance (especially for multi-column sorts that include strings) this quarter as well. I don't have any writeup of that yet




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#issuecomment-1086651738


   thank you @alamb and @tustvold for the suggestions.  I will get them added shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner edited a comment on pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
matthewmturner edited a comment on pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#issuecomment-1086651738


   thank you @alamb, @Dandandan, and @tustvold for the suggestions.  I will get them added shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] hntd187 commented on pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
hntd187 commented on pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#issuecomment-1086148208


   I'd like to ideally finalize the implementation for the streaming API and get an experimental impl available via `datafusion-streams` basically requires me to finalize the API contract. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on a change in pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#discussion_r841077605



##########
File path: docs/source/specification/quarterly_roadmap.md
##########
@@ -21,52 +21,65 @@
 
 A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.
 
-## 2022 Q1
+## 2022 Q2
 
 ### DataFusion Core
 
-- Publish official Arrow2 branch
-- Implementation of memory manager (i.e. to enable spilling to disk as needed)
+- IO Improvements

Review comment:
       Not entirely sure what this specifically is referring to, but I definitely intend to focus on improving the IO and scheduling stories in arrow-rs and DataFusion. See https://github.com/apache/arrow-rs/issues/1473 and https://github.com/apache/arrow-datafusion/issues/2079. Not sure if we want to explicitly call out the scheduling side of this.
   
   I may also get to proper filter pushdown to parquet if I have time - https://github.com/apache/arrow-rs/issues/1191




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
jychen7 commented on pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#issuecomment-1086417534


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on a change in pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#discussion_r841081043



##########
File path: docs/source/specification/quarterly_roadmap.md
##########
@@ -21,52 +21,65 @@
 
 A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.
 
-## 2022 Q1
+## 2022 Q2
 
 ### DataFusion Core
 
-- Publish official Arrow2 branch
-- Implementation of memory manager (i.e. to enable spilling to disk as needed)
+- IO Improvements
+  - Reading, registering, and writing more file formats from both DataFrame API and SQL
+  - Additional options for IO including partitioning and metadata support
+- Memory Management

Review comment:
       ```suggestion
   - Work Scheduling
     - Improve predictability, observability and performance of both IO and CPU-bound work
     - Develop a more explicit story for managing parallelism during plan execution
   - Memory Management
   ```
   
   I've yet to create a ticket for this, as I'm still exploring the problem domain, but the precursor discussions can be found https://github.com/apache/arrow-rs/issues/1473 and https://github.com/apache/arrow-datafusion/issues/2079.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on a change in pull request #2133: Update quarterly roadmap for Q2

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on a change in pull request #2133:
URL: https://github.com/apache/arrow-datafusion/pull/2133#discussion_r841082342



##########
File path: docs/source/specification/quarterly_roadmap.md
##########
@@ -21,52 +21,65 @@
 
 A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.
 
-## 2022 Q1
+## 2022 Q2
 
 ### DataFusion Core
 
-- Publish official Arrow2 branch
-- Implementation of memory manager (i.e. to enable spilling to disk as needed)
+- IO Improvements

Review comment:
       Thanks @tustvold.  I plan on finishing the work summarized on https://github.com/apache/arrow-datafusion/issues/1777 which is what that refers to




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org