You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/28 05:47:19 UTC

[GitHub] [arrow-datafusion] matthewmturner opened a new pull request #1893: Add write_ipc to ExecutionContext

matthewmturner opened a new pull request #1893:
URL: https://github.com/apache/arrow-datafusion/pull/1893


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1777 #3
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1893: Add write_ipc to ExecutionContext

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1893:
URL: https://github.com/apache/arrow-datafusion/pull/1893#issuecomment-1056242541


   @alamb sry i had basically copied this from `write_parquet` which was source of both your comments.
   
   I dont think the clone on the plan was needed, i removed it in `write_ipc`, `write_parquet`, and `write_csv`.  If you would rather i do those other functions in a separate PR i can do that.
   
   I think `IpcWriteOptions` should be clone - so im going to submit PR for that in arrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on a change in pull request #1893: Add write_ipc to ExecutionContext

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on a change in pull request #1893:
URL: https://github.com/apache/arrow-datafusion/pull/1893#discussion_r817389707



##########
File path: datafusion/src/execution/context.rs
##########
@@ -795,6 +794,56 @@ impl ExecutionContext {
         }
     }
 
+    /// Executes a query and writes the results to an Arrow IPC file.
+    pub async fn write_ipc(
+        &self,
+        plan: Arc<dyn ExecutionPlan>,
+        path: impl AsRef<str>,
+        writer_properties: Option<IpcWriteOptions>,
+    ) -> Result<()> {
+        let path = path.as_ref();
+        // create directory to contain the Parquet files (one per partition)
+        let fs_path = Path::new(path);
+        let runtime = self.runtime_env();
+        match fs::create_dir(fs_path) {
+            Ok(()) => {
+                let mut tasks = vec![];
+                for i in 0..plan.output_partitioning().partition_count() {
+                    let filename = format!("part-{}.arrow", i);
+                    let path = fs_path.join(&filename);
+                    let file = fs::File::create(path)?;
+                    let mut writer = match writer_properties {
+                        Some(props) => FileWriter::try_new_with_options(

Review comment:
       just to confirm - you would then add the `IpcWriteOptions` as a new parameter to `try_new`, right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1893: Add write_ipc to ExecutionContext

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1893:
URL: https://github.com/apache/arrow-datafusion/pull/1893#issuecomment-1053915105


   I'm wondering if `IpcWriteOptions` should implement `clone` and/or `copy` like parquet's `WriterProperties`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1893: Add write_ipc to ExecutionContext

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1893:
URL: https://github.com/apache/arrow-datafusion/pull/1893#issuecomment-1057106499


   
   > [](https://github.com/matthewmturner)I think IpcWriteOptions should be clone - so im going to submit PR for that in arrow.
   
   👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1893: Add write_ipc to ExecutionContext

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1893:
URL: https://github.com/apache/arrow-datafusion/pull/1893#issuecomment-1074309233


   https://github.com/apache/arrow-datafusion/pull/2048 has the arrow upgrade


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1893: Add write_ipc to ExecutionContext

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1893:
URL: https://github.com/apache/arrow-datafusion/pull/1893#discussion_r816846453



##########
File path: datafusion/src/execution/context.rs
##########
@@ -795,6 +796,57 @@ impl ExecutionContext {
         }
     }
 
+    /// Executes a query and writes the results to an Arrow IPC file.
+    pub async fn write_ipc(
+        &self,
+        plan: Arc<dyn ExecutionPlan>,
+        path: impl AsRef<str>,
+        writer_properties: Option<IpcWriteOptions>,
+    ) -> Result<()> {
+        let path = path.as_ref();
+        // create directory to contain the Parquet files (one per partition)
+        let fs_path = Path::new(path);
+        let runtime = self.runtime_env();
+        match fs::create_dir(fs_path) {
+            Ok(()) => {
+                let mut tasks = vec![];
+                for i in 0..plan.output_partitioning().partition_count() {
+                    let plan = plan.clone();
+                    let filename = format!("part-{}.parquet", i);

Review comment:
       probably needs to be named something other than `.parquet`

##########
File path: datafusion/src/execution/context.rs
##########
@@ -795,6 +796,57 @@ impl ExecutionContext {
         }
     }
 
+    /// Executes a query and writes the results to an Arrow IPC file.
+    pub async fn write_ipc(
+        &self,
+        plan: Arc<dyn ExecutionPlan>,
+        path: impl AsRef<str>,
+        writer_properties: Option<IpcWriteOptions>,
+    ) -> Result<()> {
+        let path = path.as_ref();
+        // create directory to contain the Parquet files (one per partition)
+        let fs_path = Path::new(path);
+        let runtime = self.runtime_env();
+        match fs::create_dir(fs_path) {
+            Ok(()) => {
+                let mut tasks = vec![];
+                for i in 0..plan.output_partitioning().partition_count() {
+                    let plan = plan.clone();

Review comment:
       why does the plan need to be cloned?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1893: Add write_ipc to ExecutionContext

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1893:
URL: https://github.com/apache/arrow-datafusion/pull/1893#discussion_r817367895



##########
File path: datafusion/src/execution/context.rs
##########
@@ -795,6 +794,56 @@ impl ExecutionContext {
         }
     }
 
+    /// Executes a query and writes the results to an Arrow IPC file.
+    pub async fn write_ipc(
+        &self,
+        plan: Arc<dyn ExecutionPlan>,
+        path: impl AsRef<str>,
+        writer_properties: Option<IpcWriteOptions>,
+    ) -> Result<()> {
+        let path = path.as_ref();
+        // create directory to contain the Parquet files (one per partition)
+        let fs_path = Path::new(path);
+        let runtime = self.runtime_env();
+        match fs::create_dir(fs_path) {
+            Ok(()) => {
+                let mut tasks = vec![];
+                for i in 0..plan.output_partitioning().partition_count() {
+                    let filename = format!("part-{}.arrow", i);
+                    let path = fs_path.join(&filename);
+                    let file = fs::File::create(path)?;
+                    let mut writer = match writer_properties {
+                        Some(props) => FileWriter::try_new_with_options(

Review comment:
       I noticed you open a ticket https://github.com/apache/arrow-rs/pull/1382 to solve compiling failure. 👍
   
   I have another thought is that if we can let `FileWriter` only has a `try_new` method and deletes the `try_new_with_options` method, just like `ArrowWriter`? This can reduce API exposure to Arrow users.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1893: Add write_ipc to ExecutionContext

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1893:
URL: https://github.com/apache/arrow-datafusion/pull/1893#discussion_r817392987



##########
File path: datafusion/src/execution/context.rs
##########
@@ -795,6 +794,56 @@ impl ExecutionContext {
         }
     }
 
+    /// Executes a query and writes the results to an Arrow IPC file.
+    pub async fn write_ipc(
+        &self,
+        plan: Arc<dyn ExecutionPlan>,
+        path: impl AsRef<str>,
+        writer_properties: Option<IpcWriteOptions>,
+    ) -> Result<()> {
+        let path = path.as_ref();
+        // create directory to contain the Parquet files (one per partition)
+        let fs_path = Path::new(path);
+        let runtime = self.runtime_env();
+        match fs::create_dir(fs_path) {
+            Ok(()) => {
+                let mut tasks = vec![];
+                for i in 0..plan.output_partitioning().partition_count() {
+                    let filename = format!("part-{}.arrow", i);
+                    let path = fs_path.join(&filename);
+                    let file = fs::File::create(path)?;
+                    let mut writer = match writer_properties {
+                        Some(props) => FileWriter::try_new_with_options(

Review comment:
       Yes, `Option<IpcWriteOptions>`. Then we can `xxx.unwrap_or_else(|| IpcWriteOptions::default())` to process it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org