You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "dadepo (via GitHub)" <gi...@apache.org> on 2023/02/24 04:42:37 UTC

[GitHub] [arrow-datafusion] dadepo opened a new issue, #5383: The output of write_csv and write_json methods is confusing.

dadepo opened a new issue, #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Using either the `write_json` or `write_csv` on a data frame creates a result that is a bit confusing.
   
   For example this:
   
   ```
       df.clone().write_csv("./data.csv").await;
       df.clone().write_json("./data.json").await;
   ```
   
   led to the following results
   
   ```
   tree -h data.csv/      
   [ 320]  data.csv/
   ├── [ 472]  part-0.csv
   ├── [   0]  part-1.csv
   ├── [   0]  part-2.csv
   ├── [   0]  part-3.csv
   ├── [   0]  part-4.csv
   ├── [   0]  part-5.csv
   ├── [   0]  part-6.csv
   └── [   0]  part-7.csv
   ```
   
   and 
   
   ```
   tree -h data.json/
   [ 320]  data.json/
   ├── [ 939]  part-0.json
   ├── [   0]  part-1.json
   ├── [   0]  part-2.json
   ├── [   0]  part-3.json
   ├── [   0]  part-4.json
   ├── [   0]  part-5.json
   ├── [   0]  part-6.json
   └── [   0]  part-7.json
   ```
   
   Where it is only the `part-0` file that has got the result, while the rest of the files created in the directory are empty.
   
   **Describe the solution you'd like**
   
   Ideally a single file should be created with the results. If there is a technically reason why this cannot be the case, perhaps the methods should be documented on why it acts this way and if it is possible to modify this behaviour and how to.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445347526

   > Another thought that comes to mind regarding this, is if this is done then could have cases where the parts written out aren't in sequential/increasing order, which could cause confusion as well. e.g. if parts 2 and 4 are the only with data then only those will appear on the filesystem like:
   
   This is an excellent point @Jefffrey  
   
   > Am not sure which is more desirable, having 'gaps' in the parts written, vs. having empty parts. Or somehow only write the parts with data first (which would break the parallel behaviour of the writes? unless force repartition).
   
   I agree that I don't know what is better. I don't really use the DataFrame API and so I don't know if the "write multiple files" is an important feature or if it was just the most straightforward initial implementation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445133312

   Thank you @Jefffrey for the analysis
   
   > Not sure if in writing logic its possible to check if a partition is empty before attempting to write to disk?
   
   I think it would be best to defer creating the files until there is actually some data (aka don't create the writer until we have at least a single record batch to write)
   
   The other thing we can do would would be to add some way to the dataframe / write_csv API to say "I want the results in a single partiton/file" -- perhaps by adding `DataFrame::repartititon` or something so the user can control if they want multiple files (potentially faster to write) or a single file (slower to write, but easier to use)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1446029283

   > When it creates multiple files, is it creating one per "partition"? Or one per worker?
   
   Seems to be 1 per partition. In my example above can bump the partitioning count to 20 partitions which will output 20 files.
   
   You have good points regarding Spark, especially about how its partitioning behaviour can be painful for smaller datasets.
   
   Yeah it would be good to expose options for writing CSVs (and JSON potentially as well) as the current API is quite limiting in how it doesn't allow this configuration. Having an API like you suggested could be a good starting point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] dadepo commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "dadepo (via GitHub)" <gi...@apache.org>.

dadepo commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445350173

   As a end user who does not know much about the internal details/limitation of the underlying data formats I'll like to ask:
   
   1. Is it required for the right usage of the library to be exposed to the facts that data could exist in "parts"?
   
   I ask, because if not, then for a user who wants to take a dataframe and produce a csv file out of that, an implementation that produces just one file will be the most user friendly approach that should be default.
   
   There should still be the option of writing the data out in parts, and an advance user, who knows more about the underlying data format can decided to go for this approach either by passing the appropriate flag or calling the appropriate method


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445473446

   > I agree that I don't know what is better. I don't really use the DataFrame API and so I don't know if the "write multiple files" is an important feature or if it was just the most straightforward initial implementation
   
   I feel it is an important feature, as having control over the whether or not to repartition/coalesce before writes makes sense from a performance perspective. It seems Apache Spark has similar behaviour in writing empty partitions out, when I do a similar test.
   
   > I am not sure about this, although perhaps simple changing the code to ensure there is a header row written (even if there was no data) would be less confusing overall?
   
   Yeah I feel this header row part is technically a separate bug (though related since if didn't want to write empty partitions then wouldn't have to fix the bug).
   
   > Is it required for the right usage of the library, for the developer to be exposed to the facts that data could exist in "parts"?
   
   Speaking from a Spark background, I feel this is an important concept, to maximize parallelization via partitioning data. Having the default behaviour be to produce a single CSV on writes might make sense from a user friendly approach especially for smaller datasets, but could have performance implications in larger ones, requiring a coalesce to single partition.
   
   Though again this is from a Spark perspective, I'm not sure how different DataFusion is regarding performance of repartitions/coalesce (especially since it's not a distributed engine like Spark).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1448277392

   > IMO for now, it would be nice if DataFusion simply:
   > 1. Didn't write empty files.
   > 2. Provide an example of passing writing the result of an execution plan to a single CSV file using the arrow-csv writer.
   
   
   I agree these changes would be a great improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445347791

   > P.S. Another thing I just noticed, is that the empty partition files actually shouldn't be completely empty, they should have a header row. For CSV, the default is to have the header rows in files written, so those empty parts should at least have the header row.
   
   I am not sure about this, although perhaps simple changing the code to ensure there is a header row written (even if there was no data) would be less confusing overall?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] wjones127 commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445536499

When it creates multiple files, is it creating one per "partition"? Or one per worker?

I can understand the performance justification for one file per thread / worker. But I'm not sure about 1 file per worker, especially for a single node engine, where the location of a partition isn't as important.

Also as @dadepo suggested, I don't think partitions (in-memory) is that useful to end users. I previously used Spark a lot, and my two biggest complaints about it are:

1. It tightly coupled file layout with partitioning. If there were 5 large files, you wouldn't get more than 5 partitions unless you explicitly asked, even if the file format was something like Parquet that had a natural way to split up into more bite-sized chunks (such as row groups). Similar issue if you had a bunch of tiny files; you would get way to many partitions / tasks.
2. It asked me to tune the number of partitions, when what I would rather tune is the batch size. This is particularly true when I didn't know the size of the incoming data; `.repartition(2000)` made sense when there was 20GB of data incoming but no sense when only 100MB was.

From an end user perspectives, I'd much rather configure the number of workers and the batch size, not the number of partitions.

IMO for now, it would be nice if DataFusion simply:

1. Didn't write empty files.
2. Provide an example of passing writing the result of an execution plan to a single CSV file using the arrow-csv writer.

In the long run, I'd prefer something similar to the Arrow C++ / [PyArrow dataset writer](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html). It handles the partitioning of the dataset, and provides knobs to control number of files that are more meaningful to users, such as max file sizes and max number of file open.

Eventually, it would be nice to support something like:

```rust
write_csv(df, Partitioning::SingleFile, WriterOptions::default()).await?;
write_parquet(
df,
Partitioning::Hive(vec!["year", "month", "day"],
WriterOptions::builder().max_rows_per_file(2_000_000).build(),
).await?;
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] vincev commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "vincev (via GitHub)" <gi...@apache.org>.

vincev commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1640485157

   In case it helps [here](https://github.com/vincev/dply-rs/blob/402ca2c263d770be16781dd32f66fa02852bd7b9/src/engine/parquet.rs#L50) is some code that executes a `LogicalPlan` and writes its output to a single Parquet file, [csv.rs](https://github.com/vincev/dply-rs/blob/main/src/engine/csv.rs) does the same for CSV files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445326546

> I think it would be best to defer creating the files until there is actually some data (aka don't create the writer until we have at least a single record batch to write)

@alamb Another thought that comes to mind regarding this, is if this is done then could have cases where the parts written out aren't in sequential/increasing order, which could cause confusion as well. e.g. if parts 2 and 4 are the only with data then only those will appear on the filesystem like:

```
[4.0K] csv
├── [ 11] part-2.csv
└── [ 11] part-4.csv
```

Am not sure which is more desirable, having 'gaps' in the parts written, vs. having empty parts. Or somehow only write the parts with data first (which would break the parallel behaviour of the writes? unless force repartition).

> The other thing we can do would would be to add some way to the dataframe / write_csv API to say "I want the results in a single partiton/file" -- perhaps by adding `DataFrame::repartititon` or something so the user can control if they want multiple files (potentially faster to write) or a single file (slower to write, but easier to use)

This does sound like a good option to have for user flexibility, though it still leaves the question of what the default behaviour should be. Or maybe its best to leave the user to decide this, and document the method to hint towards it? Since it would be a simple wrapper over the `repartition(...)` method it seems.

P.S. Another thing I just noticed, is that the empty partition files actually shouldn't be completely empty, they should have a header row. For CSV, the default is to have the header rows in files written, so those empty parts should at least have the header row.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on issue #5383: The output of write_csv and write_json methods is confusing.

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on issue #5383:
URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1442887119

   MRE:
   
   ```rust
   #[tokio::main]
   async fn main() -> Result<()> {
       let ctx = SessionContext::new();
   
       ctx.sql("select 1")
           .await?
           .repartition(Partitioning::Hash(vec![lit(0)], 5))?
           .write_csv("csv")
           .await?;
   
       Ok(())
   }
   ```
   
   Output:
   
   ```
   jeffrey:~/Code/arrow-datafusion$ tree -h csv
   [4.0K]  csv
   ├── [   0]  part-0.csv
   ├── [   0]  part-1.csv
   ├── [   0]  part-2.csv
   ├── [   0]  part-3.csv
   └── [  11]  part-4.csv
   
   0 directories, 5 files
   ```
   
   Can see its due to empty partitions still being written out to disk.
   
   Not sure if in writing logic its possible to check if a partition is empty before attempting to write to disk?
   
   https://github.com/apache/arrow-datafusion/blob/1309267e713523bc5d1c23e34dcc934d6d30c22b/datafusion/core/src/physical_plan/file_format/csv.rs#L297-L314
   
   - Without requiring execution twice
   - Not to mention could lead to case where no files are written (if all partitions are empty), unsure if desirable
   
   Probably easiest to update the documentation to reflect behaviour?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] The output of write_csv and write_json methods results in many empty output files, which is confusing [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #5383: The output of write_csv and write_json methods results in many empty output files, which is confusing
URL: https://github.com/apache/arrow-datafusion/issues/5383


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org