You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Francois Saint-Jacques (Jira)" <ji...@apache.org> on 2020/04/11 03:52:00 UTC

[jira] [Comment Edited] (ARROW-8382) [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes

    [ https://issues.apache.org/jira/browse/ARROW-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081130#comment-17081130 ] 

Francois Saint-Jacques edited comment on ARROW-8382 at 4/11/20, 3:51 AM:
-------------------------------------------------------------------------

The end goal of this is to write data to disk to re-use it afterward, very likely in a different process. The act of transcoding one dataset to another (either for a format or shape) is a rare event, it will be done once per dataset. When the data is transcoded to a new format (say parquet or ipc), the new dataset is always used. In cases of data analysis, this often means creating a new process and thus the DatasetFactory will be used to open/read. 

We should use DatasetFactory because that's what the user is going to use once the data is written to disk.  Parsing strings is negligible and I'll gladly take that if it means a simpler API that will reflect the user experience.

And finally, this is only relevant if the data is partitioned with some logic. Quite often, the written data is just partitioned for sharding/balancing where rows are randomly and uniformly assigned to partition just to "shard" the data. In such cases, the partition is almost never going to be added as a column since it doesn't yield any semantic value. The writer should definitively be optimized for this case.


was (Author: fsaintjacques):
The end goal of this is to write data to disk to re-use it afterward, very likely in a different process. The act of transcoding one dataset to another (either for a format or shape) is a rare event, it will be done once per dataset. When the data is transcoded to new format (say parquet or ipc), the new dataset is always used. In cases of data analysis, this often means creating a new process and thus the DatasetFactory will be used to open/read. 

We should use DatasetFactory because that's what the user is going to use once the data is written to disk.  Parsing strings is negligible and I'll gladly take that if it means a simpler API that will reflect the user experience.

And finally, this is only relevant if the data is partitioned with some logic. Quite often, the written data is just partitioned for sharding/balancing where rows are randomly and uniformly assigned to partition just to "shard" the data. In such cases, the partition is almost never going to be added as a column since it doesn't yield any semantic value. The writer should definitively be optimized for this case.

> [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes 
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-8382
>                 URL: https://issues.apache.org/jira/browse/ARROW-8382
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Francois Saint-Jacques
>            Priority: Major
>              Labels: dataset
>
> WritePlan should look like the following. 
> {code:c++}
> class ARROW_DS_EXPORT WritePlan {
>  public:
>   /// Execute the WritePlan and return a FileSystemDataset as a result.
>  Result<FileSystemDataset> Execute(FileSystemDatasetFactory factory);
>  protected:
>   /// The schema of the Dataset which will be written
>   std::shared_ptr<Schema> schema;
>   /// The format into which fragments will be written
>   std::shared_ptr<FileFormat> format;
>  
>   using SourceAndReader = std::pair<FIleSource, RecordBatchReader>;
>   /// Files to write
>   std::vector<SourceAndReader> outputs;
> };
> {code}
> * Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not sure if it should take the output schema, or the RecordBatchReader should be already of the right schema.
> * Add a class/function that constructs SourceAndReader from Fragments, Partitioning and base path.
> * Move Write() out FIleSystemDataset into WritePlan. It could take a FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, not a requirement.
> * Simplify writing routine to avoid the PathTree directory structure, it shouldn't be more complex than `for task in write_tasks: task()`. Not path construction should be there.
> The effects are:
> * Simplified WritePlan execution, abstracted away from path construction, and can write to multiple FileSystem and/or Buffers since it doesn't construct the FileSource.
> * By the virtue of using RecordBatchReader instead of Fragment, it isn't tied to writing from Fragment, it can take any construct that yields a RecordBatchReader. It also means that WritePlan doesn't have to know about any Scan related classes.
> * Writing can be done with or without partitioning, this logic is given to whomever generates the SourceAndReader list.
> * Should be simpler to test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)