You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Francois Saint-Jacques (Jira)" <ji...@apache.org> on 2020/04/09 14:05:02 UTC
[jira] [Created] (ARROW-8382) [C++][Dataset] Refactor WritePlan to
decouple from Fragment/Scan/Partition classes
Francois Saint-Jacques created ARROW-8382:
---------------------------------------------
Summary: [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes
Key: ARROW-8382
URL: https://issues.apache.org/jira/browse/ARROW-8382
Project: Apache Arrow
Issue Type: Improvement
Reporter: Francois Saint-Jacques
WritePlan should look like the following.
{code:c++}
class ARROW_DS_EXPORT WritePlan {
public:
/// Execute the WritePlan and return a FileSystemDataset as a result.
Result<FileSystemDataset> Execute();
protected:
/// The schema of the Dataset which will be written
std::shared_ptr<Schema> schema;
/// The format into which fragments will be written
std::shared_ptr<FileFormat> format;
using SourceAndReader = std::pair<FIleSource, RecordBatchReader>;
///
std::vector<SourceAndReader> outputs;
};
{code}
* Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not sure if it should take the output schema, or the RecordBatchReader should be already of the right schema.
* Add a class/function that constructs SourceAndReader from Fragments, Partitioning and base path. And remove any Write/Fragment logic from partition.cc.
* Move Write() out FIleSystemDataset into WritePlan. It could take a FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, not a requirement.
* Simplify writing routine to avoid the PathTree directory structure, it shouldn't be more complex than `for task in write_tasks: task()`. Not path construction should there.
The effects are:
* Simplified WritePlan execution, abstracted away from path construction, and can write to multiple FileSystem and/or Buffers since it doesn't construct the FileSource.
* By the virtue of using RecordBatchReader instead of Fragment, it isn't tied to writing from Fragment, it can take any construct that yields a RecordBatchReader. It also means that WritePlan doesn't have to know about any Scan related classes.
* Writing can be done with or without partitioning, this logic is given to whomever generates the SourceAndReader list.
* Should be simpler to test.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)