You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@submarine.apache.org by GitBox <gi...@apache.org> on 2022/09/20 14:09:59 UTC

[GitHub] [submarine] cdmikechen commented on pull request #989: SUBMARINE-1283. copy data for experiment before it running via distcp to minio

cdmikechen commented on PR #989:
URL: https://github.com/apache/submarine/pull/989#issuecomment-1252412078

@FatalLin
After https://github.com/apache/submarine/pull/994 have been merged, there is a flie confilict. After the conflict is resolved, I think this PR can be merged first.

Meanwhile I reconsidered later today after the meeting and I think it should be possible to adapt this prehandler operation by adding a `load_dataset` method to the `submarine-sdk`.
For example, we could modify quickstart to look like this:
```python
hdfs_config = {'dfs.nameservices': 'example-cluster', 'dfs.ha.namenodes.example-cluster': 'nn1,nn2', ...}
dataset = submarine.load_datasets('hdfs', hdfs_config, 'hdfs://warehouse/datasets/***.parquet')
```

The underlying implementation of `load_dataset` registers the prehandler service as pod, and performs the dataset loading\training of the experiment after successful data copying. In distributed mode, we can do blocking before the pod is finished, so that each worker can wait for the data copying to complete.
I will follow up later to see how `kubeflow` does distributed dataset loading. In the meantime, there are some good ideas from another project [huggingface-datasets](https://github.com/huggingface/datasets) that I think we should learn from (huggingface also seems to download datasets locally first).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org