You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2020/08/01 02:44:08 UTC
Re: [DISCUSS] Execute dataset scan tasks in distributed system

Hi Hongze,

> Does anyone ever try using Arrow Dataset API in a distributed system?

My  understanding is the Dataset project was initially was intended for
running on a single node machine.  It might be reasonable to extend it to
be useable in a distributed system, but I'll let the primary contributors
to DataSet chime in here.

A better solution I could think about is to make scan tasks serializable so
> we could distribute them directly to machines. Currently they don't seem to
> be designed in such way since we allow contextual stuffs to be used to
> create the tasks, e.g. the opening readers in ParquetScanTask[1]. At the
> same time a built-in ser/de mechanism will be needed. Anyway a bunch of
> work has to be done.


Again not an expert but this seems like a more promising path then relying
on consistency of the iterator (especially because datasets could
potentially change underneath when trying to run in parallel).

Thanks,
Micah

On Tue, Jul 21, 2020 at 12:57 AM Hongze Zhang <no...@126.com> wrote:

> Hi all,
>
> Does anyone ever try using Arrow Dataset API in a distributed system? E.g.
> create scan tasks in machine 1, then send and execute these tasks from
> machine 2, 3, 4.
>
> So far I think a possible workaround is to:
>
> 1. Create Dataset on machine 1;
> 2. Call Scan(), collect all scan tasks from scan task iterator;
> 3. Say we have 5 tasks with number 1, 2, 3, 4, 5 here, and we decide to
> run task 1, 2 on machine 2, run task 3, 4, 5 on machine 3;
> 4. Send the target task numbers to machine 2, 3 respectively;
> 5. Create Dataset with the same configuration on machine 2 and 3, and Call
> Scan() to create 5 tasks for each machine;
> 6. On machine 2, run task 1, 2, skip 3, 4, 5
> 7. On machine 3, skip task 1, 2, run 3, 4, 5
>
> This should work correctly only if we assume that the method
> `Dataset::Scan()` returns exactly the same task iterator on different
> machines. And not sure if unnecessary overheads will be brought during the
> process, alter all we'll run the scan method N times for N machines.
>
> A better solution I could think about is to make scan tasks serializable
> so we could distribute them directly to machines. Currently they don't seem
> to be designed in such way since we allow contextual stuffs to be used to
> create the tasks, e.g. the opening readers in ParquetScanTask[1]. At the
> same time a built-in ser/de mechanism will be needed. Anyway a bunch of
> work has to be done.
>
> So far I am not sure which way is more reasonable or there is a better one
> than both. Any thoughts please let me know.
>
> Best,
> Hongze
>
> [1]
> https://github.com/apache/arrow/blob/c09a82a388e79ddf4377f44ecfe515604f147270/cpp/src/arrow/dataset/file_parquet.cc#L52-L74
>