You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@madlib.apache.org by GitBox <gi...@apache.org> on 2020/11/19 00:57:54 UTC

[GitHub] [madlib] reductionista opened a new pull request #525: DL: Model Hopper Refactor

reductionista opened a new pull request #525:
URL: https://github.com/apache/madlib/pull/525

This is a major refactor of the model hopper parallelism of the deep learning module.

**A new data flow for the hop/training cycle**

Background:

In v1.17, MADlib creates 3 temporary tables to store the model weights while they are being
hopped around and trained: mst_weights_tbl, weights_to_update_tbl, and model_output_table.
This gives rise to 3 different long-running queries that each involve at least one Redistribute Motion:
the UDA query (which does the actual training), the UPDATE query, and the HOP query. Most of the time in
madlib_keras_fit_multiple() is spent in a loop over iterations and hops calling the run_training()
function which performs each of these 3 stages, where all of the models circulate between
these 3 temporary tables and get back to where they started after each full iteration.
The overall effect was a lot of unnecessary motion of the model weights going back and forth
between different segment hosts several times for each call to run_training(). Each hop was
really 3 or 4 hops in terms of data motion, which for a modest sized data set can cause
the hopping phase of the cycle to take just as long as the training phase, leaving GPU's with a lot
of idle time.

3 Biggest Changes:

1. This refactor involves a major simplification where the 3 temporary tables are replaced by 2 temporary tables (model_input_tbl and model_output_tbl) and the UPDATE stage is eliminated to leave only the HOP and
UDA stages. There were also 3 shorter-running TRUNCATE & DROP queries in between the 3 longer queries.
These are also reduced to only 2 TRUNCATE & DROP queries.

Two additional performance improvements closely related to the above are:
2. A __dist_key__ column is added to model_output table and both of the temporary tables are
DISTRIBUTED BY this key instead of mst. The __dist_key__'s have a 1-to-1 mapping with segment_id's,
as in the batched source table. This serves as the hash key for all JOIN's now.
The schedule table has both a __dist_key__ column and a __previous_dist_key__ column,
so that it can guide the weights from the specific segment they were on previously
to the one they are scheduled to be on for the next hop. This avoids any additional
Redistribute Motions caused by table Hash Join's. With this change, there is no longer any
movement of the models at all during except during the hop itself. They only move
exactly once per call to run_training().

and
3. For model weight averaging (madlib_keras_fit()), we needed a merge and a final function
in addition to the fit_transition function, so a UDA was the natural choice. For the model
hopper (madlib_keras_fit_multiple()), there is no need for merge or final, all we need is
to call the fit_transition function directly on each row--so it was less clear whether a UDA
was the right choice. We found that calling it as a UDF directly, and using SD to pass
the state from one row to the next (as we were doing already anyway with the UDA), resulted
in slightly better performance. So now fit_transition can be called either as a UDF or as part
of a UDA, and fit_transition_multiple_model() is always called as a UDF.

- Simplified schedule rotation: schedule table created only once, then gets
rotated on segments, instead of re-creating many times by transfering
data back and forth from master to segments to master each hop. No longer
need separate "current_schedule" and "grand_schedule" data structures.

# Other performance enhancements

4. There is no need to hop the models between the last hop of the previous iteration and the first hop of the next iteration, so once per iteration we skip the hop and instead just rename model_output_tbl to model_input_tbl before truncating and dropping model_output_tbl. For example, if there were 4 segments then in v1.17 that means 4 model hops per iteration. In this branch, the last hop is skipped and the next iteration just starts with the models on the same segments they are already on. The 4 hops has been reduced to only 3 hops. The same amount of training occurs as before, and each model is paired exactly once per iteration with each segment--just in a slightly different order.

5. Much faster initialization code: previously, we were reading the weights
in from the original model output table (during warm start) and the model
arch table (for transfer learning) one mst row at a time from segment to
master, then writing them each back out one row at a time from master
back to segments with a large number of SELECT and INSERT queries.
Now, we just use a single query to copy the weights directly from the
original model output table into the new model output table on the
segments, without ever sending them to master. And a similar single
query copies the transfer learning weights directly from model_arch to
model_output for training. Both of these happen in parallel on the
segments, instead of in sequence on master. During testing of warm
start on a 20-segment cluster with 20 models, this resulted in a 10x reduction
in initialization time (26s instead of 5 mins in v1.17)

6. Split get_model_arch_and_weights() into query_weights() and get_model_arch()
So we don't have to transfer weights from segment to master in places
where we only need the model_arch json

7. Enables JIT XLA auto-clustering, if available.

# Other Changes

8. Simplified schedule rotation: schedule table is created only once, then gets
rotated on segments, instead of re-creating many times by transferring
data back and forth from master to segments to master each hop. No longer
need separate "current_schedule" and "grand_schedule" data structures.
These tables do not contain model weights, so there is not much of a
performance benefit, but it's at least simpler and more maintainable.

9. Added some debugging that can be enabled to help profile the
performance of fit multiple, and track which segment each mst_key
is located during each hop. This also serves as an example for
the utils/debug PR this is rebased on top of.

10. Remove AutoML dependency on internals of fit_multiple (needed to make AutoML
compatible with the new FitMultiple class, but also good for avoiding having to
do the same thing for any future updates to fit_multiple. Better modularity.

11. Improved Exception handling: send full stack traceback from segment back to master (soon to be renamed "coordinator"). Now when an exception happens in fit_transition (or merge, final, etc.) we'll be able to see the stack trace of the internal UDF or UDA running on the segments, along with the stack trace of the outer UDF that runs on the
coordinator. It's attached to the DETAILS of the Exception and includes the line number where the error occurred--something that we had to guess at before, making debugging difficult.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org