You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@submarine.apache.org by GitBox <gi...@apache.org> on 2021/08/06 08:12:46 UTC

[GitHub] [submarine] featherchen opened a new pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

featherchen opened a new pull request #700:
URL: https://github.com/apache/submarine/pull/700


   ### What is this PR for?
   <!-- A few sentences describing the overall goals of the pull request's commits.
   First time? Check out the contributing guide - https://submarine.apache.org/contribution/contributions.html
   -->
   Add tensorflow distributed example with submarine sdk 
   ### What type of PR is it?
   Improvement 
   
   ### Todos
   * [ ] - None
   
   ### What is the Jira issue?
   <!-- * Open an issue on Jira https://issues.apache.org/jira/browse/SUBMARINE/
   * Put link here, and add [SUBMARINE-*Jira number*] in PR title, eg. `SUBMARINE-23. PR title`
   -->
   https://issues.apache.org/jira/browse/SUBMARINE-901?filter=-1
   ### How should this be tested?
   <!--
   * First time? Setup Travis CI as described on https://submarine.apache.org/contribution/contributions.html#continuous-integration
   * Strongly recommended: add automated unit tests for any new or changed behavior
   * Outline any manual steps to test the PR here.
   -->
   ### Screenshots (if appropriate)
   ![Screenshot from 2021-08-06 15-16-45](https://user-images.githubusercontent.com/57944334/128478746-7e8510c0-1aeb-492a-98a5-d9f699347b38.png)
   ![Screenshot from 2021-08-06 15-16-33](https://user-images.githubusercontent.com/57944334/128478783-db64d01b-5bb3-4e97-9106-881bd7f0a9ae.png)
   ![Screenshot from 2021-08-06 15-16-23](https://user-images.githubusercontent.com/57944334/128478817-f9d9d5a8-084d-4054-899b-90e818d6cfed.png)
   
   ### Questions:
   * Do the license files need updating? No
   * Are there breaking changes for older versions? No
   * Does this need new documentation? No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] asfgit closed pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #700:
URL: https://github.com/apache/submarine/pull/700


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] featherchen commented on a change in pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
featherchen commented on a change in pull request #700:
URL: https://github.com/apache/submarine/pull/700#discussion_r694225018



##########
File path: dev-support/examples/dp-tf/readme.md
##########
@@ -0,0 +1,23 @@
+# TF Distributed Example
+
+## Usage
+
+This is an easy example of how to track the metric and paramater in submarine-sdk. Basically, the sdk will detect which experiment and worker-id are you at, and log your data to the corresponding place.

Review comment:
       @ByronHsu  Thanks, I have updated the README, please check.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] featherchen commented on a change in pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
featherchen commented on a change in pull request #700:
URL: https://github.com/apache/submarine/pull/700#discussion_r696732784



##########
File path: dev-support/examples/mnist-tensorflow/ParameterServerStrategy/mnist_keras_distributed.py
##########
@@ -0,0 +1,92 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+import os
+import random
+import tensorflow as tf
+import json
+from tensorflow.keras.layers.experimental import preprocessing
+import tensorflow_datasets as tfds
+import tensorboard
+
+print(tf.__version__)
+
+TF_CONFIG = os.environ.get('TF_CONFIG', '')
+NUM_PS = len(json.loads(TF_CONFIG)['cluster']['ps'])
+cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
+
+variable_partitioner = (
+    tf.distribute.experimental.partitioners.MinSizePartitioner(
+        min_shard_bytes=(256 << 10),
+        max_shards=NUM_PS))
+
+strategy = tf.distribute.experimental.ParameterServerStrategy(
+    cluster_resolver,
+    variable_partitioner=variable_partitioner)
+
+def dataset_fn(input_context):
+  global_batch_size = 64
+  batch_size = input_context.get_per_replica_batch_size(global_batch_size)
+
+  x = tf.random.uniform((10, 10))
+  y = tf.random.uniform((10,))
+
+  dataset = tf.data.Dataset.from_tensor_slices((x, y)).shuffle(10).repeat()
+  dataset = dataset.shard(
+      input_context.num_input_pipelines,
+      input_context.input_pipeline_id)
+  dataset = dataset.batch(batch_size)
+  dataset = dataset.prefetch(2)
+
+  return dataset
+
+dc = tf.keras.utils.experimental.DatasetCreator(dataset_fn)
+
+with strategy.scope():
+  model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
+
+model.compile(tf.keras.optimizers.SGD(), loss='mse', steps_per_execution=10)
+
+working_dir = '/tmp/my_working_dir'
+log_dir = os.path.join(working_dir, 'log')
+ckpt_filepath = os.path.join(working_dir, 'ckpt')
+backup_dir = os.path.join(working_dir, 'backup')
+
+callbacks = [
+    tf.keras.callbacks.TensorBoard(log_dir=log_dir),
+    tf.keras.callbacks.ModelCheckpoint(filepath=ckpt_filepath),
+    tf.keras.callbacks.experimental.BackupAndRestore(backup_dir=backup_dir),
+]
+
+model.fit(dc, epochs=5, steps_per_epoch=20, callbacks=callbacks)
+if __name__ == "__main__":
+  modelClient = ModelsClient()
+  with modelClient.start() as run:
+    EPOCHS = 5
+    hist = model.fit(dc, epochs=EPOCHS, steps_per_epoch=20, callbacks=callbacks)
+    for i in range(EPOCHS):
+      modelClient.log_metric("val_loss", hist.history['loss'][i])
+      modelClient.log_metric("Val_accuracy", hist.history['accuracy'][i])
+    model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
+    # eval_loss, eval_acc = model.evaluate(eval_dataset)
+    # print('Eval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))
+    # modelClient.log_param("loss", eval_loss)
+    # modelClient.log_param("acc", eval_acc)

Review comment:
       Yes, I will delete them. But PS Strategy would still stock in the initail stage.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] ByronHsu commented on a change in pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
ByronHsu commented on a change in pull request #700:
URL: https://github.com/apache/submarine/pull/700#discussion_r684558359



##########
File path: dev-support/examples/dp-tf/readme.md
##########
@@ -0,0 +1,23 @@
+# TF Distributed Example
+
+## Usage
+
+This is an easy example of how to track the metric and paramater in submarine-sdk. Basically, the sdk will detect which experiment and worker-id are you at, and log your data to the corresponding place.

Review comment:
       This should explain the function of tf code

##########
File path: dev-support/examples/dp-tf/readme.md
##########
@@ -0,0 +1,23 @@
+# TF Distributed Example

Review comment:
       Why the folder name is `dp-tf`?

##########
File path: dev-support/examples/nn-pytorch/model.py
##########
@@ -32,4 +32,4 @@ def forward(self, x):
 if __name__ == "__main__":
     client = ModelsClient()
     net = LinearNNModel()
-    client.log_model("simple-nn-model", net)

Review comment:
       The comments should be removed?

##########
File path: dev-support/examples/dp-tf/distribution.py
##########
@@ -0,0 +1,109 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+
+from submarine import ModelsClient

Review comment:
       I think the name of file `distribution.py` is confusing. It could be changed to `dis-tf.py` or somewhat.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] pingsutw commented on a change in pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
pingsutw commented on a change in pull request #700:
URL: https://github.com/apache/submarine/pull/700#discussion_r684956891



##########
File path: dev-support/examples/dp-tf/build.sh
##########
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -euxo pipefail
+
+SUBMARINE_VERSION=0.6.0-SNAPSHOT
+SUBMARINE_IMAGE_NAME="distribution:${SUBMARINE_VERSION}"

Review comment:
       ```suggestion
   SUBMARINE_IMAGE_NAME="apache/submarine:distribution-${SUBMARINE_VERSION}"
   ```
   You can update this [file](https://github.com/apache/submarine/blob/master/.github/workflows/deploy_docker_images.yml), GA will publish this image to Apache docker hub

##########
File path: dev-support/examples/dp-tf/distribution.py
##########
@@ -0,0 +1,109 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+
+from submarine import ModelsClient
+import tensorflow_datasets as tfds
+import tensorflow as tf
+import os
+import tensorboard
+
+print(tf.__version__)
+
+datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
+mnist_train, mnist_test = datasets['train'], datasets['test']
+
+strategy = tf.distribute.MirroredStrategy()

Review comment:
       I think we should use the other distributed strategy. 
   > tf.distribute.MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine.
   https://www.tensorflow.org/guide/distributed_training#TF_CONFIG
   So if you launch 4 workers, those workers will not communicate with each other. 
   

##########
File path: dev-support/examples/dp-tf/distribution.py
##########
@@ -0,0 +1,109 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+
+from submarine import ModelsClient

Review comment:
       The name `mnist_keras_distributed.py` will be better, and the directory name could change to `mnist-tensorflow`. So maybe we could add `mnist_estimator_distributed.py` that uses `estimator` API or `mnist_distributed.py` that use native TensorFlow API in the future.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] featherchen commented on a change in pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
featherchen commented on a change in pull request #700:
URL: https://github.com/apache/submarine/pull/700#discussion_r694223674



##########
File path: dev-support/examples/dp-tf/distribution.py
##########
@@ -0,0 +1,109 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+
+from submarine import ModelsClient
+import tensorflow_datasets as tfds
+import tensorflow as tf
+import os
+import tensorboard
+
+print(tf.__version__)
+
+datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
+mnist_train, mnist_test = datasets['train'], datasets['test']
+
+strategy = tf.distribute.MirroredStrategy()

Review comment:
       @pingsutw  Thanks for the review, I have done some updates, and I was wondering that if you can give me some advise of parameter server and hdfs, I still can't find the way to put them together.   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] featherchen commented on a change in pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
featherchen commented on a change in pull request #700:
URL: https://github.com/apache/submarine/pull/700#discussion_r696731283



##########
File path: dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/mnist_keras_distributed.py
##########
@@ -0,0 +1,98 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+from submarine import ModelsClient
+import json
+import os
+import sys
+import tensorflow as tf
+import numpy as np
+import tensorflow_datasets as tfds
+
+BUFFER_SIZE = 10000
+BATCH_SIZE = 32
+
+strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
+
+def make_datasets_unbatched():
+  #Scaling MNIST data from (0, 255] to (0., 1.]
+  def scale(image, label):
+    image = tf.cast(image, tf.float32)
+    image /= 255
+    return image, label
+
+  datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
+
+  return datasets['train'].map(scale, num_parallel_calls=tf.data.experimental.AUTOTUNE).cache().shuffle(BUFFER_SIZE)
+
+def build_and_compile_cnn_model():
+  model = tf.keras.Sequential([
+      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
+      tf.keras.layers.MaxPooling2D(),
+      tf.keras.layers.Flatten(),
+      tf.keras.layers.Dense(64, activation='relu'),
+      tf.keras.layers.Dense(10)
+  ])
+  model.compile(
+      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+      optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
+      metrics=['accuracy'])
+  return model
+  
+#single_worker_model = build_and_compile_cnn_model()
+#single_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)

Review comment:
       Yes, Thank for the reminding.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] jeff-901 commented on pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
jeff-901 commented on pull request #700:
URL: https://github.com/apache/submarine/pull/700#issuecomment-894210859


   @featherchen Why are there four runs(worker-0) in the first picture not working? And why does the model load weights from checkpoint_dir? I don't find where you save the checkpoint.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] featherchen commented on pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
featherchen commented on pull request #700:
URL: https://github.com/apache/submarine/pull/700#issuecomment-894116086


   @ByronHsu  @jeff-901 Please review this PR for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [submarine] pingsutw commented on a change in pull request #700: SUBMARINE-901. Tensorflow distributed training example with model management API

Posted by GitBox <gi...@apache.org>.
pingsutw commented on a change in pull request #700:
URL: https://github.com/apache/submarine/pull/700#discussion_r695522589



##########
File path: dev-support/examples/mnist-tensorflow/ParameterServerStrategy/mnist_keras_distributed.py
##########
@@ -0,0 +1,92 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+import os
+import random
+import tensorflow as tf
+import json
+from tensorflow.keras.layers.experimental import preprocessing
+import tensorflow_datasets as tfds
+import tensorboard
+
+print(tf.__version__)
+
+TF_CONFIG = os.environ.get('TF_CONFIG', '')
+NUM_PS = len(json.loads(TF_CONFIG)['cluster']['ps'])
+cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
+
+variable_partitioner = (
+    tf.distribute.experimental.partitioners.MinSizePartitioner(
+        min_shard_bytes=(256 << 10),
+        max_shards=NUM_PS))
+
+strategy = tf.distribute.experimental.ParameterServerStrategy(
+    cluster_resolver,
+    variable_partitioner=variable_partitioner)
+
+def dataset_fn(input_context):
+  global_batch_size = 64
+  batch_size = input_context.get_per_replica_batch_size(global_batch_size)
+
+  x = tf.random.uniform((10, 10))
+  y = tf.random.uniform((10,))
+
+  dataset = tf.data.Dataset.from_tensor_slices((x, y)).shuffle(10).repeat()
+  dataset = dataset.shard(
+      input_context.num_input_pipelines,
+      input_context.input_pipeline_id)
+  dataset = dataset.batch(batch_size)
+  dataset = dataset.prefetch(2)
+
+  return dataset
+
+dc = tf.keras.utils.experimental.DatasetCreator(dataset_fn)
+
+with strategy.scope():
+  model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
+
+model.compile(tf.keras.optimizers.SGD(), loss='mse', steps_per_execution=10)
+
+working_dir = '/tmp/my_working_dir'
+log_dir = os.path.join(working_dir, 'log')
+ckpt_filepath = os.path.join(working_dir, 'ckpt')
+backup_dir = os.path.join(working_dir, 'backup')
+
+callbacks = [
+    tf.keras.callbacks.TensorBoard(log_dir=log_dir),
+    tf.keras.callbacks.ModelCheckpoint(filepath=ckpt_filepath),
+    tf.keras.callbacks.experimental.BackupAndRestore(backup_dir=backup_dir),
+]
+
+model.fit(dc, epochs=5, steps_per_epoch=20, callbacks=callbacks)
+if __name__ == "__main__":
+  modelClient = ModelsClient()
+  with modelClient.start() as run:
+    EPOCHS = 5
+    hist = model.fit(dc, epochs=EPOCHS, steps_per_epoch=20, callbacks=callbacks)
+    for i in range(EPOCHS):
+      modelClient.log_metric("val_loss", hist.history['loss'][i])
+      modelClient.log_metric("Val_accuracy", hist.history['accuracy'][i])
+    model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
+    # eval_loss, eval_acc = model.evaluate(eval_dataset)
+    # print('Eval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))
+    # modelClient.log_param("loss", eval_loss)
+    # modelClient.log_param("acc", eval_acc)

Review comment:
       Could we remove these lines?

##########
File path: dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/mnist_keras_distributed.py
##########
@@ -0,0 +1,98 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+from submarine import ModelsClient
+import json
+import os
+import sys
+import tensorflow as tf
+import numpy as np
+import tensorflow_datasets as tfds
+
+BUFFER_SIZE = 10000
+BATCH_SIZE = 32
+
+strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
+
+def make_datasets_unbatched():
+  #Scaling MNIST data from (0, 255] to (0., 1.]
+  def scale(image, label):
+    image = tf.cast(image, tf.float32)
+    image /= 255
+    return image, label
+
+  datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
+
+  return datasets['train'].map(scale, num_parallel_calls=tf.data.experimental.AUTOTUNE).cache().shuffle(BUFFER_SIZE)
+
+def build_and_compile_cnn_model():
+  model = tf.keras.Sequential([
+      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
+      tf.keras.layers.MaxPooling2D(),
+      tf.keras.layers.Flatten(),
+      tf.keras.layers.Dense(64, activation='relu'),
+      tf.keras.layers.Dense(10)
+  ])
+  model.compile(
+      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+      optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
+      metrics=['accuracy'])
+  return model
+  
+#single_worker_model = build_and_compile_cnn_model()
+#single_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)

Review comment:
       Could we remove these lines?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@submarine.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org