You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@submarine.apache.org by pi...@apache.org on 2022/08/29 02:59:44 UTC

[submarine] branch master updated: SUBMARINE-1312. Fix submarine-sdk not connecting to the database

This is an automated email from the ASF dual-hosted git repository.

pingsutw pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/submarine.git


The following commit(s) were added to refs/heads/master by this push:
     new 806fcb21 SUBMARINE-1312. Fix submarine-sdk not connecting to the database
806fcb21 is described below

commit 806fcb2181e5cef5f40f4e57aa941fae53ee5217
Author: cdmikechen <cd...@hotmail.com>
AuthorDate: Sat Aug 27 20:23:38 2022 +0800

    SUBMARINE-1312. Fix submarine-sdk not connecting to the database
    
    ### What is this PR for?
    The istio proxy intercepts some of the traffic to the database, causing the `submarine-sdk` within the pod to be unable to connect to the database.
    The main purpose of this PR is to remove istio's sidecar for the database.
    
    ### What type of PR is it?
    Bug Fix
    
    ### Todos
    * [x] - Add `sidecar.istio.io/inject: "false"` to submarine-database
    * [x] - Reduce quickstart image
    
    ### What is the Jira issue?
    https://issues.apache.org/jira/browse/SUBMARINE-1312
    
    ### How should this be tested?
    This PR can be tested by quickstart image.
    
    ```
    roottest:/opt# python train.py
    OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
    2022-08-27 07:22:54.266967: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
    2022-08-27 07:22:54.267033: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    WARNING:tensorflow:From train.py:61: _CollectiveAllReduceStrategyExperimental.__init__ (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version.
    Instructions for updating:
    use distribute.MultiWorkerMirroredStrategy instead
    From train.py:61: _CollectiveAllReduceStrategyExperimental.__init__ (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version.
    Instructions for updating:
    use distribute.MultiWorkerMirroredStrategy instead
    2022-08-27 07:22:56.427482: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
    2022-08-27 07:22:56.427512: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
    2022-08-27 07:22:56.427560: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (test): /proc/driver/nvidia/version does not exist
    2022-08-27 07:22:56.428097: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:CPU:0',), communication = CommunicationImplementation.AUTO
    Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:CPU:0',), communication = CommunicationImplementation.AUTO
    Generating dataset mnist (~/tensorflow_datasets/mnist/3.0.1)
    Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to ~/tensorflow_datasets/mnist/3.0.1...
    Dl Completed...: 0 url [00:00, ? url/s]          Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz into /root/tensorflow_datasets/downloads/cvdf-datasets_mnist_t10k-images-idx3-ubytedDnaEPiC58ZczHNOp6ks9L4_JLids_rpvUj38kJNGMc.gz.tmp.b8ba5e1c295746a7947775f54e76fe5b...
    Dl Completed...:   0%|                           Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz into /root/tensorflow_datasets/downloads/cvdf-datasets_mnist_t10k-labels-idx1-ubyte4Mqf5UL1fRrpd5pIeeAh8c8ZzsY2gbIPBuKwiyfSD_I.gz.tmp.0d6323e9d8684664b864a0d8e0d53cbc...
    Dl Completed...:   0%|                           Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz into /root/tensorflow_datasets/downloads/cvdf-datasets_mnist_train-images-idx3-ubyteJAsxAi0QnOBEygBw_XW2X7zp-LBZAIqqYSHN8ru4ZO4.gz.tmp.63b70936c91b4e1da998240c6c546a8e...
    Dl Completed...:   0%|                           Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-labels-idx1-ubyte.gz into /root/tensorflow_datasets/downloads/cvdf-datasets_mnist_train-labels-idx1-ubytedcDWkl3FO9T-WMEH1f1Xt51eIRmePRIMAk6X147Qw8w.gz.tmp.98a1eec4eaa249f9a15a57f584149cdc...
    Extraction completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.38s/ file]
    Dl Size...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00,  1.82 MiB/s]
    Dl Completed...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.38s/ url]
    Generating splits...:   0%|                                                                                                                                                                                                                           | 0/2 [00:00<?, ? splits/sDone writing ~/tensorflow_datasets/mnist/3.0.1.incompleteMGWHS0/mnist-train.tfrecord*. Number of examples: 60000 (shards: [60000])
    Generating splits...:  50%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                         | 1/2 [00:15<00:15, 15.13s/ splitsDone writing ~/tensorflow_datasets/mnist/3.0.1.incompleteMGWHS0/mnist-test.tfrecord*. Number of examples: 10000 (shards: [10000])
    Dataset mnist downloaded and prepared to ~/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
    Constructing tf.data.Dataset mnist for split None, from ~/tensorflow_datasets/mnist/3.0.1
    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #
    =================================================================
    conv2d (Conv2D)              (None, 26, 26, 32)        320
    _________________________________________________________________
    max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0
    _________________________________________________________________
    conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496
    _________________________________________________________________
    max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0
    _________________________________________________________________
    conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928
    _________________________________________________________________
    flatten (Flatten)            (None, 576)               0
    _________________________________________________________________
    dense (Dense)                (None, 64)                36928
    _________________________________________________________________
    dense_1 (Dense)              (None, 10)                650
    =================================================================
    Total params: 93,322
    Trainable params: 93,322
    Non-trainable params: 0
    _________________________________________________________________
    2022-08-27 07:23:20.046835: W tensorflow/core/framework/dataset.cc:679] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
    2022-08-27 07:23:20.048953: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
    Epoch 1/10
    2022-08-27 07:23:30.542246: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 7111 of 10000
    2022-08-27 07:23:34.450357: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:228] Shuffle buffer filled.
    70/70 [==============================] - 16s 20ms/step - loss: 1.8979 - accuracy: 0.3429
    {'loss': 1.8978824615478516, 'accuracy': 0.34285715222358704}
    ^CTraceback (most recent call last):
      File "train.py", line 88, in <module>
        main()
      File "train.py", line 84, in main
        multi_worker_model.fit(ds_train, epochs=10, steps_per_epoch=70, callbacks=[MyCallback()])
      File "/usr/local/lib/python3.7/site-packages/keras/engine/training.py", line 1230, in fit
        callbacks.on_epoch_end(epoch, epoch_logs)
      File "/usr/local/lib/python3.7/site-packages/keras/callbacks.py", line 413, in on_epoch_end
        callback.on_epoch_end(epoch, logs)
      File "train.py", line 81, in on_epoch_end
        submarine.log_metric("loss", logs["loss"], epoch)
      File "/usr/local/lib/python3.7/site-packages/submarine/tracking/fluent.py", line 54, in log_metric
        SubmarineClient().log_metric(job_id, key, value, worker_index, datetime.now(), step or 0)
      File "/usr/local/lib/python3.7/site-packages/submarine/tracking/client.py", line 58, in __init__
        self.store = utils.get_tracking_sqlalchemy_store(self.db_uri)
      File "/usr/local/lib/python3.7/site-packages/submarine/tracking/utils.py", line 93, in get_tracking_sqlalchemy_store
        return SqlAlchemyStore(store_uri)
      File "/usr/local/lib/python3.7/site-packages/submarine/store/tracking/sqlalchemy_store.py", line 58, in __init__
        insp = sqlalchemy.inspect(self.engine)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/inspection.py", line 64, in inspect
        ret = reg(subject)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 182, in _engine_insp
        return Inspector._construct(Inspector._init_engine, bind)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 117, in _construct
        init(self, bind)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 128, in _init_engine
        engine.connect().close()
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3315, in connect
        return self._connection_cls(self, close_with_result=close_with_result)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
        else engine.raw_connection()
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection
        return self._wrap_pool_connect(self.pool.connect, _connection)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect
        return fn()
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 310, in connect
        return _ConnectionFairy._checkout(self)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
        fairy = _ConnectionRecord.checkout(pool)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
        rec = pool._do_get()
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 146, in _do_get
        self._dec_overflow()
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__
        with_traceback=exc_tb,
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
        raise exception
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
        return self._create_connection()
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 256, in _create_connection
        return _ConnectionRecord(self)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 371, in __init__
        self.__connect()
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
        self.dbapi_connection = connection = pool._invoke_creator(self)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/create.py", line 578, in connect
        return dialect.connect(*cargs, **cparams)
      File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 597, in connect
        return self.dbapi.connect(*cargs, **cparams)
      File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 353, in __init__
        self.connect()
      File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 632, in connect
        self._get_server_information()
      File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 1055, in _get_server_information
        packet = self._read_packet()
      File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 692, in _read_packet
        packet_header = self._read_bytes(4)
      File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 732, in _read_bytes
        data = self._rfile.read(num_bytes)
      File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
        return self._sock.recv_into(b)
    KeyboardInterrupt
    ```
    I opened a quickstart pod independently and tried to call `train.py`. When it finally wrote the metric, I found that it was stuck connecting to the database, which means that the main problem was with database istio-proxy.
    
    ### Screenshots (if appropriate)
    No
    
    ### Questions:
    * Do the license files need updating? No
    * Are there breaking changes for older versions? No
    * Does this need new documentation? No
    
    Author: cdmikechen <cd...@hotmail.com>
    
    Signed-off-by: Kevin <pi...@apache.org>
    
    Closes #991 from cdmikechen/SUBMARINE-1312 and squashes the following commits:
    
    82ec839d [cdmikechen] Remove istio sidecar and reduce image size
---
 dev-support/examples/quickstart/Dockerfile                     | 6 +++---
 submarine-cloud-v2/artifacts/submarine/submarine-database.yaml | 2 ++
 submarine-cloud-v3/artifacts/submarine-database.yaml           | 2 ++
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/dev-support/examples/quickstart/Dockerfile b/dev-support/examples/quickstart/Dockerfile
index a5a35a00..ad6db3d2 100644
--- a/dev-support/examples/quickstart/Dockerfile
+++ b/dev-support/examples/quickstart/Dockerfile
@@ -17,8 +17,8 @@ FROM python:3.7
 MAINTAINER Apache Software Foundation <de...@submarine.apache.org>
 
 ADD ./tmp/submarine-sdk /opt/
-# install submarine-sdk locally
-RUN pip install /opt/pysubmarine/.[tf2]
-RUN pip install tensorflow_datasets packaging
+# install submarine-sdk locally with no cache
+RUN pip install --no-cache-dir /opt/pysubmarine/.[tf2]
+RUN pip install --no-cache-dir tensorflow_datasets packaging
 
 ADD ./train.py /opt/
diff --git a/submarine-cloud-v2/artifacts/submarine/submarine-database.yaml b/submarine-cloud-v2/artifacts/submarine/submarine-database.yaml
index 023e4e6d..f91a8194 100644
--- a/submarine-cloud-v2/artifacts/submarine/submarine-database.yaml
+++ b/submarine-cloud-v2/artifacts/submarine/submarine-database.yaml
@@ -59,6 +59,8 @@ spec:
     metadata:
       labels:
         app: "submarine-database"
+      annotations:
+        sidecar.istio.io/inject: "false"
     spec:
       serviceAccountName: "submarine-storage"
       containers:
diff --git a/submarine-cloud-v3/artifacts/submarine-database.yaml b/submarine-cloud-v3/artifacts/submarine-database.yaml
index 9800a2d4..55d016bb 100644
--- a/submarine-cloud-v3/artifacts/submarine-database.yaml
+++ b/submarine-cloud-v3/artifacts/submarine-database.yaml
@@ -56,6 +56,8 @@ spec:
     metadata:
       labels:
         app: "submarine-database"
+      annotations:
+        sidecar.istio.io/inject: "false"
     spec:
       serviceAccountName: "submarine-storage"
       containers:


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@submarine.apache.org
For additional commands, e-mail: dev-help@submarine.apache.org