You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@sdap.apache.org by ea...@apache.org on 2020/06/25 23:10:35 UTC

[incubator-sdap-ingester] branch dev updated: SDAP-237 Dockerize Collection Manager (#4)

This is an automated email from the ASF dual-hosted git repository.

eamonford pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/incubator-sdap-ingester.git


The following commit(s) were added to refs/heads/dev by this push:
     new 12d9101  SDAP-237 Dockerize Collection Manager (#4)
12d9101 is described below

commit 12d9101240eab44c2f2ef3d3127734296bd85129
Author: Eamon Ford <ea...@gmail.com>
AuthorDate: Thu Jun 25 16:10:27 2020 -0700

    SDAP-237 Dockerize Collection Manager (#4)
    
    Co-authored-by: Eamon Ford <ea...@jpl.nasa.gov>
---
 collection_manager/README.md                       | 132 ++++++++-------------
 collection_manager/collection_manager/main.py      |  33 +++---
 collection_manager/containers/docker/Dockerfile    |  11 --
 .../containers/kubernetes/data-volume.yml          |  35 ------
 collection_manager/containers/kubernetes/job.yml   |  39 ------
 .../containers/kubernetes/sdap_ingester_config.yml |  38 ------
 collection_manager/docker/Dockerfile               |  16 +++
 collection_manager/docker/entrypoint.sh            |  10 ++
 collection_manager/setup.py                        |   8 +-
 granule_ingester/granule_ingester/main.py          |   3 +-
 10 files changed, 100 insertions(+), 225 deletions(-)

diff --git a/collection_manager/README.md b/collection_manager/README.md
index cbaf1fb..9d00cbb 100644
--- a/collection_manager/README.md
+++ b/collection_manager/README.md
@@ -1,103 +1,75 @@
-# SDAP manager for ingestion of datasets
+# SDAP Collection Manager
 
-## Prerequisites
-
-### python 3
-
-Install anaconda for python 3. From the graphic install for example for macos:
-
-https://www.anaconda.com/distribution/#macos
-
-### git lfs (for development)
-
-Git lfs for the deployment from git, see https://git-lfs.github.com/
-
-If not available you have to get netcdf files for test, if you do need the tests.
-
-### Deployed nexus on kubernetes cluster
+The SDAP Collection Manager is a service that watches a YAML file (the [Collections
+Configuration](#the-collections-configuration-file) file) stored on the filesystem, and all the directories listed in that
+file. Whenever new granules are added to any of the watched directories, the Collection
+Manager service will publish a message to RabbitMQ to be picked up by the Granule Ingester
+(`/granule_ingester` in this repository), which will then ingest the new granules.
 
-See project https://github.com/apache/incubator-sdap-nexus
 
-    $ helm install nexus .  --namespace=sdap --dependency-update -f ~/overridden-nexus-values.yml 
-
-For development purpose, you might want to expose solr port outside kubernetes
-
-   kubectl port-forward solr-set-0 8983:8983 -n sdap 
+## Prerequisites
 
- 
-## For developers
+Python 3.7
 
-### deploy project
+## Building the service
+From `incubator-sdap-ingester/collection_manager`, run:
 
-    $ bash
-    $ git clone ...
-    $ cd sdap_ingest_manager
-    $ python -m venv venv
-    $ source ./venv/bin/activate
-    $ pip install .
-    $ pytest -s
+    $ python setup.py install
     
-Note the command pip install -e . does not work as it does not deploy the configuration files.
-
-### Update the project
-
-Update the code and the test with your favorite IDE (e.g. pyCharm).
-
-### Launch for development/tests
-
-### Prerequisite
 
-Deploy a local rabbitmq service, for example with docker.
+## Running the service
+From `incubator-sdap-ingester/collection_manager`, run:
 
-    docker run -d --hostname localhost -p 5672:5672 --name rabbitmq rabbitmq:3
-   
-   
-### Launch the service
+    $ python collection_manager/main.py -h
+    
+### The Collections Configuration File
 
+A path to a collections configuration file must be passed in to the Collection Manager
+at startup via the `--collections-path` parameter. Below is an example of what the 
+collections configuration file should look like:
 
-The service reads the collection configuration and submit granule ingestion messages to the message broker (rabbitmq).
-For each collection, 2 ingestion priority levels are proposed: the nominal priority, the priority for forward processing (newer files), usually higher. 
-An history of the ingested granules is managed so that the ingestion can stop and re-start anytime.
+```yaml
+# collections.yaml
 
-    cd collection_manager
-    python main.py -h
-    python main.py --collections ../tests/resources/data/collections.yml --history-path=/tmp
+collections:
 
-# Containerization
+    # The identifier for the dataset as it will appear in NEXUS.
+  - id: TELLUS_GRACE_MASCON_CRI_GRID_RL05_V2_LAND 
 
-TO BE UPDATED
+    # The local path to watch for NetCDF granule files to be associated with this dataset. 
+    # Supports glob-style patterns.
+    path: /opt/data/grace/*land*.nc 
 
-## Docker
+    # The name of the NetCDF variable to read when ingesting granules into NEXUS for this dataset.
+    variable: lwe_thickness 
 
-    docker build . -f containers/docker/config-operator/Dockerfile --no-cache --tag tloubrieu/sdap-ingest-manager:latest
-        
-To publish the docker image on dockerhub do (step necessary for kubernetes deployment):
+    # An integer priority level to use when publishing messages to RabbitMQ for historical data. 
+    # Higher number = higher priority.
+    priority: 1 
 
-    docker login
-    docker push tloubrieu/sdap-ingest-manager:latest
-    
-## Kubernetes
-    
-### Launch the service
+    # An integer priority level to use when publishing messages to RabbitMQ for forward-processing data.
+    # Higher number = higher priority.
+    forward-processing-priority: 5 
 
-    kubectl apply -f containers/kubernetes/job.yml -n sdap
-    
-Delete the service: 
+  - id: TELLUS_GRACE_MASCON_CRI_GRID_RL05_V2_OCEAN
+    path: /opt/data/grace/*ocean*.nc
+    variable: lwe_thickness
+    priority: 2
+    forward-processing-priority: 6
 
-    kubectl delete jobs --all -n sdap
-    
-    
+  - id: AVHRR_OI-NCEI-L4-GLOB-v2.0
+    path: /opt/data/avhrr/*.nc
+    variable: analysed_sst
+    priority: 1
 
-    
+```
+## Running the tests
+From `incubator-sdap-ingester/collection_manager`, run:
 
+    $ pip install pytest
+    $ pytest
     
-    
-    
- 
-    
-    
-
-
-
-
+## Building the Docker image
+From `incubator-sdap-ingester/collection_manager`, run:
 
+    $ docker build . -f docker/Dockerfile -t nexusjpl/collection-manager
diff --git a/collection_manager/collection_manager/main.py b/collection_manager/collection_manager/main.py
index bc2d356..d8d2a5a 100644
--- a/collection_manager/collection_manager/main.py
+++ b/collection_manager/collection_manager/main.py
@@ -8,7 +8,7 @@ from collection_manager.services.history_manager import SolrIngestionHistoryBuil
 
 logging.basicConfig(level=logging.INFO)
 logging.getLogger("pika").setLevel(logging.WARNING)
-logger = logging.getLogger("collection_manager")
+logger = logging.getLogger(__name__)
 
 
 def check_path(path) -> str:
@@ -18,34 +18,35 @@ def check_path(path) -> str:
 
 
 def get_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Run ingestion for a list of collection ingestion streams")
-    parser.add_argument("--refresh",
-                        help="refresh interval in seconds to check for new or updated granules",
-                        default=300)
-    parser.add_argument("--collections",
+    parser = argparse.ArgumentParser(description="Watch the filesystem for new granules, and publish messages to "
+                                                 "RabbitMQ whenever they become available.")
+    parser.add_argument("--collections-path",
                         help="Absolute path to collections configuration file",
+                        metavar="PATH",
                         required=True)
-    parser.add_argument('--rabbitmq_host',
+    history_group = parser.add_mutually_exclusive_group(required=True)
+    history_group.add_argument("--history-path",
+                               metavar="PATH",
+                               help="Absolute path to ingestion history local directory")
+    history_group.add_argument("--history-url",
+                               metavar="URL",
+                               help="URL to ingestion history solr database")
+    parser.add_argument('--rabbitmq-host',
                         default='localhost',
                         metavar='HOST',
                         help='RabbitMQ hostname to connect to. (Default: "localhost")')
-    parser.add_argument('--rabbitmq_username',
+    parser.add_argument('--rabbitmq-username',
                         default='guest',
                         metavar='USERNAME',
                         help='RabbitMQ username. (Default: "guest")')
-    parser.add_argument('--rabbitmq_password',
+    parser.add_argument('--rabbitmq-password',
                         default='guest',
                         metavar='PASSWORD',
                         help='RabbitMQ password. (Default: "guest")')
-    parser.add_argument('--rabbitmq_queue',
+    parser.add_argument('--rabbitmq-queue',
                         default="nexus",
                         metavar="QUEUE",
                         help='Name of the RabbitMQ queue to consume from. (Default: "nexus")')
-    history_group = parser.add_mutually_exclusive_group(required=True)
-    history_group.add_argument("--history-path",
-                               help="Absolute path to ingestion history local directory")
-    history_group.add_argument("--history-url",
-                               help="URL to ingestion history solr database")
 
     return parser.parse_args()
 
@@ -65,7 +66,7 @@ def main():
         publisher.connect()
         collection_processor = CollectionProcessor(message_publisher=publisher,
                                                    history_manager_builder=history_manager_builder)
-        collection_watcher = CollectionWatcher(collections_path=options.collections,
+        collection_watcher = CollectionWatcher(collections_path=options.collections_path,
                                                collection_updated_callback=collection_processor.process_collection,
                                                granule_updated_callback=collection_processor.process_granule)
 
diff --git a/collection_manager/containers/docker/Dockerfile b/collection_manager/containers/docker/Dockerfile
deleted file mode 100644
index 3ba8da7..0000000
--- a/collection_manager/containers/docker/Dockerfile
+++ /dev/null
@@ -1,11 +0,0 @@
-FROM python:3
-
-# Add kubernetes client to create other pods (ingester)
-RUN apt-get update && apt-get install -y apt-transport-https gnupg2
-RUN curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
-RUN echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | tee -a /etc/apt/sources.list.d/kubernetes.list
-RUN apt-get update && apt-get install -y kubectl
-
-RUN pip install https://github.com/tloubrieu-jpl/incubator-sdap-nexus-ingestion-manager/releases/download/0.4.0+dev/sdap_ingest_manager-0.4.0+dev-py3-none-any.whl
-
-CMD bash
diff --git a/collection_manager/containers/kubernetes/data-volume.yml b/collection_manager/containers/kubernetes/data-volume.yml
deleted file mode 100644
index b2d3815..0000000
--- a/collection_manager/containers/kubernetes/data-volume.yml
+++ /dev/null
@@ -1,35 +0,0 @@
-apiVersion: v1
-kind: PersistentVolume
-metadata:
-  name: data-volume
-  labels:
-    name: data-volume
-spec:
-  capacity:
-    storage: 3Gi
-  volumeMode: Filesystem
-  accessModes:
-  - ReadWriteOnce
-  persistentVolumeReclaimPolicy: Delete
-  storageClassName: hostpath
-  hostPath:
-    path: /Users/loubrieu/PycharmProjects/sdap_ingest_manager/sdap_ingest_manager/ingestion_order_executor/history_manager/data
-    type: Directory
-
----
-
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: data-volume-claim
-spec:
-  accessModes:
-  - ReadWriteOnce
-  volumeMode: Filesystem
-  resources:
-    requests:
-      storage: 3Gi
-  storageClassName: hostpath
-  selector:
-    matchLabels:
-      name: "data-volume"
diff --git a/collection_manager/containers/kubernetes/job.yml b/collection_manager/containers/kubernetes/job.yml
deleted file mode 100644
index 1d8bc16..0000000
--- a/collection_manager/containers/kubernetes/job.yml
+++ /dev/null
@@ -1,39 +0,0 @@
-apiVersion: batch/v1
-kind: Job
-metadata:
-  name: collection-ingester
-spec:
-  template:
-    spec:
-      containers:
-        - name: collections-ingester
-          image: tloubrieu/sdap-ingest-manager:latest
-          imagePullPolicy: IfNotPresent
-          command: ["run_collections", "--config=/opt/sdap_ingester_config/"]
-          volumeMounts:
-            - name: config-vol
-              mountPath: /opt/sdap_ingester_config/
-            - name: data-volume-for-collection-ingester
-              mountPath: /data
-              readOnly: true
-      volumes:
-        - name: config-vol
-          configMap:
-            name: collection-ingester-config
-        - name: data-volume-for-collection-ingester
-          #hostPath:
-          #  path: /Users/loubrieu/PycharmProjects/sdap_ingest_manager/sdap_ingest_manager/ingestion_order_executor/history_manager/data
-          #  type: Directory
-          persistentVolumeClaim:
-            claimName: data-volume-claim
-
-      restartPolicy: Never
-  backoffLimit: 4
-
----
-
-
-
-
-
-
diff --git a/collection_manager/containers/kubernetes/sdap_ingester_config.yml b/collection_manager/containers/kubernetes/sdap_ingester_config.yml
deleted file mode 100644
index 425b687..0000000
--- a/collection_manager/containers/kubernetes/sdap_ingester_config.yml
+++ /dev/null
@@ -1,38 +0,0 @@
-apiVersion: v1
-data:
-  collections.yml: |+
-    # collection id with only letter and -
-    # path: regular expression matching the netcdf files which compose the collection
-    # variable: netcdf variable to be ingested (only one per dataset)
-    # priority: order in which collections will be processed, the smaller numbers first.
-    avhrr-oi-analysed-sst:
-      path: /data/avhrr_oi/*.nc
-      variable: analysed_sst
-      priority: 2
-
-  sdap_ingest_manager.ini: |+
-    [COLLECTIONS_YAML_CONFIG]
-    # config_path is the value sent as argument to the run_collection command, default is /opt/sdap_ingester_config
-    yaml_file = %(config_path)s/collections.yml
-
-    [OPTIONS]
-    # set to False to actually call the ingestion command for each granule
-    # relative path starts at {sys.prefix}/.sdap_ingest_manager
-    dry_run = False
-    # set to True to automatically list the granules as seen on the nfs server when they are mounted on the local file system.
-    deconstruct_nfs = False
-    # number of parallel ingestion pods on kubernetes (1 per granule)
-    parallel_pods = 8
-
-    [INGEST]
-    # kubernetes namespace where the sdap cluster is deployed
-    kubernetes_namespace = sdap
-
-
-kind: ConfigMap
-metadata:
-  creationTimestamp: "2020-04-17T00:11:46Z"
-  name: collection-ingester-config
-  resourceVersion: "2398917"
-  selfLink: /api/v1/namespaces/default/configmaps/collection-ingester
-  uid: b914e302-736c-4c25-9943-ebc33db418ce
diff --git a/collection_manager/docker/Dockerfile b/collection_manager/docker/Dockerfile
new file mode 100644
index 0000000..ce1b577
--- /dev/null
+++ b/collection_manager/docker/Dockerfile
@@ -0,0 +1,16 @@
+FROM python:3
+
+RUN apt-get update && apt-get install -y apt-transport-https gnupg2
+RUN curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
+RUN echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | tee -a /etc/apt/sources.list.d/kubernetes.list
+RUN apt-get update && apt-get install -y kubectl
+
+COPY /collection_manager /collection_manager/collection_manager
+COPY /setup.py /collection_manager/setup.py
+COPY /requirements.txt /collection_manager/requirements.txt
+COPY /README.md /collection_manager/README.md
+COPY /docker/entrypoint.sh /entrypoint.sh
+
+RUN cd /collection_manager && python setup.py install
+
+ENTRYPOINT ["/bin/bash", "/entrypoint.sh"]
diff --git a/collection_manager/docker/entrypoint.sh b/collection_manager/docker/entrypoint.sh
new file mode 100644
index 0000000..eb88f75
--- /dev/null
+++ b/collection_manager/docker/entrypoint.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+python /collection_manager/collection_manager/main.py \
+  $([[ ! -z "$COLLECTIONS_PATH" ]] && echo --collections-path=$COLLECTIONS_PATH) \
+  $([[ ! -z "$RABBITMQ_HOST" ]] && echo --rabbitmq-host=$RABBITMQ_HOST) \
+  $([[ ! -z "$RABBITMQ_USERNAME" ]] && echo --rabbitmq-username=$RABBITMQ_USERNAME) \
+  $([[ ! -z "$RABBITMQ_PASSWORD" ]] && echo --rabbitmq-password=$RABBITMQ_PASSWORD) \
+  $([[ ! -z "$RABBITMQ_QUEUE" ]] && echo --rabbitmq-queue=$RABBITMQ_QUEUE) \
+  $([[ ! -z "$HISTORY_URL" ]] && echo --history-url=$HISTORY_URL) \
+  $([[ ! -z "$HISTORY_PATH" ]] && echo --history-path=$HISTORY_PATH)
diff --git a/collection_manager/setup.py b/collection_manager/setup.py
index 49b0d75..1542486 100644
--- a/collection_manager/setup.py
+++ b/collection_manager/setup.py
@@ -1,9 +1,7 @@
-import setuptools
-import os
-import subprocess
-import sys
 import re
 
+import setuptools
+
 PACKAGE_NAME = "sdap_collection_manager"
 
 with open("./collection_manager/__init__.py") as fi:
@@ -24,7 +22,7 @@ setuptools.setup(
     description="a helper to ingest data in sdap",
     long_description=long_description,
     long_description_content_type="text/markdown",
-    url="https://github.com/tloubrieu-jpl/incubator-sdap-nexus-ingestion-manager",
+    url="https://github.com/apache/incubator-sdap-ingester",
     packages=setuptools.find_packages(),
     classifiers=[
         "Programming Language :: Python :: 3",
diff --git a/granule_ingester/granule_ingester/main.py b/granule_ingester/granule_ingester/main.py
index 29795f7..5a8fc2d 100644
--- a/granule_ingester/granule_ingester/main.py
+++ b/granule_ingester/granule_ingester/main.py
@@ -45,7 +45,8 @@ async def run_health_checks(dependencies: List[HealthCheck]):
 
 
 async def main():
-    parser = argparse.ArgumentParser(description='Process some integers.')
+    parser = argparse.ArgumentParser(description='Listen to RabbitMQ for granule ingestion instructions, and process '
+                                                 'and ingest a granule for each message that comes through.')
     parser.add_argument('--rabbitmq_host',
                         default='localhost',
                         metavar='HOST',