You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sdap.apache.org by GitBox <gi...@apache.org> on 2020/06/25 18:51:23 UTC

[GitHub] [incubator-sdap-ingester] tloubrieu-jpl commented on a change in pull request #4: SDAP-237 Dockerize Collection Manager

tloubrieu-jpl commented on a change in pull request #4:
URL: https://github.com/apache/incubator-sdap-ingester/pull/4#discussion_r445769115



##########
File path: collection_manager/containers/docker/entrypoint.sh
##########
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+python /collection_manager/collection_manager/main.py \

Review comment:
       No sure if this is relevant, you will see.

##########
File path: collection_manager/containers/docker/entrypoint.sh
##########
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+python /collection_manager/collection_manager/main.py \

Review comment:
       The setup.py file can also generate a command line this first line shorted (for example collection-manager):
   
   entry_points={
           'console_scripts': ['collection-manager=collection_manager.collection_manager:main']
       }

##########
File path: collection_manager/README.md
##########
@@ -1,103 +1,75 @@
-# SDAP manager for ingestion of datasets
+# SDAP Collection Manager
 
-## Prerequisites
-
-### python 3
-
-Install anaconda for python 3. From the graphic install for example for macos:
-
-https://www.anaconda.com/distribution/#macos
-
-### git lfs (for development)
-
-Git lfs for the deployment from git, see https://git-lfs.github.com/
-
-If not available you have to get netcdf files for test, if you do need the tests.
-
-### Deployed nexus on kubernetes cluster
+The SDAP Collection Manager is a service that watches a YAML file (the [Collections
+Configuration](#the-collections-configuration-file) file) stored on the filesystem, and all the directories listed in that
+file. Whenever new granules are added to any of the watched directories, the Collection
+Manager service will publish a message to RabbitMQ to be picked up by the Granule Ingester
+(`/granule_ingester` in this repository), which will then ingest the new granules.
 
-See project https://github.com/apache/incubator-sdap-nexus
 
-    $ helm install nexus .  --namespace=sdap --dependency-update -f ~/overridden-nexus-values.yml 
-
-For development purpose, you might want to expose solr port outside kubernetes
-
-   kubectl port-forward solr-set-0 8983:8983 -n sdap 
+## Prerequisites
 
- 
-## For developers
+Python 3.7
 
-### deploy project
+## Building the service
+From `incubator-sdap-ingester/collection_manager`, run:
 
-    $ bash
-    $ git clone ...
-    $ cd sdap_ingest_manager
-    $ python -m venv venv
-    $ source ./venv/bin/activate
-    $ pip install .
-    $ pytest -s
+    $ python setup.py install
     
-Note the command pip install -e . does not work as it does not deploy the configuration files.
-
-### Update the project
-
-Update the code and the test with your favorite IDE (e.g. pyCharm).
-
-### Launch for development/tests
-
-### Prerequisite
 
-Deploy a local rabbitmq service, for example with docker.
+## Running the service
+From `incubator-sdap-ingester/collection_manager`, run:
 
-    docker run -d --hostname localhost -p 5672:5672 --name rabbitmq rabbitmq:3
-   
-   
-### Launch the service
+    $ python collection_manager/main.py -h
+    
+### The Collections Configuration File
 
+A path to a collections configuration file must be passed in to the Collection Manager
+at startup via the `--collections-path` parameter. Below is an example of what the 
+collections configuration file should look like:
 
-The service reads the collection configuration and submit granule ingestion messages to the message broker (rabbitmq).
-For each collection, 2 ingestion priority levels are proposed: the nominal priority, the priority for forward processing (newer files), usually higher. 
-An history of the ingested granules is managed so that the ingestion can stop and re-start anytime.
+```yaml
+# collections.yaml
 
-    cd collection_manager
-    python main.py -h
-    python main.py --collections ../tests/resources/data/collections.yml --history-path=/tmp
+collections:
 
-# Containerization
+    # The identifier for the dataset as it will appear in NEXUS.
+  - id: TELLUS_GRACE_MASCON_CRI_GRID_RL05_V2_LAND 
 
-TO BE UPDATED
+    # The local path to watch for NetCDF granule files to be associated with this dataset. 
+    # Supports glob-style patterns.
+    path: /opt/data/grace/*land*.nc 
 
-## Docker
+    # The name of the NetCDF variable to read when ingesting granules into NEXUS for this dataset.
+    variable: lwe_thickness 
 
-    docker build . -f containers/docker/config-operator/Dockerfile --no-cache --tag tloubrieu/sdap-ingest-manager:latest
-        
-To publish the docker image on dockerhub do (step necessary for kubernetes deployment):
+    # An integer priority level to use when publishing messages to RabbitMQ for historical data. 
+    # Higher number = higher priority.
+    priority: 1 
 
-    docker login
-    docker push tloubrieu/sdap-ingest-manager:latest
-    
-## Kubernetes
-    
-### Launch the service
+    # An integer priority level to use when publishing messages to RabbitMQ for forward-processing data.
+    # Higher number = higher priority.
+    forward-processing-priority: 5 
 
-    kubectl apply -f containers/kubernetes/job.yml -n sdap
-    
-Delete the service: 
+  - id: TELLUS_GRACE_MASCON_CRI_GRID_RL05_V2_OCEAN
+    path: /opt/data/grace/*ocean*.nc
+    variable: lwe_thickness
+    priority: 2
+    forward-processing-priority: 6
 
-    kubectl delete jobs --all -n sdap
-    
-    
+  - id: AVHRR_OI-NCEI-L4-GLOB-v2.0
+    path: /opt/data/avhrr/*.nc
+    variable: analysed_sst
+    priority: 1
 
-    
+```
+## Running the tests
+From `incubator-sdap-ingester/collection_manager`, run:
 
+    $ pip install pytest
+    $ pytest
     
-    
-    
- 
-    
-    
-
-
-
-
+## Building the Docker image
+From `incubator-sdap-ingester/collection_manager`, run:
 
+    $ docker build . -f docker/Dockerfile -t nexusjpl/collection-manager

Review comment:
       The 'containers' subdirectory is missing in the command line.
   
   I have a lot of warning when I run it, it doesn't need for you to change something but I want to make sure you're having the same warnings:
   In file included from ext/_yaml.c:596:
   ext/_yaml.h:10: warning: "PyString_CheckExact" redefined
    #define PyString_CheckExact PyBytes_CheckExact
    
   ext/_yaml.c:486: note: this is the location of the previous definition
      #define PyString_CheckExact          PyUnicode_CheckExact
    
   ext/_yaml.c: In function ‘__pyx_pf_5_yaml_get_version_string’:
   ext/_yaml.c:1891:17: warning: assignment discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
      __pyx_v_value = yaml_get_version_string();
                    ^
   
   The docker containers is building on my side.

##########
File path: collection_manager/containers/docker/entrypoint.sh
##########
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+python /collection_manager/collection_manager/main.py \
+  $([[ ! -z "$COLLECTIONS_PATH" ]] && echo --collections-path=$COLLECTIONS_PATH) \
+  $([[ ! -z "$RABBITMQ_HOST" ]] && echo --rabbitmq-host=$RABBITMQ_HOST) \

Review comment:
       Do we need these lines for the kubernetes/helm integration ?
   
   

##########
File path: collection_manager/collection_manager/main.py
##########
@@ -18,34 +18,35 @@ def check_path(path) -> str:
 
 
 def get_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Run ingestion for a list of collection ingestion streams")
-    parser.add_argument("--refresh",
-                        help="refresh interval in seconds to check for new or updated granules",
-                        default=300)
-    parser.add_argument("--collections",
+    parser = argparse.ArgumentParser(description="Watch the filesystem for new granules, and publish messages to "
+                                                 "RabbitMQ whenever they become available.")
+    parser.add_argument("--collections-path",
                         help="Absolute path to collections configuration file",
+                        metavar="PATH",
                         required=True)
-    parser.add_argument('--rabbitmq_host',
+    history_group = parser.add_mutually_exclusive_group(required=True)
+    history_group.add_argument("--history-path",
+                               metavar="PATH",
+                               help="Absolute path to ingestion history local directory")
+    history_group.add_argument("--history-url",
+                               metavar="URL",
+                               help="URL to ingestion history solr database")
+    parser.add_argument('--rabbitmq-host',
                         default='localhost',
                         metavar='HOST',
                         help='RabbitMQ hostname to connect to. (Default: "localhost")')
-    parser.add_argument('--rabbitmq_username',
+    parser.add_argument('--rabbitmq-username',
                         default='guest',
                         metavar='USERNAME',
                         help='RabbitMQ username. (Default: "guest")')
-    parser.add_argument('--rabbitmq_password',
+    parser.add_argument('--rabbitmq-password',
                         default='guest',
                         metavar='PASSWORD',
                         help='RabbitMQ password. (Default: "guest")')
-    parser.add_argument('--rabbitmq_queue',
+    parser.add_argument('--rabbitmq-queue',

Review comment:
       Yes that looks nicer. I think arg parse converts into a proper attribute name on its own. 

##########
File path: collection_manager/collection_manager/main.py
##########
@@ -8,7 +8,7 @@
 
 logging.basicConfig(level=logging.INFO)
 logging.getLogger("pika").setLevel(logging.WARNING)
-logger = logging.getLogger("collection_manager")
+logger = logging.getLogger()

Review comment:
       I am usually doing:
   logger = logging.getLogger(__name__)
   
   Does it behave the same without the __name__ argument ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org