You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@sdap.apache.org by nc...@apache.org on 2023/05/11 23:49:36 UTC
[incubator-sdap-in-situ-data-services] branch master updated: Update README and Deployment-in-AWS according to latest code changes (#21)

This is an automated email from the ASF dual-hosted git repository.

nchung pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-sdap-in-situ-data-services.git


The following commit(s) were added to refs/heads/master by this push:
     new 3f9690f  Update README and Deployment-in-AWS according to latest code changes (#21)
3f9690f is described below

commit 3f9690f09c19310d0152ad027acd9d55920cf962
Author: Jason Min-Liang Kang <ja...@gmail.com>
AuthorDate: Thu May 11 16:49:31 2023 -0700

    Update README and Deployment-in-AWS according to latest code changes (#21)
    
    * Update README and Deployment-in-AWS according to latest code changes
    
    * Update changelog with SDAP ticket number
    
    * Make correction to Deployment-in-AWS
---
 CHANGELOG.md         |   1 +
 Deployment-in-AWS.md | 133 +++++++++++++++++++++++++++------------------------
 README.md            |  14 ++++--
 3 files changed, 81 insertions(+), 67 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index ad3b640..e1ad6e0 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,6 +6,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 ### Added
+- SDAP-464: Updated AWS deployment guide
 ### Changed
 - Updated Elasticsearch *query_by_id* method to accept an *index* as argument
 - SDAP-462: Updated query logic so that depth -99999 is treated as surface (i.e. depth 0)
diff --git a/Deployment-in-AWS.md b/Deployment-in-AWS.md
index c18fdb0..59786d1 100644
--- a/Deployment-in-AWS.md
+++ b/Deployment-in-AWS.md
@@ -1,77 +1,80 @@
-## AWS Resources
-- while there is a plan to have a terraform setup for this setup, it is still in development
-- Currently, they are setup manually.
+# About
+This is a detailed guide to deploy SDAP In-Situ to AWS. Although there is a plan to use terraform for deploying SDAP In-Situ to AWS, it is still being developed and is not yet ready. Therefore, the current method for deploying SDAP In-Situ to AWS requires manual steps.
 
+# Prerequisite/Preparation
+## AWS Resources Provision
 ### S3
-- 1 bucket which stores Parquet data
-### DynamoDB
-- 1 table for Parquet metadata mapped to S3 file metadata
-- Table setting
-
-        Partition key: s3_url (String)
-        Sort key: -
-        Capacity mode: On-demand
-### IAM for long term tokens or IAM role permissions
-- S3 bucket full access
-- Read permission for Other S3 buckets where in-situ json file are stored.
-- DDB: full access for the mentioned table 
-### EKS setup
-- This was setup by the SA. The instruction will be added in the future.
-- Note that IAM roles can be setup for EKS, but SAs have not set that up yet. 
-- Long term tokens were created and given to the system which are used instead of IAM roles for EKS
-
-### Create Namespace
-- for this deployment, we are using namespace `bitnami-spark`
-
-        kubectl create namespace bitnami-spark
+- 1 bucket storing Parquet data
+- additional bucket(s) (as necessary) storing in-situ data to be ingested
+### OpenSearch
+- create a domain named `sdap-in-situ` ([AWS OpenSearch guide](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gsg.html))
+- under `sdap-in-situ`, create 2 indices named `entry_file_record` and `parquet_stats`
+- for `entry_file_records`, create an alias named `entry_file_records_alias`
+- for `parquet_stats`, create an alias named `parquet_stats_alias`
+### EKS
+- set up EKS in AWS environment ([AWS EKS guide](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html))
+- make sure [kubectl](https://kind.sigs.k8s.io/docs/user/quick-start/) is able to communicate with EKS
+- create a namespace called `nexus-in-situ`
+### IAM Access Key
+- create a user called `nexus-in-situ-user`
+- create an access key for `nexus-in-situ-user` to be used by SDAP In-Situ
+- cdms-parquet should have follow permissions
+    - S3 bucket full access
+    - OpenSearch full access
+## Prepare SDAP In-Situ For Deployment
+### Required softwares
+- [docker cli](https://docs.docker.com/engine/install/) is part of docker engine installation
+- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/)
+- [helm cli](https://helm.sh/docs/intro/quickstart/)
+### Build Parquet Flask container
+- build docker container using [parquet.spark.3.2.0.r44.Dockerfile](docker/parquet.spark.3.2.0.r44.Dockerfile)
 
+        docker build -f docker/parquet.spark.3.2.0.r44.Dockerfile -t {your-tag} .
 
-### Build Parquet Flask container
-- build it using this [Dockerfile](k8s_spark/parquet.spark.3.2.0.r44.Dockerfile)
-- Note that the image name and tag needs to be updated. 
+### Push docker container
+- push docker container to a docker container registry
 
-        docker build -f ../docker/parquet.spark.3.2.0.r44.Dockerfile -t waiphyojpl/cdms.parquet.flask:t7 ..
+        docker push {your-tag}
 
-### Deploy Parquet Flask container and Spark Cluster to kubernetes namespace: `bitnami-spark`
-- We are nesting the bitnami spark helm chart as a dependency within our parquet-spark-helm helm chart.
-- Spark custom values are maintained in `values.yaml` within the `bitnami-spark` YAML block.
-- With the default values, Spark master can be accessed inside the `VPC` as the `NodePort` is being used.
-- Spark AWS Loadbalancer setting should only be used if internal load-balancers are used on private subnets, which would allow intra-VPC access only and block public access.
-- This is the sample values.yaml with explanations
+# Deployment
+## edit `values.yaml`
+- spark helm chart is a dependency to SDAP In-Situ, and its values can be updated under `bitnami` YAML block
+- Update values in [values.yaml](k8s_spark/parquet.spark.helm/values.yaml) to match AWS resources provisioned from above
+- sample values.yaml with explanations follows (more value options and explanation can be found inside `values.yaml`)
 
         flask_env:
-          parquet_file_name: "S3 URL for the bucket created in the first step. Note that `s3a` needs to be used. s3a://cdms-dev-in-situ-parquet/CDMS_insitu.parquet"
-          spark_app_name: "any name of the spark app name. example: parquet_flask_demo"
-          log_level: "python3 log level. DEBUG, INFO, and etc."
-          parquet_metadata_tbl: "DynamoDB table name: cdms_parquet_meta_dev_v1"
-          spark_config_dict: {"spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"} Change to a `org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider` if IAM based credentials are used. But it is not tested at this moment.
-        
-        aws_creds: 'Will create a secret and related env vars on the deployment. Unnecessary if AWS IRSA is being used.
-          awskey: 'long term aws Key. leave an empty string if using IAM'
-          awssecret: 'long term aws secret. leave an empty string if using IAM'
-          awstoken: 'aws session if using locally for a short lived aws credentials. leave an empty string if using IAM'
+            parquet_file_name: "S3 URL for the bucket created in the first step. Note that `s3a` needs to be used. s3a://cdms-dev-in-situ-parquet/CDMS_insitu.parquet"
+            spark_app_name: "any name of the spark app name. example: parquet_flask"
+            log_level: "python3 log level. DEBUG, INFO, and etc."
+            flask_prefix: "prefix to SDAP In-Situ API"
+            es_url: "URL to AWS OpenSearch"
+            es_port: "AWS OpenSearch port number (default to 443)"
+            spark_config_dict: {"spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"} Change to a `org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider` if IAM based credentials are used. But it is not tested at this moment.
+    
+        aws_creds:
+            awskey: "long term aws Key. leave an empty string if using IAM"
+            awssecret: "long term aws secret. leave an empty string if using IAM"
+            awstoken: "aws session if using locally for a short lived aws credentials. leave an empty string if using IAM"
 
         serviceAccount:
-          create: true
-          annotations: {} 'Can be used for [AWS IRSA](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-enable-IAM.html)'.  Uncomment the line below and specify an IAM Role to enable.
-            # eks.amazonaws.com/role-arn: 'arn:aws:iam::xxxxxxxxxxxxxx:role/parquet-spark'
+            create: true
+            annotations: {} 'Can be used for [AWS IRSA](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-enable-IAM.html)'.  Uncomment the line below and specify an IAM Role to enable.
+                # eks.amazonaws.com/role-arn: 'arn:aws:iam::xxxxxxxxxxxxxx:role/parquet-spark'
 
         image:
-          repository: "the name value from `Build Parquet Flask container` step. example: waiphyojpl/cdms.parquet.flask"
-          pullPolicy: IfNotPresent
-          # Overrides the image tag whose default is the chart appVersion.
-          tag: "the tag value from `Build Parquet Flask container` step. example: t7"
-
-        bitnami-spark: 'Default values for the Bitnami Spark helm chart.'
-- Update values in `values.yaml` to match your environment (dynamodb table name, s3 bucket name, etc.)
+            repository: "the name value from `Build Parquet Flask container` step. example: waiphyojpl/cdms.parquet.flask"
+            pullPolicy: IfNotPresent
+            # Overrides the image tag whose default is the chart appVersion.
+            tag: "the tag value from `Build Parquet Flask container` step. example: t7"
 
-- Create the `bitnami-spark` namespace
+            bitnami-spark: 'Default values for the Bitnami Spark helm chart.'
 
-        $ kubectl create namespace bitnami-spark
+## Deploy SDAP In-Situ and Spark Cluster to k8s namespace: `nexus-in-situ`
 - From the [helm folder](k8s_spark/parquet.spark.helm), run this command
 
-        $ helm install parquet-t1 . -n bitnami-spark --dependency-update
-- After deploying, this is what kubernetes should look like: `kubectl get all -n bitnami-spark`
+        helm install -n nexus-in-situ --dependency-update parquet-t1 .
+
+- After deploying, this is what kubernetes should look like: `kubectl get all -n nexus-in-situ`
 
         NAME                                                 READY   STATUS    RESTARTS   AGE
         pod/parquet-t1-bitnami-spark-master-0                1/1     Running   0          29d
@@ -95,15 +98,21 @@
         NAME                                               READY   AGE
         statefulset.apps/parquet-t1-bitnami-spark-master   1/1     29d
         statefulset.apps/parquet-t1-bitnami-spark-worker   4/4     29d
-- The parquet port can be forwarded if needed. There are plans to use `ingress` or `AWS Loadbalancer`, but they are still in development as SA hasn't approved it yet.
+
+- command to proxy forward to SDAP In-Situ port (if needed)
 
         kubectl port-forward service/parquet-t1-parquet-spark-helm -n bitnami-spark 9801:9801
 
 ### Querying Parquet via Flask
 - Example command:
         
-        time curl 'http://localhost:30801/1.0/query_data_doms?startIndex=3&itemsPerPage=20&minDepth=-99&variable=wind_speed&columns=air_pressure&maxDepth=-1&startTime=2019-02-14T00:00:00Z&endTime=2021-02-16T00:00:00Z&platform=3B&bbox=-111,11,111,99'
+        curl 'http://localhost:30801/1.0/query_data_doms?startIndex=3&itemsPerPage=20&minDepth=-99&variable=wind_speed&columns=air_pressure&maxDepth=-1&startTime=2019-02-14T00:00:00Z&endTime=2021-02-16T00:00:00Z&platform=3B&bbox=-111,11,111,99'
 
-### Documentation and Sources
-- [Bitnami Spark Helm Chart](https://github.com/bitnami/charts/tree/master/bitnami/spark)
+# References
+- [Bitnami Spark Helm Chart values](https://github.com/bitnami/charts/tree/master/bitnami/spark)
 - [AWS IAM Roles for Service Accounts (IRSA)](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-enable-IAM.html)
+- [AWS EKS guide](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html)
+- [kubectl quick start guide](https://kind.sigs.k8s.io/docs/user/quick-start/)
+- [docker cli](https://docs.docker.com/engine/install/)
+- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/)
+- [helm cli](https://helm.sh/docs/intro/quickstart/)
\ No newline at end of file
diff --git a/README.md b/README.md
index d08c868..178f0f9 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,11 @@
-# Insitu Data in Parquet format stored in S3
+# About
+Ingest in-situ data (in json) to AWS S3 as parquet object files.
 
-### How to ingest a insitu json file to Parquet
+# Deployment
+Follow [this guide](Deployment-in-AWS.md) to deploy SDAP In-Situ to AWS cloud.
+
+# Ingestion
+## How to ingest a insitu json file to Parquet
 - Assumption: K8s is successfully deployed
 - Download this repo
 - (optional) create different python3.6 environment
@@ -30,8 +35,8 @@
               --BUCKET_NAME cdms-dev-ncar-in-situ-stage  \
               --KEY_PREFIX cdms_icoads_2017-01-01.json
   
-### Ref:
-- how to replace parquet file partially
+# Useful Commands
+- to replace parquet file partially
 ```
 https://stackoverflow.com/questions/38487667/overwrite-specific-partitions-in-spark-dataframe-write-method?noredirect=1&lq=1
 > Finally! This is now a feature in Spark 2.3.0: SPARK-20236
@@ -40,5 +45,4 @@ https://stackoverflow.com/questions/38487667/overwrite-specific-partitions-in-sp
 
 spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
 data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")
-
 ```
\ No newline at end of file