You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pirk.apache.org by ja...@apache.org on 2016/09/20 15:08:53 UTC

[1/2] incubator-pirk git commit: PIRK-64: Added new Cloud Instructions page with details on how to use AWS, GCP, and (in the future) Azure with Pirk

Repository: incubator-pirk
Updated Branches:
  refs/heads/gh-pages 14dbf7ade -> 3af67efa4


PIRK-64: Added new Cloud Instructions page with details on how to use AWS, GCP, and (in the future) Azure with Pirk


Project: http://git-wip-us.apache.org/repos/asf/incubator-pirk/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-pirk/commit/e347cfb6
Tree: http://git-wip-us.apache.org/repos/asf/incubator-pirk/tree/e347cfb6
Diff: http://git-wip-us.apache.org/repos/asf/incubator-pirk/diff/e347cfb6

Branch: refs/heads/gh-pages
Commit: e347cfb6a3575c70fc598413ca5e87bee451f738
Parents: 14dbf7a
Author: Jacob Wilder <ja...@apache.org>
Authored: Tue Sep 20 00:09:10 2016 -0400
Committer: Jacob Wilder <ja...@apache.org>
Committed: Tue Sep 20 00:11:18 2016 -0400

----------------------------------------------------------------------
 _includes/nav.html    |   1 +
 cloud_instructions.md | 143 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 144 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-pirk/blob/e347cfb6/_includes/nav.html
----------------------------------------------------------------------
diff --git a/_includes/nav.html b/_includes/nav.html
index 84275f6..eab369f 100755
--- a/_includes/nav.html
+++ b/_includes/nav.html
@@ -17,6 +17,7 @@
         <ul class="dropdown-menu">
           <li id="nav_users"><a href="{{ site.baseurl }}/for_users">For Users</a></li>
           <li id="nav_developers"><a href="{{ site.baseurl }}/for_developers">For Developers</a></li>
+          <li id="nav_developers"><a href="{{ site.baseurl }}/cloud_instructions">Cloud instructions</a></li>
           <li id="nav_papers"><a href="{{ site.baseurl }}/papers">Papers &amp Presentations</a></li>
           <li class="nav_faq"><a href="{{ site.baseurl }}/faq">FAQ</a></li>
           <li class="divider"></li>

http://git-wip-us.apache.org/repos/asf/incubator-pirk/blob/e347cfb6/cloud_instructions.md
----------------------------------------------------------------------
diff --git a/cloud_instructions.md b/cloud_instructions.md
new file mode 100644
index 0000000..00340d4
--- /dev/null
+++ b/cloud_instructions.md
@@ -0,0 +1,143 @@
+---
+title: Running Pirk in Cloud Environments (GCP, AWS, Azure)
+nav: nav_commercial_cloud
+---
+
+## Google Cloud Platform (GCP)
+This guide is a walkthrough with steps to get Pirk running on Google's Cloud Platform. 
+
+
+### Steps
+ 1. Create a google cloud account. You may be eligible for a first-60-days-or-$300 credit. You can do this from [https://cloud.google.com/](https://cloud.google.com/).
+ 2. Install the [`gcloud` command line tool](https://cloud.google.com/sdk/) and run `gcloud init`. This will let you authorize the `gcloud` tool against your Google account.
+ 3. Create a new project. For example, `pirkongcpexample`
+ 4. Enable billing for that project (somewhere in the user interface). If you are a free trial user you may not need to change anything in the billing settings. 
+ 5. Enable the dataproc API. At the time of writing, [this page API](https://console.developers.google.com/apis/api/dataproc/overview) was involved in the process of enabling dataproc API (if you have more than one project you may need to switch to the correct one using the picker on the page next to the Google logo). Ignore any warnings you see about a need to get credentials. 
+ 6. Spin up a cluster (replace $PROJECTNAME with the project name you used above):  
+`gcloud dataproc clusters create cluster-1 --zone us-east1-b --master-machine-type n1-standard-2 --master-boot-disk-size 150 --num-workers 3 --worker-machine-type n1-highmem-2 --worker-boot-disk-size 25 --project `**`$PROJECTNAME`**` --properties yarn:yarn.log-aggregation-enable=true,yarn:yarn.scheduler.maximum-allocation-mb=10240,yarn:yarn.nodemanager.resource.memory-mb=10240`
+
+ 7. Once this completes run `gcloud compute config-ssh`. This adds entries to your `~/.ssh/config` which allow you to connect to your cluster nodes. To see the list look at your `~/.ssh/config` file. An example: 
+
+        Host cluster-1-m.us-east1-b.pirkongcpexample
+	        HostName 105.126.210.81  
+	        IdentityFile /Users/pirkdev/.ssh/google_ed25519  
+	        UserKnownHostsFile=/Users/pirkdev/.ssh/google_compute_known_hosts  
+	        HostKeyAlias=compute.1295540156620891161  
+	        IdentitiesOnly=yes  
+	        CheckHostIP=no 
+
+    To SSH to this node I type `ssh -D 10010 cluster-1-m.us-east1-b.pirkongcpexample` (the `-m` indicates the master). (The `-D 10010` flag is optional and enables a SOCKS proxy you can configure a web browser with to see the [web interfaces](https://cloud.google.com/dataproc/concepts/cluster-web-interfaces))
+  
+ 8. On GCP the default property `spark.home = /usr` is incorrect. Since `/root/` isn't accessible 
+ (and thus putting an additional properties file in `local.pirk.properties.dir=/root/` isn't viable) one solution is to modify the
+ compiled-in `pirk.properties`
+ to have `local.pirk.properties.dir=/home/pirkdev/share` instead. (You'll want to change the `pirkdev` to your username on the node).
+ At `/home/pirkdev/share/sparkfix.properties` put a file containing `spark.home=/usr/lib/spark/bin`
+ 9. Transfer your compiled jar to the cluster: e.g. `scp apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar cluster-1-m.us-east1-b.pirkongcpexample:~`
+ 10. Run your jar. For example: 
+ 
+            hadoop jar apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar org.apache.pirk.test.distributed.DistributedTestDriver -j $PWD/apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar
+            
+            spark-submit --class org.apache.pirk.test.distributed.DistributedTestDriver  $PWD/apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar -j $PWD/apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar -t 1:JS
+
+ 11. When you want to stop your cluster: 
+ 
+            [pirkdev:~] 2 % gcloud compute instances list
+            NAME           ZONE        MACHINE_TYPE    \u2026
+            cluster-1-m    us-east1-b  n1-standard-2   \u2026
+            cluster-1-w-0  us-east1-b  n1-highmem-2    \u2026
+            cluster-1-w-1  us-east1-b  n1-highmem-2    \u2026
+            cluster-1-w-2  us-east1-b  n1-highmem-2    \u2026
+            [pirkdev:~] 2 % gcloud compute instances stop cluster-1-m  cluster-1-w-0 cluster-1-w-1 cluster-1-w-2
+
+ Stop your instances to save money. [To quote google](https://cloud.google.com/compute/docs/instances/stopping-or-deleting-an-instance): 
+
+> A stopped instance does not incur charges, but all of the resources that are attached to 
+> the instance will still be charged. For example, you are charged for persistent disks and 
+> external IP addresses according to the price sheet, even if an instance is stopped. To stop 
+> being charged for attached resources, you can reconfigure a stopped instance to not use 
+> those resources, and then delete the resources.
+
+
+## Microsoft Azure
+Right now Pirk can't be run on Microsoft's Azure HDInsight Hadoop platform because it only supports Java 7. Committer Jacob Wilder emailed a Microsoft engineer who works on HDInsight and heard that it is on Microsoft's roadmap for the end of September or October 2016. 
+
+### Steps that will likely eventually work
+
+These directions are based on the [basic cli instructions](https://azure.microsoft.com/en-us/documentation/articles/xplat-cli-install/) and the article [Create Linux-based clusters in HDInsight using the Azure CLI](https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-create-linux-clusters-azure-cli/). 
+
+A note on HDInsight pricing: 
+> HDInsight clusters billing is pro-rated per minute, whether you are using them or not. Please be sure to delete your cluster after you have finished using it. For information on deleting a cluster, see [How to delete an HDInsight cluster](https://azure.microsoft.com/en-us/documentation/articles/hdinsight-delete-cluster/).
+
+1. Create a Microsoft Azure account and either add billing information or get some credit for it. 
+2. If you haven't used Azure before then deploy [this template](https://github.com/Azure/azure-quickstart-templates/blob/master/101-acs-mesos/docs/CommonIssues.md). It will set up your account with licenses for the right resources. Don't forget to delete it after deploy. 
+3. Install the Azure CLI and run `azure login` to authenticate.
+4. Enter resource manager mode with `azure config mode arm`
+5. Pick a location from the location list (`azure location list`, e.g. `eastus`). Remainder of this tutorial uses location eastus but you can switch it out. 
+6. Create the cluster group. This example uses the name `pirkcluster1`, you can pick a different name.  `azure group create `**`pirkcluster1`**` eastus``
+7. Create storage to use for the cluster `azure storage account create -g `**`pirkcluster1`**` --sku-name RAGRS -l eastus --kind Storage `**`pirkstorage1`**
+8. Get one of the access keys for the storage account. `key1` is fine. 
+
+        % azure storage account keys list -g pirkcluster1 pirkstorage
+        info:    Executing command storage account keys list
+        + Getting storage account keys                                
+        data:    Name  Key                                Permissions
+        data:    ----  ----------------------------------------------
+        data:    key1  [a bunch of base64, save THIS]     Full  
+        data:    key2  another bunch of base 64           Full       
+        info:    storage account keys list command OK
+
+9. Register for the Azure HDInsight provider: `azure provider register Microsoft.HDInsight`
+10. Create the cluster. Replace the bold faced values. You already have `key1_from_above_command`. Select your own ssh and http passwords. In this example **`pirkhdinsight1`** is the name that will be used to SSH into the cluster and manage it.  
+    `azure hdinsight cluster create -g pirkcluster1 -l eastus -y Linux --clusterType Hadoop --defaultStorageAccountName pirkstorage1.blob.core.windows.net --defaultStorageAccountKey `**`key1_from_above_command`**` --defaultStorageContainer `**`pirkhdinsight1`**` --workerNodeCount 2 --userName admin --password `**`httppassword`**` --sshUserName sshuser --sshPassword `**`sshpassword`**` `**`pirkhdinsight1`**
+    
+11. This command takes about 15 minutes. Once it finishes you can log into your cluster using `ssh sshuser@`**`pirkhdinsight1`**`-ssh.azurehdinsight.net` where you've replaced **`pirkhdinsight1`** with the name of your cluster. 
+12. You may choose to install your ssh keys at this point using a command like `ssh-copy-id -i ~/.ssh/azure_ed25519 -o PubkeyAuthentication=no sshuser@pirkhdinsight1-ssh.azurehdinsight.net`
+13. At this point you can't do anything since HDInsight doesn't support Java 8. Delete your cluster and wait for Azure HDInsight to support Java 8. 
+
+
+## Amazon Web Services EC2 EMR
+1. You'll need to have an AWS account with credit or billing information. 
+2. Either create a key pair within the AWS user interface or make one locally and upload the public key. Note the exact name of the keypair in the AWS interface because it is an argument to later commands. The keypair used in this tutorial is called `amazon_rsa` in the amazon user interface and the private key is located on the local machine at `~/.ssh/amazon_rsa`
+3. Install the [AWS CLI](https://aws.amazon.com/cli/) (probably using `pip install aws`) and run `aws configure` and input the required Access Key ID and Secret associated with your account. 
+4. Run `aws emr create-default-roles`. 
+5. Before you can create a cluster you need to make a JSON file locally. Call it (for example) `aws-cluster-conf.json` with these contents:
+  
+        [
+          {
+            "Classification": "yarn-site",
+            "Properties": {
+              "yarn.nodemanager.aux-services": "mapreduce_shuffle",
+              "yarn.nodemanager.aux-services.mapreduce_shuffle.class": "org.apache.hadoop.mapred.ShuffleHandler"
+            }
+          }
+        ]
+    
+    This configuration file fixes some YARN configuration options that (left absent)
+    prevent distributed Pirk from running. 
+
+6. Create the cluster:
+
+        aws emr create-cluster \
+           --name "Spark Cluster" \
+           --release-label emr-5.0.0 \
+           --applications Name=Spark \
+           --ec2-attributes KeyName=amazon_rsa \
+           --instance-type m3.xlarge \
+           --instance-count 3 \
+           --use-default-roles \
+           --configurations file://./aws-cluster-conf.json
+    
+    Make note of the ClusterID it returns. For the remainder of these steps, assume that **`$cid`** has been set equal to the cluster ID (you may find it convenient to do this using `export cid=YOURCLUSTERID`)
+7. Wait for your cluster to be ready. You might find this command helpful: `watch -n 60 aws emr describe-cluster --output json --cluster-id `**`$cid`**
+8. Once your cluster is ready, go into the [AWS console in your browser](https://console.aws.amazon.com/ec2/v2/home) and add a firewall rule enabling SSH access. Select the correct region in the upper corner, then click on Security Groups in the left hand column. Find the row with the Group Name "ElasticMapReduce-master", select the Inbound tab in the lower pane, click Edit, and then add a Rule for SSH (in the drop down menu) with Source "My IP" (change this to another value if desired).
+9. Upload the jar file (underneath the covers this is running `scp`):  
+`aws emr put --cluster-id `**`$cid`**` --key-pair-file `**`~/.ssh/amazon_rsa`**` --src apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar`
+10. SSH in using  
+`aws emr ssh --cluster-id `**`$cid`**` --key-pair-file `**`~/.ssh/amazon_rsa`**  
+If you want to SSH in and set up a SOCKS proxy to access the [web interfaces](https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html) (like in the GCP instructions) copy the output of the SSH command and add the `-D $SOCKSPORTNUM` flag. The YARN resource manager is on port 8088 of the Master node. 
+
+11. Now on the cluster, you can run the distributed tests:  
+`hadoop jar apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar org.apache.pirk.test.distributed.DistributedTestDriver -j $PWD/apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar -t 1:J`
+12. When you are done working with your cluster, terminate it:  
+`aws emr terminate-clusters --cluster-ids `**`$cid`**
\ No newline at end of file


[2/2] incubator-pirk git commit: Updated the instructions on updating the website

Posted by ja...@apache.org.
Updated the instructions on updating the website


Project: http://git-wip-us.apache.org/repos/asf/incubator-pirk/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-pirk/commit/3af67efa
Tree: http://git-wip-us.apache.org/repos/asf/incubator-pirk/tree/3af67efa
Diff: http://git-wip-us.apache.org/repos/asf/incubator-pirk/diff/3af67efa

Branch: refs/heads/gh-pages
Commit: 3af67efa423a10b572cf43012804413193616754
Parents: e347cfb
Author: Jacob Wilder <ja...@apache.org>
Authored: Tue Sep 20 10:55:56 2016 -0400
Committer: Jacob Wilder <ja...@apache.org>
Committed: Tue Sep 20 10:55:56 2016 -0400

----------------------------------------------------------------------
 website_updates.md | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-pirk/blob/3af67efa/website_updates.md
----------------------------------------------------------------------
diff --git a/website_updates.md b/website_updates.md
index 1dfe07f..d80f2fd 100644
--- a/website_updates.md
+++ b/website_updates.md
@@ -3,10 +3,14 @@ title: Updating the Website
 nav: nav_website_updates
 ---
 
-Pirk's web site is developed using [Jekyll](https://jekyllrb.com). The Jekyll source is contained in the [gh-pages](https://git-wip-us.apache.org/repos/asf?p=incubator-pirk.git;a=tree;h=gh-pages) branch and the [asf-site](https://git-wip-us.apache.org/repos/asf?p=incubator-pirk.git;a=tree;h=asf-site) branch contains the HTML content.
+Pirk's web site is developed using [Jekyll](https://jekyllrb.com). The Jekyll source is contained in the 
+[gh-pages](https://git-wip-us.apache.org/repos/asf?p=incubator-pirk.git;a=tree;h=gh-pages) branch and the 
+[asf-site](https://git-wip-us.apache.org/repos/asf?p=incubator-pirk.git;a=tree;h=asf-site) branch contains 
+the HTML content.
 
 Website development is
-performed by editing the contents of the [gh-pages](https://git-wip-us.apache.org/repos/asf?p=incubator-pirk.git;a=tree;h=gh-pages) branch, either
+performed by editing the contents of the 
+[gh-pages](https://git-wip-us.apache.org/repos/asf?p=incubator-pirk.git;a=tree;h=gh-pages) branch, either
 directly by a committer, with a pull request to [GitHub](https://github.com/apache/incubator-pirk), or a patch
 submitted to [JIRA](https://issues.apache.org/jira/browse/PIRK). The rendered site can be previewed locally or on
 [GitHub](https://apache.github.io/incubator-pirk/), and the rendered site (in the `_site` directory) will be
@@ -37,7 +41,12 @@ branch. You can also run this command manually:
     ./_devtools/git-hooks/post-commit
 
 To automatically run this post-commit hook in your local repository, copy
-the given file into your `.git/hook` directory:
+the given file into your `.git/hooks` directory, make the file executable, and run it:
 
     cp ./_devtools/git-hooks/post-commit .git/hooks/
-    
\ No newline at end of file
+    chmod +x .git/hooks/post-commit
+    ./.git/hooks/post-commit
+
+Finally, ensure you have pushed the changes in both the `asf-site` and `gh-pages` branches. The `gh-pages` branch 
+contains the source used by the `post-commit` script to generate the static HTML which lives in the
+`asf-site` branch and is actually served.
\ No newline at end of file