You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@zeppelin.apache.org by ah...@apache.org on 2017/02/28 04:12:11 UTC

zeppelin git commit: [MINOR] add pig wiki page to pig doc

Repository: zeppelin
Updated Branches:
  refs/heads/master 8bb37c2c2 -> a26bd2d76


[MINOR] add pig wiki page to pig doc

### What is this PR for?
Add pig wiki page pig doc

### What type of PR is it?
[Documentation |]

### Todos
* [ ] - Task

### What is the Jira issue?
No jira created

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Jeff Zhang <zj...@apache.org>

Closes #2004 from zjffdu/pig_doc and squashes the following commits:

e5a564a [Jeff Zhang] rename zeppelin to Zeppelin and pig to Pig
65458ff [Jeff Zhang] address comments and minor update on pig tutorial
c6cb5ff [Jeff Zhang] update pig tutorial
b8542de [Jeff Zhang] [MINOR] add pig wiki page to pig doc


Project: http://git-wip-us.apache.org/repos/asf/zeppelin/repo
Commit: http://git-wip-us.apache.org/repos/asf/zeppelin/commit/a26bd2d7
Tree: http://git-wip-us.apache.org/repos/asf/zeppelin/tree/a26bd2d7
Diff: http://git-wip-us.apache.org/repos/asf/zeppelin/diff/a26bd2d7

Branch: refs/heads/master
Commit: a26bd2d76a50b8911ed39e99054d71a641cce8c9
Parents: 8bb37c2
Author: Jeff Zhang <zj...@apache.org>
Authored: Mon Feb 27 14:42:19 2017 +0800
Committer: ahyoungryu <ah...@apache.org>
Committed: Tue Feb 28 13:12:02 2017 +0900

----------------------------------------------------------------------
 .../zeppelin/img/pig_zeppelin_tutorial.png      | Bin 0 -> 280450 bytes
 docs/interpreter/pig.md                         |  62 ++++++++++++++-----
 notebook/2C57UKYWR/note.json                    |  32 +++++-----
 3 files changed, 63 insertions(+), 31 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/zeppelin/blob/a26bd2d7/docs/assets/themes/zeppelin/img/pig_zeppelin_tutorial.png
----------------------------------------------------------------------
diff --git a/docs/assets/themes/zeppelin/img/pig_zeppelin_tutorial.png b/docs/assets/themes/zeppelin/img/pig_zeppelin_tutorial.png
new file mode 100644
index 0000000..b90b982
Binary files /dev/null and b/docs/assets/themes/zeppelin/img/pig_zeppelin_tutorial.png differ

http://git-wip-us.apache.org/repos/asf/zeppelin/blob/a26bd2d7/docs/interpreter/pig.md
----------------------------------------------------------------------
diff --git a/docs/interpreter/pig.md b/docs/interpreter/pig.md
index ad2e80a..d1f18fa 100644
--- a/docs/interpreter/pig.md
+++ b/docs/interpreter/pig.md
@@ -15,14 +15,16 @@ group: manual
 [Apache Pig](https://pig.apache.org/) is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
 
 ## Supported interpreter type
-  - `%pig.script` (default)
+  - `%pig.script` (default Pig interpreter, so you can use `%pig`)
     
-    All the pig script can run in this type of interpreter, and display type is plain text.
+    `%pig.script` is like the Pig grunt shell. Anything you can run in Pig grunt shell can be run in `%pig.script` interpreter, it is used for running Pig script where you don\u2019t need to visualize the data, it is suitable for data munging. 
   
   - `%pig.query`
  
-    Almost the same as `%pig.script`. The only difference is that you don't need to add alias in the last statement. And the display type is table.   
-
+    `%pig.query` is a little different compared with `%pig.script`. It is used for exploratory data analysis via Pig latin where you can leverage Zeppelin\u2019s visualization ability. There're 2 minor differences in the last statement between `%pig.script` and `%pig.query`
+    - No pig alias in the last statement in `%pig.query` (read the examples below).
+    - The last statement must be in single line in `%pig.query`
+    
 ## Supported runtime mode
   - Local
   - MapReduce
@@ -52,8 +54,8 @@ group: manual
 ### How to configure interpreter
 
 At the Interpreters menu, you have to create a new Pig interpreter. Pig interpreter has below properties by default.
-And you can set any pig properties here which will be passed to pig engine. (like tez.queue.name & mapred.job.queue.name).
-Besides, we use paragraph title as job name if it exists, else use the last line of pig script. So you can use that to find app running in YARN RM UI.
+And you can set any Pig properties here which will be passed to Pig engine. (like tez.queue.name & mapred.job.queue.name).
+Besides, we use paragraph title as job name if it exists, else use the last line of Pig script. So you can use that to find app running in YARN RM UI.
 
 <table class="table-configuration">
     <tr>
@@ -95,22 +97,52 @@ Besides, we use paragraph title as job name if it exists, else use the last line
 ```
 %pig
 
-raw_data = load 'dataset/sf_crime/train.csv' using PigStorage(',') as (Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y);
-b = group raw_data all;
-c = foreach b generate COUNT($1);
-dump c;
+bankText = load 'bank.csv' using PigStorage(';');
+bank = foreach bankText generate $0 as age, $1 as job, $2 as marital, $3 as education, $5 as balance; 
+bank = filter bank by age != '"age"';
+bank = foreach bank generate (int)age, REPLACE(job,'"','') as job, REPLACE(marital, '"', '') as marital, (int)(REPLACE(balance, '"', '')) as balance;
+store bank into 'clean_bank.csv' using PigStorage(';'); -- this statement is optional, it just show you that most of time %pig.script is used for data munging before querying the data. 
 ```
 
 ##### pig.query
 
+Get the number of each age where age is less than 30
+
+```
+%pig.query
+ 
+bank_data = filter bank by age < 30;
+b = group bank_data by age;
+foreach b generate group, COUNT($1);
+```
+
+The same as above, but use dynamic text form so that use can specify the variable maxAge in textbox. (See screenshot below). Dynamic form is a very cool feature of Zeppelin, you can refer this [link]((../manual/dynamicform.html)) for details.
+
 ```
 %pig.query
+ 
+bank_data = filter bank by age < ${maxAge=40};
+b = group bank_data by age;
+foreach b generate group, COUNT($1) as count;
+```
+
+Get the number of each age for specific marital type, also use dynamic form here. User can choose the marital type in the dropdown list (see screenshot below).
 
-b = foreach raw_data generate Category;
-c = group b by Category;
-foreach c generate group as category, COUNT($1) as count;
+```
+%pig.query
+ 
+bank_data = filter bank by marital=='${marital=single,single|divorced|married}';
+b = group bank_data by age;
+foreach b generate group, COUNT($1) as count;
 ```
 
+The above examples are in the Pig tutorial note in Zeppelin, you can check that for details. Here's the screenshot.
+
+<img class="img-responsive" width="1024px" style="margin:0 auto; padding: 26px;" src="../assets/themes/zeppelin/img/pig_zeppelin_tutorial.png" />
+
+
 Data is shared between `%pig` and `%pig.query`, so that you can do some common work in `%pig`, and do different kinds of query based on the data of `%pig`. 
-Besides, we recommend you to specify alias explicitly so that the visualization can display the column name correctly. Here, we name `COUNT($1)` as `count`, if you don't do this,
-then we will name it using position, here we will use `col_1` to represent `COUNT($1)` if you don't specify alias for it. There's one pig tutorial note in zeppelin for your reference.
+Besides, we recommend you to specify alias explicitly so that the visualization can display the column name correctly. In the above example 2 and 3 of `%pig.query`, we name `COUNT($1)` as `count`. If you don't do this,
+then we will name it using position. E.g. in the above first example of `%pig.query`, we will use `col_1` in chart to represent `COUNT($1)`.
+
+

http://git-wip-us.apache.org/repos/asf/zeppelin/blob/a26bd2d7/notebook/2C57UKYWR/note.json
----------------------------------------------------------------------
diff --git a/notebook/2C57UKYWR/note.json b/notebook/2C57UKYWR/note.json
index 21d1231..22afb2a 100644
--- a/notebook/2C57UKYWR/note.json
+++ b/notebook/2C57UKYWR/note.json
@@ -115,7 +115,7 @@
     {
       "text": "%pig\n\nbankText \u003d load \u0027bank.csv\u0027 using PigStorage(\u0027;\u0027);\nbank \u003d foreach bankText generate $0 as age, $1 as job, $2 as marital, $3 as education, $5 as balance; \nbank \u003d filter bank by age !\u003d \u0027\"age\"\u0027;\nbank \u003d foreach bank generate (int)age, REPLACE(job,\u0027\"\u0027,\u0027\u0027) as job, REPLACE(marital, \u0027\"\u0027, \u0027\u0027) as marital, (int)(REPLACE(balance, \u0027\"\u0027, \u0027\u0027)) as balance;\n\n-- The following statement is optional, it depends on whether your needs.\n-- store bank into \u0027clean_bank.csv\u0027 using PigStorage(\u0027;\u0027);\n\n\n",
       "user": "anonymous",
-      "dateUpdated": "Jan 22, 2017 12:49:11 PM",
+      "dateUpdated": "Feb 24, 2017 5:08:08 PM",
       "config": {
         "colWidth": 12.0,
         "editorMode": "ace/mode/pig",
@@ -138,15 +138,15 @@
       "jobName": "paragraph_1483277250237_-466604517",
       "id": "20161228-140640_1560978333",
       "dateCreated": "Jan 1, 2017 9:27:30 PM",
-      "dateStarted": "Jan 22, 2017 12:49:11 PM",
-      "dateFinished": "Jan 22, 2017 12:49:13 PM",
+      "dateStarted": "Feb 24, 2017 5:08:08 PM",
+      "dateFinished": "Feb 24, 2017 5:08:11 PM",
       "status": "FINISHED",
       "progressUpdateIntervalMs": 500
     },
     {
       "text": "%pig.query\n\nbank_data \u003d filter bank by age \u003c 30;\nb \u003d group bank_data by age;\nforeach b generate group, COUNT($1);\n\n",
       "user": "anonymous",
-      "dateUpdated": "Jan 22, 2017 12:49:16 PM",
+      "dateUpdated": "Feb 24, 2017 5:08:13 PM",
       "config": {
         "colWidth": 4.0,
         "editorMode": "ace/mode/pig",
@@ -183,15 +183,15 @@
       "jobName": "paragraph_1483277250238_-465450270",
       "id": "20161228-140730_1903342877",
       "dateCreated": "Jan 1, 2017 9:27:30 PM",
-      "dateStarted": "Jan 22, 2017 12:49:16 PM",
-      "dateFinished": "Jan 22, 2017 12:49:30 PM",
+      "dateStarted": "Feb 24, 2017 5:08:13 PM",
+      "dateFinished": "Feb 24, 2017 5:08:26 PM",
       "status": "FINISHED",
       "progressUpdateIntervalMs": 500
     },
     {
-      "text": "%pig.query\n\nbank_data \u003d filter bank by age \u003c ${maxAge\u003d40};\nb \u003d group bank_data by age;\nforeach b generate group, COUNT($1);",
+      "text": "%pig.query\n\nbank_data \u003d filter bank by age \u003c ${maxAge\u003d40};\nb \u003d group bank_data by age;\nforeach b generate group, COUNT($1) as count;",
       "user": "anonymous",
-      "dateUpdated": "Jan 22, 2017 12:49:18 PM",
+      "dateUpdated": "Feb 24, 2017 5:08:14 PM",
       "config": {
         "colWidth": 4.0,
         "editorMode": "ace/mode/pig",
@@ -228,7 +228,7 @@
         "msg": [
           {
             "type": "TABLE",
-            "data": "group\tcol_1\n19\t4\n20\t3\n21\t7\n22\t9\n23\t20\n24\t24\n25\t44\n26\t77\n27\t94\n28\t103\n29\t97\n30\t150\n31\t199\n32\t224\n33\t186\n34\t231\n35\t180\n"
+            "data": "group\tcount\n19\t4\n20\t3\n21\t7\n22\t9\n23\t20\n24\t24\n25\t44\n26\t77\n27\t94\n28\t103\n29\t97\n30\t150\n31\t199\n32\t224\n33\t186\n34\t231\n35\t180\n"
           }
         ]
       },
@@ -236,15 +236,15 @@
       "jobName": "paragraph_1483277250239_-465835019",
       "id": "20161228-154918_1551591203",
       "dateCreated": "Jan 1, 2017 9:27:30 PM",
-      "dateStarted": "Jan 22, 2017 12:49:18 PM",
-      "dateFinished": "Jan 22, 2017 12:49:32 PM",
+      "dateStarted": "Feb 24, 2017 5:08:14 PM",
+      "dateFinished": "Feb 24, 2017 5:08:29 PM",
       "status": "FINISHED",
       "progressUpdateIntervalMs": 500
     },
     {
-      "text": "%pig.query\n\nbank_data \u003d filter bank by marital\u003d\u003d\u0027${marital\u003dsingle,single|divorced|married}\u0027;\nb \u003d group bank_data by age;\nforeach b generate group, COUNT($1) as c;\n\n\n",
+      "text": "%pig.query\n\nbank_data \u003d filter bank by marital\u003d\u003d\u0027${marital\u003dsingle,single|divorced|married}\u0027;\nb \u003d group bank_data by age;\nforeach b generate group, COUNT($1) as count;\n\n\n",
       "user": "anonymous",
-      "dateUpdated": "Jan 22, 2017 12:49:20 PM",
+      "dateUpdated": "Feb 24, 2017 5:08:15 PM",
       "config": {
         "colWidth": 4.0,
         "editorMode": "ace/mode/pig",
@@ -292,7 +292,7 @@
         "msg": [
           {
             "type": "TABLE",
-            "data": "group\tc\n23\t3\n24\t11\n25\t11\n26\t18\n27\t26\n28\t23\n29\t37\n30\t56\n31\t104\n32\t105\n33\t103\n34\t142\n35\t109\n36\t117\n37\t100\n38\t99\n39\t88\n40\t105\n41\t97\n42\t91\n43\t79\n44\t68\n45\t76\n46\t82\n47\t78\n48\t91\n49\t87\n50\t74\n51\t63\n52\t66\n53\t75\n54\t56\n55\t68\n56\t50\n57\t78\n58\t67\n59\t56\n60\t36\n61\t15\n62\t5\n63\t7\n64\t6\n65\t4\n66\t7\n67\t5\n68\t1\n69\t5\n70\t5\n71\t5\n72\t4\n73\t6\n74\t2\n75\t3\n76\t1\n77\t5\n78\t2\n79\t3\n80\t6\n81\t1\n83\t2\n86\t1\n87\t1\n"
+            "data": "group\tcount\n23\t3\n24\t11\n25\t11\n26\t18\n27\t26\n28\t23\n29\t37\n30\t56\n31\t104\n32\t105\n33\t103\n34\t142\n35\t109\n36\t117\n37\t100\n38\t99\n39\t88\n40\t105\n41\t97\n42\t91\n43\t79\n44\t68\n45\t76\n46\t82\n47\t78\n48\t91\n49\t87\n50\t74\n51\t63\n52\t66\n53\t75\n54\t56\n55\t68\n56\t50\n57\t78\n58\t67\n59\t56\n60\t36\n61\t15\n62\t5\n63\t7\n64\t6\n65\t4\n66\t7\n67\t5\n68\t1\n69\t5\n70\t5\n71\t5\n72\t4\n73\t6\n74\t2\n75\t3\n76\t1\n77\t5\n78\t2\n79\t3\n80\t6\n81\t1\n83\t2\n86\t1\n87\t1\n"
           }
         ]
       },
@@ -300,8 +300,8 @@
       "jobName": "paragraph_1483277250240_-480070728",
       "id": "20161228-142259_575675591",
       "dateCreated": "Jan 1, 2017 9:27:30 PM",
-      "dateStarted": "Jan 22, 2017 12:49:30 PM",
-      "dateFinished": "Jan 22, 2017 12:49:34 PM",
+      "dateStarted": "Feb 24, 2017 5:08:27 PM",
+      "dateFinished": "Feb 24, 2017 5:08:31 PM",
       "status": "FINISHED",
       "progressUpdateIntervalMs": 500
     },