You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2011/08/26 20:37:45 UTC

[Pig Wiki] Update of "FAQ" by daijy

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "FAQ" page has been changed by daijy:
http://wiki.apache.org/pig/FAQ?action=diff&rev1=7&rev2=8

- '''Q: How can I load data using Unicode control characters as delimiters?''' 
+ This page has been moved to [[https://cwiki.apache.org/confluence/display/PIG/FAQ|Confluence]]
  
- The first parameter to !PigStorage is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See [[http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html|java.util.regex.Pattern]] for more information on the way to use special characters in regex. 
- 
- If you are loading a file which contains Ctrl+A as separators, you can specify this to !PigStorage using the Unicode notation.
- 
- {{{
- LOAD 'input.dat' USING PigStorage('\u0001')as (x,y,z);
- }}}
- 
- '''Q: How do I make my jobs run on multiple machines?'''
- 
- Use the PARALLEL clause:
- 
- {{{
- C = JOIN A by url, B by url PARALLEL 50;
- }}}
- 
- '''Q: How do I make my Pig jobs run on a specified number of reducers?'''
- 
- You can achieve this with the PARALLEL clause. For example: 
- {{{
- C = JOIN A by url, B by url PARALLEL 50. 
- }}}
- 
- Besides PARALLEL clause, you can also use "set default_parallel" statement in Pig script, or set "mapred.reduce.tasks" system property to specify default parallel to use. If none of these values are set, Pig will only use 1 reducers. (In Pig 0.8, we change the default reducer from 1 to a number calculated by a simple heuristic for foolproof purpose)
- 
- '''Q: Can I do a numerical comparison while filtering?'''
- 
- Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, <  etc. and for string comparisons use eq, neq etc. See the format of [[#CondS|Conditions]].
- 
- 
- 
- 
- '''Q: Does Pig support regular expressions?'''
- 
- Pig does support regular expression matching via the `matches` keyword. It uses [[http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html|java.util.regex]] matches which means your pattern has to match the entire string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"` not `"fred"`).
- 
- '''Q: How do I prevent failure if some records don't have the needed number of columns?'''
- 
- You can filter away those records by including the following in your Pig program:
- 
- {{{
- A = LOAD 'foo' USING PigStorage('\t');
- B = FILTER A BY ARITY(*) < 5;
- .....
- }}}
- 
- This code would drop all records that have fewer than five (5) columns.
- 
- '''Q: Is there any difference between `==` and `eq` for numeric comparisons?'''
- 
- There is no difference when using integers. However, `11.0` and `11` will be equal with `==` but not with `eq`. 
- 
- '''Q: Is it possible to use PIG with a regular Hadoop cluster (not HOD)?'''
- 
- You can set this property using the empty string.
- 
- {{{
- hod.server=""
- }}}
- 
- '''Q: Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?'''
- 
- You can run the following set of commands, which are equivalent to `SELECT COUNT(*)` in SQL:
- 
- {{{
- a = LOAD 'mytestfile.txt';
- b = GROUP a ALL;
- c = FOREACH b GENERATE COUNT(a.$0);
- }}}
- 
- 
- '''Q: Does Pig allow grouping on expressions?'''
- 
- Pig allows grouping of expressions. For example:
- 
- {{{
- grunt> a = LOAD 'mytestfile.txt' AS (x,y,z);
- grunt> DUMP a;
- (1,2,3)
- (4,2,1)
- (4,3,4)
- (4,3,4)
- (7,2,5)
- (8,4,3)
- 
- b = GROUP a BY (x+y);
- (3.0,{(1,2,3)})
- (6.0,{(4,2,1)})
- (7.0,{(4,3,4),(4,3,4)})
- (9.0,{(7,2,5)})
- (12.0,{(8,4,3)})
- }}}
- 
- If the grouping is based on constants, the result is the same as GROUP ALL except the group-id is replaced by the constant.
- {{{
- grunt> b = GROUP a BY 4;
- (4,{(1,2,3),(4,2,1),(4,3,4),(4,3,4),(7,2,5),(8,4,3)})
- }}}
- '''Q: Is there a way to check if a map is empty?'''
- 
- In Pig 2.0 you can test the existence of values in a map using the null construct: 
- m#'key' is not null
- 
- '''Q: How can I specify the number of nodes Pig allocates?'''
- 
- {{{
- > pig -Dhod.param='-m 3' my_script.pig
- }}}
- 
- Three (3) nodes is the minimum.
- 
- '''Q: How can I ask Pig to use an already allocated HOD cluster?''' 
- 
- Suppose you allocated a cluster:
- {{{
- $ mkdir -p ~/hod-clusters/test
- $ hod allocate -d ~/hod-clusters/test -n 5
- $ setenv CLUSTERDIR ~/hod-clusters/test
- }}}
-  
- You can then use the following command, using either -Dhod.server=’’ or –Dhod.server=””
- {{{
- $ pig -cp $CLUSTERDIR -Dhod.server='' myscript.pig 
- }}}
-  
-