You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Olga Natkovich <ol...@yahoo-inc.com> on 2008/06/21 00:35:06 UTC

pig tutorial is ready

Hi,
 
If you are new to Pig, the best place to start is by trying out our
brand new tutorial: http://wiki.apache.org/pig/PigTutorial.
 
We hope that you find it useful and informative!
 
As always, your feedback is welcome!
 
Olga

Re: pig tutorial is ready

Posted by Tanton Gibbs <ta...@gmail.com>.
First, I really like the formatting and the explanation behind the
tutorial.  I thought it was well written and gave a lot of useful
information.

Now, for the potential improvements:

1) I'm a bit annoyed at how many UDFs I need to "write" to do the work
done by example 1.  This is somewhat of a turn off.

UDFs
1. NonURLDetector - determine if query field is empty or a URL (by the
way, shouldn't this be a URLDetector if it decides that it is a URL?
The negative of a negative may get confusing).
2. toLower - lowercase
3. extractHour - pull out hour from datetime stamp
4. NGramGenerator - generate ngrams
5. ScoreGenerator - generate scores

It seems to me that the first three are just simple regexs or
substitutions.  Could there not just be a replace or match function
that takes the place of all of these "custom" UDFs?  I'm not saying it
should be builtin to the language, just a UDF that does match or
replace.  The second two are the actual "logic" and are what someone
would expect to have to write.

2) The source code for the UDFs doesn't come in the .tar.gz file.  Of
course, if they have the svn repository checked out they can get to
it, but it would be nice to include it in the original
download...perhaps you could just put it in the tutorial.jar?

3) I never get the "full" pig scripts on the web page, only the
decomposed ones that you have commented on.  It might be nice to see,
after the description, the final script.

4) On the web page, I don't see any examples.  Of course, I can
download it and run it to see what it does, but it would be nice to
have a few records done via "illustrate" on the web page.  That way I
could see how the records change after each pig statement.  The user
comments are nice, but nothing helps like an example.

If you think any of these ideas are worthwhile, I'd be happy to do
them, just let me know.

Finally, the size of a language's standard library is a determining
factor to its success.  PiggyBank looks to be a good start, but I
think you're going to need to put some thought into what UDFs are
packaged as "standard" with Pig.  These functions will need to be of a
higher quality than those allowed in the PiggyBank.   Things like
match, replace, the math functions, etc... would make good candidates.
 Of course there are many, many more.  I imagine, though, that there
could be a promotion path from the PiggyBank into the standard
library.

Thanks!
Tanton

On Fri, Jun 20, 2008 at 5:35 PM, Olga Natkovich <ol...@yahoo-inc.com> wrote:
> Hi,
>
> If you are new to Pig, the best place to start is by trying out our
> brand new tutorial: http://wiki.apache.org/pig/PigTutorial.
>
> We hope that you find it useful and informative!
>
> As always, your feedback is welcome!
>
> Olga
>