You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Woody Anderson (JIRA)" <ji...@apache.org> on 2010/02/04 05:26:28 UTC

[jira] Commented: (PIG-928) UDFs in scripting languages

    [ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829432#action_12829432 ] 

Woody Anderson commented on PIG-928:
------------------------------------

slight error in the js_wc.js script:
change line 9 to:
X = foreach a GENERATE spig_split($0);
and, if you want schema info in the JS impl, change 'bag' to 'b:{tt:(t:chararray)}' on line 4.

setenv PIG_HEAPSIZE 2048
time pig -x local tokenize.pig
  41.724u 2.046s 0:30.52 143.3%	0+0k 0+16io 8pf+0w
time pig -x local js_wc.pig
  72.079u 2.905s 0:54.50 137.5%	0+0k 0+46io 14pf+0w
time pig -x local pjy_wc.pig
  41.588u 2.155s 0:33.58 130.2%	0+0k 0+6io 8pf+0w

so the testing indicates that with this implementation the jython is fairly on par with the java TOKENIZE impl, and js is just shy of twice as slow.

there are a lot of reasons that the performance of this implementation is startlingly better than the previous numbers, mostly to do with caching the functions, and jython.2.5.1 perhaps being better than whatever python variant was tried above.
this impl also aheres to the schema system for output data, which does cost some cpu, but is generally not too bad.

the scripter converter does not have a js handler, but it does convert inlined jython code (anything between @@ jython @@ and subsequent @@)
for example (taken from pjy_wc.pjy):
@@ jython @@
def split(a):
    """ @return b:{tt:(t:chararray)} """
    return a.split()


anyway, i'd like to discuss these approaches moving into pig with more out-of-the-box support.
package: org/apache/pig/scripting is meant to be the harness that i'd like to see as part of pig (or something very like that package)
packages: org/apache/pig/scripting/js, org/apache/pig/scripting/jython are implementations that i think are pretty useful, but could be improved. distributing these with pig is certainly debatable. eps jython requires jython.jar to function, and the js implementation is really just a proof of concept for a second language impl (i didn't even make a FilterFunc yet)

the scripter functionality is something i'd like to see supported by the pig parser as much as possible, but i don't have a great idea of how to do that yet. perhaps a new statement to allow a user to register a language pack jar would include hooking it into the parser to handle file references etc. as manually handling the dependency graph is a major pita. The creation of a Code jar and the invocation of javac (in particular, this may not be needed) are pretty arduous, so it'd be nice for a general system to make this work.
I tried to write the script so that you could add new language handlers to it and it would process functions of the form {lang}.{function}(args) and convert appropriately. but i only implemented jython, so the language separation may not be entirely complete, e.g. a language with very different structure may require some other modifications to the script.

i want to close by saying that the initial inspiration for this work and the idea of the pre-process script came from a blog post about a project called baconsnake http://arnab.org/blog/baconsnake, by Arnab Nandi. That post put me on the track of using jython from java code for the first time, and the idea of making the actual script injecting language tolerable. many thanks.

> UDFs in scripting languages
> ---------------------------
>
>                 Key: PIG-928
>                 URL: https://issues.apache.org/jira/browse/PIG-928
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>         Attachments: package.zip, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python, ruby, etc.  This frees users from needing to compile Java, generate a jar, etc.  It also opens Pig to programmers who prefer scripting languages over Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.