You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Jeremy Karn <jk...@mortardata.com> on 2013/08/23 22:19:52 UTC

Review Request 13781: Changes to add support for streaming_python udfs.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13781/
-----------------------------------------------------------

Review request for pig.


Repository: pig-git


Description
-------

Changes for PIG-2417 (https://issues.apache.org/jira/browse/PIG-2417)


Diffs
-----

  build.xml b20eb3d 
  src/org/apache/pig/PigToStream.java 7cc2950 
  src/org/apache/pig/StreamToPig.java ff24b27 
  src/org/apache/pig/builtin/PigStreaming.java 5467693 
  src/org/apache/pig/builtin/PigToStreamUDF.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamUDFToPig.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingDelimiters.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingUDF.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingUDFException.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingUDFOutputSchemaException.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/DefaultInputHandler.java 301bea3 
  src/org/apache/pig/impl/streaming/DefaultOutputHandler.java 1b46e7d 
  src/org/apache/pig/impl/streaming/ExecutableManager.java cf79c83 
  src/org/apache/pig/impl/streaming/InputHandler.java 690d94e 
  src/org/apache/pig/impl/streaming/OutputHandler.java 6e9262a 
  src/org/apache/pig/impl/streaming/StreamingUDFOutputHandler.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamingUtil.java PRE-CREATION 
  src/org/apache/pig/impl/util/JarManager.java 5c4acb0 
  src/org/apache/pig/impl/util/StorageUtil.java dcb62ec 
  src/org/apache/pig/scripting/ScriptEngine.java 29a9e1f 
  src/org/apache/pig/scripting/ScriptingIllustrateOutputCapturer.java PRE-CREATION 
  src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java PRE-CREATION 
  src/python/streaming/controller.py PRE-CREATION 
  src/python/streaming/pig_util.py PRE-CREATION 
  test/org/apache/pig/builtin/TestPigToStreamUDF.java PRE-CREATION 
  test/org/apache/pig/builtin/TestStreamUDFToPig.java PRE-CREATION 
  test/org/apache/pig/builtin/TestStreamingUDF.java PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestExecutableManager.java 6246019 
  test/org/apache/pig/impl/streaming/TestStreamingUDFOutputHandler.java PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestStreamingUtil.java PRE-CREATION 
  test/org/apache/pig/test/TestPigStreaming.java PRE-CREATION 
  test/org/apache/pig/test/TestStreaming.java 1eac5d2 
  test/python/streaming/test_controller.py PRE-CREATION 
  test/unit-tests d52ad9d 

Diff: https://reviews.apache.org/r/13781/diff/


Testing
-------


Thanks,

Jeremy Karn


Re: Review Request 13781: Changes to add support for streaming_python udfs.

Posted by Jeremy Karn <jk...@mortardata.com>.

> On Sept. 4, 2013, 11:36 p.m., Julien Le Dem wrote:
> > src/org/apache/pig/builtin/StreamUDFToPig.java, line 74
> > <https://reviews.apache.org/r/13781/diff/1/?file=344447#file344447line74>
> >
> >     decoding string is actually quite costly as java strings are utf-16 and creating a string always copies the content of the byte array.
> >     Can't you just use the binary format of the type?

I'm not sure how to use the binary format without significantly rewriting the serialization logic on the python side to serialize the data in a byte format (using something like struct.pack).   Is that what you were thinking of, or were you thinking of something else?

The last profiling we did showed that the bottleneck is typically on the python side deserializing the input from Pig.  I think any performance enhancements at this point should be done in a new jira with solid profiling over a number of test cases.


> On Sept. 4, 2013, 11:36 p.m., Julien Le Dem wrote:
> > src/org/apache/pig/builtin/StreamUDFToPig.java, line 98
> > <https://reviews.apache.org/r/13781/diff/1/?file=344447#file344447line98>
> >
> >     would StreamTokenizer help ?

I took a look, and I don't think its worth re-writing this logic using StreamTokenizer.


> On Sept. 4, 2013, 11:36 p.m., Julien Le Dem wrote:
> > src/org/apache/pig/builtin/StreamUDFToPig.java, line 109
> > <https://reviews.apache.org/r/13781/diff/1/?file=344447#file344447line109>
> >
> >     if you know the size in advance you can pass it here and avoid resizing

Done.


> On Sept. 4, 2013, 11:36 p.m., Julien Le Dem wrote:
> > src/org/apache/pig/builtin/StreamUDFToPig.java, line 123
> > <https://reviews.apache.org/r/13781/diff/1/?file=344447#file344447line123>
> >
> >     you probably want to do getInstance() once

Done.


> On Sept. 4, 2013, 11:36 p.m., Julien Le Dem wrote:
> > src/org/apache/pig/builtin/StreamingDelimiters.java, line 74
> > <https://reviews.apache.org/r/13781/diff/1/?file=344448#file344448line74>
> >
> >     why do we need a hashmap to access those statically defined values?

Done.


> On Sept. 4, 2013, 11:36 p.m., Julien Le Dem wrote:
> > src/org/apache/pig/builtin/StreamingUDF.java, line 51
> > <https://reviews.apache.org/r/13781/diff/1/?file=344449#file344449line51>
> >
> >     enum?

An enum seems a little more verbose without much benefit in this case.  I don't feel strongly about it though.


> On Sept. 4, 2013, 11:36 p.m., Julien Le Dem wrote:
> > src/org/apache/pig/builtin/StreamingUDF.java, line 382
> > <https://reviews.apache.org/r/13781/diff/1/?file=344449#file344449line382>
> >
> >     if (info) log.info

Fixed.  This log statement was removed entirely.


> On Sept. 4, 2013, 11:36 p.m., Julien Le Dem wrote:
> > src/org/apache/pig/builtin/StreamingUDFException.java, line 23
> > <https://reviews.apache.org/r/13781/diff/1/?file=344450#file344450line23>
> >
> >     if you don't make the parent see the cause printStackTrace won't display the cause's stackTrace

Done.


> On Sept. 4, 2013, 11:36 p.m., Julien Le Dem wrote:
> > src/org/apache/pig/impl/util/StorageUtil.java, line 176
> > <https://reviews.apache.org/r/13781/diff/1/?file=344460#file344460line176>
> >
> >     why is that?

No good reason, so I added BigDecimal and BigInteger support.


- Jeremy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13781/#review25913
-----------------------------------------------------------


On Sept. 5, 2013, 8:23 p.m., Jeremy Karn wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13781/
> -----------------------------------------------------------
> 
> (Updated Sept. 5, 2013, 8:23 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Changes for PIG-2417 (https://issues.apache.org/jira/browse/PIG-2417)
> 
> 
> Diffs
> -----
> 
>   build.xml 7e22192 
>   src/org/apache/pig/PigToStream.java 7cc2950 
>   src/org/apache/pig/StreamToPig.java ff24b27 
>   src/org/apache/pig/builtin/PigStreaming.java 5467693 
>   src/org/apache/pig/builtin/PigToStreamUDF.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamUDFToPig.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingDelimiters.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDF.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDFException.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDFOutputSchemaException.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/DefaultInputHandler.java 301bea3 
>   src/org/apache/pig/impl/streaming/DefaultOutputHandler.java 1b46e7d 
>   src/org/apache/pig/impl/streaming/ExecutableManager.java cf79c83 
>   src/org/apache/pig/impl/streaming/InputHandler.java 690d94e 
>   src/org/apache/pig/impl/streaming/OutputHandler.java 6e9262a 
>   src/org/apache/pig/impl/streaming/StreamingUDFOutputHandler.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingUtil.java PRE-CREATION 
>   src/org/apache/pig/impl/util/JarManager.java 5c4acb0 
>   src/org/apache/pig/impl/util/StorageUtil.java dcb62ec 
>   src/org/apache/pig/scripting/ScriptEngine.java 29a9e1f 
>   src/org/apache/pig/scripting/ScriptingIllustrateOutputCapturer.java PRE-CREATION 
>   src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java PRE-CREATION 
>   src/python/streaming/controller.py PRE-CREATION 
>   src/python/streaming/pig_util.py PRE-CREATION 
>   test/org/apache/pig/builtin/TestPigToStreamUDF.java PRE-CREATION 
>   test/org/apache/pig/builtin/TestStreamUDFToPig.java PRE-CREATION 
>   test/org/apache/pig/builtin/TestStreamingUDF.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestExecutableManager.java 6246019 
>   test/org/apache/pig/impl/streaming/TestStreamingUDFOutputHandler.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestStreamingUtil.java PRE-CREATION 
>   test/org/apache/pig/test/TestPigStreaming.java PRE-CREATION 
>   test/org/apache/pig/test/TestStreaming.java 1eac5d2 
>   test/python/streaming/test_controller.py PRE-CREATION 
>   test/unit-tests d52ad9d 
> 
> Diff: https://reviews.apache.org/r/13781/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jeremy Karn
> 
>


Re: Review Request 13781: Changes to add support for streaming_python udfs.

Posted by Julien Le Dem <ju...@ledem.net>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13781/#review25913
-----------------------------------------------------------



src/org/apache/pig/builtin/StreamUDFToPig.java
<https://reviews.apache.org/r/13781/#comment50560>

    decoding string is actually quite costly as java strings are utf-16 and creating a string always copies the content of the byte array.
    Can't you just use the binary format of the type?



src/org/apache/pig/builtin/StreamUDFToPig.java
<https://reviews.apache.org/r/13781/#comment50566>

    would StreamTokenizer help ?



src/org/apache/pig/builtin/StreamUDFToPig.java
<https://reviews.apache.org/r/13781/#comment50562>

    if you know the size in advance you can pass it here and avoid resizing



src/org/apache/pig/builtin/StreamUDFToPig.java
<https://reviews.apache.org/r/13781/#comment50561>

    you probably want to do getInstance() once



src/org/apache/pig/builtin/StreamingDelimiters.java
<https://reviews.apache.org/r/13781/#comment50565>

    why do we need a hashmap to access those statically defined values?



src/org/apache/pig/builtin/StreamingUDF.java
<https://reviews.apache.org/r/13781/#comment50568>

    enum?



src/org/apache/pig/builtin/StreamingUDF.java
<https://reviews.apache.org/r/13781/#comment50569>

    chain exceptions



src/org/apache/pig/builtin/StreamingUDF.java
<https://reviews.apache.org/r/13781/#comment50571>

    if (info) log.info



src/org/apache/pig/builtin/StreamingUDFException.java
<https://reviews.apache.org/r/13781/#comment50574>

    if you don't make the parent see the cause printStackTrace won't display the cause's stackTrace



src/org/apache/pig/impl/util/StorageUtil.java
<https://reviews.apache.org/r/13781/#comment50575>

    why is that?


- Julien Le Dem


On Aug. 23, 2013, 8:19 p.m., Jeremy Karn wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13781/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2013, 8:19 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Changes for PIG-2417 (https://issues.apache.org/jira/browse/PIG-2417)
> 
> 
> Diffs
> -----
> 
>   build.xml b20eb3d 
>   src/org/apache/pig/PigToStream.java 7cc2950 
>   src/org/apache/pig/StreamToPig.java ff24b27 
>   src/org/apache/pig/builtin/PigStreaming.java 5467693 
>   src/org/apache/pig/builtin/PigToStreamUDF.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamUDFToPig.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingDelimiters.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDF.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDFException.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDFOutputSchemaException.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/DefaultInputHandler.java 301bea3 
>   src/org/apache/pig/impl/streaming/DefaultOutputHandler.java 1b46e7d 
>   src/org/apache/pig/impl/streaming/ExecutableManager.java cf79c83 
>   src/org/apache/pig/impl/streaming/InputHandler.java 690d94e 
>   src/org/apache/pig/impl/streaming/OutputHandler.java 6e9262a 
>   src/org/apache/pig/impl/streaming/StreamingUDFOutputHandler.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingUtil.java PRE-CREATION 
>   src/org/apache/pig/impl/util/JarManager.java 5c4acb0 
>   src/org/apache/pig/impl/util/StorageUtil.java dcb62ec 
>   src/org/apache/pig/scripting/ScriptEngine.java 29a9e1f 
>   src/org/apache/pig/scripting/ScriptingIllustrateOutputCapturer.java PRE-CREATION 
>   src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java PRE-CREATION 
>   src/python/streaming/controller.py PRE-CREATION 
>   src/python/streaming/pig_util.py PRE-CREATION 
>   test/org/apache/pig/builtin/TestPigToStreamUDF.java PRE-CREATION 
>   test/org/apache/pig/builtin/TestStreamUDFToPig.java PRE-CREATION 
>   test/org/apache/pig/builtin/TestStreamingUDF.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestExecutableManager.java 6246019 
>   test/org/apache/pig/impl/streaming/TestStreamingUDFOutputHandler.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestStreamingUtil.java PRE-CREATION 
>   test/org/apache/pig/test/TestPigStreaming.java PRE-CREATION 
>   test/org/apache/pig/test/TestStreaming.java 1eac5d2 
>   test/python/streaming/test_controller.py PRE-CREATION 
>   test/unit-tests d52ad9d 
> 
> Diff: https://reviews.apache.org/r/13781/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jeremy Karn
> 
>


Re: Review Request 13781: Changes to add support for streaming_python udfs.

Posted by Jeremy Karn <jk...@mortardata.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13781/
-----------------------------------------------------------

(Updated Sept. 7, 2013, 3:39 p.m.)


Review request for pig.


Repository: pig-git


Description
-------

Changes for PIG-2417 (https://issues.apache.org/jira/browse/PIG-2417)


Diffs (updated)
-----

  build.xml 7e22192 
  src/org/apache/pig/impl/builtin/StreamingUDF.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/ExecutableManager.java cf79c83 
  src/org/apache/pig/impl/streaming/OutputHandler.java 6e9262a 
  src/org/apache/pig/impl/streaming/PigToStreamUDF.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamUDFToPig.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamingDelimiters.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamingUDFException.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamingUDFInputHandler.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamingUDFOutputHandler.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamingUDFOutputSchemaException.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamingUtil.java PRE-CREATION 
  src/org/apache/pig/impl/util/JarManager.java 5c4acb0 
  src/org/apache/pig/impl/util/StorageUtil.java dcb62ec 
  src/org/apache/pig/scripting/ScriptEngine.java 29a9e1f 
  src/org/apache/pig/scripting/ScriptingOutputCapturer.java PRE-CREATION 
  src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java PRE-CREATION 
  src/python/streaming/controller.py PRE-CREATION 
  src/python/streaming/pig_util.py PRE-CREATION 
  test/org/apache/pig/impl/builtin/TestStreamingUDF.java PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestExecutableManager.java 6246019 
  test/org/apache/pig/impl/streaming/TestPigToStreamUDF.java PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestStreamUDFToPig.java PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestStreamingUDFOutputHandler.java PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestStreamingUtil.java PRE-CREATION 
  test/org/apache/pig/test/TestPigStreaming.java PRE-CREATION 
  test/python/streaming/test_controller.py PRE-CREATION 
  test/unit-tests d52ad9d 

Diff: https://reviews.apache.org/r/13781/diff/


Testing
-------


Thanks,

Jeremy Karn


Re: Review Request 13781: Changes to add support for streaming_python udfs.

Posted by Daniel Dai <da...@gmail.com>.

> On Sept. 7, 2013, 12:04 a.m., Daniel Dai wrote:
> > src/org/apache/pig/builtin/StreamingUDF.java, line 169
> > <https://reviews.apache.org/r/13781/diff/2/?file=348276#file348276line169>
> >
> >     Are those files only for illustrate? Why not just use a random tmp file? Why do we distinguish exectype in illustrate? It only run locally.
> 
> Jeremy Karn wrote:
>     I renamed the class to remove Illustrate from the name.  This is actually used both for illustrate (where we capture the output only from the last run) and in real runs (where we write the output to the user logs).  By putting it in the user logs you can get the logs from the task tracker web interface.  We could use temp files but I like the more human readable names.

Does the log file location hack also work in Hadoop2?


> On Sept. 7, 2013, 12:04 a.m., Daniel Dai wrote:
> > test/unit-tests, line 73
> > <https://reviews.apache.org/r/13781/diff/2/?file=348302#file348302line73>
> >
> >     This is not true unit test, should remove from the list
> 
> Jeremy Karn wrote:
>     Can it go in commit-tests?

The goal for commit-tests to complete within 10 min. So we don't tend to include end-to-end unit test cases.


- Daniel


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13781/#review25964
-----------------------------------------------------------


On Sept. 7, 2013, 3:39 p.m., Jeremy Karn wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13781/
> -----------------------------------------------------------
> 
> (Updated Sept. 7, 2013, 3:39 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Changes for PIG-2417 (https://issues.apache.org/jira/browse/PIG-2417)
> 
> 
> Diffs
> -----
> 
>   build.xml 7e22192 
>   src/org/apache/pig/impl/builtin/StreamingUDF.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/ExecutableManager.java cf79c83 
>   src/org/apache/pig/impl/streaming/OutputHandler.java 6e9262a 
>   src/org/apache/pig/impl/streaming/PigToStreamUDF.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamUDFToPig.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingDelimiters.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingUDFException.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingUDFInputHandler.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingUDFOutputHandler.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingUDFOutputSchemaException.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingUtil.java PRE-CREATION 
>   src/org/apache/pig/impl/util/JarManager.java 5c4acb0 
>   src/org/apache/pig/impl/util/StorageUtil.java dcb62ec 
>   src/org/apache/pig/scripting/ScriptEngine.java 29a9e1f 
>   src/org/apache/pig/scripting/ScriptingOutputCapturer.java PRE-CREATION 
>   src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java PRE-CREATION 
>   src/python/streaming/controller.py PRE-CREATION 
>   src/python/streaming/pig_util.py PRE-CREATION 
>   test/org/apache/pig/impl/builtin/TestStreamingUDF.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestExecutableManager.java 6246019 
>   test/org/apache/pig/impl/streaming/TestPigToStreamUDF.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestStreamUDFToPig.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestStreamingUDFOutputHandler.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestStreamingUtil.java PRE-CREATION 
>   test/org/apache/pig/test/TestPigStreaming.java PRE-CREATION 
>   test/python/streaming/test_controller.py PRE-CREATION 
>   test/unit-tests d52ad9d 
> 
> Diff: https://reviews.apache.org/r/13781/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jeremy Karn
> 
>


Re: Review Request 13781: Changes to add support for streaming_python udfs.

Posted by Jeremy Karn <jk...@mortardata.com>.

> On Sept. 7, 2013, 12:04 a.m., Daniel Dai wrote:
> > src/org/apache/pig/builtin/StreamUDFToPig.java, line 47
> > <https://reviews.apache.org/r/13781/diff/2/?file=348274#file348274line47>
> >
> >     This might be affected by PIG-3255

Looks like it would be easy to make this patch work with PIG-3255.  If we could get that merged shortly I'd update this patch with the necessary changes.


> On Sept. 7, 2013, 12:04 a.m., Daniel Dai wrote:
> > src/org/apache/pig/builtin/StreamUDFToPig.java, line 109
> > <https://reviews.apache.org/r/13781/diff/2/?file=348274#file348274line109>
> >
> >     How do we handle delimit escaping? Seems we don't handle it right now?

We don't handle it now.  We use a 3 character delimiter which avoids collisions in practice.  I would actually like to switch the serialization to a 'real' serialization library but we did a little bit of prototyping a few months ago and there wasn't anything out there that worked really well and had significant performance improvements.  Avro was the most promising but on the Python side it was missing a number of features we needed.


> On Sept. 7, 2013, 12:04 a.m., Daniel Dai wrote:
> > src/org/apache/pig/builtin/StreamingUDF.java, line 169
> > <https://reviews.apache.org/r/13781/diff/2/?file=348276#file348276line169>
> >
> >     Are those files only for illustrate? Why not just use a random tmp file? Why do we distinguish exectype in illustrate? It only run locally.

I renamed the class to remove Illustrate from the name.  This is actually used both for illustrate (where we capture the output only from the last run) and in real runs (where we write the output to the user logs).  By putting it in the user logs you can get the logs from the task tracker web interface.  We could use temp files but I like the more human readable names.


> On Sept. 7, 2013, 12:04 a.m., Daniel Dai wrote:
> > src/org/apache/pig/builtin/StreamingUDF.java, line 198
> > <https://reviews.apache.org/r/13781/diff/2/?file=348276#file348276line198>
> >
> >     Any reason not to create a StreamingUDFInputHandler?

I didn't really think it was necessary but I guess its more consistent.


> On Sept. 7, 2013, 12:04 a.m., Daniel Dai wrote:
> > src/org/apache/pig/builtin/StreamingUDF.java, line 366
> > <https://reviews.apache.org/r/13781/diff/2/?file=348276#file348276line366>
> >
> >     Seems flush() is more proper to put inside inputHandler

I'm not sure if it would be unnecessary for some clients of InputHandler.  For the StreamingUDF we need to make sure we always flush or else we can get blocked on the reading process not getting the full input (and then never writing the output) but I'm not sure if its always necessary.


> On Sept. 7, 2013, 12:04 a.m., Daniel Dai wrote:
> > src/org/apache/pig/impl/streaming/InputHandler.java, line 66
> > <https://reviews.apache.org/r/13781/diff/2/?file=348282#file348282line66>
> >
> >     Why we need this change?

Probably just left over from debugging where I wanted to know the value of the byte[].


> On Sept. 7, 2013, 12:04 a.m., Daniel Dai wrote:
> > test/unit-tests, line 73
> > <https://reviews.apache.org/r/13781/diff/2/?file=348302#file348302line73>
> >
> >     This is not true unit test, should remove from the list

Can it go in commit-tests?


- Jeremy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13781/#review25964
-----------------------------------------------------------


On Sept. 5, 2013, 8:23 p.m., Jeremy Karn wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13781/
> -----------------------------------------------------------
> 
> (Updated Sept. 5, 2013, 8:23 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Changes for PIG-2417 (https://issues.apache.org/jira/browse/PIG-2417)
> 
> 
> Diffs
> -----
> 
>   build.xml 7e22192 
>   src/org/apache/pig/PigToStream.java 7cc2950 
>   src/org/apache/pig/StreamToPig.java ff24b27 
>   src/org/apache/pig/builtin/PigStreaming.java 5467693 
>   src/org/apache/pig/builtin/PigToStreamUDF.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamUDFToPig.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingDelimiters.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDF.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDFException.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDFOutputSchemaException.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/DefaultInputHandler.java 301bea3 
>   src/org/apache/pig/impl/streaming/DefaultOutputHandler.java 1b46e7d 
>   src/org/apache/pig/impl/streaming/ExecutableManager.java cf79c83 
>   src/org/apache/pig/impl/streaming/InputHandler.java 690d94e 
>   src/org/apache/pig/impl/streaming/OutputHandler.java 6e9262a 
>   src/org/apache/pig/impl/streaming/StreamingUDFOutputHandler.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingUtil.java PRE-CREATION 
>   src/org/apache/pig/impl/util/JarManager.java 5c4acb0 
>   src/org/apache/pig/impl/util/StorageUtil.java dcb62ec 
>   src/org/apache/pig/scripting/ScriptEngine.java 29a9e1f 
>   src/org/apache/pig/scripting/ScriptingIllustrateOutputCapturer.java PRE-CREATION 
>   src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java PRE-CREATION 
>   src/python/streaming/controller.py PRE-CREATION 
>   src/python/streaming/pig_util.py PRE-CREATION 
>   test/org/apache/pig/builtin/TestPigToStreamUDF.java PRE-CREATION 
>   test/org/apache/pig/builtin/TestStreamUDFToPig.java PRE-CREATION 
>   test/org/apache/pig/builtin/TestStreamingUDF.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestExecutableManager.java 6246019 
>   test/org/apache/pig/impl/streaming/TestStreamingUDFOutputHandler.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestStreamingUtil.java PRE-CREATION 
>   test/org/apache/pig/test/TestPigStreaming.java PRE-CREATION 
>   test/org/apache/pig/test/TestStreaming.java 1eac5d2 
>   test/python/streaming/test_controller.py PRE-CREATION 
>   test/unit-tests d52ad9d 
> 
> Diff: https://reviews.apache.org/r/13781/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jeremy Karn
> 
>


Re: Review Request 13781: Changes to add support for streaming_python udfs.

Posted by Daniel Dai <da...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13781/#review25964
-----------------------------------------------------------



src/org/apache/pig/PigToStream.java
<https://reviews.apache.org/r/13781/#comment50670>

    No code change, remove this section from patch



src/org/apache/pig/StreamToPig.java
<https://reviews.apache.org/r/13781/#comment50671>

    No code change, remove this section from patch



src/org/apache/pig/builtin/PigStreaming.java
<https://reviews.apache.org/r/13781/#comment50672>

    No code change, remove this section from patch



src/org/apache/pig/builtin/PigToStreamUDF.java
<https://reviews.apache.org/r/13781/#comment50675>

    I feel those classes StreamingUDF should go to package org.apache.pig.impl.builtin, StreamingDelimiters, StreamUDFToPig, PigToStreamUDF, etc should go to org.apache.pig.impl.streaming



src/org/apache/pig/builtin/PigToStreamUDF.java
<https://reviews.apache.org/r/13781/#comment50673>

    I personally like to move the last element processing to the for loop:
            for (int i=0; i < sz; i++) {
                field = t.get(i);
                StorageUtil.putField(out, field, DELIMS, true);
                if (i!=sz-1) {
                    out.write(DELIMS.getParamDelim());
                }
            }



src/org/apache/pig/builtin/StreamUDFToPig.java
<https://reviews.apache.org/r/13781/#comment50684>

    This might be affected by PIG-3255



src/org/apache/pig/builtin/StreamUDFToPig.java
<https://reviews.apache.org/r/13781/#comment50686>

    How do we handle delimit escaping? Seems we don't handle it right now?



src/org/apache/pig/builtin/StreamingUDF.java
<https://reviews.apache.org/r/13781/#comment50685>

    Are those files only for illustrate? Why not just use a random tmp file? Why do we distinguish exectype in illustrate? It only run locally.



src/org/apache/pig/builtin/StreamingUDF.java
<https://reviews.apache.org/r/13781/#comment50680>

    Any reason not to create a StreamingUDFInputHandler?



src/org/apache/pig/builtin/StreamingUDF.java
<https://reviews.apache.org/r/13781/#comment50679>

    Can we use FileUtils.copyFile?



src/org/apache/pig/builtin/StreamingUDF.java
<https://reviews.apache.org/r/13781/#comment50683>

    Seems flush() is more proper to put inside inputHandler



src/org/apache/pig/impl/streaming/InputHandler.java
<https://reviews.apache.org/r/13781/#comment50682>

    Why we need this change?



test/unit-tests
<https://reviews.apache.org/r/13781/#comment50681>

    This is not true unit test, should remove from the list


Overall, the patch is non-intrusive. Except for minor refactory in ExecutableManager, existing features should not be affected. I am fine to check in code quickly, then improve the code over time. Before that, we need:
* address existing comments, especially conflict with PIG-3255
* add some e2e tests
* run regression tests

- Daniel Dai


On Sept. 5, 2013, 8:23 p.m., Jeremy Karn wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13781/
> -----------------------------------------------------------
> 
> (Updated Sept. 5, 2013, 8:23 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Changes for PIG-2417 (https://issues.apache.org/jira/browse/PIG-2417)
> 
> 
> Diffs
> -----
> 
>   build.xml 7e22192 
>   src/org/apache/pig/PigToStream.java 7cc2950 
>   src/org/apache/pig/StreamToPig.java ff24b27 
>   src/org/apache/pig/builtin/PigStreaming.java 5467693 
>   src/org/apache/pig/builtin/PigToStreamUDF.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamUDFToPig.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingDelimiters.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDF.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDFException.java PRE-CREATION 
>   src/org/apache/pig/builtin/StreamingUDFOutputSchemaException.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/DefaultInputHandler.java 301bea3 
>   src/org/apache/pig/impl/streaming/DefaultOutputHandler.java 1b46e7d 
>   src/org/apache/pig/impl/streaming/ExecutableManager.java cf79c83 
>   src/org/apache/pig/impl/streaming/InputHandler.java 690d94e 
>   src/org/apache/pig/impl/streaming/OutputHandler.java 6e9262a 
>   src/org/apache/pig/impl/streaming/StreamingUDFOutputHandler.java PRE-CREATION 
>   src/org/apache/pig/impl/streaming/StreamingUtil.java PRE-CREATION 
>   src/org/apache/pig/impl/util/JarManager.java 5c4acb0 
>   src/org/apache/pig/impl/util/StorageUtil.java dcb62ec 
>   src/org/apache/pig/scripting/ScriptEngine.java 29a9e1f 
>   src/org/apache/pig/scripting/ScriptingIllustrateOutputCapturer.java PRE-CREATION 
>   src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java PRE-CREATION 
>   src/python/streaming/controller.py PRE-CREATION 
>   src/python/streaming/pig_util.py PRE-CREATION 
>   test/org/apache/pig/builtin/TestPigToStreamUDF.java PRE-CREATION 
>   test/org/apache/pig/builtin/TestStreamUDFToPig.java PRE-CREATION 
>   test/org/apache/pig/builtin/TestStreamingUDF.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestExecutableManager.java 6246019 
>   test/org/apache/pig/impl/streaming/TestStreamingUDFOutputHandler.java PRE-CREATION 
>   test/org/apache/pig/impl/streaming/TestStreamingUtil.java PRE-CREATION 
>   test/org/apache/pig/test/TestPigStreaming.java PRE-CREATION 
>   test/org/apache/pig/test/TestStreaming.java 1eac5d2 
>   test/python/streaming/test_controller.py PRE-CREATION 
>   test/unit-tests d52ad9d 
> 
> Diff: https://reviews.apache.org/r/13781/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jeremy Karn
> 
>


Re: Review Request 13781: Changes to add support for streaming_python udfs.

Posted by Jeremy Karn <jk...@mortardata.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13781/
-----------------------------------------------------------

(Updated Sept. 5, 2013, 8:23 p.m.)


Review request for pig.


Repository: pig-git


Description
-------

Changes for PIG-2417 (https://issues.apache.org/jira/browse/PIG-2417)


Diffs (updated)
-----

  build.xml 7e22192 
  src/org/apache/pig/PigToStream.java 7cc2950 
  src/org/apache/pig/StreamToPig.java ff24b27 
  src/org/apache/pig/builtin/PigStreaming.java 5467693 
  src/org/apache/pig/builtin/PigToStreamUDF.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamUDFToPig.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingDelimiters.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingUDF.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingUDFException.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingUDFOutputSchemaException.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/DefaultInputHandler.java 301bea3 
  src/org/apache/pig/impl/streaming/DefaultOutputHandler.java 1b46e7d 
  src/org/apache/pig/impl/streaming/ExecutableManager.java cf79c83 
  src/org/apache/pig/impl/streaming/InputHandler.java 690d94e 
  src/org/apache/pig/impl/streaming/OutputHandler.java 6e9262a 
  src/org/apache/pig/impl/streaming/StreamingUDFOutputHandler.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamingUtil.java PRE-CREATION 
  src/org/apache/pig/impl/util/JarManager.java 5c4acb0 
  src/org/apache/pig/impl/util/StorageUtil.java dcb62ec 
  src/org/apache/pig/scripting/ScriptEngine.java 29a9e1f 
  src/org/apache/pig/scripting/ScriptingIllustrateOutputCapturer.java PRE-CREATION 
  src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java PRE-CREATION 
  src/python/streaming/controller.py PRE-CREATION 
  src/python/streaming/pig_util.py PRE-CREATION 
  test/org/apache/pig/builtin/TestPigToStreamUDF.java PRE-CREATION 
  test/org/apache/pig/builtin/TestStreamUDFToPig.java PRE-CREATION 
  test/org/apache/pig/builtin/TestStreamingUDF.java PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestExecutableManager.java 6246019 
  test/org/apache/pig/impl/streaming/TestStreamingUDFOutputHandler.java PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestStreamingUtil.java PRE-CREATION 
  test/org/apache/pig/test/TestPigStreaming.java PRE-CREATION 
  test/org/apache/pig/test/TestStreaming.java 1eac5d2 
  test/python/streaming/test_controller.py PRE-CREATION 
  test/unit-tests d52ad9d 

Diff: https://reviews.apache.org/r/13781/diff/


Testing
-------


Thanks,

Jeremy Karn