You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@livy.apache.org by "Grant Wu (JIRA)" <ji...@apache.org> on 2019/07/16 22:05:00 UTC

[jira] [Comment Edited] (LIVY-322) JsonParseException on failure to parse text output from subprocess call to hadoop fs -rm

    [ https://issues.apache.org/jira/browse/LIVY-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886493#comment-16886493 ] 

Grant Wu edited comment on LIVY-322 at 7/16/19 10:04 PM:
---------------------------------------------------------

Hey - I think the {{fake_shell}} approach needs a fundamental redesign.  In addition to \{{subprocess.call}} inserting raw stdout, there are many other ways to sneak in raw stdout to cause parsing exceptions.  Here's a reproducing example using the REST API:
{code}
curl -X POST -H "Content-Type:'application/json'" http://livy-server:8998/sessions/8/statements -d '\{"kind": "pyspark", "code":"import sys\nprint(\"[Source: 2019-07-16 15:15:37.349 INFO  ConnectionPool:72 | Created connection for pulsar://pulsar-broker.petuum-system:6650\", file=sys.__stdout__)"}'
{code}
Note that this example is rather artificial. But the breaking text is actual output from Apache Pulsar, which we want to use within Spark jobs. Pulsar's Python client is written in C++, and so the \{[fake_shell}} approach of overwriting {{sys.stderr}} and {{sys.stdout}} doesn't work. I can't really think of a Python solution to this.

 


was (Author: grantwu):
Hey - I think the {{fake_shell}} approach needs a fundamental redesign.  In addition to `subprocess.call` inserting raw stdout, there are many other ways to sneak in raw stdout to cause parsing exceptions.  Here's a reproducing example using the REST API:

{code:bash}
curl -X POST -H "Content-Type:'application/json'" http://livy-server:8998/sessions/8/statements -d '\{"kind": "pyspark", "code":"import sys\nprint(\"[Source: 2019-07-16 15:15:37.349 INFO  ConnectionPool:72 | Created connection for pulsar://pulsar-broker.petuum-system:6650\", file=sys.__stdout__)"}'
{code}

Note that this example is rather artificial.  But the breaking text is actual output from Apache Pulsar, which we want to use within Spark jobs.  Pulsar's Python client is written in C++, and so the {[fake_shell}} approach of overwriting {{sys.stderr}} and {{sys.stdout}} doesn't work.  I can't really think of a Python solution to this.

 

> JsonParseException on failure to parse text output from subprocess call to hadoop fs -rm
> ----------------------------------------------------------------------------------------
>
>                 Key: LIVY-322
>                 URL: https://issues.apache.org/jira/browse/LIVY-322
>             Project: Livy
>          Issue Type: Bug
>          Components: API, Interpreter
>    Affects Versions: 0.3
>            Reporter: Rick Bernotas
>            Priority: Major
>         Attachments: patch_LIVY-322_rickbernotas.patch
>
>
> In a pyspark session, if you run a subprocess.call() to do a "hadoop fs -rm" on a Hadoop 2.7 cluster, the response from the "hadoop fs -rm" (a text response that it has moved the file to the .Trash folder in HDFS) will cause a JsonParseException in Livy, and then all following statement executions in the session will fail to work right.
> I suspect there is something in the response from the hadoop fs that is tripping up Livy in the conversion to Json, perhaps a reserved or special character in the response that Livy is not filtering out, as the response is otherwise innocuous.
> Livy needs to correctly parse the response and not throw an exception, and also in the case that an exception is thrown, the session should be able to recover from the exception to continue running statements correctly.  Following the Json Exception, even a print(1) statement fails to execute properly, necessitating the user get a new session to work with.
> Example follows below.
> {code:java}
> ### CREATE A NEW PYSPARK SESSION
> -bash-4.1$ curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions
> {"id":2,"appId":null,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
> ### CHECK THE STATE OF SESSION 2 UNTIL IT GOES FROM "STARTING" STATE TO "IDLE" STATE
> -bash-4.1$ curl localhost:8998/sessions/2
> {"id":2,"appId":null,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
> -bash-4.1$ curl localhost:8998/sessions/2
> {"id":2,"appId":null,"owner":null,"proxyUser":null,"state":"idle","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
> ### RUN THE PYSPARK CODE IN SESSION 2, "import subprocess"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"import subprocess"}'
> {"id":0,"state":-X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/0
> {"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":""}}}
> ### THE OUTPUT IS {"text/plain":""} WHICH IS EXPECTED AND CORRECT
> ### RUN THE PYSPARK CODE IN SESSION 2, "subprocess.call(["hadoop", "fs", "-touchz", "foo.tmp"])"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"subprocess.call([\"hadoop\", \"fs\", \"-touchz\", \"foo.tmp\"])"}'
> {"id":1,"state":"running","output":null}
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/1
> {"id":1,"state":"available","output":{"status":"ok","execution_count":1,"data":{"text/plain":"0"}}}
> ### THE OUTPUT IS {"text/plain":"0"} WHICH IS EXPECTED OUTPUT THAT THE TOUCHZ COMPLETED WITH RETURN CODE 0.
> ### RUN THE PYSPARK CODE IN SESSION 2, "print(subprocess.check_output(["hadoop", "fs", "-ls", "foo.tmp"]))"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"print(subprocess.check_output([\"hadoop\", \"fs\", \"-ls\", \"foo.tmp\"]))"}'
> {"id":2,"state":"waiting","output":null}
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/2
> {"id":2,"state":"available","output":{"status":"ok","execution_count":2,"data":{"text/plain":"-rw-------   3 username group          0 2017-02-23 19:26 foo.tmp"}}}
> ### THE OUTPUT IS {"text/plain":"-rw-------   3 username group          0 2017-02-23 19:26 foo.tmp"} WHICH IS EXPECTED OUTPUT OF DIRECTORY LISTING
> ### RUN THE PYSPARK CODE IN SESSION 2, "subprocess.call(["hadoop", "fs", "-rm", "foo.tmp"])"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"subprocess.call([\"hadoop\", \"fs\", \"-rm\", \"foo.tmp\"])"}'
> {"id":3,"state":"waiting","output":null}
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/3
> {"id":3,"state":"available","output":{"status":"error","execution_count":3,"ename":"com.fasterxml.jackson.core.JsonParseException","evalue":"Unrecognized token 'Moved': was expecting ('true', 'false' or 'null')\n at [Source: Moved: 'foo.tmp' to trash at: .Trash/Current; line: 1, column: 6]","traceback":[]}}
> ### JSON EXCEPTION APPEARS HERE WHICH IS INCORRECT PARSING OF THE OUTPUT
> ### RUN THE PYSPARK CODE IN SESSION 2, "print(1)"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"print(1)"}'
> {"id":4,"state":"available","output":null}
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/4
> {"id":4,"state":"available","output":{"status":"ok","execution_count":4,"data":{"text/plain":""}}}
> ### THE OUTPUT IS {"text/plain":""} WHICH IS EMPTY STRING, INDICATING OPERATION COMPLETED WITH NO OUTPUT, WHICH IS INCORRECT, IT SHOULD RETURN 1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)