You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/04/19 11:22:00 UTC
[jira] [Resolved] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input

     [ https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-27507.
----------------------------------
    Resolution: Cannot Reproduce

{code}
Input length: 2264	Output length: 2264
Input length: 2265	Output length: 2265
Input length: 2667	Output length: 2667
Input length: 2666	Output length: 2666
Input length: 2668	Output length: 2668
Input length: 26000	Output length: 26000
{code}

I can't reproduce in the current master as above.

It should be great if we can identify which JIRA fixes and see if it's applicable to backport. For now, I am leaving this resolved.

> get_json_object fails somewhat arbitrarily on long input
> --------------------------------------------------------
>
>                 Key: SPARK-27507
>                 URL: https://issues.apache.org/jira/browse/SPARK-27507
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Michael Chirico
>            Priority: Major
>         Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png
>
>
> Some long JSON objects are parsed incorrectly by {{get_json_object}}.
> The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark:
> {code:java}
> # v2.3.1
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> from string import ascii_lowercase
> # create a long string
> alpha_rep = ascii_lowercase*1000
> # create a simple query on a simple json object which contains this string
> test_q = '''
> select get_json_object('{{"a": "{}"}}', '$.a')
> '''
> def run_q(s):
>     return len(spark.sql(test_q.format(s)).collect()[0][0])
> def diagnose(s):
>     out_len = run_q(s)
>     # input & output should be identical (length match is a necessary condition)
>     print('Input length: %d\tOutput length: %d' % (len(s), out_len))
>     return True
> def test_l(n):
>     diagnose(alpha_rep[:n])
>     return True
> test_l(2264)
> test_l(2265)
> test_l(2667)
> test_l(2666)
> test_l(2668)
> test_l(len(alpha_rep)){code}
> With results on my instance:
> {code:java}
> Input length: 2264	Output length: 2264
> Input length: 2265	Output length: 2265
> Input length: 2667	Output length: 2660 <---- problematic!!
> Input length: 2666	Output length: 2666
> Input length: 2668	Output length: 2661 <---- problematic!!
> Input length: 26000	Output length: 26000
> {code}
> It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large.
>  
> More details from a {{pandas}} exploration:
> {code:java}
> import pandas as pd
> DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})
> N = DF.shape[0]
> # note -- takes about 20 minutes to run on my machine
> for ii in range(N):
>     DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
>     if ii % 520 == 0:
>         print("%.0f%% Done" % (100.0*ii/N))
> DF[DF['n'] != DF['m']].shape
> # (1326, 2)
> DF['miss'] = DF['n'] - DF['m']
> DF.plot('n', 'miss')
> {code}
> Plot attached
> So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org