You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Chirico (JIRA)" <ji...@apache.org> on 2019/04/18 11:15:00 UTC
[jira] [Created] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input

Michael Chirico created SPARK-27507:
---------------------------------------

             Summary: get_json_object fails somewhat arbitrarily on long input
                 Key: SPARK-27507
                 URL: https://issues.apache.org/jira/browse/SPARK-27507
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.3.1
            Reporter: Michael Chirico


Some long JSON objects are parsed incorrectly by {{get_json_object}}.

The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark:
{code:java}
# v2.3.1

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from string import ascii_lowercase

# create a long string
alpha_rep = ascii_lowercase*1000

# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''

def run_q(s):
    return len(spark.sql(test_q.format(s)).collect()[0][0])

def diagnose(s):
    out_len = run_q(s)
    # input & output should be identical (length match is a necessary condition)
    print('Input length: %d\tOutput length: %d' % (len(s), out_len))
    return True

def test_l(n):
    diagnose(alpha_rep[:n])
    return True

test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264	Output length: 2264
Input length: 2265	Output length: 2265
Input length: 2667	Output length: 2660
Input length: 2666	Output length: 2666
Input length: 2668	Output length: 2661 <---- problematic!!
Input length: 26000	Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large.

 

More details from a {{pandas}} exploration:
{code:java}
import pandas as pd

DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})

N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
    DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
    if ii % 520 == 0:
        print("%.0f%% Done" % (100.0*ii/N))

DF[DF['n'] != DF['m']].shape
# (1326, 2)

DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot here:

[https://imgur.com/vCPLNwy]

So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org