You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Chirico (JIRA)" <ji...@apache.org> on 2019/04/18 11:15:00 UTC
[jira] [Created] (SPARK-27507) get_json_object fails somewhat
arbitrarily on long input
Michael Chirico created SPARK-27507:
---------------------------------------
Summary: get_json_object fails somewhat arbitrarily on long input
Key: SPARK-27507
URL: https://issues.apache.org/jira/browse/SPARK-27507
Project: Spark
Issue Type: New Feature
Components: SQL
Affects Versions: 2.3.1
Reporter: Michael Chirico
Some long JSON objects are parsed incorrectly by {{get_json_object}}.
The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark:
{code:java}
# v2.3.1
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
from string import ascii_lowercase
# create a long string
alpha_rep = ascii_lowercase*1000
# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''
def run_q(s):
return len(spark.sql(test_q.format(s)).collect()[0][0])
def diagnose(s):
out_len = run_q(s)
# input & output should be identical (length match is a necessary condition)
print('Input length: %d\tOutput length: %d' % (len(s), out_len))
return True
def test_l(n):
diagnose(alpha_rep[:n])
return True
test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264 Output length: 2264
Input length: 2265 Output length: 2265
Input length: 2667 Output length: 2660
Input length: 2666 Output length: 2666
Input length: 2668 Output length: 2661 <---- problematic!!
Input length: 26000 Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large.
More details from a {{pandas}} exploration:
{code:java}
import pandas as pd
DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})
N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
if ii % 520 == 0:
print("%.0f%% Done" % (100.0*ii/N))
DF[DF['n'] != DF['m']].shape
# (1326, 2)
DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot here:
[https://imgur.com/vCPLNwy]
So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org