You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Chirico (JIRA)" <ji...@apache.org> on 2019/04/18 11:17:00 UTC
[jira] [Updated] (SPARK-27507) get_json_object fails somewhat
arbitrarily on long input
[ https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Chirico updated SPARK-27507:
------------------------------------
Description:
Some long JSON objects are parsed incorrectly by {{get_json_object}}.
The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark:
{code:java}
# v2.3.1
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
from string import ascii_lowercase
# create a long string
alpha_rep = ascii_lowercase*1000
# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''
def run_q(s):
return len(spark.sql(test_q.format(s)).collect()[0][0])
def diagnose(s):
out_len = run_q(s)
# input & output should be identical (length match is a necessary condition)
print('Input length: %d\tOutput length: %d' % (len(s), out_len))
return True
def test_l(n):
diagnose(alpha_rep[:n])
return True
test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264 Output length: 2264
Input length: 2265 Output length: 2265
Input length: 2667 Output length: 2660
Input length: 2666 Output length: 2666
Input length: 2668 Output length: 2661 <---- problematic!!
Input length: 26000 Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large.
More details from a {{pandas}} exploration:
{code:java}
import pandas as pd
DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})
N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
if ii % 520 == 0:
print("%.0f%% Done" % (100.0*ii/N))
DF[DF['n'] != DF['m']].shape
# (1326, 2)
DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot attached
So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected.
was:
Some long JSON objects are parsed incorrectly by {{get_json_object}}.
The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark:
{code:java}
# v2.3.1
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
from string import ascii_lowercase
# create a long string
alpha_rep = ascii_lowercase*1000
# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''
def run_q(s):
return len(spark.sql(test_q.format(s)).collect()[0][0])
def diagnose(s):
out_len = run_q(s)
# input & output should be identical (length match is a necessary condition)
print('Input length: %d\tOutput length: %d' % (len(s), out_len))
return True
def test_l(n):
diagnose(alpha_rep[:n])
return True
test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264 Output length: 2264
Input length: 2265 Output length: 2265
Input length: 2667 Output length: 2660
Input length: 2666 Output length: 2666
Input length: 2668 Output length: 2661 <---- problematic!!
Input length: 26000 Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large.
More details from a {{pandas}} exploration:
{code:java}
import pandas as pd
DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})
N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
if ii % 520 == 0:
print("%.0f%% Done" % (100.0*ii/N))
DF[DF['n'] != DF['m']].shape
# (1326, 2)
DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot here:
[https://imgur.com/vCPLNwy]
So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected.
> get_json_object fails somewhat arbitrarily on long input
> --------------------------------------------------------
>
> Key: SPARK-27507
> URL: https://issues.apache.org/jira/browse/SPARK-27507
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 2.3.1
> Reporter: Michael Chirico
> Priority: Major
> Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png
>
>
> Some long JSON objects are parsed incorrectly by {{get_json_object}}.
> The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark:
> {code:java}
> # v2.3.1
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> from string import ascii_lowercase
> # create a long string
> alpha_rep = ascii_lowercase*1000
> # create a simple query on a simple json object which contains this string
> test_q = '''
> select get_json_object('{{"a": "{}"}}', '$.a')
> '''
> def run_q(s):
> return len(spark.sql(test_q.format(s)).collect()[0][0])
> def diagnose(s):
> out_len = run_q(s)
> # input & output should be identical (length match is a necessary condition)
> print('Input length: %d\tOutput length: %d' % (len(s), out_len))
> return True
> def test_l(n):
> diagnose(alpha_rep[:n])
> return True
> test_l(2264)
> test_l(2265)
> test_l(2667)
> test_l(2666)
> test_l(2668)
> test_l(len(alpha_rep)){code}
> With results on my instance:
> {code:java}
> Input length: 2264 Output length: 2264
> Input length: 2265 Output length: 2265
> Input length: 2667 Output length: 2660
> Input length: 2666 Output length: 2666
> Input length: 2668 Output length: 2661 <---- problematic!!
> Input length: 26000 Output length: 26000
> {code}
> It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large.
>
> More details from a {{pandas}} exploration:
> {code:java}
> import pandas as pd
> DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})
> N = DF.shape[0]
> # note -- takes about 20 minutes to run on my machine
> for ii in range(N):
> DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
> if ii % 520 == 0:
> print("%.0f%% Done" % (100.0*ii/N))
> DF[DF['n'] != DF['m']].shape
> # (1326, 2)
> DF['miss'] = DF['n'] - DF['m']
> DF.plot('n', 'miss')
> {code}
> Plot attached
> So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org