You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Chirico (JIRA)" <ji...@apache.org> on 2019/04/18 11:17:00 UTC

[jira] [Updated] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input

     [ https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Chirico updated SPARK-27507:
------------------------------------
    Description: 
Some long JSON objects are parsed incorrectly by {{get_json_object}}.

The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark:
{code:java}
# v2.3.1

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from string import ascii_lowercase

# create a long string
alpha_rep = ascii_lowercase*1000

# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''

def run_q(s):
    return len(spark.sql(test_q.format(s)).collect()[0][0])

def diagnose(s):
    out_len = run_q(s)
    # input & output should be identical (length match is a necessary condition)
    print('Input length: %d\tOutput length: %d' % (len(s), out_len))
    return True

def test_l(n):
    diagnose(alpha_rep[:n])
    return True

test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264	Output length: 2264
Input length: 2265	Output length: 2265
Input length: 2667	Output length: 2660
Input length: 2666	Output length: 2666
Input length: 2668	Output length: 2661 <---- problematic!!
Input length: 26000	Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large.

 

More details from a {{pandas}} exploration:
{code:java}
import pandas as pd

DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})

N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
    DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
    if ii % 520 == 0:
        print("%.0f%% Done" % (100.0*ii/N))

DF[DF['n'] != DF['m']].shape
# (1326, 2)

DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot attached

So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected.

  was:
Some long JSON objects are parsed incorrectly by {{get_json_object}}.

The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark:
{code:java}
# v2.3.1

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from string import ascii_lowercase

# create a long string
alpha_rep = ascii_lowercase*1000

# create a simple query on a simple json object which contains this string
test_q = '''
select get_json_object('{{"a": "{}"}}', '$.a')
'''

def run_q(s):
    return len(spark.sql(test_q.format(s)).collect()[0][0])

def diagnose(s):
    out_len = run_q(s)
    # input & output should be identical (length match is a necessary condition)
    print('Input length: %d\tOutput length: %d' % (len(s), out_len))
    return True

def test_l(n):
    diagnose(alpha_rep[:n])
    return True

test_l(2264)
test_l(2265)
test_l(2667)
test_l(2666)
test_l(2668)
test_l(len(alpha_rep)){code}
With results on my instance:
{code:java}
Input length: 2264	Output length: 2264
Input length: 2265	Output length: 2265
Input length: 2667	Output length: 2660
Input length: 2666	Output length: 2666
Input length: 2668	Output length: 2661 <---- problematic!!
Input length: 26000	Output length: 26000
{code}
It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large.

 

More details from a {{pandas}} exploration:
{code:java}
import pandas as pd

DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})

N = DF.shape[0]
# note -- takes about 20 minutes to run on my machine
for ii in range(N):
    DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
    if ii % 520 == 0:
        print("%.0f%% Done" % (100.0*ii/N))

DF[DF['n'] != DF['m']].shape
# (1326, 2)

DF['miss'] = DF['n'] - DF['m']
DF.plot('n', 'miss')
{code}
Plot here:

[https://imgur.com/vCPLNwy]

So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected.


> get_json_object fails somewhat arbitrarily on long input
> --------------------------------------------------------
>
>                 Key: SPARK-27507
>                 URL: https://issues.apache.org/jira/browse/SPARK-27507
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Michael Chirico
>            Priority: Major
>         Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png
>
>
> Some long JSON objects are parsed incorrectly by {{get_json_object}}.
> The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark:
> {code:java}
> # v2.3.1
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> from string import ascii_lowercase
> # create a long string
> alpha_rep = ascii_lowercase*1000
> # create a simple query on a simple json object which contains this string
> test_q = '''
> select get_json_object('{{"a": "{}"}}', '$.a')
> '''
> def run_q(s):
>     return len(spark.sql(test_q.format(s)).collect()[0][0])
> def diagnose(s):
>     out_len = run_q(s)
>     # input & output should be identical (length match is a necessary condition)
>     print('Input length: %d\tOutput length: %d' % (len(s), out_len))
>     return True
> def test_l(n):
>     diagnose(alpha_rep[:n])
>     return True
> test_l(2264)
> test_l(2265)
> test_l(2667)
> test_l(2666)
> test_l(2668)
> test_l(len(alpha_rep)){code}
> With results on my instance:
> {code:java}
> Input length: 2264	Output length: 2264
> Input length: 2265	Output length: 2265
> Input length: 2667	Output length: 2660
> Input length: 2666	Output length: 2666
> Input length: 2668	Output length: 2661 <---- problematic!!
> Input length: 26000	Output length: 26000
> {code}
> It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large.
>  
> More details from a {{pandas}} exploration:
> {code:java}
> import pandas as pd
> DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)})
> N = DF.shape[0]
> # note -- takes about 20 minutes to run on my machine
> for ii in range(N):
>     DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']])
>     if ii % 520 == 0:
>         print("%.0f%% Done" % (100.0*ii/N))
> DF[DF['n'] != DF['m']].shape
> # (1326, 2)
> DF['miss'] = DF['n'] - DF['m']
> DF.plot('n', 'miss')
> {code}
> Plot attached
> So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org