You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jacob Wujciak-Jens (Jira)" <ji...@apache.org> on 2022/04/08 12:44:00 UTC
[jira] [Updated] (ARROW-13314) [Python] JSON parsing segment fault on long records (block_size) dependent

     [ https://issues.apache.org/jira/browse/ARROW-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jacob Wujciak-Jens updated ARROW-13314:
---------------------------------------
    Summary: [Python] JSON parsing segment fault on long records (block_size) dependent  (was: [Pyhton] JSON parsing segment fault on long records (block_size) dependent)

> [Python] JSON parsing segment fault on long records (block_size) dependent
> --------------------------------------------------------------------------
>
>                 Key: ARROW-13314
>                 URL: https://issues.apache.org/jira/browse/ARROW-13314
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Guido Muscioni
>            Priority: Major
>
> Hello,
>  
> I have a big JSON file (~300MB) with complex records (nested json, nested lists of jsons). When I try to read this with pyarrow I am getting a segmentation fault. I tried then couple of things from read options, please see the code below (I developed this code on an example file that was attached here: https://issues.apache.org/jira/browse/ARROW-9612):
>  
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import tqdm
>     if __name__ == '__main__':
>          source = 'wiki_04.jsonl'
>          ro = ReadOptions(block_size=2**20)
>          with open(source, 'r') as file:
>              for i, line in tqdm.tqdm(enumerate(file)):
>                  with open('temp_file_arrow_3.ndjson', 'a') as file2:
>                      file2.write(line)
>                  json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
> {code}
> For both the example file and my file, this code will return the straddling object exception (or seg fault) once the file reach the block_size. Increasing the block_size will make the code fail later.
> Then I tried, on my file, to put an explicit schema:
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import pandas as pd
>     if __name__ == '__main__':
>          source = 'my_file.jsonl'
>          df = pd.read_json(source, lines=True) 
>          table_schema = pa.Table.from_pandas(df).schema
>          
>          ro = ReadOptions(explicit_schema = table_schema)
>          table = json.read_json(source, read_options=ro)         
> {code}
> This works, which may suggest that this issue, and the issue of the linked JIRA issue, are only appearing when an explicit schema is not provided. Additionally the following code works as well:
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import pandas as pd
>     if __name__ == '__main__':
>          source = 'my_file.jsonl'
>          
>          ro = ReadOptions(block_size = 2**30)
>          table = json.read_json(source, read_options=ro)         
> {code}
> The block_size is bigger than my file in this case. Is it possible that the schema is defined in the first block and then if the schema changes, I get a seg fault?
> I cannot share my json file, however, I hope that someone could add some clarity on what I am seeing and maybe suggest a workaround.
> Thank you,
>  Guido



--
This message was sent by Atlassian Jira
(v8.20.1#820001)