You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2021/07/19 14:33:00 UTC
[jira] [Commented] (ARROW-13314) JSON parsing segment fault on long records (block_size) dependent

    [ https://issues.apache.org/jira/browse/ARROW-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383377#comment-17383377 ] 

Alessandro Molina commented on ARROW-13314:
-------------------------------------------

I was able to reproduce the issue locally. I seem to get the abort/segfault only when arrow is built in debug mode by the way. Otherwise it seems to freeze waiting for some thread.

This is the mentioned exception
{code}
Traceback (most recent call last):
  File "/home/amol/ARROW/tries/read.py", line 5, in <module>
    json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
  File "pyarrow/_json.pyx", line 247, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)
{code}

In debug mode I also get those two extra errors
{code}
pure virtual method called
terminate called without an active exception
{code}

and the traceback I could get from gdb looks like
{code}
#4  0x00007ffff39a5567 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff39a62e5 in __cxa_pure_virtual () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff5ed13f0 in arrow::json::ChunkedStructArrayBuilder::InsertChildren (this=0xb89ae0, block_index=0, 
    unconverted=...) at src/arrow/json/chunked_builder.cc:396
#7  0x00007ffff5ed0321 in arrow::json::ChunkedStructArrayBuilder::Insert (this=0xb89ae0, block_index=0, 
    unconverted=std::shared_ptr<arrow::Array> (use count 1, weak count 0) = {...})
    at src/arrow/json/chunked_builder.cc:320
#8  0x00007ffff5f2ba61 in arrow::json::TableReaderImpl::ParseAndInsert (this=0xc489b0, 
    partial=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, 
    completion=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, 
    whole=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, block_index=0)
    at src/arrow/json/reader.cc:158
#9  0x00007ffff5f2a331 in arrow::json::TableReaderImpl::Read()::{lambda()#1}::operator()() const (__closure=0xca6cb8)
    at src/arrow/json/reader.cc:104
...
{code}

> JSON parsing segment fault on long records (block_size) dependent
> -----------------------------------------------------------------
>
>                 Key: ARROW-13314
>                 URL: https://issues.apache.org/jira/browse/ARROW-13314
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Guido Muscioni
>            Priority: Major
>
> Hello,
>  
> I have a big JSON file (~300MB) with complex records (nested json, nested lists of jsons). When I try to read this with pyarrow I am getting a segmentation fault. I tried then couple of things from read options, please see the code below (I developed this code on an example file that was attached here: https://issues.apache.org/jira/browse/ARROW-9612):
>  
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import tqdm
>     if __name__ == '__main__':
>          source = 'wiki_04.jsonl'
>          ro = ReadOptions(block_size=2**20)
>          with open(source, 'r') as file:
>              for i, line in tqdm.tqdm(enumerate(file)):
>                  with open('temp_file_arrow_3.ndjson', 'a') as file2:
>                      file2.write(line)
>                  json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
> {code}
> For both the example file and my file, this code will return the straddling object exception (or seg fault) once the file reach the block_size. Increasing the block_size will make the code fail later.
> Then I tried, on my file, to put an explicit schema:
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import pandas as pd
>     if __name__ == '__main__':
>          source = 'my_file.jsonl'
>          df = pd.read_json(source, lines=True) 
>          table_schema = pa.Table.from_pandas(df).schema
>          
>          ro = ReadOptions(explicit_schema = table_schema)
>          table = json.read_json(source, read_options=ro)         
> {code}
> This works, which may suggest that this issue, and the issue of the linked JIRA issue, are only appearing when an explicit schema is not provided. Additionally the following code works as well:
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import pandas as pd
>     if __name__ == '__main__':
>          source = 'my_file.jsonl'
>          
>          ro = ReadOptions(block_size = 2**30)
>          table = json.read_json(source, read_options=ro)         
> {code}
> The block_size is bigger than my file in this case. Is it possible that the schema is defined in the first block and then if the schema changes, I get a seg fault?
> I cannot share my json file, however, I hope that someone could add some clarity on what I am seeing and maybe suggest a workaround.
> Thank you,
>  Guido



--
This message was sent by Atlassian Jira
(v8.3.4#803005)