You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ryan Stalets (Jira)" <ji...@apache.org> on 2021/07/12 20:45:00 UTC

[jira] [Created] (ARROW-13318) kMaxParserNumRows Value Increase/Removal

Ryan Stalets created ARROW-13318:
------------------------------------

             Summary: kMaxParserNumRows Value Increase/Removal
                 Key: ARROW-13318
                 URL: https://issues.apache.org/jira/browse/ARROW-13318
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
            Reporter: Ryan Stalets


I'm a new pyArrow user and have been investigating occasional errors related to the Python exception: "ArrowInvalid: Exceeded maximum rows" when parsing JSON line files using pyarrow.json.read_json(). In digging in, it looks like the original source of this exception is in cpp/src/arrow/json/parser.cc on line 703, which appears to throw the error when the number of lines processed exceeds kMaxParserNumRows.

 
{code:java}
for (; num_rows_ < kMaxParserNumRows; ++num_rows_) {
      auto ok = reader.Parse<parse_flags>(json, handler);
      switch (ok.Code()) {
        case rj::kParseErrorNone:
          // parse the next object
          continue;
        case rj::kParseErrorDocumentEmpty:
          // parsed all objects, finish
          return Status::OK();
        case rj::kParseErrorTermination:
          // handler emitted an error
          return handler.Error();
        default:
          // rj emitted an error
          return ParseError(rj::GetParseError_En(ok.Code()), " in row ", num_rows_);
      }
    }
    return Status::Invalid("Exceeded maximum rows");
  }{code}
 

 

This constant appears to be set in arrow/json/parser.h on line 53, and has been set this way since that file's initial commit.

 
{code:java}
constexpr int32_t kMaxParserNumRows = 100000;{code}
 

 

There does not appear to be a comment in the code or in the commit or PR explaining this maximum number of lines.

 

I'm wondering what the reason for this maximum might be, and if it might be removed, increased, or made overridable in the C++ and the upstream Python. It is common to need to process JSON files of arbitrary length (logs from applications, third-party vendors, etc) where the user of the data does not have control over the size of the file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)