You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/18 02:27:45 UTC

[GitHub] [arrow] mofeiatwork opened a new issue #12459: Any easy way to convert parquet to JSON and access nested structures?

mofeiatwork opened a new issue #12459:
URL: https://github.com/apache/arrow/issues/12459


   I'm working on a project that convert parquet to JSON format through apache arrow, but got confused about it.
   
   One intuitive way is to iterate several types of `Array` directly, and construct a JSON object from it. For `PrimitiveArray` it's easy, but seems very hard for `NestArray`, since it does not provide any `Iterator` interface to get an `Struct`, `List`, `Map`, but only the underlying stored array. 
   
   I have investigate the examples and documents to solve this problem but got no result.
   
   To conclude, is there interface to iterate nested structure in row-oriented style, instead of operate on the underlying array? Or existed interface to convert arrow array to JSON format?
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on issue #12459: Any easy way to convert parquet to JSON and access nested structures?

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #12459:
URL: https://github.com/apache/arrow/issues/12459#issuecomment-1043940744


   > I think it's a general requirement for arrow, to make accessing nested structure api more user-friendly?
   
   Two thoughts on this:
   
    1. Arrow is mostly designed to excel at processing data in a columnar way, so that's where most attention has focused.
    2. A lot more effort has gone to making the higher-level bindings (Python, Ruby, R) easy, because that's where most of our users are. See related discussion in ARROW-8709.
   
   I don't see any existing JIRA issues for creating more convenient iterator interfaces to nested columns, so if you or someone wanted to create one and/or work on it, that would be welcomed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #12459: Any easy way to convert parquet to JSON and access nested structures?

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #12459:
URL: https://github.com/apache/arrow/issues/12459#issuecomment-1048257370


   A two stage approach is what I had in mind as well.  So what you are describing sounds correct to me.  Although I had thought the stages would be reversed.  That being said, I have not given this problem a whole lot of thought and I'm not sure I understand the details of your proposal.
   
   If you submit a PR I will be happy to provide further feedback.
   
   If you run into trouble then I think a row-based converter would still be a good starting point and I would be happy to provide feedback on a PR for a row-based converter too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mofeiatwork commented on issue #12459: Any easy way to convert parquet to JSON and access nested structures?

Posted by GitBox <gi...@apache.org>.
mofeiatwork commented on issue #12459:
URL: https://github.com/apache/arrow/issues/12459#issuecomment-1045610272


   After reading the arrow document and code, I have implemented a basic row-based converter from arrow array to json, which could handle most primitive types and nested types including struct, list, and map.
   
   Of course it's less efficient than columnar style builder, which I will further improve it.
   
   I think the key is how to convert nested type to tree-schema JSON through a row-oriented JSON builder. Since most JSON builder implementations (like rapidjson, simdjson) are row-oriented, which build a JSON document one by one. A basic idea is, iterate nested data recursive, and build the JSON tree at the same time. The limitation is during the iteration of nested data, there's a lot of code branch which reduce the performance.
   
   So my idea is separate JSON building into two stages:
   - Schema building: build JSON tree structure according to arrow schema
   - Leaf filling: fill the JSON tree leaf node with arrow array
   
   In this way, most code branch could be eliminated, and the access of arrow array will be cache-friendly.
   
   How do you think about it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #12459: Any easy way to convert parquet to JSON and access nested structures?

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #12459:
URL: https://github.com/apache/arrow/issues/12459#issuecomment-1045068724


   There is an open request (ARROW-5033) for a line delimited JSON writer (i.e. https://jsonlines.org/).  So if you do some up with a general solution and submit a PR that would be helpful.  Converting values to strings in a row-based fashion would be rather inefficient.  A better approach would be to cast all leaf arrays to string and then construct the JSON lines in a row-based fashion.  The CSV writer can give some general idea how to do this but the CSV writer doesn't have to work with nested data types (no way to represent those in CSV).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on issue #12459: Any easy way to convert parquet to JSON and access nested structures?

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #12459:
URL: https://github.com/apache/arrow/issues/12459#issuecomment-1043761323


   Hi @mofeiatwork, thanks for your question. Could you clarify which language / library you are using? (We have quite a few different ones supported :smile:) That will make it easier to find people who can answer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mofeiatwork commented on issue #12459: Any easy way to convert parquet to JSON and access nested structures?

Posted by GitBox <gi...@apache.org>.
mofeiatwork commented on issue #12459:
URL: https://github.com/apache/arrow/issues/12459#issuecomment-1043772515


   > 
   
   I'm using the arrow/c++ library to operating parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on issue #12459: Any easy way to convert parquet to JSON and access nested structures?

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #12459:
URL: https://github.com/apache/arrow/issues/12459#issuecomment-1043804251


   Have you seen the example at https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html?
   I think the ColumnarTableToVector seems relevant; it shows an example for handling a list column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mofeiatwork commented on issue #12459: Any easy way to convert parquet to JSON and access nested structures?

Posted by GitBox <gi...@apache.org>.
mofeiatwork commented on issue #12459:
URL: https://github.com/apache/arrow/issues/12459#issuecomment-1043882756


   This example use a simple and static schema, which could be constructed easily. 
   
   But for more complicated circumstance, like dynamic schema(without priori knowledge about it), and multi-level nested structure, it's not an easy work to convert columnar to row format. 
   
   I think it's a general requirement for arrow, to make accessing nested structure api more user-friendly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mofeiatwork commented on issue #12459: Any easy way to convert parquet to JSON and access nested structures?

Posted by GitBox <gi...@apache.org>.
mofeiatwork commented on issue #12459:
URL: https://github.com/apache/arrow/issues/12459#issuecomment-1043958801


   Got it. The [ARROW-8709](https://issues.apache.org/jira/browse/ARROW-8709) explained clearly.
   
   I'll find a workaround for my needs. If it could be a general solution, I'll submit a proposal and discuss with you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org