You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Piotr Żelasko (Jira)" <ji...@apache.org> on 2021/05/04 19:59:00 UTC
[jira] [Comment Edited] (ARROW-12588) Expose JSON schema inference to Python API

    [ https://issues.apache.org/jira/browse/ARROW-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339272#comment-17339272 ] 

Piotr Żelasko edited comment on ARROW-12588 at 5/4/21, 7:58 PM:
----------------------------------------------------------------

Thank you, you solved my issue :) Previously, I read in the documentation that pa.array() works in >>simple cases<<, so I guess I assumed it won't work for mine. But it did!

If it's interesting: I have lists of JSON manifests representing different objects (in Lhotse [https://github.com/lhotse-speech/lhotse] -- a library for speech data pipelines that I'm developing, where arrow currently helps me deal with cases when metadata can be massive itself, e.g. for terabyte-sized speech datasets). Some of them can compose others, and so the schema is quite complex, e.g. these two items can be held in the same manifest:

Item #1

{'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': [\{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0, 'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0, 'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames': 100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start': 0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant', 'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources': [{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate': 16000, 'num_samples': 160000, 'duration': 10.0}, 'type': 'Cut'}

Item #2

{'id': '3693dee0-1ac8-4f5a-a8c1-d6b4f6f80fbb', 'tracks': [\{'cut': {'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': [{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0, 'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0, 'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames': 100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start': 0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant', 'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources': [{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate': 16000, 'num_samples': 160000, 'duration': 10.0}}, 'offset': 0.0}, \{'cut': {'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': [{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0, 'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0, 'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames': 100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start': 0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant', 'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources': [{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate': 16000, 'num_samples': 160000, 'duration': 10.0}}, 'offset': 5.0, 'snr': 8}], 'type': 'MixedCut'}

It turns out that pa.array works just fine with a list of those.


was (Author: pzelasko):
Thank you, you solved my issue :) Previously, I read in the documentation that pa.array() works in >>simple cases<<, so I guess I assumed it won't work for mine. But it did!

If it's interesting: I have lists of JSON manifests representing different objects (in Lhotse [https://github.com/lhotse-speech/lhotse] -- a library for speech data pipelines that I'm developing, where arrow currently helps me deal with cases when metadata can be massive itself, e.g. for terabyte-sized speech datasets). Some of them can compose others, and so the schema is quite complex, e.g. these two items can be held in the same manifest:

Item #1

{{{'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': [\{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0, 'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0, 'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames': 100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start': 0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant', 'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources': [{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate': 16000, 'num_samples': 160000, 'duration': 10.0}, 'type': 'Cut'}}}

Item #2

{{{'id': '3693dee0-1ac8-4f5a-a8c1-d6b4f6f80fbb', 'tracks': [\{'cut': {'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': [{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0, 'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0, 'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames': 100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start': 0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant', 'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources': [{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate': 16000, 'num_samples': 160000, 'duration': 10.0}}, 'offset': 0.0}, \{'cut': {'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': [{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0, 'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0, 'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames': 100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start': 0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant', 'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources': [{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate': 16000, 'num_samples': 160000, 'duration': 10.0}}, 'offset': 5.0, 'snr': 8}], 'type': 'MixedCut'}}}

It turns out that pa.array works just fine with a list of those.

> Expose JSON schema inference to Python API
> ------------------------------------------
>
>                 Key: ARROW-12588
>                 URL: https://issues.apache.org/jira/browse/ARROW-12588
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Piotr Żelasko
>            Priority: Minor
>
> When using `pyarrow.json.read_json()`, the schema is automatically inferred. It would be useful to infer the schema from a json that is already loaded in memory (i.e. possibly a list of dicts in Python).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)