You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "christiangiessleraracom (via GitHub)" <gi...@apache.org> on 2023/06/14 08:37:00 UTC

[GitHub] [arrow] christiangiessleraracom opened a new issue, #36060: Unexpected Behaviour of nullable fields

christiangiessleraracom opened a new issue, #36060:
URL: https://github.com/apache/arrow/issues/36060

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   The following problem:
   If an optional field is not specified in the json, but is in the schema, it is still created in the pyarrow table, including all nested fields that are specified in the schema (with null values).
   Is this the intended behaviour or is there a setting option so that non-existent fields are also not in the table?
   
   
   Here is a data example. our productive data schema is of course much more complex and more nested, but it illustrates what I am doing:
   
   Schema (all fields are nullable):
   ```
   field1: struct<subfield1: double, subfield2: double>
   field2: timestamp[ms]
   field3: double
   
   ```
   
   json file:
   ```json
   {
     "field3": 123.4
   }
   ```
   
   Python code handling the data:
   
   ```python
   read_options = pajson.ReadOptions(block_size=1600000000)
   
   parse_options = pajson.ParseOptions(
   	explicit_schema=pa_schema,
   	unexpected_field_behavior="ignore"
   )
   table = pajson.read_json(
   	tmp_file_name,  read_options=read_options, parse_options=parse_options
   )
   
   pq.write_to_dataset(
   	table=table,
   	root_path=dataset_path,
   	basename_template=hashvalue + ".parquet",
   	existing_data_behavior="overwrite_or_ignore",
   	schema=pa_schema
   )
   ```
   
   table debug output from evaluation in pycharm: 
   ```
   column_names: ['field1', 'field2', 'field3']
   columns:
   [
     -- is_valid:
         [
         false
       ]
     -- child 0 type: double
       [
         null
       ]
     -- child 1 type: double
       [
         null
       ]
   ]
   [
     [
       null
     ]
   ]
   [
     [
       123.4
     ]
   ]
   
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on issue #36060: Unexpected Behaviour of nullable fields

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #36060:
URL: https://github.com/apache/arrow/issues/36060#issuecomment-1590759938

   Hi, this is expected because all columns in a table object must have the same length.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] christiangiessleraracom commented on issue #36060: Unexpected Behaviour of nullable fields

Posted by "christiangiessleraracom (via GitHub)" <gi...@apache.org>.
christiangiessleraracom commented on issue #36060:
URL: https://github.com/apache/arrow/issues/36060#issuecomment-1590861582

   > If you don't pass an explicit_schema to read_json, the resulting table won't contain any null columns. If you must set a schema, you can use this piece of code to drop all null columns after the table is created:
   > 
   > ```python
   > all_null_columns = []
   > for col, name in zip(table.columns, table.column_names):
   >     if col.null_count == col.length():
   >       all_null_columns.append(name)
   > table = table.drop_columns(all_null_columns)
   > ```
   > 
   > Skipping all null columns at parsing time just doesn't feel right, because if an explicit_schema is set, a table with that schema should be expected, right?
   
   Thanks. 
   yes, i have to pass a schema. i'll try your code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on issue #36060: Unexpected Behaviour of nullable fields

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #36060:
URL: https://github.com/apache/arrow/issues/36060#issuecomment-1590855868

   If you don't pass an explicit_schema to read_json, the resulting table won't contain any null columns. If you must set a schema, you can use this piece of code to drop all null columns after the table is created:
   ```python
   all_null_columns = []
   for col, name in zip(table.columns, table.column_names):
       if col.null_count == col.length():
         all_null_columns.append(name)
   table = table.drop_columns(all_null_columns)
   ```
   
   Skipping all null columns at parsing time just doesn't feel right, because if an explicit_schema is set, a table with that schema should be expected, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] christiangiessleraracom commented on issue #36060: Unexpected Behaviour of nullable fields

Posted by "christiangiessleraracom (via GitHub)" <gi...@apache.org>.
christiangiessleraracom commented on issue #36060:
URL: https://github.com/apache/arrow/issues/36060#issuecomment-1590776094

   > Hi, this is expected because all columns in a table object must have the same length.
   
   yes. the length is clear. but is there a possibility that columns that do not contain any values at all are not written into the table? if something is not there, it does not have a length.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org