You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/15 20:31:11 UTC
[GitHub] [arrow] mishbahr opened a new issue #11967: Parquet schema / data type for entire null object DataFrame columns
mishbahr opened a new issue #11967:
URL: https://github.com/apache/arrow/issues/11967
I'm writing some DataFrame to binary parquet format with one or more entire null object columns.
If I then load the parquet dataset with `use_legacy_dataset=False`
```python
parquet_dataset = pq.ParquetDataset(root_path, use_legacy_dataset=False, **kwargs)
type(parquet)
pyarrow.parquet._ParquetDatasetV2
```
It returns an `_ParquetDatasetV2` instance and when I check the schema.
```python
type(parquet_dataset.schema)
pyarrow.lib.Schema
```
If I load the same file but with `use_legacy_dataset=True`
```python
parquet_dataset2 = pq.ParquetDataset(root_path, use_legacy_dataset=True, **kwargs)
```
The schema for the file is an instance of `ParquetSchema`
```python
type(parquet_dataset2.schema)
pyarrow._parquet.ParquetSchema
```
This is as I would expect and I'm aware that I can get the "arrow schema" like this.
```python
arrow_schema = parquet_dataset2.schema.to_arrow_schema()
type(arrow_schema)
pyarrow.lib.Schema
```
i.e same format as when I use `use_legacy_dataset=False`
For an instance of `ParquetSchema`, I can get details of any column. e.g
```python
parquet_dataset2.schema[13]
<ParquetColumnSchema>
name: col13
path: col13
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
```
Here the "physical_type" for this column is INT96.
```python
parquet.schema[13].physical_type
'INT32'
```
For an instance of `pyarrow.lib.Schema`, if I get the "data type" for the same column.
```python
parquet_dataset.schema.field("col13").type
DataType(null)
```
i.e with no information about what the "data type" is supposed to be.
This information is available in the Parquet file. But how do I access it?
Is there way to convert instance of `pyarrow.lib.Schema` -> `pyarrow._parquet.ParquetSchema`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #11967: Parquet schema / data type for entire null object DataFrame columns
Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995263886
Copying my question from the gist:
What is parquet_file.schema[0].logical_type? For me, if I do not specify a schema, it is Null (which is different than None). In your first snippet the logical type is None so I assume you are specifying the schema when writing.
Perhaps you have some files with Null logical type and some with None logical type? This could explain the behavior as the new datasets API infers the schema from a single file (picked more or less at random). So if it picked one of the null ones then you may end up with the behavior you are describing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] mishbahr commented on issue #11967: Parquet schema / data type for entire null object DataFrame columns
Posted by GitBox <gi...@apache.org>.
mishbahr commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995287905
```python
parquet_file.schema[0]
```
```python
<ParquetColumnSchema>
name: x
path: x
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Null
converted_type (legacy): NONE
```
Just incase there is some issue with my data source .
Now I'm using `df = pd.DataFrame(data={"x": [None, ]})` as input.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] mishbahr closed issue #11967: Parquet schema / data type for entire null object DataFrame columns
Posted by GitBox <gi...@apache.org>.
mishbahr closed issue #11967:
URL: https://github.com/apache/arrow/issues/11967
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #11967: Parquet schema / data type for entire null object DataFrame columns
Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995708589
> Where does pyarrow get `INT32` as "physical_type" when the column completely empty (only null values)
For Parquet, you need to distinguish the "physical_type" and "logical_type" (as shown in the output of the `ParquetColumnSchema`, this is "INT32" vs "Null" for this column of all nulls).
Parquet only has a limited set of physical types, see https://github.com/apache/parquet-format#types. And "Null" is not a physical type, but only a logical type. And a logical type always "annotates" some actual physical type.
So when Arrow saves a "null" column (in Arrow this is an actual, proper type) to Parquet, it can use a "Null" logical type, but it still needs to choose some physical type for the column in the Parquet file. And by default, Arrow uses INT32 for the physical type.
That explains where the "INT32" physical type is coming from. But in general, I think you don't need to care about this / you can ignore this. When reading the Parquet file to an Arrow table, we will correctly notice the "Null" logical type, and create a "null" column in the resulting Arrow table (basically ignoring the INT32 physical type of the Parquet field)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] mishbahr commented on issue #11967: Parquet schema / data type for entire null object DataFrame columns
Posted by GitBox <gi...@apache.org>.
mishbahr commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995718849
Ok. Thanks for the explanation.
I think I am going to change my code the use the v2 dataset API (`use_legacy_dataset=False`)
And then throw an error when `field.type == pa.null()` and ask the user to cast dataframe column with specific type.
e.g `df['x'] = df['x'].astype('string')`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] mishbahr commented on issue #11967: Parquet schema / data type for entire null object DataFrame columns
Posted by GitBox <gi...@apache.org>.
mishbahr commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995697816
Ignoring my initial code sample where I was using real parquet file as source.
This is fresh example using `df = pd.DataFrame(data={"col1": [None, ], "col2": ["foo1", ]})` as starting point.
Where does pyarrow get `INT32` as "physical_type" when the column completely empty (only null values)
```python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
```
```python
df = pd.DataFrame(data={"col1": [None, ], "col2": ["foo1", ]})
```
```python
table = pa.Table.from_pandas(df)
```
```python
pq.write_table(table, '/tmp/data.parquet')
```
```python
legacy_dataset = pq.ParquetDataset('/tmp/data.parquet', use_legacy_dataset=True)
```
```python
dataset = pq.ParquetDataset('/tmp/data.parquet', use_legacy_dataset=False)
```
```python
legacy_dataset.schema
```
<pyarrow._parquet.ParquetSchema object at 0x7efc1dc51a40>
required group field_id=-1 schema {
optional int32 field_id=-1 col1 (Null);
optional binary field_id=-1 col2 (String);
}
```python
legacy_dataset.schema[0]
```
<ParquetColumnSchema>
name: col1
path: col1
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Null
converted_type (legacy): NONE
```python
dataset.schema[0].type
```
DataType(null)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] mishbahr commented on issue #11967: Parquet schema / data type for entire null object DataFrame columns
Posted by GitBox <gi...@apache.org>.
mishbahr commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995242509
![screenshot-parquet-dataset](https://user-images.githubusercontent.com/1264089/146270547-2a24ec3b-7338-4372-b257-35ec63d01cdd.png)
I recreated the error with a simple test dataframe.
I think there is a bug on the legacy datasets API.
Where is is getting the `INT32` physical type from?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #11967: Parquet schema / data type for entire null object DataFrame columns
Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #11967:
URL: https://github.com/apache/arrow/issues/11967#issuecomment-995297427
This is a null array represented in parquet:
```
<ParquetColumnSchema>
name: x
path: x
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Null
converted_type (legacy): NONE
```
The original post was not a null array:
```
<ParquetColumnSchema>
name: col13
path: col13
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
```
In particular the `logical_type` is different between the two. I'm not actually sure what data type it is (maybe timestamp)? What is different between how the data is generated? Do all of your files look like the latter (`logical_type: None`) or is it possible that some of your files look like the former?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org