You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/02/08 12:50:01 UTC
[jira] [Commented] (ARROW-10067) [Python] Manual dataset with timestamp partition type error

    [ https://issues.apache.org/jira/browse/ARROW-10067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281015#comment-17281015 ] 

Joris Van den Bossche commented on ARROW-10067:
-----------------------------------------------

[~josham] sorry for the _very_ late response, but thanks for the report!

Indeed, it seems that, currently, when creating a FileSystemDataset manually, you need to ensure that the type used in the partition expresssion matches the type in the schema. 
A workaround for now is to manually ensure that your partitions use microsecond type ({{ds.field("date") == pa.scalar(pd.Timestamp("2018-01-01"), pa.timestamp("us"))}}). Since this is a lower level function, I am not sure how flexible we want to be here, or want to be more strict in matching input types.

We have work to do in general to be more flexible regarding schema evolution, see ARROW-11003 for the umbrella issue.

> [Python] Manual dataset with timestamp partition type error
> -----------------------------------------------------------
>
>                 Key: ARROW-10067
>                 URL: https://issues.apache.org/jira/browse/ARROW-10067
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Josh
>            Priority: Minor
>
> Going off the docs [https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset] but instead using date partitioning. If you create the partitions using pandas Timestamps you get timestamp[ns] vs timestamp[us] type errors.
>  
> {code:java}
> import tempfile
> import pathlib
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.dataset as ds
> import pyarrow.parquet as pq
> from pyarrow import fs
> base = pathlib.Path(tempfile.gettempdir())
> table = pa.table({"col1": range(3), "col2": np.random.randn(3)})
> (base / "parquet_dataset_manual").mkdir(exist_ok=True)
> pq.write_table(table, base / "parquet_dataset_manual" / "data_20180101.parquet")
> pq.write_table(table, base / "parquet_dataset_manual" / "data_20180102.parquet")
> schema = pa.schema([("date", pa.timestamp("ns")), ("col1", pa.int64()), ("col2", pa.float64())])
> dataset = ds.FileSystemDataset.from_paths(
>     ["data_20180101.parquet", "data_20180102.parquet"],
>     schema=schema,
>     format=ds.ParquetFileFormat(),
>     filesystem=fs.SubTreeFileSystem(str(base / "parquet_dataset_manual"), fs.LocalFileSystem()),
>     partitions=[ds.field("date") == pd.Timestamp("2018-01-01"), ds.field("date") == pd.Timestamp("2018-01-01")],
> )
> print(dataset.to_table().to_pandas())
> # pyarrow.lib.ArrowTypeError: field date: timestamp[ns] cannot be materialized from scalar of type timestamp[us]
> print(dataset.to_table(filter=ds.field("date") == pd.Timestamp("2018-01-01")).to_pandas())
> # ../src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: timestamp[ns] vs timestamp[us]
> dataset = ds.FileSystemDataset.from_paths(
>     ["data_20180101.parquet", "data_20180102.parquet"],
>     schema=schema,
>     format=ds.ParquetFileFormat(),
>     filesystem=fs.SubTreeFileSystem(str(base / "parquet_dataset_manual"), fs.LocalFileSystem()),
>     partitions=[
>         ds.field("date") == pa.scalar(pd.Timestamp("2018-01-01"), pa.timestamp("ns")),
>         ds.field("date") == pa.scalar(pd.Timestamp("2018-01-02"), pa.timestamp("ns")),
>     ],
> )
> print(dataset.to_table().to_pandas())
> print(dataset.to_table(filter=ds.field("date") == pd.Timestamp("2018-01-01")).to_pandas()){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)