You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2019/09/20 09:34:00 UTC
[jira] [Commented] (ARROW-6623) [CI][Python] Dask docker integration test broken perhaps by statistics-related change

    [ https://issues.apache.org/jira/browse/ARROW-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934241#comment-16934241 ] 

Joris Van den Bossche commented on ARROW-6623:
----------------------------------------------

I opened an issue on the dask tracker: https://github.com/dask/dask/issues/5418

There are actually two errors that happen (when rerunning the test multiple times). At least one of them is due to the recent {{schema}} changes (index now needs to be included) that I did.  
The other might be related to change in statistics for null columns: https://github.com/apache/arrow/pull/5403 (ARROW-6623)

> [CI][Python] Dask docker integration test broken perhaps by statistics-related change
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-6623
>                 URL: https://issues.apache.org/jira/browse/ARROW-6623
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 0.15.0
>
>
> see new failure 
> https://circleci.com/gh/ursa-labs/crossbow/3027?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
> {code}
> =================================== FAILURES ===================================
> ___________________ test_timeseries_nulls_in_schema[pyarrow] ___________________
> tmpdir = local('/tmp/pytest-of-root/pytest-0/test_timeseries_nulls_in_schem0')
> engine = 'pyarrow'
>     def test_timeseries_nulls_in_schema(tmpdir, engine):
>         tmp_path = str(tmpdir)
>         ddf2 = (
>             dask.datasets.timeseries(start="2000-01-01", end="2000-01-03", freq="1h")
>             .reset_index()
>             .map_partitions(lambda x: x.loc[:5])
>         )
>         ddf2 = ddf2.set_index("x").reset_index().persist()
>         ddf2.name = ddf2.name.where(ddf2.timestamp == "2000-01-01", None)
>     
>         ddf2.to_parquet(tmp_path, engine=engine)
>         ddf_read = dd.read_parquet(tmp_path, engine=engine)
>     
>         assert_eq(ddf_read, ddf2, check_divisions=False, check_index=False)
>     
>         # Can force schema validation on each partition in pyarrow
>         if engine == "pyarrow":
>             # The schema mismatch should raise an error
>             with pytest.raises(ValueError):
>                 ddf_read = dd.read_parquet(
>                     tmp_path, dataset={"validate_schema": True}, engine=engine
>                 )
>             # There should be no error if you specify a schema on write
>             schema = pa.schema(
>                 [
>                     ("x", pa.float64()),
>                     ("timestamp", pa.timestamp("ns")),
>                     ("id", pa.int64()),
>                     ("name", pa.string()),
>                     ("y", pa.float64()),
>                 ]
>             )
>             ddf2.to_parquet(tmp_path, schema=schema, engine=engine)
>             assert_eq(
> >               dd.read_parquet(tmp_path, dataset={"validate_schema": True}, engine=engine),
>                 ddf2,
>                 check_divisions=False,
>                 check_index=False,
>             )
> opt/conda/lib/python3.6/site-packages/dask/dataframe/io/tests/test_parquet.py:1964: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> opt/conda/lib/python3.6/site-packages/dask/dataframe/io/parquet/core.py:190: in read_parquet
>     out = sorted_columns(statistics)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> statistics = ({'columns': [{'max': -0.25838390663957256, 'min': -0.979681447427093, 'name': 'x', 'null_count': 0}, {'max': Timestam...ull_count': 0}, {'max': 0.8978352477516438, 'min': -0.7218571212693894, 'name': 'y', 'null_count': 0}], 'num-rows': 7})
>     def sorted_columns(statistics):
>         """ Find sorted columns given row-group statistics
>     
>         This finds all columns that are sorted, along with appropriate divisions
>         values for those columns
>     
>         Returns
>         -------
>         out: List of {'name': str, 'divisions': List[str]} dictionaries
>         """
>         if not statistics:
>             return []
>     
>         out = []
>         for i, c in enumerate(statistics[0]["columns"]):
>             if not all(
>                 "min" in s["columns"][i] and "max" in s["columns"][i] for s in statistics
>             ):
>                 continue
>             divisions = [c["min"]]
>             max = c["max"]
>             success = True
>             for stats in statistics[1:]:
>                 c = stats["columns"][i]
> >               if c["min"] >= max:
> E               TypeError: '>=' not supported between instances of 'numpy.ndarray' and 'str'
> opt/conda/lib/python3.6/site-packages/dask/dataframe/io/parquet/core.py:570: TypeError
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)