You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/06/07 16:03:00 UTC
[jira] [Reopened] (ARROW-12988) [CI] The kartothek nightly integration build is failing (test_update_dataset_from_ddf_empty)

     [ https://issues.apache.org/jira/browse/ARROW-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche reopened ARROW-12988:
-------------------------------------------

Will leave this open as a reminder for myself to revert this fix before 5.0.0 (can be done once the next dask is released with the bugfix: https://github.com/dask/dask/pull/7769)

> [CI] The kartothek nightly integration build is failing (test_update_dataset_from_ddf_empty)
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12988
>                 URL: https://issues.apache.org/jira/browse/ARROW-12988
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Continuous Integration, Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 5.0.0
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The nightly "kartothek" integration builds are failing.
> More specifically, the {{test_update_dataset_from_ddf_empty}} is failing with:
> {code}
> =================================== FAILURES ===================================
> ___________________ test_update_dataset_from_ddf_empty[True] ___________________
> store_factory = functools.partial(<function get_store_from_url at 0x7f1434733050>, 'hfs:///tmp/pytest-of-root/pytest-0/test_update_dataset_from_ddf_e0/store')
> shuffle = True
>     @pytest.mark.parametrize("shuffle", [True, False])
>     def test_update_dataset_from_ddf_empty(store_factory, shuffle):
>         with pytest.raises(ValueError, match="Cannot store empty datasets"):
>             update_dataset_from_ddf(
> >               dask.dataframe.from_delayed([], meta=(("a", int),)),
>                 store_factory,
>                 dataset_uuid="output_dataset_uuid",
>                 table="core",
>                 shuffle=shuffle,
>                 partition_on=["a"],
>             ).compute()
> tests/io/dask/dataframe/test_update.py:57: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> dfs = [], meta = (('a', <class 'int'>),), divisions = None
> prefix = 'from-delayed', verify_meta = True
>     @insert_meta_param_description
>     def from_delayed(
>         dfs, meta=None, divisions=None, prefix="from-delayed", verify_meta=True
>     ):
>         """Create Dask DataFrame from many Dask Delayed objects
>     
>         Parameters
>         ----------
>         dfs : list of Delayed
>             An iterable of ``dask.delayed.Delayed`` objects, such as come from
>             ``dask.delayed`` These comprise the individual partitions of the
>             resulting dataframe.
>         $META
>         divisions : tuple, str, optional
>             Partition boundaries along the index.
>             For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions
>             For string 'sorted' will compute the delayed values to find index
>             values.  Assumes that the indexes are mutually sorted.
>             If None, then won't use index information
>         prefix : str, optional
>             Prefix to prepend to the keys.
>         verify_meta : bool, optional
>             If True check that the partitions have consistent metadata, defaults to True.
>         """
>         from dask.delayed import Delayed
>     
>         if isinstance(dfs, Delayed):
>             dfs = [dfs]
>         dfs = [
>             delayed(df) if not isinstance(df, Delayed) and hasattr(df, "key") else df
>             for df in dfs
>         ]
>         for df in dfs:
>             if not isinstance(df, Delayed):
>                 raise TypeError("Expected Delayed object, got %s" % type(df).__name__)
>     
> >       parent_meta = delayed(make_meta)(dfs[0]).compute()
> E       IndexError: list index out of range
> /opt/conda/envs/arrow/lib/python3.7/site-packages/dask/dataframe/io/io.py:591: IndexError
> {code}
> (from https://github.com/ursacomputing/crossbow/runs/2756067090)
> Not directly sure if this is a kartothek issue or a pyarrow issue. But also created an issue on their side: https://github.com/JDASoftwareGroup/kartothek/issues/475



--
This message was sent by Atlassian Jira
(v8.3.4#803005)