You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/06/07 16:03:00 UTC
[jira] [Reopened] (ARROW-12988) [CI] The kartothek nightly
integration build is failing (test_update_dataset_from_ddf_empty)
[ https://issues.apache.org/jira/browse/ARROW-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reopened ARROW-12988:
-------------------------------------------
Will leave this open as a reminder for myself to revert this fix before 5.0.0 (can be done once the next dask is released with the bugfix: https://github.com/dask/dask/pull/7769)
> [CI] The kartothek nightly integration build is failing (test_update_dataset_from_ddf_empty)
> --------------------------------------------------------------------------------------------
>
> Key: ARROW-12988
> URL: https://issues.apache.org/jira/browse/ARROW-12988
> Project: Apache Arrow
> Issue Type: Bug
> Components: Continuous Integration, Python
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Minor
> Labels: pull-request-available
> Fix For: 5.0.0
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> The nightly "kartothek" integration builds are failing.
> More specifically, the {{test_update_dataset_from_ddf_empty}} is failing with:
> {code}
> =================================== FAILURES ===================================
> ___________________ test_update_dataset_from_ddf_empty[True] ___________________
> store_factory = functools.partial(<function get_store_from_url at 0x7f1434733050>, 'hfs:///tmp/pytest-of-root/pytest-0/test_update_dataset_from_ddf_e0/store')
> shuffle = True
> @pytest.mark.parametrize("shuffle", [True, False])
> def test_update_dataset_from_ddf_empty(store_factory, shuffle):
> with pytest.raises(ValueError, match="Cannot store empty datasets"):
> update_dataset_from_ddf(
> > dask.dataframe.from_delayed([], meta=(("a", int),)),
> store_factory,
> dataset_uuid="output_dataset_uuid",
> table="core",
> shuffle=shuffle,
> partition_on=["a"],
> ).compute()
> tests/io/dask/dataframe/test_update.py:57:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> dfs = [], meta = (('a', <class 'int'>),), divisions = None
> prefix = 'from-delayed', verify_meta = True
> @insert_meta_param_description
> def from_delayed(
> dfs, meta=None, divisions=None, prefix="from-delayed", verify_meta=True
> ):
> """Create Dask DataFrame from many Dask Delayed objects
>
> Parameters
> ----------
> dfs : list of Delayed
> An iterable of ``dask.delayed.Delayed`` objects, such as come from
> ``dask.delayed`` These comprise the individual partitions of the
> resulting dataframe.
> $META
> divisions : tuple, str, optional
> Partition boundaries along the index.
> For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions
> For string 'sorted' will compute the delayed values to find index
> values. Assumes that the indexes are mutually sorted.
> If None, then won't use index information
> prefix : str, optional
> Prefix to prepend to the keys.
> verify_meta : bool, optional
> If True check that the partitions have consistent metadata, defaults to True.
> """
> from dask.delayed import Delayed
>
> if isinstance(dfs, Delayed):
> dfs = [dfs]
> dfs = [
> delayed(df) if not isinstance(df, Delayed) and hasattr(df, "key") else df
> for df in dfs
> ]
> for df in dfs:
> if not isinstance(df, Delayed):
> raise TypeError("Expected Delayed object, got %s" % type(df).__name__)
>
> > parent_meta = delayed(make_meta)(dfs[0]).compute()
> E IndexError: list index out of range
> /opt/conda/envs/arrow/lib/python3.7/site-packages/dask/dataframe/io/io.py:591: IndexError
> {code}
> (from https://github.com/ursacomputing/crossbow/runs/2756067090)
> Not directly sure if this is a kartothek issue or a pyarrow issue. But also created an issue on their side: https://github.com/JDASoftwareGroup/kartothek/issues/475
--
This message was sent by Atlassian Jira
(v8.3.4#803005)