You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/06/07 13:30:00 UTC

[jira] [Created] (ARROW-12988) [CI] The kartothek nightly integration build is failing (test_update_dataset_from_ddf_empty)

Joris Van den Bossche created ARROW-12988:
---------------------------------------------

             Summary: [CI] The kartothek nightly integration build is failing (test_update_dataset_from_ddf_empty)
                 Key: ARROW-12988
                 URL: https://issues.apache.org/jira/browse/ARROW-12988
             Project: Apache Arrow
          Issue Type: Bug
          Components: Continuous Integration, Python
            Reporter: Joris Van den Bossche


The nightly "kartothek" integration builds are failing.

More specifically, the {{test_update_dataset_from_ddf_empty}} is failing with:

{code}
=================================== FAILURES ===================================
___________________ test_update_dataset_from_ddf_empty[True] ___________________

store_factory = functools.partial(<function get_store_from_url at 0x7f1434733050>, 'hfs:///tmp/pytest-of-root/pytest-0/test_update_dataset_from_ddf_e0/store')
shuffle = True

    @pytest.mark.parametrize("shuffle", [True, False])
    def test_update_dataset_from_ddf_empty(store_factory, shuffle):
        with pytest.raises(ValueError, match="Cannot store empty datasets"):
            update_dataset_from_ddf(
>               dask.dataframe.from_delayed([], meta=(("a", int),)),
                store_factory,
                dataset_uuid="output_dataset_uuid",
                table="core",
                shuffle=shuffle,
                partition_on=["a"],
            ).compute()

tests/io/dask/dataframe/test_update.py:57: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

dfs = [], meta = (('a', <class 'int'>),), divisions = None
prefix = 'from-delayed', verify_meta = True

    @insert_meta_param_description
    def from_delayed(
        dfs, meta=None, divisions=None, prefix="from-delayed", verify_meta=True
    ):
        """Create Dask DataFrame from many Dask Delayed objects
    
        Parameters
        ----------
        dfs : list of Delayed
            An iterable of ``dask.delayed.Delayed`` objects, such as come from
            ``dask.delayed`` These comprise the individual partitions of the
            resulting dataframe.
        $META
        divisions : tuple, str, optional
            Partition boundaries along the index.
            For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions
            For string 'sorted' will compute the delayed values to find index
            values.  Assumes that the indexes are mutually sorted.
            If None, then won't use index information
        prefix : str, optional
            Prefix to prepend to the keys.
        verify_meta : bool, optional
            If True check that the partitions have consistent metadata, defaults to True.
        """
        from dask.delayed import Delayed
    
        if isinstance(dfs, Delayed):
            dfs = [dfs]
        dfs = [
            delayed(df) if not isinstance(df, Delayed) and hasattr(df, "key") else df
            for df in dfs
        ]
        for df in dfs:
            if not isinstance(df, Delayed):
                raise TypeError("Expected Delayed object, got %s" % type(df).__name__)
    
>       parent_meta = delayed(make_meta)(dfs[0]).compute()
E       IndexError: list index out of range

/opt/conda/envs/arrow/lib/python3.7/site-packages/dask/dataframe/io/io.py:591: IndexError
{code}

(from https://github.com/ursacomputing/crossbow/runs/2756067090)

Not directly sure if this is a kartothek issue or a pyarrow issue. But also created an issue on their side: https://github.com/JDASoftwareGroup/kartothek/issues/475



--
This message was sent by Atlassian Jira
(v8.3.4#803005)