You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lance Dacey (Jira)" <ji...@apache.org> on 2021/01/15 10:20:00 UTC
[jira] [Comment Edited] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

    [ https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265869#comment-17265869 ] 

Lance Dacey edited comment on ARROW-11250 at 1/15/21, 10:19 AM:
----------------------------------------------------------------

{code:java}
selected_files1 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True)
selected_files2 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True)
selected_files1 == selected_files2

True{code}
I am able to run the above cell over and over again.

 

Now when I use fs.info() without a final slash:
{code:java}
fs.info("dev/test-split")
{'name': 'dev/test-split/', 'size': 0, 'type': 'directory'}{code}
If I add a slash to the folder name, the slash is removed in the fs.info() return - will this impact anything? 
{code:java}
fs.info("dev/test-split/")
{'name': 'dev/test-split', 'size': 0, 'type': 'directory'}
{code}
 
{code:java}
selected_files3 = fs.info("dev/test-split")
selected_files4 = fs.info("dev/test-split/")
selected_files3 == selected_files4

False{code}
 

 

Edit - running fs.info() on the same path fails if I do it more than once without changing the name by adding a slash, or resetting my kernel. Even if I delete the fs variable and create a new filesystem, it does not work.


was (Author: ldacey):
{code:java}
selected_files1 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True)
selected_files2 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True)
selected_files1 == selected_files2

True{code}
I am able to run the above cell over and over again.

 

Now when I use fs.info() without a final slash:
{code:java}
fs.info("dev/test-split")
{'name': 'dev/test-split/', 'size': 0, 'type': 'directory'}{code}
If I add a slash to the folder name, the slash is removed in the fs.info() return - will this impact anything? 
{code:java}
fs.info("dev/test-split/")
{'name': 'dev/test-split', 'size': 0, 'type': 'directory'}
{code}
 
{code:java}
selected_files3 = fs.info("dev/test-split")
selected_files4 = fs.info("dev/test-split/")
selected_files3 == selected_files4

False{code}

> [Python] Inconsistent behavior calling ds.dataset()
> ---------------------------------------------------
>
>                 Key: ARROW-11250
>                 URL: https://issues.apache.org/jira/browse/ARROW-11250
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 18.04
> adal                      1.2.5              pyh9f0ad1d_0    conda-forge
> adlfs                     0.5.9              pyhd8ed1ab_0    conda-forge
> apache-airflow            1.10.14                  pypi_0    pypi
> azure-common              1.1.24                     py_0    conda-forge
> azure-core                1.9.0              pyhd3deb0d_0    conda-forge
> azure-datalake-store      0.0.51             pyh9f0ad1d_0    conda-forge
> azure-identity            1.5.0              pyhd8ed1ab_0    conda-forge
> azure-nspkg               3.0.2                      py_0    conda-forge
> azure-storage-blob        12.6.0             pyhd3deb0d_0    conda-forge
> azure-storage-common      2.1.0            py37hc8dfbb8_3    conda-forge
> fsspec                    0.8.5              pyhd8ed1ab_0    conda-forge
> jupyterlab_pygments       0.1.2              pyh9f0ad1d_0    conda-forge
> pandas                    1.2.0            py37ha9443f7_0
> pyarrow                   2.0.0           py37h4935f41_6_cpu    conda-forge
>            Reporter: Lance Dacey
>            Priority: Minor
>              Labels: azureblob, dataset,, python
>             Fix For: 4.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a dataset which certainly exists on Azure Blob.
>  
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>  
> One example of this is reading a dataset in one cell:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>  
> Then in another cell I try to read the same dataset:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---------------------------------------------------------------------------
> FileNotFoundError                         Traceback (most recent call last)
> <ipython-input-514-bf63585a0c1b> in <module>
> ----> 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
>     669     # TODO(kszucs): support InMemoryDataset for a table input
>     670     if _is_path_like(source):
> --> 671         return _filesystem_dataset(source, **kwargs)
>     672     elif isinstance(source, (tuple, list)):
>     673         if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
>     426         fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
>     427     else:
> --> 428         fs, paths_or_selector = _ensure_single_source(source, filesystem)
>     429 
>     430     options = FileSystemFactoryOptions(
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _ensure_single_source(path, filesystem)
>     402         paths_or_selector = [path]
>     403     else:
> --> 404         raise FileNotFoundError(path)
>     405 
>     406     return filesystem, paths_or_selector
> FileNotFoundError: dev/test-split
> {code}
>  
> If I reset the kernel, it works again. It also works if I change the path slightly, like adding a "/" at the end (so basically it just not work if I read the same dataset twice):
>  
> {code:java}
> ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
> {code}
>  
>  
> The other strange behavior I have noticed that that if I read a dataset inside of my Jupyter notebook,
>  
> {code:java}
> %%time
> dataset = ds.dataset("dev/test-split", partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), flavor="hive"), 
> filesystem=fs,
> exclude_invalid_files=False)
> CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code}
>  
> Now, on the exact same server when I try to run the same code against the same dataset in Airflow it takes over 3 minutes (comparing the timestamps in my logs between right before I read the dataset, and immediately after the dataset is available to filter):
> {code:java}
> [2021-01-14 03:52:04,011] INFO - Reading dev/test-split
> [2021-01-14 03:55:17,360] INFO - Processing dataset in batches
> {code}
> This is probably not a pyarrow issue, but what are some potential causes that I can look into? I have one example where it is 9 seconds to read the dataset in Jupyter, but then 11 *minutes* in Airflow. I don't know what to really investigate - as I mentioned, the Jupyter notebook and Airflow are on the same server and both are deployed using Docker. Airflow is using the CeleryExecutor.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)