You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lance Dacey (Jira)" <ji...@apache.org> on 2022/04/05 14:27:00 UTC
[jira] [Comment Edited] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path

    [ https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517478#comment-17517478 ] 

Lance Dacey edited comment on ARROW-16077 at 4/5/22 2:26 PM:
-------------------------------------------------------------

I am not sure about any public datasets. Locally, I use [azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio] for testing which can be installed or run as a Docker container. Note that I only use Azure Blob and not Azure Data Lake, so there might be some differences I am not aware of.

I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet data from Azure. I did a couple of tests with double slashes in the path. Perhaps I misunderstood what the original issue was, but it looks like I can read the data with pq.read_table and with pandas using fs.open() and storage_options. I pasted my quick tests below.



{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from adlfs import AzureBlobFileSystem
from pandas.testing import assert_frame_equal


URL = "http://127.0.0.1:10000"
ACCOUNT_NAME = "devstoreaccount1"
KEY = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
CONN_STR = f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};"


@pytest.fixture
def example_data():
    return {
        "date_id": [20210114, 20210811],
        "id": [1, 2],
        "created_at": [
            "2021-01-14 16:45:18",
            "2021-08-11 15:10:00",
        ],
        "updated_at": [
            "2021-01-14 16:45:18",
            "2021-08-11 15:10:00",
        ],
        "category": ["cow", "sheep"],
        "value": [0, 99],
    }


def test_double_slashes(example_data):
    fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, connection_string=CONN_STR)
    fs.mkdir("resource")
    path = "resource/path/to//parquet/files/part-001.parquet"
    table = pa.table(example_data)
    pq.write_table(table, where=path, filesystem=fs)

    # use pq.read_table() with filesystem
    new = pq.read_table(source=path, filesystem=fs)
    assert new == table

    # use adlfs filesystem.open()
    df = pd.read_parquet(fs.open(path, mode="rb"))
    dataframe_table = pa.Table.from_pandas(df)
    assert table == dataframe_table

    # use abfs path with storage options
    df2 = pd.read_parquet(f"abfs://{path}", storage_options={"connection_string": CONN_STR})
    assert_frame_equal(df, df2)

{code}





was (Author: ldacey):
I am not sure about any public datasets. Locally, I use [azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio] for testing which can be installed or run as a Docker container.

I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet data from Azure. I did a couple of tests with double slashes in the path. Perhaps I misunderstood what the original issue was, but it looks like I can read the data with pq.read_table and with pandas using fs.open() and storage_options. I pasted my quick tests below.



{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from adlfs import AzureBlobFileSystem
from pandas.testing import assert_frame_equal


URL = "http://127.0.0.1:10000"
ACCOUNT_NAME = "devstoreaccount1"
KEY = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
CONN_STR = f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};"


@pytest.fixture
def example_data():
    return {
        "date_id": [20210114, 20210811],
        "id": [1, 2],
        "created_at": [
            "2021-01-14 16:45:18",
            "2021-08-11 15:10:00",
        ],
        "updated_at": [
            "2021-01-14 16:45:18",
            "2021-08-11 15:10:00",
        ],
        "category": ["cow", "sheep"],
        "value": [0, 99],
    }


def test_double_slashes(example_data):
    fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, connection_string=CONN_STR)
    fs.mkdir("resource")
    path = "resource/path/to//parquet/files/part-001.parquet"
    table = pa.table(example_data)
    pq.write_table(table, where=path, filesystem=fs)

    # use pq.read_table() with filesystem
    new = pq.read_table(source=path, filesystem=fs)
    assert new == table

    # use adlfs filesystem.open()
    df = pd.read_parquet(fs.open(path, mode="rb"))
    dataframe_table = pa.Table.from_pandas(df)
    assert table == dataframe_table

    # use abfs path with storage options
    df2 = pd.read_parquet(f"abfs://{path}", storage_options={"connection_string": CONN_STR})
    assert_frame_equal(df, df2)

{code}




> [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16077
>                 URL: https://issues.apache.org/jira/browse/ARROW-16077
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Jon Rosenberg
>            Priority: Major
>
> Reading a partitioned parquet from adlfs with pyarrow through pandas will throw unnecessary exceptions on not matching forward slashes in the listed files returned from adlfs, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to/parquet/files"){code}
> results in exception of the form
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'path/to/parquet/files/part-0001.parquet', which is outside base dir '/path/to/parquet/files/'{code}
>  
> and testing with modifying the adlfs method to prepend slashes to all returned files, we still end up with an error on file paths that would otherwise be handled correctly where there is a double slash in a location where there should be one, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to//parquet/files") {code}
> would throw
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path '/path/to/parquet/files/part-0001.parquet', which is outside base dir '/path/to//parquet/files/' {code}
> In both cases the ls has returned correctly from adlfs, given it's discovered the file part-0001.parquet but the pyarrow exception stops what could otherwise be successful processing. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)