You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Simon Lidberg (JIRA)" <ji...@apache.org> on 2019/06/24 06:57:00 UTC

[jira] [Comment Edited] (ARROW-5647) [Python] Accessing a file from Databricks using pandas read_parquet using the pyarrow engine fails with : Passed non-file path: /mnt/aa/example.parquet

    [ https://issues.apache.org/jira/browse/ARROW-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870857#comment-16870857 ] 

Simon Lidberg edited comment on ARROW-5647 at 6/24/19 6:56 AM:
---------------------------------------------------------------

I tested now with accessing the file using /dbfs/mnt/aa/example.parquet it fails but this time with another error:

ArrowIOError Traceback (most recent call last) <command-4042920808160098> in <module>() ----> 1 pddf2 = pd.read_parquet("/dbfs/mnt/aa/example2.parquet", engine='pyarrow')  2 display(pddf2) /databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)  280  281 impl = get_engine(engine) --> 282 return impl.read(path, columns=columns, **kwargs) /databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)  127 kwargs['use_pandas_metadata'] = True  128 result = self.api.parquet.read_table(path, columns=columns, --> 129 **kwargs).to_pandas()  130 if should_close:  131 try: /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)  1150 return fs.read_parquet(path, columns=columns,  1151 use_threads=use_threads, metadata=metadata, -> 1152 use_pandas_metadata=use_pandas_metadata)  1153  1154 pf = ParquetFile(source, metadata=metadata) /databricks/python/lib/python3.5/site-packages/pyarrow/filesystem.py in read_parquet(self, path, columns, metadata, schema, use_threads, use_pandas_metadata)  177 from pyarrow.parquet import ParquetDataset  178 dataset = ParquetDataset(path, schema=schema, metadata=metadata, --> 179 filesystem=self)  180 return dataset.read(columns=columns, use_threads=use_threads,  181 use_pandas_metadata=use_pandas_metadata) /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, memory_map)  956  957 if validate_schema: --> 958 self.validate_schemas()  959  960 if filters is not None: /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in validate_schemas(self)  967 self.schema = self.common_metadata.schema  968 else: --> 969 self.schema = self.pieces[0].get_metadata().schema  970 elif self.schema is None:  971 self.schema = self.metadata.schema /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in get_metadata(self, open_file_func)  500 f = self._open(open_file_func)  501 else: --> 502 f = self.open()  503 return f.metadata  504 /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in open(self)  518 Returns instance of ParquetFile  519 """ --> 520 reader = self.open_file_func(self.path)  521 if not isinstance(reader, ParquetFile):  522 reader = ParquetFile(reader) /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in open_file(path, meta)  1054 return ParquetFile(path, metadata=meta,  1055 memory_map=self.memory_map, -> 1056 common_metadata=self.common_metadata)  1057 else:  1058 def open_file(path, meta=None): /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in __init__(self, source, metadata, common_metadata, memory_map)  128 memory_map=True):  129 self.reader = ParquetReader() --> 130 self.reader.open(source, use_memory_map=memory_map, metadata=metadata)  131 self.common_metadata = common_metadata  132 self._nested_paths_by_prefix = self._build_nested_paths() /databricks/python/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so in pyarrow._parquet.ParquetReader.open() /databricks/python/lib/python3.5/site-packages/pyarrow/lib.cpython-35m-x86_64-linux-gnu.so in pyarrow.lib.check_status() ArrowIOError: Invalid parquet file. Corrupt footer.

My entire test code is as follows:

from pyspark.sql import *
# Create test data
row1 = Row(id='1', name='Alpha')
row2 = Row(id='2', name='Beta')
row3 = Row(id='3', name='Gamma')
row4 = Row(id='4', name='Delta')

df = spark.createDataFrame([row1, row2, row3, row4])
display(df)

# Write the data to the mount point
df.write.parquet("/mnt/aa/example2.parquet")

# Try reading using pandas read_parquet
pddf2 = pd.read_parquet("/dbfs/mnt/aa/example.parquet", engine='pyarrow')
display(pddf2) 

The same file can be read using 
df2 = spark.read.parquet("/mnt/aa/example.parquet")

pddf = df2.toPandas()
type(pddf)



was (Author: simonlid):
I tested now with accessing the file using /dbfs/mnt/aa/example.parquet it fails but this time with another error:

Version:1.0 StartHTML:000000221 EndHTML:000014581 StartFragment:000001384 EndFragment:000014435 StartSelection:000001384 EndSelection:000014435 SourceURL:https://westeurope.azuredatabricks.net/?o=4812138293037155 ADLS2_Test - Databricks
--------------------------------------------------------------------------- ArrowIOError Traceback (most recent call last) <command-4042920808160098> in <module>() ----> 1 pddf2 = pd.read_parquet("/dbfs/mnt/aa/example2.parquet", engine='pyarrow')  2 display(pddf2) /databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)  280  281 impl = get_engine(engine) --> 282 return impl.read(path, columns=columns, **kwargs) /databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)  127 kwargs['use_pandas_metadata'] = True  128 result = self.api.parquet.read_table(path, columns=columns, --> 129 **kwargs).to_pandas()  130 if should_close:  131 try: /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)  1150 return fs.read_parquet(path, columns=columns,  1151 use_threads=use_threads, metadata=metadata, -> 1152 use_pandas_metadata=use_pandas_metadata)  1153  1154 pf = ParquetFile(source, metadata=metadata) /databricks/python/lib/python3.5/site-packages/pyarrow/filesystem.py in read_parquet(self, path, columns, metadata, schema, use_threads, use_pandas_metadata)  177 from pyarrow.parquet import ParquetDataset  178 dataset = ParquetDataset(path, schema=schema, metadata=metadata, --> 179 filesystem=self)  180 return dataset.read(columns=columns, use_threads=use_threads,  181 use_pandas_metadata=use_pandas_metadata) /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, memory_map)  956  957 if validate_schema: --> 958 self.validate_schemas()  959  960 if filters is not None: /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in validate_schemas(self)  967 self.schema = self.common_metadata.schema  968 else: --> 969 self.schema = self.pieces[0].get_metadata().schema  970 elif self.schema is None:  971 self.schema = self.metadata.schema /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in get_metadata(self, open_file_func)  500 f = self._open(open_file_func)  501 else: --> 502 f = self.open()  503 return f.metadata  504 /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in open(self)  518 Returns instance of ParquetFile  519 """ --> 520 reader = self.open_file_func(self.path)  521 if not isinstance(reader, ParquetFile):  522 reader = ParquetFile(reader) /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in open_file(path, meta)  1054 return ParquetFile(path, metadata=meta,  1055 memory_map=self.memory_map, -> 1056 common_metadata=self.common_metadata)  1057 else:  1058 def open_file(path, meta=None): /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in __init__(self, source, metadata, common_metadata, memory_map)  128 memory_map=True):  129 self.reader = ParquetReader() --> 130 self.reader.open(source, use_memory_map=memory_map, metadata=metadata)  131 self.common_metadata = common_metadata  132 self._nested_paths_by_prefix = self._build_nested_paths() /databricks/python/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so in pyarrow._parquet.ParquetReader.open() /databricks/python/lib/python3.5/site-packages/pyarrow/lib.cpython-35m-x86_64-linux-gnu.so in pyarrow.lib.check_status() ArrowIOError: Invalid parquet file. Corrupt footer.

My entire test code is as follows:

from pyspark.sql import *
# Create test data
row1 = Row(id='1', name='Alpha')
row2 = Row(id='2', name='Beta')
row3 = Row(id='3', name='Gamma')
row4 = Row(id='4', name='Delta')

df = spark.createDataFrame([row1, row2, row3, row4])
display(df)

# Write the data to the mount point
df.write.parquet("/mnt/aa/example2.parquet")

# Try reading using pandas read_parquet
pddf2 = pd.read_parquet("/dbfs/mnt/aa/example.parquet", engine='pyarrow')
display(pddf2) 

The same file can be read using 
df2 = spark.read.parquet("/mnt/aa/example.parquet")

pddf = df2.toPandas()
type(pddf)


> [Python] Accessing a file from Databricks using pandas read_parquet using the pyarrow engine fails with : Passed non-file path: /mnt/aa/example.parquet 
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-5647
>                 URL: https://issues.apache.org/jira/browse/ARROW-5647
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>         Environment: Azure Databricks
>            Reporter: Simon Lidberg
>            Priority: Major
>         Attachments: arrow_error.txt
>
>
> When trying to access a file using a mount point pointing to an Azure blob storage account the code fails with the following error:
> {color:#8b0000}OSError{color}: Passed non-file path: /mnt/aa/example.parquet
> {color:#8b0000}---------------------------------------------------------------------------{color} {color:#8b0000}OSError{color} Traceback (most recent call last) {color:#006400}<command-1848295812523966>{color} in {color:#4682b4}<module>{color}{color:#00008b}(){color} {color:#006400}----> 1{color} pddf2 {color:#AA4B00}={color} pd{color:#AA4B00}.{color}read_parquet{color:#AA4B00}({color}{color:#00008b}"/mnt/aa/example.parquet"{color}{color:#AA4B00},{color} engine{color:#AA4B00}={color}{color:#00008b}'pyarrow'{color}{color:#AA4B00}){color} {color:#006400} 2{color} display{color:#AA4B00}({color}pddf2{color:#AA4B00}){color} {color:#006400}/databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py{color} in {color:#4682b4}read_parquet{color}{color:#00008b}(path, engine, columns, **kwargs){color} {color:#006400} 280{color} {color:#006400} 281{color} impl {color:#AA4B00}={color} get_engine{color:#AA4B00}({color}engine{color:#AA4B00}){color} {color:#006400}--> 282{color} {color:#006400}return{color} impl{color:#AA4B00}.{color}read{color:#AA4B00}({color}path{color:#AA4B00},{color} columns{color:#AA4B00}={color}columns{color:#AA4B00},{color} {color:#AA4B00}**{color}kwargs{color:#AA4B00}){color} {color:#006400}/databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py{color} in {color:#4682b4}read{color}{color:#00008b}(self, path, columns, **kwargs){color} {color:#006400} 127{color} kwargs{color:#AA4B00}[{color}{color:#00008b}'use_pandas_metadata'{color}{color:#AA4B00}]{color} {color:#AA4B00}={color} {color:#006400}True{color} {color:#006400} 128{color} result = self.api.parquet.read_table(path, columns=columns, {color:#006400}--> 129{color}{color:#AA4B00} **kwargs).to_pandas() {color}{color:#006400} 130{color} {color:#006400}if{color} should_close{color:#AA4B00}:{color} {color:#006400} 131{color} {color:#006400}try{color}{color:#AA4B00}:{color} {color:#006400}/databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py{color} in {color:#4682b4}read_table{color}{color:#00008b}(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem){color} {color:#006400} 1150{color} return fs.read_parquet(path, columns=columns, {color:#006400} 1151{color} use_threads{color:#AA4B00}={color}use_threads{color:#AA4B00},{color} metadata{color:#AA4B00}={color}metadata{color:#AA4B00},{color} {color:#006400}-> 1152{color}{color:#AA4B00} use_pandas_metadata=use_pandas_metadata) {color}{color:#006400} 1153{color} {color:#006400} 1154{color} pf {color:#AA4B00}={color} ParquetFile{color:#AA4B00}({color}source{color:#AA4B00},{color} metadata{color:#AA4B00}={color}metadata{color:#AA4B00}){color} {color:#006400}/databricks/python/lib/python3.5/site-packages/pyarrow/filesystem.py{color} in {color:#4682b4}read_parquet{color}{color:#00008b}(self, path, columns, metadata, schema, use_threads, use_pandas_metadata){color} {color:#006400} 177{color} {color:#006400}from{color} pyarrow{color:#AA4B00}.{color}parquet {color:#006400}import{color} ParquetDataset {color:#006400} 178{color} dataset = ParquetDataset(path, schema=schema, metadata=metadata, {color:#006400}--> 179{color}{color:#AA4B00} filesystem=self) {color}{color:#006400} 180{color} return dataset.read(columns=columns, use_threads=use_threads, {color:#006400} 181{color} use_pandas_metadata=use_pandas_metadata) {color:#006400}/databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py{color} in {color:#4682b4}__init__{color}{color:#00008b}(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, memory_map){color} {color:#006400} 933{color} self{color:#AA4B00}.{color}metadata_path{color:#AA4B00}){color} {color:#AA4B00}={color} _make_manifest{color:#AA4B00}({color} {color:#006400} 934{color} path_or_paths{color:#AA4B00},{color} self{color:#AA4B00}.{color}fs{color:#AA4B00},{color} metadata_nthreads{color:#AA4B00}={color}metadata_nthreads{color:#AA4B00},{color} {color:#006400}--> 935{color}{color:#AA4B00} open_file_func=self._open_file_func) {color}{color:#006400} 936{color} {color:#006400} 937{color} {color:#006400}if{color} self{color:#AA4B00}.{color}common_metadata_path {color:#006400}is{color} {color:#006400}not{color} {color:#006400}None{color}{color:#AA4B00}:{color} {color:#006400}/databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py{color} in {color:#4682b4}_make_manifest{color}{color:#00008b}(path_or_paths, fs, pathsep, metadata_nthreads, open_file_func){color} {color:#006400} 1108{color} {color:#006400}if{color} {color:#006400}not{color} fs{color:#AA4B00}.{color}isfile{color:#AA4B00}({color}path{color:#AA4B00}){color}{color:#AA4B00}:{color} {color:#006400} 1109{color} raise IOError('Passed non-file path: \{0}' {color:#006400}-> 1110{color}{color:#AA4B00} .format(path)) {color}{color:#006400} 1111{color} piece {color:#AA4B00}={color} ParquetDatasetPiece{color:#AA4B00}({color}path{color:#AA4B00},{color} open_file_func{color:#AA4B00}={color}open_file_func{color:#AA4B00}){color} {color:#006400} 1112{color} pieces{color:#AA4B00}.{color}append{color:#AA4B00}({color}piece{color:#AA4B00}){color} {color:#8b0000}OSError{color}: Passed non-file path: /mnt/aa/example.parquet
>  
> I am using the following code from a Databricks notebook to reproduce the issue:
> {color:#005000}%sh 
> {color}
> {color:#005000}sudo apt-get -y install python3-pip
> /databricks/python3/bin/pip3 uninstall pandas -y
> /databricks/python3/bin/pip3 uninstall numpy -y{color}
> {color:#005000}{color:#b08000}/databricks/python3/bin/pip3 uninstall pyarrow -y{color}{color}
>  
>  
> {color:#005000}{color:#b08000}{color:#b08000}%sh 
> /databricks/python3/bin/pip3 install numpy==1.14.0
> /databricks/python3/bin/pip3 install pandas==0.24.1
> /databricks/python3/bin/pip3 install pyarrow==0.13.0{color}{color}{color}
>  
> {color:#005000}{color:#b08000}{color:#b08000}{color:#b08000}dbutils.fs.mount(
>   source = "wasbs://<mycontainer>@<mystorageaccount>.blob.core.windows.net",
>   mount_point = "/mnt/aa",
>   extra_configs = \{"fs.azure.account.key.<mystorageaccount>.blob.core.windows.net":dbutils.secrets.get(scope = "storage", key = "blob_key")}){color}{color}{color}{color}
>  
> {color:#005000}{color:#b08000}{color:#b08000}{color:#b08000}pddf2 = pd.read_parquet("/mnt/aa/example.parquet", engine='pyarrow')
> display(pddf2){color}{color}{color}{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)