You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Robin Kåveland Hansen <ka...@gmail.com> on 2020/05/01 11:49:28 UTC

[Python] Accessing Azure Blob storage using arrow

Hi!

Hadoop has builtin support for several so-called hdfs-compatible file
systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage
and Azure Data Lake gen2. Using these with hdfs commands requires a
little bit of setup in core-site.xml, one of the simplest possible
examples being:

<property>
  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
  <value>YOUR ACCESS KEY</value>
</property>

At that point, you can issue commands like:

hdfs dfs -ls wasbs://containername@youraccount.blob.core.windows.net

I currently use spark to access a bunch of azure storage accounts, so I
already have the core-site.xml setup and thought to leverage
pyarrow.fs.HadoopFileSystem to be able to interact directly with these
file systems instead of having to put things on local storage first. I'm
working with hive-partitioned datasets, so there's an annoying amount of
"double work" in downloading only the necessary partitions.

Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an
exception like:

IllegalArgumentException: Wrong FS: wasbs://..., expected:
hdfs://localhost:port

whenever given one of the configured paths that aren't fs.defaultFS.

Is there any way of making this work? Looks like this validation is
happening on the java side of the connection, so maybe there's nothing
that can be done in arrow?

The other option I checked out was to extend pyarrow.fs.FileSystem to
write a class built on the Azure Storage SDK, but after reading the
pyarrow code, that seems non-trivial, since it's being passed back to
C++ under the hood. I'm also seeing some typechecking that seems to
indicate that you're not supposed to extend this API.

That leaves the option of doing this in C++ using some SDK like
https://github.com/Azure/azure-storage-cpplite which is unfortunately a
lot more involved for me than I was hoping for when I started tumbling
down this particular rabbithole.

-- 
Kind regards,
Robin Kåveland

Re: [Python] Accessing Azure Blob storage using arrow

Posted by Wes McKinney <we...@gmail.com>.

I just commented about this in

https://issues.apache.org/jira/browse/ARROW-2034

Our preferred path forward would almost certainly be to build a C++
implementation of the arrow::filesystem::Filesystem interface that
deals with Azure and then that would be straightforward to hook up
with the Datasets API

On Wed, May 6, 2020 at 2:58 AM Robin Kåveland Hansen
<ka...@gmail.com> wrote:
>
> Hi,
>
> You're right, I want dataset functionality, I'm able to read individual
> files into memory and passing them to arrow just fine, like the example
> from the documentation.
>
> On 3 May 2020 at 00:12:48, Micah Kornfield (emkornfield@gmail.com) wrote:
>
> Hi Robin,
> I'm not an expert in this area and there has been a lot of change since I looked into this, but I there was an old PR that looked to do a python implementation [1], as you noted this was closed in favor of trying to target a C++ implementation.  It sounds like you may want more data-set like functionality, but does the example given for reading from Azure in the documentation work for you [2]?  I think there are similar APIs for parsing other file types.
>
> Hope this helps.
>
> -Micah
>
> [1] https://github.com/apache/arrow/pull/4121
> [2] https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage
>
> On Fri, May 1, 2020 at 4:49 AM Robin Kåveland Hansen <ka...@gmail.com> wrote:
>>
>> Hi!
>>
>> Hadoop has builtin support for several so-called hdfs-compatible file
>> systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage
>> and Azure Data Lake gen2. Using these with hdfs commands requires a
>> little bit of setup in core-site.xml, one of the simplest possible
>> examples being:
>>
>> <property>
>>   <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
>>   <value>YOUR ACCESS KEY</value>
>> </property>
>>
>> At that point, you can issue commands like:
>>
>> hdfs dfs -ls wasbs://containername@youraccount.blob.core.windows.net
>>
>> I currently use spark to access a bunch of azure storage accounts, so I
>> already have the core-site.xml setup and thought to leverage
>> pyarrow.fs.HadoopFileSystem to be able to interact directly with these
>> file systems instead of having to put things on local storage first. I'm
>> working with hive-partitioned datasets, so there's an annoying amount of
>> "double work" in downloading only the necessary partitions.
>>
>> Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an
>> exception like:
>>
>> IllegalArgumentException: Wrong FS: wasbs://..., expected:
>> hdfs://localhost:port
>>
>> whenever given one of the configured paths that aren't fs.defaultFS.
>>
>> Is there any way of making this work? Looks like this validation is
>> happening on the java side of the connection, so maybe there's nothing
>> that can be done in arrow?
>>
>> The other option I checked out was to extend pyarrow.fs.FileSystem to
>> write a class built on the Azure Storage SDK, but after reading the
>> pyarrow code, that seems non-trivial, since it's being passed back to
>> C++ under the hood. I'm also seeing some typechecking that seems to
>> indicate that you're not supposed to extend this API.
>>
>> That leaves the option of doing this in C++ using some SDK like
>> https://github.com/Azure/azure-storage-cpplite which is unfortunately a
>> lot more involved for me than I was hoping for when I started tumbling
>> down this particular rabbithole.
>>
>> --
>> Kind regards,
>> Robin Kåveland
>>
> --
> Vennlig hilsen,
> Robin Kåveland
>

Re: [Python] Accessing Azure Blob storage using arrow

Posted by Robin Kåveland Hansen <ka...@gmail.com>.

Hi,

You're right, I want dataset functionality, I'm able to read individual
files into memory and passing them to arrow just fine, like the example
from the documentation.

On 3 May 2020 at 00:12:48, Micah Kornfield (emkornfield@gmail.com) wrote:

Hi Robin,
I'm not an expert in this area and there has been a lot of change since I
looked into this, but I there was an old PR that looked to do a python
implementation [1], as you noted this was closed in favor of trying to
target a C++ implementation.  It sounds like you may want more data-set
like functionality, but does the example given for reading from Azure in
the documentation work for you [2]?  I think there are similar APIs for
parsing other file types.

Hope this helps.

-Micah

[1] https://github.com/apache/arrow/pull/4121
[2]
https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage


On Fri, May 1, 2020 at 4:49 AM Robin Kåveland Hansen <ka...@gmail.com>
wrote:

> Hi!
>
> Hadoop has builtin support for several so-called hdfs-compatible file
> systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage
> and Azure Data Lake gen2. Using these with hdfs commands requires a
> little bit of setup in core-site.xml, one of the simplest possible
> examples being:
>
> <property>
>   <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
>   <value>YOUR ACCESS KEY</value>
> </property>
>
> At that point, you can issue commands like:
>
> hdfs dfs -ls wasbs://containername@youraccount.blob.core.windows.net
>
> I currently use spark to access a bunch of azure storage accounts, so I
> already have the core-site.xml setup and thought to leverage
> pyarrow.fs.HadoopFileSystem to be able to interact directly with these
> file systems instead of having to put things on local storage first. I'm
> working with hive-partitioned datasets, so there's an annoying amount of
> "double work" in downloading only the necessary partitions.
>
> Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an
> exception like:
>
> IllegalArgumentException: Wrong FS: wasbs://..., expected:
> hdfs://localhost:port
>
> whenever given one of the configured paths that aren't fs.defaultFS.
>
> Is there any way of making this work? Looks like this validation is
> happening on the java side of the connection, so maybe there's nothing
> that can be done in arrow?
>
> The other option I checked out was to extend pyarrow.fs.FileSystem to
> write a class built on the Azure Storage SDK, but after reading the
> pyarrow code, that seems non-trivial, since it's being passed back to
> C++ under the hood. I'm also seeing some typechecking that seems to
> indicate that you're not supposed to extend this API.
>
> That leaves the option of doing this in C++ using some SDK like
> https://github.com/Azure/azure-storage-cpplite which is unfortunately a
> lot more involved for me than I was hoping for when I started tumbling
> down this particular rabbithole.
>
> --
> Kind regards,
> Robin Kåveland
>
> --
Vennlig hilsen,
Robin Kåveland

Re: [Python] Accessing Azure Blob storage using arrow

Posted by Micah Kornfield <em...@gmail.com>.

Hi Robin,
I'm not an expert in this area and there has been a lot of change since I
looked into this, but I there was an old PR that looked to do a python
implementation [1], as you noted this was closed in favor of trying to
target a C++ implementation.  It sounds like you may want more data-set
like functionality, but does the example given for reading from Azure in
the documentation work for you [2]?  I think there are similar APIs for
parsing other file types.

Hope this helps.

-Micah

[1] https://github.com/apache/arrow/pull/4121
[2]
https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage


On Fri, May 1, 2020 at 4:49 AM Robin Kåveland Hansen <ka...@gmail.com>
wrote:

> Hi!
>
> Hadoop has builtin support for several so-called hdfs-compatible file
> systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage
> and Azure Data Lake gen2. Using these with hdfs commands requires a
> little bit of setup in core-site.xml, one of the simplest possible
> examples being:
>
> <property>
>   <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
>   <value>YOUR ACCESS KEY</value>
> </property>
>
> At that point, you can issue commands like:
>
> hdfs dfs -ls wasbs://containername@youraccount.blob.core.windows.net
>
> I currently use spark to access a bunch of azure storage accounts, so I
> already have the core-site.xml setup and thought to leverage
> pyarrow.fs.HadoopFileSystem to be able to interact directly with these
> file systems instead of having to put things on local storage first. I'm
> working with hive-partitioned datasets, so there's an annoying amount of
> "double work" in downloading only the necessary partitions.
>
> Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an
> exception like:
>
> IllegalArgumentException: Wrong FS: wasbs://..., expected:
> hdfs://localhost:port
>
> whenever given one of the configured paths that aren't fs.defaultFS.
>
> Is there any way of making this work? Looks like this validation is
> happening on the java side of the connection, so maybe there's nothing
> that can be done in arrow?
>
> The other option I checked out was to extend pyarrow.fs.FileSystem to
> write a class built on the Azure Storage SDK, but after reading the
> pyarrow code, that seems non-trivial, since it's being passed back to
> C++ under the hood. I'm also seeing some typechecking that seems to
> indicate that you're not supposed to extend this API.
>
> That leaves the option of doing this in C++ using some SDK like
> https://github.com/Azure/azure-storage-cpplite which is unfortunately a
> lot more involved for me than I was hoping for when I started tumbling
> down this particular rabbithole.
>
> --
> Kind regards,
> Robin Kåveland
>
>