You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/30 20:03:55 UTC

[GitHub] [arrow] Mokubyow opened a new issue #10634: pq.write_to_dataset() and ds.write_dataset() both throw InvalidLocationConstraint when using S3FileSystem

Mokubyow opened a new issue #10634:
URL: https://github.com/apache/arrow/issues/10634


   I've been having some issues with writing a dataset or table to s3. I'm using pyarrow 4.0.1 with python and am just wondering if I've missing something becuase I keep getting the following error:
   
   ```
   import pyarrow.dataset as ds
   from pyarrow import fs
   import pyarrow.parquet as pq
   
   source_uri = "s3://my-existing-bucket/old_prefix/"
   dataset = ds.dataset(source_uri)
   
   target_uri = "s3://my-existing-bucket/new_prefix/"       
   filesystem, path = fs.FileSystem.from_uri(uri)
   
   # both of these write methods product the error below
   
   # First write type
   ds.write_dataset(dataset, path, filesystem=filesystem, format="parquet")
   
   # Second write type 
   pq.write_to_dataset(dataset.to_table(), path, filesystem=filesystem)
   ```
   
   **Error**
   ```
   OSError: When creating bucket 'my-existing-bucket': AWS Error [code 100]: Unable to parse ExceptionName: InvalidLocationConstraint Message: The specified location-constraint is not valid
   ```
   
   I've checked and my region is set to `us-east-1` which is correct for me. Additionally, I was able to write a single parquet file using so I know my credentials are in working order. 
   
   ```
   pq.write_table(dataset.to_table(), path, filesystem=filesystem)
   ```
   
   Finally, the target bucket was also created in us-east-1 so I don't think it's a region issue, but that is about all I could find online about this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed issue #10634: pq.write_to_dataset() and ds.write_dataset() both throw InvalidLocationConstraint when using S3FileSystem

Posted by GitBox <gi...@apache.org>.
wesm closed issue #10634:
URL: https://github.com/apache/arrow/issues/10634


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed issue #10634: pq.write_to_dataset() and ds.write_dataset() both throw InvalidLocationConstraint when using S3FileSystem

Posted by GitBox <gi...@apache.org>.
wesm closed issue #10634:
URL: https://github.com/apache/arrow/issues/10634


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10634: pq.write_to_dataset() and ds.write_dataset() both throw InvalidLocationConstraint when using S3FileSystem

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10634:
URL: https://github.com/apache/arrow/issues/10634#issuecomment-872502967


   Ok, yes, I looked a little further.  I did not realize that the dataset code calls CreateDir even if the bucket already exists (it uses CreateDir to test if the bucket exists).  So this is ARROW-13228.  If you are able to wait for version 5.0.0 (~a month out) then you can get a fix there.  Alternatively you can use the latest nightly build or build from source.
   
   Another workaround may be to use s3fs, PyFilesystem, and FSSpecHandler:
   
   ```
   import s3fs
   import pyarrow.fs
   s3fs_instance = s3fs.S3FileSystem()
   filesystem = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3fs_instance))
   ```
   
   A final workaround could be to create your own filesystem implementation that wraps a pyarrow.fs.S3FileSystem instance (e.g. proxy pattern) and for `create_dir` it simply returns True.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Mokubyow commented on issue #10634: pq.write_to_dataset() and ds.write_dataset() both throw InvalidLocationConstraint when using S3FileSystem

Posted by GitBox <gi...@apache.org>.
Mokubyow commented on issue #10634:
URL: https://github.com/apache/arrow/issues/10634#issuecomment-872305153


   @westonpace I'm 100% sure I'm using the exact same `path` variable for each of these writing methods. I've annotated what works and what doesn't below. I can also confirm that the `path` variable returned from `fs.FileSystem.from_uri(target_uri)` does indeed start with the bucket name like so: `my-existing-bucket/new_prefix/`
   
   ```
   filesystem, path = fs.FileSystem.from_uri(target_uri)
   
   # Throws InvalidLocationConstraint error
   ds.write_dataset(dataset, path, filesystem=filesystem, format="parquet")
   
   # Throws InvalidLocationConstraint error
   pq.write_to_dataset(dataset.to_table(), path, filesystem=filesystem)
   
   # Writes single file
   pq.write_table(dataset.to_table(), path, filesystem=filesystem)
   ```
   
   Are there any solutions you can think of for writing a large dataset as many parquet files from pyarrow to unblock me while we debug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10634: pq.write_to_dataset() and ds.write_dataset() both throw InvalidLocationConstraint when using S3FileSystem

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10634:
URL: https://github.com/apache/arrow/issues/10634#issuecomment-871906909


   It appears us-east-1 has special rules regarding the LocationConstraint.  See https://github.com/boto/boto3/issues/125 for details.  I've created ARROW-13228 to track this.
   
   Are you sure you are using `my-existing-bucket` in both cases?  Note that the error you are getting says "When creating bucket".  Both dataset write methods will create a bucket if it does not exist so my guess is that path does not start with "my-existing-bucket".  If you are intentionally creating a new bucket then you can workaround this error by manually creating the destination bucket outside of Arrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Mokubyow edited a comment on issue #10634: pq.write_to_dataset() and ds.write_dataset() both throw InvalidLocationConstraint when using S3FileSystem

Posted by GitBox <gi...@apache.org>.
Mokubyow edited a comment on issue #10634:
URL: https://github.com/apache/arrow/issues/10634#issuecomment-872305153


   @westonpace I'm 100% sure I'm using the exact same `path` variable for each of these writing methods. I've annotated what works and what doesn't below. I can also confirm that the `path` variable returned from `fs.FileSystem.from_uri(target_uri)` does indeed start with the bucket name `my-existing-bucket/new_prefix/` Additionally, I'm not creating a new bucket this bucket already exists in my AWS account I'm just looking to write to it, so I'm not sure why this error is being thrown in the first place.
   
   ```
   filesystem, path = fs.FileSystem.from_uri(target_uri)
   
   # Throws InvalidLocationConstraint error
   ds.write_dataset(dataset, path, filesystem=filesystem, format="parquet")
   
   # Throws InvalidLocationConstraint error
   pq.write_to_dataset(dataset.to_table(), path, filesystem=filesystem)
   
   # Writes single file
   pq.write_table(dataset.to_table(), path, filesystem=filesystem)
   ```
   
   Are there any solutions you can think of for writing a large dataset as many parquet files from pyarrow to unblock me while we debug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org