You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Vilhelm von Ehrenheim (JIRA)" <ji...@apache.org> on 2017/05/22 12:18:04 UTC
[jira] [Created] (BEAM-2338) Directory filepattern wildcard broken
in python SDK
Vilhelm von Ehrenheim created BEAM-2338:
-------------------------------------------
Summary: Directory filepattern wildcard broken in python SDK
Key: BEAM-2338
URL: https://issues.apache.org/jira/browse/BEAM-2338
Project: Beam
Issue Type: Bug
Components: beam-model
Affects Versions: 2.0.0
Reporter: Vilhelm von Ehrenheim
Assignee: Frances Perry
Validation of file patterns containing wildcard (`*`) in directories does not work if the filename is specified fully.
Some kinds of patterns generates an error from here:
https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168
I've tried a few different FileSystems match commands which confuses be a bit.
Full path works:
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, 74721736)]
{noformat}
Glob star on directory does not
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list
[]
{noformat}
If adding a star on the file level only searching for TIF files it works (all tough we match a different file but that is fine)
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'], limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, 65862791)]
{noformat}
Ok, Here comes the even more strange case.
Looking for the same file we found with the patterns that but with a star on the dir we find it!!
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'], limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, 65862791)]
{noformat}
Also looking at the first case again we will match if the star is placed late enough in the pattern to make the directory unique.
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, 74721736)]
{noformat}
but not if further up in the name
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list
[]
{noformat}
My guess is that some folders are dropped from the list of matched directories or something.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)