You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Pablo Estrada (Jira)" <ji...@apache.org> on 2019/09/24 18:43:00 UTC

[jira] [Commented] (BEAM-7998) MatchesFiles or MatchAll seems to return seveval time the same element

    [ https://issues.apache.org/jira/browse/BEAM-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937092#comment-16937092 ] 

Pablo Estrada commented on BEAM-7998:
-------------------------------------

I've ran the following pipeline locally, and I've had no problem : / - but if the problem you describe is happening somehow, it would be pretty serious. Are you running locally only? or ever using Dataflow? If you're using Dataflow, it makes sense to file a support ticket to get this resolved.
{code:java}


def run():
  with beam.Pipeline() as p:
    pairs = (p
        | fileio.MatchFiles('gs://my-bucket/*.json')
        | fileio.ReadMatches()
        | beam.Map(lambda f: (f.metadata.path,
                              json.loads(f.read_utf8()))))    

    pairs | beam.Map(lambda x: print(x))
{code}

> MatchesFiles or MatchAll seems to return seveval time the same element
> ----------------------------------------------------------------------
>
>                 Key: BEAM-7998
>                 URL: https://issues.apache.org/jira/browse/BEAM-7998
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-files
>    Affects Versions: 2.14.0
>         Environment: GCP for storage, DirectRunner and DataflowRunner both have the problem. PyCharm on Win10 for IDE and dev environment.
>            Reporter: Jerome MASSOT
>            Assignee: Pablo Estrada
>            Priority: Major
>              Labels: ccoss2019
>
> Hi team,
> when I use MatcheFiles using wildcard and files located in a GCP bucket, the MatcheFiles transform returns several times (at least 2) the same file.
> I have tried to follow the stack, and I can see that the MatchesAll is called twice when I run the pipeline on a debug project where a single element is present in the bucket.
> But I am not good enough to say more than that. Sorry.
> Best regards
> Jerome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)