You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/07/11 18:33:00 UTC

[jira] [Commented] (BEAM-2586) Accommodate custom delimiters in TextIO

    [ https://issues.apache.org/jira/browse/BEAM-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082716#comment-16082716 ] 

ASF GitHub Bot commented on BEAM-2586:
--------------------------------------

GitHub user ChristophHebert opened a pull request:

    https://github.com/apache/beam/pull/3543

    [BEAM-2586] Accommodate custom delimiters in TextIO

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ChristophHebert/beam modifiedTextIO

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/3543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3543
    
----
commit 6277f69120da04e69189baf7b20940fc414dbbb8
Author: Chris Hebert <ch...@digitalreasoning.com>
Date:   2017-07-11T17:54:56Z

    Accommodate custom delimiters in TextIO

----


> Accommodate custom delimiters in TextIO
> ---------------------------------------
>
>                 Key: BEAM-2586
>                 URL: https://issues.apache.org/jira/browse/BEAM-2586
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Christopher Hebert
>            Assignee: Davor Bonaci
>            Priority: Minor
>
> We frequently process text files delimited by something other than newlines, including delimited only by end of file.
> First option:
> When we want to delimit by commas (or something else), we could use TextIO to read in line by line and apply a transform to split each line on commas. When we want to delimit by whole file, we could combine the elements of the PCollection output from TextIO that come from the same file into one element.
> Second option:
> Alternatively to complicating (and slowing) our pipelines with the methods above, we could write custom FileBasedSources for each use case.
> Third option:
> Preferably, we'd like to generalize TextIO to accept delimiters other than the default: \n, \r, \r\n.
> I'll attach a pull request for how we envision this generalization of TextIO to look.
> If this is not the direction Beam would like to go with TextIO, then we'll stick to maintaining our own TextIO or our own FileBasedSources to achieve this functionality.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)