You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kenneth Knowles (Jira)" <ji...@apache.org> on 2021/11/08 19:37:00 UTC

[jira] [Commented] (BEAM-13189) Add escapechar to Python TextIO reads

    [ https://issues.apache.org/jira/browse/BEAM-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440710#comment-17440710 ] 

Kenneth Knowles commented on BEAM-13189:
----------------------------------------

I added you to the "Contributors" role so you can be assigned Jira tickets. It seems you are doing this one. Thanks!

> Add escapechar to Python TextIO reads
> -------------------------------------
>
>                 Key: BEAM-13189
>                 URL: https://issues.apache.org/jira/browse/BEAM-13189
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-py-common, io-py-files
>            Reporter: Eugene Nikolaiev
>            Assignee: Eugene Nikolaiev
>            Priority: P2
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Existing TextIO connector can be used for splitting lines of CSV or tab-delimited files for its ability to read large files in parallel and rebalance the work. Each line then can be parsed with {{csv}} library separately. This works, if there are no line delimiters inside the lines. Otherwise the lines are split incorrectly. 
> One of tab-delimited dialects uses escape characters to escape the line and column delimiters (usually backslash) instead of quoting the columns. This can be parsed with Python {{csv}} library using [escapechar|https://docs.python.org/3/library/csv.html#csv.Dialect.escapechar] dialect parameter.
> The escapechar itself can also be escaped to allow having such character before the line delimiters.
> Example of such file format usage: [Adobe Analytics Data Feed|https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-spec-chars.html?lang=en]
> It would be nice if TextIO transforms {{ReadFromText}} and {{ReadAllFromText}} had support for {{escapechar}} as follows:
>  
> {code:java}
> import csv
> import tempfile
> import apache_beam as beam
> with tempfile.NamedTemporaryFile('w') as temp_file:
>   # Write CSV lines with escaped line terminator
>   temp_file.write('a\\\na\taa\n')
>   temp_file.write('bb\tbb\n')
>   temp_file.flush()
>   # Read and print lines
>   with beam.Pipeline() as pipeline:
>     (
>       pipeline
>       | beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
>       | beam.Map(lambda x: print(repr(x)))
>     )
>   # Read lines, parse and print TSV rows
>   with beam.Pipeline() as pipeline:
>     (
>       pipeline
>       | beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
>       | beam.Map(lambda x: next(csv.reader([x], escapechar='\\', delimiter='\t')))
>       | beam.Map(lambda x: print(repr(x)))
>     )
> {code}
> This would print:
> {code:java}
> 'a\\\na\taa'
> 'bb\tbb'
> ['a\na', 'aa']
> ['bb', 'bb']
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)