You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Etienne Chauchot (JIRA)" <ji...@apache.org> on 2017/08/25 12:52:00 UTC

[jira] [Comment Edited] (BEAM-2802) TextIO should allow specifying a custom delimiter

    [ https://issues.apache.org/jira/browse/BEAM-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141566#comment-16141566 ] 

Etienne Chauchot edited comment on BEAM-2802 at 8/25/17 12:51 PM:
------------------------------------------------------------------

I have read the PR above and I agree with your comments  in particular the ones on performance and maintenance (complex generic code). Comparing to the above PR what I have started coding:
- performance: I only add one loop in {{findSeparatorBounds()}} to iterate over the bytes of the custom delimiter if it is set. If it is not set, the current code (\r\n) is called.
- maintenance: I decided not to make the code generic to support parsing both {{\r\n}} and the custom delimiter but rather to keep the code path as simple as possible with simple {{if (customDelimiter != null)}} (see above)
- no {{Set}} of delimiters but rather a {{byte[]}} delimiter that is either set or not set. 
- there is only one delimiter at a time, either {{\r\n}} or custom delimiter because we do not want to split twice if there is both new lines and custom separator in the text file in particular if a record is spread across more than one line

One could argue that the custom delimiter is not needed because we could split using new lines and then use a {{ParDo}} to split once again using the custom delimiter. But if a record is spread between multi-line, then this approach will generate 2 records in the output Collection.
 



was (Author: echauchot):
I have read the PR above and I agree with your comments  in particular the ones on performance and maintenance (complex generic code). Comparing to the above PR what I have started coding:
- performance: I only add one loop in {{findSeparatorBounds()}} to iterate over the bytes of the custom delimiter if it is set. If it is not set, the current code (\r\n) is called.
- maintenance: I decided not to make the code generic to support parsing both {{\r\n}} and the custom delimiter but rather to keep the code path as simple as possible with simple {{if (customDelimiter != null)}} (see above)
- no set of delimiters but rather a {{byte[]}} delimiter that is either set or not set. 
- there is only one delimiter at a time, either {{\r\n}} or custom delimiter because we do not want to split twice if there is both new lines and custom separator in the text file in particular if a record is spread across more than one line

One could argue that the custom delimiter is not needed because we could split using new lines and then use a {{ParDo}} to split once again using the custom delimiter. But if a record is spread between multi-line, then this approach will generate 2 records in the output Collection.
 


> TextIO should allow specifying a custom delimiter
> -------------------------------------------------
>
>                 Key: BEAM-2802
>                 URL: https://issues.apache.org/jira/browse/BEAM-2802
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: Minor
>
> Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a text file into PCollection elements. It might happen that a record is spread across more than one line. In that case we should be able to specify a custom record delimiter to be used in place of the default ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)