You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by rekhajoshm <gi...@git.apache.org> on 2016/06/01 20:23:58 UTC

[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding

GitHub user rekhajoshm opened a pull request:

    https://github.com/apache/flink/pull/2060

    [FLINK-3921] StringParser encoding

    Corrected StringParser encoding

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rekhajoshm/flink FLINK-3921

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2060.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2060
    
----
commit 1e1ce9efefccf5a5585be802af4200e9d1ae7a98
Author: Rekha Joshi <re...@gmail.com>
Date:   2016-06-01T20:03:50Z

    Merge pull request #1 from apache/master
    
    Apache Flink master pull

commit 675b6a44e76ae71901bc6d4eaea1d09b6f789ff6
Author: Joshi <re...@gmail.com>
Date:   2016-06-01T20:26:47Z

    [FLINK-3921] StringParser encoding

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2060#discussion_r78203358
  
    --- Diff: flink-core/src/main/java/org/apache/flink/types/parser/FieldParser.java ---
    @@ -75,8 +76,30 @@
     		/** Invalid Boolean value **/
     		BOOLEAN_INVALID
     	}
    -	
    +
    +	private Charset charset = Charset.forName("US-ASCII");
    +
     	private ParseErrorState errorState = ParseErrorState.NONE;
    +
    +	/**
    +	 * Parses the value of a field from the byte array.
    +	 * The start position within the byte array and the array's valid length is given.
    +	 * The content of the value is delimited by a field delimiter.
    +	 *
    +	 * @param bytes The byte array that holds the value.
    +	 * @param startPos The index where the field starts
    +	 * @param limit The limit unto which the byte contents is valid for the parser. The limit is the
    +	 *              position one after the last valid byte.
    +	 * @param delim The field delimiter character
    +	 * @param reuse An optional reusable field to hold the value
    +	 * @param charset The charset to parse with
    +	 *
    +	 * @return The index of the next delimiter, if the field was parsed correctly. A value less than 0 otherwise.
    +	 */
    +	public int parseField(byte[] bytes, int startPos, int limit, byte[] delim, T reuse, Charset charset){
    +		this.charset = charset;
    --- End diff --
    
    Is this method needed? `GenericCsvInputFormat.open` can `setCharset` on each newly instantiated `FieldParser`, and in the case where a user decided to change charset on an open file `GenericCsvInputFormat.setCharset` could go through and `setCharset` on the list of `FieldParser`s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    @rekhajoshm have you had a chance to look again at this PR? The 1.2 feature freeze will be in a few weeks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2060#discussion_r78199594
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java ---
    @@ -314,6 +320,25 @@ protected void setFieldsGeneric(boolean[] includedMask, Class<?>[] fieldTypes) {
     		this.fieldIncluded = includedMask;
     	}
     
    +	/**
    +	 * Gets the Charset for the parser.Default is set to UTF-8
    +	 *
    +	 * @return The charset for the parser.
    +	 */
    +	public Charset getCharset() {
    +		return this.charset;
    +	}
    +
    +	/**
    +	 * Sets the charset of the parser. Called by subclasses of the parser to set the type of charset
    +	 * when doing a parse.
    +	 *
    +	 * @param charset The charset  to set.
    +	 */
    +	protected void setCharset(Charset charset){
    --- End diff --
    
    Add space between `)` and `{` (and I count two more instances).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    Thanks @greghogan for your review. Please have a look. thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2060#discussion_r78199365
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java ---
    @@ -314,6 +320,25 @@ protected void setFieldsGeneric(boolean[] includedMask, Class<?>[] fieldTypes) {
     		this.fieldIncluded = includedMask;
     	}
     
    +	/**
    +	 * Gets the Charset for the parser.Default is set to UTF-8
    +	 *
    +	 * @return The charset for the parser.
    +	 */
    +	public Charset getCharset() {
    +		return this.charset;
    +	}
    +
    +	/**
    +	 * Sets the charset of the parser. Called by subclasses of the parser to set the type of charset
    +	 * when doing a parse.
    +	 *
    +	 * @param charset The charset  to set.
    +	 */
    +	protected void setCharset(Charset charset){
    +		this.charset = charset != null ? charset : Charset.forName("UTF-8");
    --- End diff --
    
    Use `Preconditions.checkNotNull`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2060#discussion_r78199868
  
    --- Diff: flink-core/src/main/java/org/apache/flink/types/parser/FieldParser.java ---
    @@ -75,8 +76,30 @@
     		/** Invalid Boolean value **/
     		BOOLEAN_INVALID
     	}
    -	
    +
    +	private Charset charset = Charset.forName("US-ASCII");
    --- End diff --
    
    Should this also default to `UFT-8`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    Agree @greghogan , we started this just for StringParser encoding :-)
    I am in tight schedule, however did a quick update for comments, please have a look.thank you.
    
    - Removed unused methods on setCommentPrefix
    - Added charset functionality for user to set charset encoding on CsvReader
    - Updated test 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2060#discussion_r71137136
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java ---
    @@ -106,6 +106,11 @@ protected GenericCsvInputFormat() {
     	protected GenericCsvInputFormat(Path filePath) {
     		super(filePath);
     	}
    +
    +	protected GenericCsvInputFormat(Path filePath, Charset charset) {
    +		super(filePath);
    +		this.charset = charset != null ? charset : Charset.forName("UTF-8");
    --- End diff --
    
    Would this be better as `this.charset = Preconditions.checkNotNull(charset);` since the user should use `GenericCsvInputFormat(Path)` when using the default charset?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    `StringParser` and `StringValueParser` have the method `enableQuotedStringParsing` which is called by `GenericCsvInputFormat.open` only for these types. We should consider doing the same for `setCharset` rather than adding it to `FieldParser`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/2060


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2060#discussion_r78199262
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java ---
    @@ -314,6 +320,25 @@ protected void setFieldsGeneric(boolean[] includedMask, Class<?>[] fieldTypes) {
     		this.fieldIncluded = includedMask;
     	}
     
    +	/**
    +	 * Gets the Charset for the parser.Default is set to UTF-8
    +	 *
    +	 * @return The charset for the parser.
    +	 */
    +	public Charset getCharset() {
    +		return this.charset;
    +	}
    +
    +	/**
    +	 * Sets the charset of the parser. Called by subclasses of the parser to set the type of charset
    +	 * when doing a parse.
    +	 *
    +	 * @param charset The charset  to set.
    --- End diff --
    
    Double space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2060#discussion_r78199216
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java ---
    @@ -314,6 +320,25 @@ protected void setFieldsGeneric(boolean[] includedMask, Class<?>[] fieldTypes) {
     		this.fieldIncluded = includedMask;
     	}
     
    +	/**
    +	 * Gets the Charset for the parser.Default is set to UTF-8
    --- End diff --
    
    Should we spell out `charset` as "character set" in the documentation? Also, space after period and missing period at end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    Let's make this configurable on `GenericCsvInputFormat`, with the default of "UTF-8" (this is what we use in other places as the default, too).
    
    The charset affects the
      - delimiter
      - comments
      - parsers
    
    I think the `FieldParser` should have a `setCharset()` method. That way, we need not pass the charset to every method call.
    
    This would also need a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    Hi @greghogan sorry, lost thread on this, updated for Preconditions check now. Unless i am missing something, the enableQuotedStringParsing() is present in subclasses as it is slightly different than the parent and involves if-else-casting to be invoked.But in case of setCharset it is same for all possible subclasses and does not need to be.Hope you agree.Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    Sorry for delay, got busy. 
    Thank you @StephanEwen for your review.updated.Please have a look.thanks cc @greghogan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    Apologies for the long delay. I'd like to attempt to summarize this ticket and pull request to validate my understanding.
    
    Previously StringParser was using the system encoding and `GenericCsvInputFormat` was using UTF-8 for the delimiter and an overloadable UTF-8 for the comment prefix.
    
    StringParser's quoteCharacter remains a `byte` with no encoding.
    
    Now GenericCsvInputFormat can be configured with a charset which is used for the delimiter, comment prefix, and field parsers (only used in StringParser).
    
    Should `setCommentPrefix(String commentPrefix, Charset charset)` and `setCommentPrefix(String commentPrefix, String charsetName)` be removed from `GenericCsvInputFormat`? Would different encodings be used on the same file?
    
    Allow the user to set the character encoding in `CsvReader` which would be applied in `CsvReader.configureInputFormat`?
    
    Are the new tests checking the encoding? The test strings are using using characters common to UTF-8 and ASCII. We could instead use one of the UTF-16 encodings from https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    The only internal usage of `StringParser` is from `GenericCsvInputFormat`. Should we make the encoding configurable in `GenericCsvInputFormat` with a default of US-ASCII? This could then be overridden in an additional constructor of `StringParser`.
    
    Should the same changes be made to `StringValueParser`?
    
    @rekhajoshm @fhueske @StephanEwen :+1: :-1:?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    I am working on merging this ..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by rekhajoshm <gi...@git.apache.org>.
Github user rekhajoshm commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    Thanks @greghogan for your inputs. updated. Please have a look. 
    Please note though the configurable Charset will have no impact on StringValueParser(or other parsers) except StringParser.In SVParser we explicitly call setValueAscii(), and hence works with ascii only. @StephanEwen @fhueske 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    Hi @rekhajoshm have you had a chance to look at the last two comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---