You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by Cbro <gi...@git.apache.org> on 2014/12/02 10:40:14 UTC

[GitHub] incubator-flink pull request: Mk amulti char delim

GitHub user Cbro opened a pull request:

    https://github.com/apache/incubator-flink/pull/247

    Mk amulti char delim

    [FLINK-1168] Added support for multi-char delimiters + added new tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Cbro/incubator-flink MKAmultiCharDelim

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-flink/pull/247.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #247
    
----
commit 21394aae4912d49ea53696bab449f2e5ee766632
Author: Fabian Hueske <fh...@apache.org>
Date:   2014-10-20T13:18:20Z

    [FLINK-1168] Added support for multi-char delimiters. Missing: Tests with multi-char delims, API

commit 16c6dbade208e5feca8789347d52f3efb1f33650
Author: Fabian Hueske <fh...@apache.org>
Date:   2014-10-20T13:18:20Z

    [FLINK-1168] Added support for multi-char delimiters. Missing: Tests with multi-char delims, API
    
    Modified the i+1 to 1+field_delim.length

commit 33fef529b12d8bf834fdf7f1d636bce67a34e10d
Author: Cbro <ma...@gmail.com>
Date:   2014-11-29T15:13:56Z

    [JIRA 1168] Modified code to allow multiple character field delimiters in CSV fields.

commit 2360c0b9744b58d924cdce5e40ca52263a215b81
Author: Cbro <ma...@gmail.com>
Date:   2014-11-29T17:38:02Z

    [JIRA 1168] Modified to obey style guide. If statement had missing braces.

commit 51fd765caafc66f2eeefb2c5dc44e3c18e57408e
Author: Cbro <ma...@gmail.com>
Date:   2014-12-01T14:47:54Z

    [JIRA 1168] Added new tests.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-flink pull request: Mk amulti char delim

Posted by fhueske <gi...@git.apache.org>.
Github user fhueske commented on a diff in the pull request:

    https://github.com/apache/incubator-flink/pull/247#discussion_r21404017
  
    --- Diff: flink-core/src/test/java/org/apache/flink/types/parser/ParserTestBase.java ---
    @@ -307,4 +310,75 @@ public void testStaticParseMethodWithInvalidValues() {
     		
     		return result;
     	}
    +
    +
    +	private static byte[] concatenateMulti(String[] values, char[] delimiter, boolean delimiterAtEnd) {
    +		int len = 0;
    +		for (String s : values) {
    +			len += s.length() + delimiter.length;
    +		}
    +
    +		if (!delimiterAtEnd) {
    +			len -= delimiter.length;
    +		}
    +
    +		int currPos = 0;
    +		byte[] result = new byte[len];
    +
    +		for (int i = 0; i < values.length; i++) {
    +			String s = values[i];
    +
    +			byte[] bytes = s.getBytes();
    +			int numBytes = bytes.length;
    +			System.arraycopy(bytes, 0, result, currPos, numBytes);
    +			currPos += numBytes;
    +
    +			if (delimiterAtEnd || i < values.length-1) {
    +				for(int k=0; k< delimiter.length; k++)
    +					result[currPos++] = (byte) delimiter[k];
    +			}
    +		}
    +		return result;
    +	}
    +
    +
    +	@Test
    +	public void testConcatenatedMultiCharDelim() {
    +		try {
    +			String[] testValues = getValidTestValues();
    +			T[] results = getValidTestResults();
    +
    +			byte[] allBytesWithDelimiter = concatenateMulti(testValues,  new char[] {'|','*','|'}, true);
    +			byte[] allBytesNoDelimiterEnd = concatenateMulti(testValues, new char[] {'|','*','|'}, false);
    +
    +			FieldParser<T> parser1 = getParser();
    +			FieldParser<T> parser2 = getParser();
    +
    +			T val1 = parser1.createValue();
    +			T val2 = parser2.createValue();
    +
    +			int pos1 = 0;
    +			int pos2 = 0;
    +
    +			for (int i = 0; i < results.length; i++) {
    +				pos1 = parser1.parseField(allBytesWithDelimiter, pos1,  allBytesWithDelimiter.length,   new char[] {'|','*','|'}, val1);
    +				System.out.println(parser1.getLastResult().toString());
    +				assertTrue("1. Parser declared the valid value " + testValues[i] + " as invalid.", pos1 != -1);
    +				T result1 = parser1.getLastResult();
    +				assertEquals("1. Parser parsed wrong.", results[i], result1);
    +
    +				pos2 = parser2.parseField(allBytesNoDelimiterEnd, pos2, allBytesNoDelimiterEnd.length,  new char[] {'|','*','|'}, val2);
    +				System.out.println(parser2.getLastResult().toString());
    --- End diff --
    
    Please remove print statement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-flink pull request: Mk amulti char delim

Posted by fhueske <gi...@git.apache.org>.
Github user fhueske commented on a diff in the pull request:

    https://github.com/apache/incubator-flink/pull/247#discussion_r21404005
  
    --- Diff: flink-core/src/test/java/org/apache/flink/types/parser/ParserTestBase.java ---
    @@ -307,4 +310,75 @@ public void testStaticParseMethodWithInvalidValues() {
     		
     		return result;
     	}
    +
    +
    +	private static byte[] concatenateMulti(String[] values, char[] delimiter, boolean delimiterAtEnd) {
    +		int len = 0;
    +		for (String s : values) {
    +			len += s.length() + delimiter.length;
    +		}
    +
    +		if (!delimiterAtEnd) {
    +			len -= delimiter.length;
    +		}
    +
    +		int currPos = 0;
    +		byte[] result = new byte[len];
    +
    +		for (int i = 0; i < values.length; i++) {
    +			String s = values[i];
    +
    +			byte[] bytes = s.getBytes();
    +			int numBytes = bytes.length;
    +			System.arraycopy(bytes, 0, result, currPos, numBytes);
    +			currPos += numBytes;
    +
    +			if (delimiterAtEnd || i < values.length-1) {
    +				for(int k=0; k< delimiter.length; k++)
    +					result[currPos++] = (byte) delimiter[k];
    +			}
    +		}
    +		return result;
    +	}
    +
    +
    +	@Test
    +	public void testConcatenatedMultiCharDelim() {
    +		try {
    +			String[] testValues = getValidTestValues();
    +			T[] results = getValidTestResults();
    +
    +			byte[] allBytesWithDelimiter = concatenateMulti(testValues,  new char[] {'|','*','|'}, true);
    +			byte[] allBytesNoDelimiterEnd = concatenateMulti(testValues, new char[] {'|','*','|'}, false);
    +
    +			FieldParser<T> parser1 = getParser();
    +			FieldParser<T> parser2 = getParser();
    +
    +			T val1 = parser1.createValue();
    +			T val2 = parser2.createValue();
    +
    +			int pos1 = 0;
    +			int pos2 = 0;
    +
    +			for (int i = 0; i < results.length; i++) {
    +				pos1 = parser1.parseField(allBytesWithDelimiter, pos1,  allBytesWithDelimiter.length,   new char[] {'|','*','|'}, val1);
    +				System.out.println(parser1.getLastResult().toString());
    --- End diff --
    
    Please remove print statement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-flink pull request: Mk amulti char delim

Posted by fhueske <gi...@git.apache.org>.
Github user fhueske commented on a diff in the pull request:

    https://github.com/apache/incubator-flink/pull/247#discussion_r21403969
  
    --- Diff: flink-core/src/main/java/org/apache/flink/types/parser/StringValueParser.java ---
    @@ -38,15 +38,16 @@
     	private StringValue result;
     	
     	@Override
    -	public int parseField(byte[] bytes, int startPos, int length, char delim, StringValue reusable) {
    +	public int parseField(byte[] bytes, int startPos, int length, char[] delim, StringValue reusable) {
     		
     		this.result = reusable;
    -		
     		int i = startPos;
    -		
    -		final byte delByte = (byte) delim;
     		byte current;
    -		
    +
    +
    +		String s = new String(bytes);
    +		System.out.println("[StringValueParser] Input text:  "+ s);
    --- End diff --
    
    Please remove debugging statements like these.
    Unit tests should not write to the console.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-flink pull request: Mk amulti char delim

Posted by fhueske <gi...@git.apache.org>.
Github user fhueske commented on a diff in the pull request:

    https://github.com/apache/incubator-flink/pull/247#discussion_r21409043
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java ---
    @@ -358,7 +359,8 @@ protected int skipFields(byte[] bytes, int startPos, int limit, char delim) {
     				i++; // the quote
     				
     				// skip trailing whitespace characters 
    -				while (i < limit && (current = bytes[i]) != delByte) {
    +				while (i <= limit-delim.length && !FieldParser.delimiterNext(bytes, i, delim)) {
    +					current = bytes[i];
     					if (current == ' ' || current == '\t') {
     						i++;
     					}
    --- End diff --
    
    The `return (i == limit ? limit : i+1)` statement does not seem to be right. This case should be checked with a test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-flink pull request: Mk amulti char delim

Posted by fhueske <gi...@git.apache.org>.
Github user fhueske commented on a diff in the pull request:

    https://github.com/apache/incubator-flink/pull/247#discussion_r21409066
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java ---
    @@ -375,7 +377,7 @@ protected int skipFields(byte[] bytes, int startPos, int limit, char delim) {
     		}
     		else {
     			// unquoted field
    -			while (i < limit && bytes[i] != delByte) {
    +			while (i <= limit-delim.length && !FieldParser.delimiterNext(bytes, i, delim)) {
    --- End diff --
    
    The `return (i == limit ? limit : i+1)` statement does not seem to be right. This case should be checked with a test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---