You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by joemeszaros <gi...@git.apache.org> on 2015/09/10 13:32:57 UTC

[GitHub] nifi pull request: NIFI 944: Support tab & special delimiter chars...

GitHub user joemeszaros opened a pull request:

    https://github.com/apache/nifi/pull/87

    NIFI 944: Support tab & special delimiter chars in csv2avro converter

    The original ConvertCSVToAvro implementation was unable to process tab separated files, because "\t" as delimiter character was invalid, coming from the UI, entered by the user. The validator for this property only allowed single characters and the "\t" input was not unescaped before validation. The current version allows special, escaped characters like \t and \f (form-feed) for delimiter, quote and escape character. 
    
    ![nifi_csv2avro](https://cloud.githubusercontent.com/assets/1064211/9787146/4d87a338-57c0-11e5-88a6-6fe55a38bcfe.png)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ImpressTV/nifi NIFI-944

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/87.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #87
    
----
commit a079e38a47ab9fd9e91d57668b7fbd9f55b82467
Author: Joe <jo...@impresstv.com>
Date:   2015-09-08T15:13:14Z

    Tab & special delimiter chars in csv2avro converter
    
    The original ConvertCSVToAvro implementation was unable to process tab separated files, because "\t" as delimiter character was invalid. The validator for this property only allowed single characters and the "\t" input was not unescaped before validation. The current version allows special, escaped characters like \t and \f (form-feed).

commit b3165b08fa44c6c99d3cc859b52c5af905272f8f
Author: Joe <jo...@impresstv.com>
Date:   2015-09-09T08:34:29Z

    Set allowable values for bool properties in csv2avro converter

commit f7d9547db46c8b9381261280bd92e3260e19e78d
Author: Joe <jo...@impresstv.com>
Date:   2015-09-10T11:15:20Z

    NIFI-944: Csv2avro converter unit test with tab separated file

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI 944: Support tab & special delimiter chars...

Posted by joemeszaros <gi...@git.apache.org>.
Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/87#issuecomment-149570327
  
    Updated the code according to your comment. 
    
    For your second issue: I do not think, that manual restriction of special characters is a good idea, because we can not take into account many-many use cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI 944: Support tab & special delimiter chars...

Posted by trkurc <gi...@git.apache.org>.
Github user trkurc commented on the pull request:

    https://github.com/apache/nifi/pull/87#issuecomment-148938140
  
    out of curiosity, I noticed you moved unescapeString to a method between pull requests - is there a reason you pulled it out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI 944: Support tab & special delimiter chars...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/nifi/pull/87


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI 944: Support tab & special delimiter chars...

Posted by trkurc <gi...@git.apache.org>.
Github user trkurc commented on the pull request:

    https://github.com/apache/nifi/pull/87#issuecomment-149768870
  
    @joemeszaros I am going to open a bug ticket which sanity checks across the properties. I don't believe your pr exacerbates the issue, so should not be blocked by it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI 944: Support tab & special delimiter chars...

Posted by trkurc <gi...@git.apache.org>.
Github user trkurc commented on the pull request:

    https://github.com/apache/nifi/pull/87#issuecomment-149767800
  
    I wrote a test which exhausts the unicode code points. \u0000 is the most problematic as it threw exceptions
    ```
    java.lang.UnsupportedOperationException: The separator character must be defined!
            at au.com.bytecode.opencsv.CSVParser.<init>(CSVParser.java:149) ~[na:na]
            at au.com.bytecode.opencsv.CSVReader.<init>(CSVReader.java:194) ~[na:na]
            at org.kitesdk.data.spi.filesystem.CSVUtil.newReader(CSVUtil.java:54) ~[na:na]
            at org.kitesdk.data.spi.filesystem.CSVFileReader.initialize(CSVFileReader.java:120) ~[na:na]
            at org.apache.nifi.processors.kite.ConvertCSVToAvro$2.process(ConvertCSVToAvro.java:244) ~[na:na]
            at org.apache.nifi.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2155) ~[nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.processors.kite.ConvertCSVToAvro.onTrigger(ConvertCSVToAvro.java:239) ~[na:na]
            at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) ~[nifi-api-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1077) ~[nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:127) [nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:49) [nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:119) [nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
            at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_45]
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45]
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_45]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]
            at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
    ```
    Also noted is a problem existing in the current implementation in 0.3.0, the processor doesn't validate that the escape character, delimiter and quote characters are different. 
    
    ```
    java.lang.UnsupportedOperationException: The separator, quote, and escape characters must be different!
            at au.com.bytecode.opencsv.CSVParser.<init>(CSVParser.java:146) ~[na:na]
            at au.com.bytecode.opencsv.CSVReader.<init>(CSVReader.java:194) ~[na:na]
            at org.kitesdk.data.spi.filesystem.CSVUtil.newReader(CSVUtil.java:54) ~[na:na]
            at org.kitesdk.data.spi.filesystem.CSVFileReader.initialize(CSVFileReader.java:120) ~[na:na]
            at org.apache.nifi.processors.kite.ConvertCSVToAvro$2.process(ConvertCSVToAvro.java:244) ~[na:na]
            at org.apache.nifi.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2155) ~[nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.processors.kite.ConvertCSVToAvro.onTrigger(ConvertCSVToAvro.java:239) ~[na:na]
            at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) ~[nifi-api-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1077) ~[nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:127) [nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:49) [nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:119) [nifi-framework-core-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
            at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_45]
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45]
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_45]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]
            at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
    ```
    
    I had a very challenging time programatically creating test data between \ud800 and \udffff of all things due to TestUtil when I did this (and delimiterCodePoint was in that range):
    
    ```java
                    s.setLength(0);
                    s.append("1").appendCodePoint(delimiterCodePoint).append("green").append('\n');
                    s.appendCodePoint(delimiterCodePoint).append("blue").appendCodePoint(delimiterCodePoint).append('\n');
                    s.append("2").appendCodePoint(delimiterCodePoint).append("grey").appendCodePoint(delimiterCodePoint).append("12.95");
                    
                    logger.info(StringEscapeUtils.escapeJava(s.toString()));
                    runner.enqueue(streamFor(s.toString()));
    ```
    and got this stack trace:
    
    ```
    java.nio.charset.MalformedInputException: Input length = 1
    	at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
    	at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:816)
    	at org.apache.nifi.processors.kite.TestUtil.bytesFor(TestUtil.java:97)
    	at org.apache.nifi.processors.kite.TestUtil.streamFor(TestUtil.java:91)
    	at org.apache.nifi.processors.kite.TestUtil.streamFor(TestUtil.java:87)
    	at org.apache.nifi.processors.kite.TestCSVToAvroProcessor.testAllCharacterSeparatedConversion(TestCSVToAvroProcessor.java:130)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:497)
    	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
    	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
    	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
    	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
    	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI 944: Support tab & special delimiter chars...

Posted by trkurc <gi...@git.apache.org>.
Github user trkurc commented on the pull request:

    https://github.com/apache/nifi/pull/87#issuecomment-149617912
  
    @joemeszaros  - I'm not sure I disagree with you... however I was a bit troubled by the \n case when testing, and unicode is full of control characters (e.g. LRM (U+200E)) and I'm not sure what the behavior should be if one of those is chosen. My concern is having behavior that we're not expecting, but someone builds a flow that depends on the unexpected behavior and when we "fix" it, their flow breaks. We could write a unit/regression test that exhausts the character set for consistent behavior, but I'm not sure how I feel about that either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nifi pull request: NIFI 944: Support tab & special delimiter chars...

Posted by trkurc <gi...@git.apache.org>.
Github user trkurc commented on the pull request:

    https://github.com/apache/nifi/pull/87#issuecomment-148940719
  
    It looks good, with the exception of a couple issues. First off, adding the allowableValues on the HAS_HEADER property descriptor actually could break flows already using this processor - e.g. If I have the value "False", which validated previously, it breaks with this change. I added a ticket (NIFI-1044) which sort of addresses this issue as a whole, if there is a compatibility break, we may want to hold that change. 
    
    Secondly, there should probably be some sanity checking on the delimiter value now. It is pretty cool that I can use \t, or even \u0009, and I used \u0555 which totally worked. However, I can also use \n, which probably shouldn't be allowed (I'm not even sure what would be expected)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---