You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2019/09/19 15:03:00 UTC

[jira] [Commented] (NIFI-6640) Schema Inference of UNION/CHOICE types not handled correctly

    [ https://issues.apache.org/jira/browse/NIFI-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933474#comment-16933474 ] 

ASF subversion and git services commented on NIFI-6640:
-------------------------------------------------------

Commit 34112519c2dde19d704ef624e62e51b399cf1ce7 in nifi's branch refs/heads/master from Tamas Palfy
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=3411251 ]

NIFI-6640 - UNION/CHOICE types not handled correctly
3 important changes:
1. FieldTypeInference had a bug when dealing with multiple datatypes for
 the same field where some (but not all) were in a wider-than-the-other
 relationship.
 Before: Some datatypes could be lost. String was wider than any other.
 After: Consistent behaviour. String is NOT wider than any other.
2. Choosing a datatype for a value from a ChoiceDataType:
 Before it chose the first compatible datatype as the basis of conversion.
 After change it tries to find the most suitable datatype.
3. Conversion of a value of avro union type:
 Before it chose the first compatible datatype as the basis of conversion.
 After change it tries to find the most suitable datatype.

Change: In the RecordFieldType enum moved TIMESTAMP ahead of DATE.

This closes #3724.

Signed-off-by: Mark Payne <ma...@hotmail.com>


> Schema Inference of UNION/CHOICE types not handled correctly
> ------------------------------------------------------------
>
>                 Key: NIFI-6640
>                 URL: https://issues.apache.org/jira/browse/NIFI-6640
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>            Reporter: Tamas Palfy
>            Assignee: Tamas Palfy
>            Priority: Major
>              Labels: Record, inference, schema
>         Attachments: NIFI-6640.template.xml
>
>          Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> When reading the following CSV:
> {code}
> Id|Value
> 1|3
> 2|3.75
> 3|3.85
> 4|8
> 5|2.0
> 6|4.0
> 7|some_string
> {code}
> And try to channel through a {{ConvertRecord}} processor, the following exception is thrown:
> {code}
> 2019-09-06 18:25:48,936 ERROR [Timer-Driven Process Thread-2] o.a.n.processors.standard.ConvertRecord ConvertRecord[id=07635c71-016d-1000-3847-ff916164b32a] Failed to process StandardFlowFileRecord[uuid=4b4ab01a-b349-4f83-9b25-6a58d0b29
> 7c1,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1567786888281-1, container=default, section=1], offset=326669, length=56],offset=0,name=4b4ab01a-b349-4f83-9b25-6a58d0b297c1,size=56]; will route to failure: org.apa
> che.nifi.processor.exception.ProcessException: Could not parse incoming data
> org.apache.nifi.processor.exception.ProcessException: Could not parse incoming data
>         at org.apache.nifi.processors.standard.AbstractRecordProcessor$1.process(AbstractRecordProcessor.java:170)
>         at org.apache.nifi.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2925)
>         at org.apache.nifi.processors.standard.AbstractRecordProcessor.onTrigger(AbstractRecordProcessor.java:122)
>         at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
>         at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1162)
>         at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:205)
>         at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.nifi.serialization.MalformedRecordException: Error while getting next record. Root cause: org.apache.nifi.serialization.record.util.IllegalTypeConversionException: Cannot convert value [some_string] of type class j
> ava.lang.String for field Value to any of the following available Sub-Types for a Choice: [FLOAT, INT]
>         at org.apache.nifi.csv.CSVRecordReader.nextRecord(CSVRecordReader.java:119)
>         at org.apache.nifi.serialization.RecordReader.nextRecord(RecordReader.java:50)
>         at org.apache.nifi.processors.standard.AbstractRecordProcessor$1.process(AbstractRecordProcessor.java:156)
>         ... 13 common frames omitted
> Caused by: org.apache.nifi.serialization.record.util.IllegalTypeConversionException: Cannot convert value [some_string] of type class java.lang.String for field Value to any of the following available Sub-Types for a Choice: [FLOAT, INT
> ]
>         at org.apache.nifi.serialization.record.util.DataTypeUtils.convertType(DataTypeUtils.java:166)
>         at org.apache.nifi.serialization.record.util.DataTypeUtils.convertType(DataTypeUtils.java:116)
>         at org.apache.nifi.csv.AbstractCSVRecordReader.convert(AbstractCSVRecordReader.java:86)
>         at org.apache.nifi.csv.CSVRecordReader.nextRecord(CSVRecordReader.java:105)
>         ... 15 common frames omitted
> {code}
> The problem is that {{FieldTypeInference}} has both a list of {{possibleDataTypes}} and a {{singleDataType}} and as long as an added dataType is not in a "wider" relationship with the previous types it is added to the {{possibleDataTypes}}. But once a "wider" type is added, it actually gets set as the {{singleDataType}} and the {{possibleDataTypes}} remains intact.
> However when we try to determine the actual dataType, if the {{possibleDataTypes}} is not null then it will be used and the {{singleDataType}} will be _ignored_.
> So in our example a {{FieldTypeInference}} with (FLOAT, INT) as {{possibleDataTypes}} and STRING as {{singleDataType}} will be created, the FLOAT or INT will be chosen and "some_string" will be tried being written as a float or integer.
> ----
> Also there is an issue with the handling of multiple datatypes when _writing_ data.
> When multiple datatypes are possible, a so-called CHOICE datatype is assigned in the inferred schema. This contains the possible datatypes in a list.
> However most (if not all) of the times when choose a concrete datatype for a given value when writing it (tested with JSON and Avro writers), the first matching type is selected from the list. And in the current implementation, all number types are matching for all numbers, so 3.75 may be written as an INT, resulting in data loss.
> The problem is that the type list is not in any particular order _and_ the first matching type is chosen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)