You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/10 12:19:39 UTC

[GitHub] [druid] FrankChen021 opened a new issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

FrankChen021 opened a new issue #10259:
URL: https://github.com/apache/druid/issues/10259


   ### Affected Version
   
   - 0.17
   - 0.18
   - 0.19
   
   ### Description
   
   There's a topic in our kafka cluster, which contains messages in **pretty** JSON format as below. The newest 0.19 fails to parse these messages as JSON objects while 0.16 works fine.
   
   JSON example
   
   ```
   {
           "byteCount":0,
           "partition":0,
           "recordAge":0,
           "recordCount":0,
           "replicationLatency":0,
           "targetCluster":"dst",
           "timestamp":1597045440490,
           "topic":"test"
   }
   ```
   
   0.16
   
   ![image](https://user-images.githubusercontent.com/6525742/89770984-28592500-db32-11ea-8389-88d53458f909.png)
   ![image](https://user-images.githubusercontent.com/6525742/89771027-3eff7c00-db32-11ea-97d1-a5484178120a.png)
   
   
   0.19
   ![image](https://user-images.githubusercontent.com/6525742/89771163-7b32dc80-db32-11ea-887f-d235c1da76fb.png)
   ![image](https://user-images.githubusercontent.com/6525742/89771277-a4536d00-db32-11ea-9277-1bf6e91c18c7.png)
   
   after changing `Input Format` from default `Regex` to `Json`, following error appears.
   
   ![image](https://user-images.githubusercontent.com/6525742/89771008-3313ba00-db32-11ea-9339-f70cc98db878.png)
   
   
   # Reason
   
   After diving into the code between 0.16 and 0.19, I found the problem is caused by `JsonReader` which was introduced in 0.17 by #8823
   
   The new `JsonReader` inherits from `TextReader` which uses `LineIterator` to split the input string and return text LINE BY LINE instead of the whole text.
   
   So for multiple-line json text, this implementation fails to parse the text as JSON object.
   
   https://github.com/apache/druid/blob/e2487bcc30c5ac0f4281ddd2dcf8906dcd00cba8/core/src/main/java/org/apache/druid/data/input/TextReader.java#L57
   
   # How to fix
   
   Maybe `JsonReader` should override the `intermediateRowIterator` function defined in `TextReader` to return an iterator with only one string object.
   
   
   @jihoonson please check this bug if you're convenient :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-684116519


   @FrankChen021 thank you for reporting. `InputEntity` and `InputSource` are designed to be independent from how data is formatted. Does `InputEntity` have to know if `inputFormat` is `lineSplittable`? Or can we add `lineSplittable` to `JsonInputFormat` instead?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-690521098


   We should be able to determine automatically if a Kafka message or a text file is line-delimited JSON or not. Check out the Jackson ObjectMapper.readValues method: it might already do what we need:
   
   ```
        * Sequence can be either wrapped or unwrapped root-level sequence:
        * wrapped means that the elements are enclosed in JSON Array;
        * and unwrapped that elements are directly accessed at main level.
        * Assumption is that iff the first token of the document is
        * <code>START_ARRAY</code>, we have a wrapped sequence; otherwise
        * unwrapped. For wrapped sequences, leading <code>START_ARRAY</code>
        * is skipped, so that for both cases, underlying {@link JsonParser}
        * will point to what is expected to be the first token of the first
        * element.
   ```
   
   It sounds like it could read a JSON array of objects ("wrapped"), newline delimited objects ("unwrapped"), pretty-printed objects, or even concatenated pretty-printed objects.
   
   To make this work, we might need to have JsonReader extend InputEntityReader directly, instead of TextReader.
   
   @FrankChen021 @jihoonson what do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-686676250


   @FrankChen021 thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson closed issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
jihoonson closed issue #10259:
URL: https://github.com/apache/druid/issues/10259


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm edited a comment on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
gianm edited a comment on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-690521098


   We should be able to determine automatically if a Kafka message or a text file is line-delimited JSON or not. Check out the Jackson ObjectMapper.readValues method: it might already do what we need:
   
   ```
        * Sequence can be either wrapped or unwrapped root-level sequence:
        * wrapped means that the elements are enclosed in JSON Array;
        * and unwrapped that elements are directly accessed at main level.
        * Assumption is that iff the first token of the document is
        * <code>START_ARRAY</code>, we have a wrapped sequence; otherwise
        * unwrapped. For wrapped sequences, leading <code>START_ARRAY</code>
        * is skipped, so that for both cases, underlying {@link JsonParser}
        * will point to what is expected to be the first token of the first
        * element.
   ```
   
   It sounds like it could read a JSON array of objects ("wrapped"), newline delimited objects ("unwrapped"), a pretty-printed object, or even concatenated objects.
   
   To make this work, we might need to have JsonReader extend InputEntityReader directly, instead of TextReader.
   
   @FrankChen021 @jihoonson what do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
FrankChen021 commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-690822298


   @gianm  Woo, it greatly simplifies the solution for this issue. I will make a new patch based on your suggestion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
FrankChen021 commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-686190666


   In practice, multiple JSON objects in one message is seldom. But I think the point you mentioned is right that we should not impose the assumption that one message represents one JSON on the implementation. I'll take your suggestion and make a patch for this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm edited a comment on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
gianm edited a comment on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-690521098


   We should be able to determine automatically if a Kafka message or a text file is line-delimited JSON or not. Check out the Jackson ObjectMapper.readValues method: it might already do what we need:
   
   ```
        * Sequence can be either wrapped or unwrapped root-level sequence:
        * wrapped means that the elements are enclosed in JSON Array;
        * and unwrapped that elements are directly accessed at main level.
        * Assumption is that iff the first token of the document is
        * <code>START_ARRAY</code>, we have a wrapped sequence; otherwise
        * unwrapped. For wrapped sequences, leading <code>START_ARRAY</code>
        * is skipped, so that for both cases, underlying {@link JsonParser}
        * will point to what is expected to be the first token of the first
        * element.
   ```
   
   It sounds like it could read a JSON array of objects ("wrapped"), newline delimited objects ("unwrapped"), a pretty-printed object, or even concatenated objects.
   
   To make this work, we probably need to have JsonReader extend InputEntityReader directly, instead of TextReader.
   
   @FrankChen021 @jihoonson what do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
FrankChen021 commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-685213851


   @jihoonson Adding this new property to InputFormat is also one way to resolve this bug. I thought about it. 
   
   But since InputFormat is given in a taskSpec, it's up to user to decide if its value. For Kafka ingestion, frontend is required to set its value to false by default. I don't if it's better or not.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-685279290


   Even in Kafka ingestion, a single message can have multiple JSON objects delimited by line feed. It makes sense to me that users need to set the flag properly when they submit their ingestion task since they are the people who know how their data is formatted. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
FrankChen021 commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-690822298


   @gianm  Woo, it greatly simplifies the solution for this issue. I will make a new patch based on your suggestion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
FrankChen021 commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-683518040


   After an attempt to resolve this problem, I found it's a little bit tricky to fix this issue as the way above. 
   
   All the InputSources except Kafka assume each text line of input as a JSON object. Overriding  `intermediateRowIterator ` in JsonReader to parse input text as a whole as above would also break this assumption, which would cause parsing of these input source, such as local text, work incorrectly.
   
   To handle these two different needs, another feasible and easy to fix way is:
   
   1. add a boolean property, called as `lineSplittable` for example, to `InputEntity` to indicate whether the text should be treated as line by line or as a whole. the default value is true, meaning to be treated one by one because only Kafka records need to be treated as a whole.
   
   2. `ByteEntity`, which inherits from `InputEntity` is used by Kafka input source `RecordSupplierInputSource`, provides a ctor to allow higher level code to pass value to `lineSplittable`
   
   3. createReader of JsonInputFormat checks this property on InputEntity, if it's true, create an instance of current JsonReader class, if not, create a new JsonReader to read the input text as a whole.
   
   4. `RecordSupplierInputSource` passes a `false` value to `ByteEntity` to indicate there is no need to treat the input text line by line instead of a whole part for json.
   
   @jihoonson Do you have any other better ideas ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #10259: Kafka ingestion fails to parse multiple-line messages in 0.19

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #10259:
URL: https://github.com/apache/druid/issues/10259#issuecomment-690555205


   Oh good point. I simply tested and it does work.Yeah, `JsonReader` should extend `InputEntityReader` directly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org