You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/03/05 12:45:10 UTC

[GitHub] [druid] harnasz opened a new issue #9460: Issue with CONCAT expression when using Kafka streaming ingestion.

harnasz opened a new issue #9460: Issue with CONCAT expression when using Kafka streaming ingestion.
URL: https://github.com/apache/druid/issues/9460
 
 
   We are seeing issues when using Kafka Streaming Ingestion using the CONCAT `expression` whereby it's prepending and appending `["` and `"]` to the result, believing this could be due to the rows being in the aggregate heap memory and not yet being persisted to the segments.
   
   See below for more detail.
   
   ## Affected Version
   
   `0.16`and `0.17`
   
   ## Description
   
   #### Cluster size
   
   Single using the quickstart `bin/start-single-server-small`
   
   #### Configurations in use
   
   Using the default configuration located here:
   
   `conf/druid/single-server/small`
   
   However using MySQL for the metadata storage and enabling globally cached lookups.
   
   ## Steps to reproduce the problem
   
   **Setup a Kafka Supervisor**
   
   *Note `"maxRowsPerSegment":3` in `tuningConfig`, we will refer to this later* 
   
   ```
   {
     "type":"kafka",
     "dataSchema":{
       "dataSource":"items",
       "parser":{
         "type":"string",
         "parseSpec":{
           "format":"csv",
           "timestampSpec":{
             "column":"time",
             "format":"iso"
           },
           "columns":[
             "time",
             "currency",
             "value"
           ],
           "dimensionsSpec":{
             "dimensions":[
               "currency"
             ]
           }
         }
       },
       "metricsSpec":[
         {
           "name":"count",
           "type":"count"
         },
         {
           "name":"sum_value",
           "type":"doubleSum",
           "fieldName":"value"
         }
       ],
       "granularitySpec":{
         "type":"uniform",
         "segmentGranularity":"WEEK",
         "queryGranularity":"NONE"
       }
     },
     "tuningConfig":{
       "type":"kafka",
       "maxRowsPerSegment":3
     },
     "ioConfig":{
       "topic":"debugitems",
       "consumerProperties":{
         "bootstrap.servers":"localhost:9092"
       },
       "taskCount":1,
       "replicas":1,
       "taskDuration":"PT1H"
     }
   }
   ```
   
   
   **Produce the Messages**
   
   Using Kafkacat execute the following *two* commands to produce the messages:
    
   ```
   echo "2020-01-14T11:11:00.000Z,GBP,30.12" | kafkacat -b  127.0.0.1:9092  -t debugitems
   echo "2020-01-15T11:11:00.000Z,EUR,30.12" | kafkacat -b  127.0.0.1:9092  -t debugitems
   ```
   
   Then run the following query:
   
   ```
   SELECT __time,  sum_value, CONCAT(TIME_FORMAT(__time, 'yyyy-MM-dd'), ':', currency) as "concat_expression" FROM items
   ```
   
   and you will see the results of below:
   
   ```
   (Query 1 Result)
   +--------------------------+-----------+--------------------+
   |          __time          | sum_value | concat_expression  |
   +--------------------------+-----------+--------------------+
   | 2020-01-14T11:11:00.000Z |     30.12 | ["2020-01-14:GBP"] |
   | 2020-01-15T11:11:00.000Z |     30.12 | ["2020-01-15:EUR"] |
   +--------------------------+-----------+--------------------+
   ```
   
   If you then produce another message of:
   
   ```
   echo "2020-01-16T11:11:00.000Z,GBP,30.12" | kafkacat -b  127.0.0.1:9092  -t debugitems
   ```
   
   And then rerun the above query you will see:
   
   ```
   (Query 2 Result)
   +--------------------------+-----------+-------------------+
   |          __time          | sum_value | concat_expression |
   +--------------------------+-----------+-------------------+
   | 2020-01-14T11:11:00.000Z |     30.12 | 2020-01-14:GBP    |
   | 2020-01-15T11:11:00.000Z |     30.12 | 2020-01-15:EUR    |
   | 2020-01-16T11:11:00.000Z |     30.12 | 2020-01-16:GBP    |
   +--------------------------+-----------+-------------------+
   ```
   
   ## The Issue
   
   From the results of Query 1 the values in the column of `concat_expression` wraps the value with `["` and `"]`.  Using the LTRIM and RTRIM functions to trim `["` but this does not have any impact on the value from the expression. 
   
   From the results of Query 2 after 3 values have been ingested the values in the column of `concat_expression` are *no* longer wrapped with `["` and `"]`.
   
   We believe this could be due to that when two rows have been ingested they are aggregated in heap memory however when the third row gets ingested it gets persisted to the segment due to setting `"maxRowsPerSegment` to `3`.
   
   We are seeing no issues when using Batch Ingestion, just the streaming ingestion. We are using the result of the CONCAT expression to invoke a lookup, but as the CONCAT expression is appending and prepending `["` and `"]` and then cannot find the value in the lookup.
   
   ## Any debugging that you have already done
   
   The only debugging that has been carried out is changing the tuning config values, reducing `intermediatePersistPeriod` or `maxRowsPerSegment` to persist the rows to the segments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] Synforge commented on issue #9460: Issue with result from CONCAT expression when using Kafka streaming ingestion.

Posted by GitBox <gi...@apache.org>.
Synforge commented on issue #9460: Issue with result from CONCAT expression when using Kafka streaming ingestion.
URL: https://github.com/apache/druid/issues/9460#issuecomment-605441813
 
 
   I've done a little bit of digging on this and this bug applies to all string dimension columns in the IncrementalIndexStorageAdapter. It seems that regardless of whether a multi value was inserted into a column or not, this storage adapter sets all string columns to be multi value.
   
   e.g. for the example above while it hasn't been persisted a query for segment metadata results in this:
   
   `            "currency": {
                   "cardinality": 2, 
                   "errorMessage": null, 
                   "hasMultipleValues": true, 
                   "maxValue": "GBP", 
                   "minValue": "EUR", 
                   "size": 0, 
                   "type": "STRING"
               } `
   
   Whereas the persisted data returns hasMultipleValues correctly as false, it seems this results in inconsistencies when using any kind of string function against a dimensional column that has not yet been persisted vs data that has been persisted. So I think this problem is bigger than just the above report.
   
   I verified this by amending the following to return false and this then correctly returns just a string value instead of an array. However I'm aware this may break multi-values on ingestion?
   
   https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java#L166
   
   Happy to take a look further if anyone can offer any advice as to how to tackle this problem. I believe @gianm wrote some of this code, I'm hoping you might be able to offer some advice?
   
   Thanks
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org