You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/10/06 22:57:43 UTC
[GitHub] [druid] sergioferragut commented on issue #13174: Add example for nested columns with streaming

sergioferragut commented on issue #13174:
URL: https://github.com/apache/druid/issues/13174#issuecomment-1270806973

   Just tested it by using the kafka tutorial but replacing the wikipedia data with kttm nested data:
   Steps:
   
   Create the topic
   `./bin/kafka-topics.sh --create --topic kttm_nested --bootstrap-server localhost:9092`
   
   Get the nested data from kttm nested example:
   ```
   curl https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz -o kttm-nested-data.json.gz
   gunzip -c kttm-nested-data.json.gz > kttm-nested-data.json
   ```
   
   Publish to the topic:
   ```
   export KAFKA_OPTS="-Dfile.encoding=UTF-8"
   ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic kttm_nested < kttm-nested-data.json
   ```
   
   The UI for "Load Data" does not automatically recognize the nested JSON columns in the parsing step. 
   In the "Configure Schema" step, you can use "Add dimension", type the name and choose type "json".
   
   The resulting Ingestion Spec:
   ```{
     "type": "kafka",
     "spec": {
       "ioConfig": {
         "type": "kafka",
         "consumerProperties": {
           "bootstrap.servers": "localhost:9092"
         },
         "topic": "kttm_nested",
         "inputFormat": {
           "type": "json"
         },
         "useEarliestOffset": true
       },
       "tuningConfig": {
         "type": "kafka"
       },
       "dataSchema": {
         "dataSource": "kttm_nested",
         "timestampSpec": {
           "column": "timestamp",
           "format": "iso"
         },
         "dimensionsSpec": {
           "dimensions": [
             "session",
             "number",
             "client_ip",
             "language",
             "adblock_list",
             "app_version",
             "path",
             "loaded_image",
             "referrer",
             "referrer_host",
             "server_ip",
             "screen",
             "window",
             {
               "type": "long",
               "name": "session_length"
             },
             "timezone",
             "timezone_offset",
             {
               "type": "json",
               "name": "event"
             },
             {
               "type": "json",
               "name": "agent"
             },
             {
               "type": "json",
               "name": "geo_ip"
             }
           ]
         },
         "granularitySpec": {
           "queryGranularity": "none",
           "rollup": false,
           "segmentGranularity": "hour"
         }
       }
     }
   }
   ```
   
   @techdocsmith, This example works, but it requires the kafka setup steps to run, so I'm not sure if it fits in the nested columns docs page as is. Perhaps adjust the kafka tutorial so it uses this source instead? Let me know how else to help.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org