You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/05/15 11:58:01 UTC

[GitHub] [incubator-druid] max-schmidt54321 opened a new issue #7664: Re-indexing Segments that contain thetaSketches

max-schmidt54321 opened a new issue #7664: Re-indexing Segments that contain thetaSketches
URL: https://github.com/apache/incubator-druid/issues/7664
 
 
   Is it possible to re-index segments and use thetaSketches? 
   Currently I have a datasource that is being ingested with the following supervisor spec (uid_sketch is working properly here):
   
   ```json
       {
         "type": "kafka",
         "dataSchema": {
           "dataSource": "pageview",
           "parser": {
             "type": "avro_stream",
             "avroBytesDecoder": {
               "type": "schema_registry",
               "url": "http://schema-registry:8081"
             },
             "parseSpec": {
               "format": "avro",
               "flattenSpec": {
                 "useFieldDiscovery": "true",
                 "fields": [
                   "articleId",
                   "uid",
                   "some-fields-that-need-to-be-flattened..."
                 ]
               },
               "timestampSpec": {
                 "column": "timestamp",
                 "format": "auto"
               },
               "dimensionsSpec": {
                 "dimensions": [
                 	{
                     "type": "long",
                     "name": "articleId"
                   },
                   "some-other-dimensions..."
                 ],
                 "dimensionExclusions": [
                     "uid"
                   ]
               }
             }
           },
           "metricsSpec": [
             {
               "type": "count",
               "name": "count"
             },
             {
   		"type" : "thetaSketch",
   		 "name" : "uid_sketch",
   		 "fieldName" : "uid",
   		"size": 4096
   	 }
           ],
           "granularitySpec": {
             "type": "uniform",
             "segmentGranularity": "DAY",
             "queryGranularity": "minute",
             "rollup": true,
             "intervals": null
           },
           "transformSpec": {
             "filter": null,
             "transforms": []
           }
         },
         "ioConfig": {
           "topic": "kafka-topic",
           "replicas": 1,
           "taskCount": 1,
           "taskDuration": "PT18000S",
           "consumerProperties": {
             "bootstrap.servers": "kafka:9092"
           }
         }
       }
   ```
   
   
   What I am trying to do is to re-index the "pageview" datasource into a new datasource "pageview-reindexed", for a certain interval and only for the dimension "articleId", the metric "uid_sketch" and a queryGranularity of "ten_minute".
   
   
   The re-indexing task looks like this:
   
   ```json
       {
           "type": "index",
           "spec": {
               "dataSchema": {
                   "dataSource": "pageview-reindexed",
                   "parser": {
                       "parseSpec": {
                           "flattenSpec": {
                               "useFieldDiscovery": "true",
                               "fields": [
                               "articleId",
                               "uid",
                               "some-fields-that-need-to-be-flattened..."
                             ]
                           },
                           "timestampSpec": {
                               "column": "timestamp",
                               "format": "auto"
                           },
                          
                           "dimensionsSpec": {
                               "dimensions": [
                                   {
                                       "type": "long",
                                       "name": "articleId"
                                   }
                               ]
                           }
                       }
                   },
                   "metricsSpec": [
                       {
                           "type": "thetaSketch",
                           "name": "uid_sketch",
                           "fieldName": "uid",
                           "size": 4096
                       }
                   ],
                   "granularitySpec": {
                       "type": "uniform",
                       "segmentGranularity": "DAY",
                       "queryGranularity": "ten_minute",
                       "rollup": true,
                       "intervals": null
                   }
               },
               "ioConfig": {
                   "type": "index",
                   "firehose": {
                       "type": "ingestSegment",
                       "dataSource": "pageview",
                       "interval": "2019-05-12/2019-05-15"
                   },
                   "appendToExisting": false
               }
           }
       }
   ```
   
   The indexing task finishes successfully and the dimension "articleId" seems to be ingestet properly but the thetaSketches are "null" for every entry.
   Executing the following query 
   
   ```json
       {
         "queryType": "select",
         "dataSource": "pageview-reindexed",
         "dimensions": [],
         "metrics": [],
         "intervals": [
            "2019-05-13T10:00:00.000Z/P1D"
         ],
         "granularity": "all",
         "pagingSpec": {
           "pagingIdentifiers": {},
           "threshold": 100
         }
       }
   ```
   
   will give results where the metric "uid_sketch" is always null.
   
   ```json
       [
           {
               "timestamp": "2019-05-13T10:00:00.000Z",
               "result": {
                   "pagingIdentifiers": {
                       "pageview-raw-reindexed_2019-05-13T10:00:00.000Z_2019-05-14T00:00:00.000Z_2019-05-15T10:13:47.886Z": 99
                   },
                   "dimensions": [
                       "articleId"
                   ],
                   "metrics": [
                       "uid_sketch"
                   ],
                   "events": [
                       {
                           "segmentId": "pageview-raw-reindexed_2019-05-13T10:00:00.000Z_2019-05-14T00:00:00.000Z_2019-05-15T10:13:47.886Z",
                           "offset": 0,
                           "event": {
                               "timestamp": "2019-05-13T10:00:00.000Z",
                               "articleId": 219,
                               "uid_sketch": null,
                           }
                       },
       ...
   ```
   
   Am I doing something wrong or is it simply not possible to use thetaSketches when re-indexing?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org