You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/03/17 02:46:05 UTC

[GitHub] [druid] cswarth opened a new issue #11003: Doc queston: Can protobuf extension be used with "index_parallel"?

cswarth opened a new issue #11003:
URL: https://github.com/apache/druid/issues/11003


   The [Protobuf extension documentation](https://druid.apache.org/docs/latest/development/extensions-core/protobuf.html) demonstrates use of the extension to decode Kafka events.
   Can the protobuf extension also be used to decode files, or it is only suitable for streaming input?
   
   I tried make an example "index_parallel" task definition that uses protobuf but it gets rejected,
   ```
   {"error":"Cannot construct instance of `org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexIngestionSpec`,
     problem: Cannot use parser and inputSource together. Try using inputFormat instead of parser.
    at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 77, column: 1] 
   (through reference chain: org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask[\"spec\"])"
   }
   ```
   
   Task definition:
   ```
   curl -v http://localhost:8888/druid/indexer/v1/task -H 'Content-Type: application/json' -d '
   {
     "type": "index_parallel",
     "spec": {
       "ioConfig": {
         "type": "index_parallel",
         "inputSource": {
           "type": "local",
           "filter": "metrics.bin",
           "baseDir": "./"
         }
       },
       "tuningConfig": {
         "type": "index_parallel",
         "partitionsSpec": {
           "type": "dynamic"
         }
       },
       "dataSchema": {
         "dataSource": "metrics",
         "parser": {
           "type": "protobuf",
           "descriptor": "file:///tmp/metrics.desc",
           "protoMessageType": "Metrics",
           "parseSpec": {
             "format": "json",
             "timestampSpec": {
               "column": "timestamp",
               "format": "auto"
             },
             "dimensionsSpec": {
               "dimensions": [
                 "unit",
                 "http_method",
                 "http_code",
                 "page",
                 "metricType",
                 "server"
               ],
               "dimensionExclusions": [
                 "timestamp",
                 "value"
               ]
             }
           }
         },
         "metricsSpec": [
           {
             "name": "count",
             "type": "count"
           },
           {
             "name": "value_sum",
             "fieldName": "value",
             "type": "doubleSum"
           },
           {
             "name": "value_min",
             "fieldName": "value",
             "type": "doubleMin"
           },
           {
             "name": "value_max",
             "fieldName": "value",
             "type": "doubleMax"
           }
         ],
         "granularitySpec": {
           "type": "uniform",
           "segmentGranularity": "HOUR",
           "queryGranularity": "NONE"
         }
     }
   }
   '
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on issue #11003: Doc question: Can protobuf extension be used with "index_parallel"?

Posted by GitBox <gi...@apache.org>.
clintropolis commented on issue #11003:
URL: https://github.com/apache/druid/issues/11003#issuecomment-800791722


   I am not familiar enough with protobuf encoded files to know if this will work, but the error you are seeing is related to trying to use `inputSource` with a parser. You need to use the older 'parser' based ingestion spec to not see this error, see https://druid.apache.org/docs/latest/ingestion/index.html#parser-deprecated. (no `inputSource` or `inputFormat` on parser based specs, instead "firehoses" are used in place of input source iirc)
   
   The protobuf parser depends on getting byte chunks of encoded proto messages, so any file reader would need to read out individual message binary blobs from the underlying file to feed to the parser, which is the part that makes me unsure that protobuf files with batch would work correctly. For example the CSV parser is fed single lines from an underlying text file, where the line is expected to be a CSV row. A protobuf file parser would need something to do something similar with the binary message blobs from the file, and i'm not sure if just having the message schema is enough for that to work.
   
   I think if the `inputFormat` did exist, it might still have this issue, a file based protobuf decoder might be a specialized `InputFormat` implementation that is separate from a streaming individual message processor format (again I'm not familiar with the file side of things much).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cswarth commented on issue #11003: Doc question: Can protobuf extension be used with "index_parallel"?

Posted by GitBox <gi...@apache.org>.
cswarth commented on issue #11003:
URL: https://github.com/apache/druid/issues/11003#issuecomment-800780744


   I suspect the answer is "No, protobuf cannot be used to parse file contents", based on this comment, 
   https://github.com/apache/druid/pull/10839#issuecomment-790505170
       
   
   > If you are interested in doing a follow-up PR or two, the protobuf extension is one of the few ingestion formats that has not been ported to the newer InputFormat interface which has replaced parsers. There are a handful of implementations that could be used as reference (JSON, Avro though it is missing an InputFormat for streaming support and using schema registry itself, and many others). I think this should be relatively straight-forward with the protobuf extension since most of the work is being done in the decoder implementations, which could also be re-used by the InputFormat implementation.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis edited a comment on issue #11003: Doc question: Can protobuf extension be used with "index_parallel"?

Posted by GitBox <gi...@apache.org>.
clintropolis edited a comment on issue #11003:
URL: https://github.com/apache/druid/issues/11003#issuecomment-800791722


   I am not familiar enough with protobuf encoded files to know if this will work, but the error you are seeing is related to trying to use `inputSource` with a parser. You need to use the older 'parser' based ingestion spec to not see this error, see https://druid.apache.org/docs/latest/ingestion/index.html#parser-deprecated. (no `inputSource` or `inputFormat` on parser based specs, instead "firehoses" are used in place of input source iirc, https://druid.apache.org/docs/latest/ingestion/native-batch.html#firehoses-deprecated)
   
   The protobuf parser depends on getting byte chunks of encoded proto messages, so any file reader would need to read out individual message binary blobs from the underlying file to feed to the parser, which is the part that makes me unsure that protobuf files with batch would work correctly. For example the CSV parser is fed single lines from an underlying text file, where the line is expected to be a CSV row. A protobuf file parser would need something to do something similar with the binary message blobs from the file, and i'm not sure if just having the message schema is enough for that to work.
   
   I think if the `inputFormat` did exist, it might still have this issue, a file based protobuf decoder might be a specialized `InputFormat` implementation that is separate from a streaming individual message processor format (again I'm not familiar with the file side of things much).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org