You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/03/05 02:00:16 UTC
Apache Pinot Daily Email Digest (2021-03-04)

### _#general_

  
 **@amommendes:** @amommendes has joined the channel  

###  _#random_

  
 **@amommendes:** @amommendes has joined the channel  

###  _#troubleshooting_

  
 **@mohammedgalalen056:** Hi, I faced this error when trying to do
BatchIngestion from the local file system `Failed to generate Pinot segment
for file - file:data/orders.csv` `java.lang.NumberFormatException: For input
string: "2019-05-02 17:49:53"` here is the dateTimeFieldSpecs in the schema
file: ```"dateTimeFieldSpecs": [ { "dataType": "STRING", "name": "start_date",
"format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity":
"1:DAYS" }, { "dataType": "STRING", "name": "end_date", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, {
"dataType": "STRING", "name": "created_at", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, {
"dataType": "STRING", "name": "updated_at", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }
]```  
**@ken:** What’s the full schema? Looks like you’ve got a numeric (metrics or
dimensions) field, but the data in your input file is a date.  
**@mohammedgalalen056:** ```{ "schemaName": "orders", "metricFieldSpecs": [ {
"dataType": "DOUBLE", "name": "total" }, { "dataType": "FLOAT", "name":
"percentage" } ], "dimensionFieldSpecs": [ { "dataType": "INT", "name": "id"
}, { "dataType": "STRING", "name": "user_id" }, { "dataType": "STRING",
"name": "worker_id" }, { "dataType": "INT", "name": "job_id" }, { "dataType":
"DOUBLE", "name": "lat" }, { "dataType": "DOUBLE", "name": "lng" }, {
"dataType": "INT", "name": "work_place" }, { "dataType": "STRING", "name":
"note" }, { "dataType": "STRING", "name": "address" }, { "dataType": "STRING",
"name": "canceled_by" }, { "dataType": "INT", "name": "status" }, {
"dataType": "STRING", "name": "canceled_message" } ], "dateTimeFieldSpecs": [
{ "dataType": "STRING", "name": "start_date", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, {
"dataType": "STRING", "name": "end_date", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, {
"dataType": "STRING", "name": "created_at", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, {
"dataType": "STRING", "name": "updated_at", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" } ]
}```  
**@ken:** I’d take a few rows of your input data and dump into Excel, to
confirm the order/number of columns matches what you’ve defined in your
schema.  
**@mohammedgalalen056:** I've fixed the error, the raw data was corrupted  
 **@fabricio.dutra87:** Hi all, I'm trying to ingest data from kafka using a
topic that doesnt has a datetime column and receving this error:
```{"code":400,"error":"Schema should not be null for REALTIME table"}``` I'm
using this spec: ```curl -X POST "" -H "accept: application/json" -H "Content-
Type: application/json" -d "{ \"tableName\": \"realtime_strimzi_dev_acks\",
\"tableType\": \"REALTIME\", \"segmentsConfig\": { \"segmentPushType\":
\"REFRESH\", \"schemaName\": \"sch_strimzi_acks\", \"replication\": \"1\",
\"replicasPerPartition\": \"1\" }, \"tenants\": {}, \"tableIndexConfig\": {
\"loadMode\": \"MMAP\", \"invertedIndexColumns\": [ \"column1\" ],
\"streamConfigs\": { \"streamType\": \"kafka\",
\"stream.kafka.consumer.type\": \"lowlevel\", \"stream.kafka.topic.name\":
\"producer-test-strimzi-dev-acks-0\", \"stream.kafka.decoder.class.name\":
\"org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder\",
\"stream.kafka.consumer.factory.class.name\":
\"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory\",
\"stream.kafka.broker.list\": \"edh-kafka-
brokers.ingestion.svc.Cluster.local:9092\",
\"realtime.segment.flush.threshold.time\": \"3600000\",
\"realtime.segment.flush.threshold.size\": \"50000\",
\"stream.kafka.consumer.prop.auto.offset.reset\": \"smallest\" } },
\"metadata\": { \"customConfigs\": {} }}"``` Is there a way to create a
realtime table autofilling/creating a datetime column?  
**@g.kishore:** did you upload the schema first?  
**@fabricio.dutra87:** yes, but I had the same error message  
**@npawar:** Can you paste the schema here?  
**@fabricio.dutra87:** I'm not including a timefieldspec as I dont have it on
my Kafka topic. So would be nice if there was a way to autofill a datetime
column on Pinot. That's the spec: ```{ "schemaName": "sch_strimzi_ack",
"dimensionFieldSpecs": [ { "name": "column1", "dataType": "STRING" } ] }```  
**@chinmay.cerebro:** Auto creating a time stamp column is not supported as of
now. Do you have any column in Kafka that we can derive time stamp from ?  
**@g.kishore:** You can probably use now() udf  
**@fabricio.dutra87:** hmm ok. We will try then to implement the workaround by
including the datetime column on that topic. Thanks guys!!  
**@npawar:** also, its failing in the first place because the schema name is
not matching what you’ve put in the table config  
**@npawar:** ```sch_strimzi_ack``` vs ```"schemaName\":
\"sch_strimzi_acks\\``` plural  
**@npawar:** hence the schema not found exception  
**@npawar:** we can make that exception clearer. Do you mind creating an issue
on github?  
**@fabricio.dutra87:** thanks Neha, the error was clearer when I fixed the
name: ```{"code":400,"error":"'timeColumnName' cannot be null in REALTIME
table config"}```  
 **@falexvr:** Hey guys, for some reason every query I sent to pinot is only
returning 10 records at most, only if I specify a limit it brings more than 10
records, is there something I have to do to get the full amount of records?  
**@g.kishore:** yes default limit is 10  
**@g.kishore:** you can specify limit 1000 to get more records  
**@g.kishore:** or 10000  
 **@amommendes:** @amommendes has joined the channel  

###  _#aggregators_

  
 **@ita.pai:** @ita.pai has joined the channel  
 **@ita.pai:** @ita.pai has left the channel  

###  _#pinot-dev_

  
 **@ken:** Currently `DistinctCountHLL` only works for single value fields. It
seems like a simple change in
`DistinctCountHLLAggregationFunction.aggregate()` to check if the
`BlockValSet` is multi-valued, and if so then call `BlockValSet.getXXXMV()`
and do a sub-iteration on the secondary array it returns. Does that make
sense?  
**@g.kishore:** Surprised that’s it’s not supported as of now  
**@ken:** If you run this query on a MVF, you get: ``` "message":
"QueryExecutionError:\njava.lang.UnsupportedOperationException\n\tat
org.apache.pinot.core.segment.index.readers.ForwardIndexReader.readDictIds(ForwardIndexReader.java:84)\n\tat
org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readStringValues(DataFetcher.java:439)\n\tat
org.apache.pinot.core.common.DataFetcher.fetchStringValues(DataFetcher.java:146)\n\tat
org.apache.pinot.core.common.DataBlockCache.getStringValuesForSVColumn(DataBlockCache.java:194)\n\tat
org.apache.pinot.core.operator.docvalsets.ProjectionBlockValSet.getStringValuesSV(ProjectionBlockValSet.java:94)\n\tat
org.apache.pinot.core.query.aggregation.function.DistinctCountHLLAggregationFunction.aggregate(DistinctCountHLLAggregationFunction.java:103)\n\tat
org.apache.pinot.core.query.aggregation.DefaultAggregationExecutor.aggregate(DefaultAggregationExecutor.java:47)\n\tat
org.apache.pinot.core.operator.query.AggregationOperator.getNextBlock(AggregationOperator.java:66)\n\tat
org.apache.pinot.core.operator.query.AggregationOperator.getNextBlock(AggregationOperator.java:35)\n\tat
org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:49)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator$1.runJob(BaseCombineOperator.java:94)\n\tat
org.apache.pinot.core.util.trace.TraceRunnable.run(TraceRunnable.java:40)\n\tat
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)"```  
**@ken:** I’ll file an issue and generate a PR  
**@mayanks:** @ken can you try `distinctCountHLLMV`?  
**@mayanks:** Aggregation functions on MV columns have an `MV` suffix in the
name.  
**@ken:** @mayanks Thanks for clarifying, I was confused by seeing
`aggregate`, `aggregateGroupBySV`, and `aggregateGroupByMV`. Made me think
there was a missing `aggregateMV` function. I see now that the `BySV` and
ByMV` methods are for doing aggregations when the grouping column is SV vs.
MV.  
**@mayanks:** :+1:  
**@ken:** @mayanks But why does there need to be a different function? In the
implementations the function signatures are the same, and (I assume) the
`BlockValSet` could be used to determine whether to handle it as an SV or an
MV column.  
**@mayanks:** Yeah, in future, we might merge the two.  
**@ken:** OK, I’ll change my issue description :slightly_smiling_face:  
**@mayanks:** sounds good  

###  _#community_

  
 **@amommendes:** @amommendes has joined the channel  

###  _#announcements_

  
 **@amommendes:** @amommendes has joined the channel  

###  _#getting-started_

  
 **@phuchdh:** @phuchdh has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org