You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/05/14 02:00:17 UTC
Apache Pinot Daily Email Digest (2021-05-13)

### _#general_

  
 **@humengyuk18:** How do I cast a string value to int or long using sql in
pinot?  
**@jackie.jxt:** Are you planning to aggregate on it? If so, it should be
auto-casted  
**@jackie.jxt:** E.g. `select sum(numericStringCol) from myTable`  
**@humengyuk18:** I see, but this is a multivalue column, I got it working by
using groovy udf.  
 **@patidar.rahul8392:** { "ID":"I", "ct_ts":"2021-05-12 08:13:09.000190",
"time_ts":"2021-05-12T08:13:16.152000", "Pid":"111111", "Task":{
"STARTDATETIME":"20210512081308", "Id1":"111111111111", "Id2":null }  
 **@patidar.rahul8392:** Hi everyone, Kindly suggest how to write JSON schema
for this sample data. I am not able the data for first 3 columns in pinot but
cannot see any data starttimedate, id1 and id2  
 **@patidar.rahul8392:** Kindly suggest how to access data for the remaining 3
fields.  
 **@patidar.rahul8392:** Hi am following @npawar tutorial and trying to build
my JSON but it's working file if we have data like this .but in my data inside
one task segment I have all of my remaining fields. Kindly suggest.  
 **@kmvb.tau:** Does pinot supports partition only for RealTime Tables? For
the Offline Table all partition data written in the same segment file. segment
metadata.properties `column.RELATEDID.partitionFunction = Murmur`
`column.RELATEDID.numPartitions = 10` `column.RELATEDID.partitionValues =
0,1,2,3,4,5,6,7,8,9` Note: Running Data Ingestion using pinot-admin.sh
LaunchDataIngestionJob  
**@mayanks:** For offline, Pinot expects data to be already partitioned. In
future we will add capability of Pinot sorting and partitioning the data
during offline ingestion.  
**@kmvb.tau:** does pinot support partition for multiple columns?  
**@mayanks:** No  
 **@riteshrathi:** @riteshrathi has joined the channel  
 **@humengyuk18:** Why the datetimeconvert transform function is much slower
than time_floor in Druid? We are migrating our Druid table to Pinot, but found
datetime transform and granularity is very slow compare to Druid.  
**@g.kishore:** what is the conversion? simple date format to epoch?  
**@humengyuk18:** Epoch to epoch with 15 minutes granularity.  
**@g.kishore:** that should be fast, whats the query  
**@humengyuk18:** ```SELECT DATETIMECONVERT(datetimeconvert(__time,
'1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES'),
'1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MINUTES'), COUNT(*) FROM
product_log WHERE datetimeconvert(__time, '1:MILLISECONDS:EPOCH',
'1:MILLISECONDS:EPOCH', '30:MINUTES') >= 1620830640000 AND
datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH',
'30:MINUTES') < 1620917040000 AND method =
'DeviceInternalService.CheckDeviceInSameGroup' AND container_name = 'whale-
device' AND error > '0' GROUP BY DATETIMECONVERT(datetimeconvert(__time,
'1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES'),
'1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MINUTES') ORDER BY COUNT(*)
DESC```  
**@humengyuk18:** ```SELECT datetimeconvert(__time, '1:MILLISECONDS:EPOCH',
'1:MILLISECONDS:EPOCH', '30:MINUTES'), COUNT(*) FROM product_log WHERE
datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH',
'30:MINUTES') >= 1620830760000 AND datetimeconvert(__time,
'1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES') < 1620917160000
AND method = 'DeviceInternalService.CheckDeviceInSameGroup' AND container_name
= 'whale-device' AND error > '0' GROUP BY datetimeconvert(__time,
'1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES') ORDER BY
COUNT(*) DESC``` The cost is similar if I deselect the 1 minute time grain in
Superset.  
**@humengyuk18:** Druid returned result under 1 second, but Pinot cost 5-6
seconds.  
**@g.kishore:** can you paste the response  
**@g.kishore:** I think segment pruning is not happening,  
**@humengyuk18:** ```{ "resultTable": { "dataSchema": { "columnDataTypes": [
"LONG", "LONG" ], "columnNames": [
"datetimeconvert(__time,'1:MILLISECONDS:EPOCH','1:MILLISECONDS:EPOCH','30:MINUTES')",
"count(*)" ] }, "rows": [ [ 1620873000000, 180 ], [ 1620869400000, 179 ], [
1620871200000, 178 ], [ 1620894600000, 172 ], [ 1620892800000, 166 ], [
1620874800000, 164 ], [ 1620876600000, 163 ], [ 1620896400000, 163 ], [
1620867600000, 162 ], [ 1620885600000, 161 ] ] }, "exceptions": [],
"numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 41,
"numSegmentsProcessed": 41, "numSegmentsMatched": 12,
"numConsumingSegmentsQueried": 3, "numDocsScanned": 7706,
"numEntriesScannedInFilter": 195554753, "numEntriesScannedPostFilter": 7706,
"numGroupsLimitReached": false, "totalDocs": 165272282, "timeUsedMs": 2335,
"segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs":
1620917392724 }```  
**@g.kishore:** ```"numSegmentsProcessed": 41, "numSegmentsMatched": 12,
"numConsumingSegmentsQueried": 3, "numDocsScanned": 7706,
"numEntriesScannedInFilter": 195554753, "numEntriesScannedPostFilter":
7706,```  
**@g.kishore:** this should be processed in milliseconds  
**@g.kishore:** @jackie.jxt looks like if the udf is in the predicate the
pruning is not efficient  
**@g.kishore:** @humengyuk18 can you file an issue  
**@humengyuk18:** ok, I will file an issue for this.  
**@g.kishore:** ```SELECT datetimeconvert(__time, '1:MILLISECONDS:EPOCH',
'1:MILLISECONDS:EPOCH', '30:MINUTES'), COUNT(*) FROM product_log WHERE __time
>= 1620830760000 AND __time < 1620917160000 AND method =
'DeviceInternalService.CheckDeviceInSameGroup' AND container_name = 'whale-
device' AND error > '0' GROUP BY datetimeconvert(__time,
'1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES') ORDER BY
COUNT(*) DESC```  
**@g.kishore:** whats the time for this query  
**@humengyuk18:** ```{ "resultTable": { "dataSchema": { "columnDataTypes": [
"LONG", "LONG" ], "columnNames": [
"datetimeconvert(__time,'1:MILLISECONDS:EPOCH','1:MILLISECONDS:EPOCH','30:MINUTES')",
"count(*)" ] }, "rows": [ [ 1620873000000, 180 ], [ 1620869400000, 179 ], [
1620871200000, 178 ], [ 1620894600000, 172 ], [ 1620892800000, 166 ], [
1620874800000, 164 ], [ 1620876600000, 163 ], [ 1620896400000, 163 ], [
1620867600000, 162 ], [ 1620865800000, 161 ] ] }, "exceptions": [],
"numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 41,
"numSegmentsProcessed": 12, "numSegmentsMatched": 12,
"numConsumingSegmentsQueried": 3, "numDocsScanned": 7770,
"numEntriesScannedInFilter": 68503679, "numEntriesScannedPostFilter": 7770,
"numGroupsLimitReached": false, "totalDocs": 165381107, "timeUsedMs": 647,
"segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs":
1620917833431 }``` This one only takes 647ms to finish.  
 **@kelvin:** @kelvin has joined the channel  
 **@humengyuk18:** ```{ "resultTable": { "dataSchema": { "columnDataTypes": [
"LONG", "LONG" ], "columnNames": [
"datetimeconvert(__time,'1:MILLISECONDS:EPOCH','1:MILLISECONDS:EPOCH','30:MINUTES')",
"count(*)" ] }, "rows": [ [ 1620873000000, 180 ], [ 1620869400000, 179 ], [
1620871200000, 178 ], [ 1620894600000, 172 ], [ 1620892800000, 166 ], [
1620874800000, 164 ], [ 1620876600000, 163 ], [ 1620896400000, 163 ], [
1620867600000, 162 ], [ 1620865800000, 161 ] ] }, "exceptions": [],
"numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 41,
"numSegmentsProcessed": 12, "numSegmentsMatched": 12,
"numConsumingSegmentsQueried": 3, "numDocsScanned": 7770,
"numEntriesScannedInFilter": 68503679, "numEntriesScannedPostFilter": 7770,
"numGroupsLimitReached": false, "totalDocs": 165381107, "timeUsedMs": 647,
"segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs":
1620917833431 }``` This one only takes 647ms to finish.  
 **@pedro.cls93:** Hello, If updating an existing realtime table with a new
transformConfig, is the transformed field automatically computed for existing
segments or is there some endpoint I need to call to force that computation?
The same question but for altering an existing transformConfig.  
**@mayanks:** Afaik auto recompute of transforms is not supported.  
**@pedro.cls93:** So I need to recreate the table to force the update?  
**@mayanks:** Yeah. @npawar can you also confirm  
**@npawar:** You have to call reload segments api  
**@mayanks:** @npawar Is that supported for realtime also?  
**@npawar:** It should work same for the completed segments  
**@mayanks:** I see. I stand corrected then.  
**@mayanks:** @pedro.cls93  
**@npawar:** Note that it will only work if the arguments to the new transform
function are already present  
**@mayanks:** Could we document this @npawar  
**@pedro.cls93:** Perfect, sounds just like my use-case thank you both!  
 **@pedro.cls93:** What is the performance implications of defining a
dimension field in a schema as a string with a max length of Integer.MaxValue
(2GB if all space is fully used)?  
**@pedro.cls93:** This field is a stringified json, usually between 20 chars
and 4000 chars (~20% of the time). What is the impact of json indexing by it?  
**@mayanks:** Will there be filtering on this column? If so, may be better to
use JSON if it is indeed JSON.  
**@pedro.cls93:** There could potentially be some filtering, yes, though right
now the use-case is simply to extract fields within this json field. What do
you mean by "use JSON if it is indeed JSON"?  
**@mayanks:** Use JSON indexing  
**@mayanks:**  
**@pedro.cls93:** I had it initially but by default Pinot's String are set to
512 chars in length. For the index to truly be useful I need to set this
field's length to Integer.MaxValue (2^31-1 chars). I'm trying to understand if
having fields this large should be a concern.  
**@mayanks:** What's the max size?  
**@mayanks:** Raw strings don't have an inv index at the moment, which is why
I suggested JSON indexing  
**@pedro.cls93:** 2³¹ -1, realistically I think the largest I've seen was
around ~20k chars  
**@mayanks:** Yeah, it will have perf impact. Do you really need the entire
2GB JSON blob? Or is there specific fields you can extract out from it?  
**@pedro.cls93:** Can you quantify (very roughly) what the perf impact would
be?  
**@pedro.cls93:** The issue is that JSON blob is dynamic, there is fixed
schema. In some cases we have text translations in the payload, which can be
quite large.  
**@mayanks:** I don't have any concrete numbers for that. I'll have to see the
overall data size, and query  
**@mayanks:** Do you need those text translations?  
**@mayanks:** If not, may be filtering them out.  
**@pedro.cls93:** My consumers do, yes.  
**@mayanks:** I see  
**@mayanks:** What's your overall latency SLA requirement?  
**@pedro.cls93:** < 1 second for queries which access json blob. <100ms for
top-level concrete fields.  
**@mayanks:** Ok, may have to do some quick test for the largest json blob.  
**@pedro.cls93:** Preliminary results tell me the largest json blob is 40500
chars long.  
**@mayanks:** I meant like actually querying  
**@pedro.cls93:** I am, it just takes a while to get long-term historical
data.  
**@mayanks:** I'd say just do one day may be? then we can extrapolate?  
**@pedro.cls93:** Largest in the past month was 40k  
**@mayanks:** Of course, with actual data, it will be more accurate.  
**@pedro.cls93:** Unfortunately I can not share the actual data, it may
contain PII  
**@mayanks:** Oh I was suggesting for you to run it  
**@pedro.cls93:** Ah I see now what you meant, I'll try that tomorrow
probably.  
 **@kr.vikas:** @kr.vikas has joined the channel  
 **@avasudevan:** I have added few new columns to the Pinot Table and Pinot
Schema, inorder for the new columns to be populated I did `Reload All
Segments` for the Table in the UI. 2 Questions here: • I see `Reload All
Segments` to re-index data is this the right approach to re-populate new
columns? • I don’t see the progress of the `Reload All Segments` I see this PR
is completed -  which release is this part of?  
**@mayanks:** Yes, reload all segments is the right approach  
**@mayanks:** The PR  should be part of 0.7.1 (latest)  
**@avasudevan:** I see `GET /segments/{tableName}/reload-status` as the API
endpoint in the PR documentation but i don’t see that endpoint in my swagger.
Could you help me here on the right endpoint to call to get the status of
`Reload All Segments`?  
**@avasudevan:** @mayanks ^^  
**@mayanks:** Hmm, I only see `"segments/{tableName}/metadata` to fetch all
segment metadata. I didn't find the reload status in the PR  

###  _#random_

  
 **@riteshrathi:** @riteshrathi has joined the channel  
 **@kelvin:** @kelvin has joined the channel  
 **@kr.vikas:** @kr.vikas has joined the channel  

###  _#troubleshooting_

  
 **@riteshrathi:** @riteshrathi has joined the channel  
 **@kelvin:** @kelvin has joined the channel  
 **@kr.vikas:** @kr.vikas has joined the channel  

###  _#docs_

  
 **@keweishang:** @keweishang has joined the channel  
 **@keweishang:** @keweishang has left the channel  

###  _#pinot-dev_

  
 **@keweishang:** @keweishang has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org