You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/04/25 02:00:26 UTC

Apache Pinot Daily Email Digest (2022-04-24)

### _#general_

  
 **@fding:** added an integration to this channel:  
**@nutbiz:** @nutbiz has joined the channel  
 **@g.kishore:** removed an integration from this channel:  
**@aroopganguly:** @aroopganguly has joined the channel  

###  _#random_

  
 **@nutbiz:** @nutbiz has joined the channel  
 **@aroopganguly:** @aroopganguly has joined the channel  

###  _#troubleshooting_

  
 **@diogo.baeder:** Quick question: is it possible to run a query on a certain
table, but with an IN_SUBQUERY where the subquery is run against another
table?  
**@mayanks:** Not at the moment, no  
**@diogo.baeder:** Thanks. Are there any plans to support it though?  
**@mayanks:** Likely yes, but no concrete timeline at the moment. This is
venturing into full support for nested queries and joins  
**@diogo.baeder:** Got it, makes sense. Thanks man! :slightly_smiling_face:  
**@g.kishore:** @diogo.baeder can you please try the query.. I dont remember
restricting IN_SUBQUERY to the same table.  
**@mayanks:** Yes, please try out @diogo.baeder, I might have missed something
here.  
**@diogo.baeder:** Ah, nice, I'll try it out then :slightly_smiling_face:  
**@diogo.baeder:** Well my friends it just so happens that... YES, it does
support subqueries to different tables!!! Yay!!! \o/  
**@diogo.baeder:** For reference, this is my query: ```select weight, caseid
as weights_caseid from weights where in_subquery(caseid, 'select
id_set(caseid) from brands_metrics where brand_id = 1000226 and date_ >
20150501 and date_ < 20150515 limit 1000000') = 1 and in_subquery(caseid,
'select id_set(caseid) from filters where name = ''gender'' and values_ = 2
and date_ > 20150501 and date_ < 20150515 limit 1000000') = 1 and date_ >
20150501 and date_ < 20150515 limit 1000000``` and I get results! Guys, this
is amazing news, because the fact that this works means that I don't even need
to use Trino or anything like that, because then I can just do the aggregation
of the data in the Pinot Broker itself! And it runs much faster than in Trino,
actually :slightly_smiling_face:  
**@mayanks:** Thanks for confirming @diogo.baeder, we will add this to the
doc. cc: @mark.needham  
**@diogo.baeder:** Dude, I am *SO* happy with this... it's a game-changer for
me, really! :heart: Next on: I'll change our structure and do some more
complex stuff with Groovy in order to have some calculations done over JSON
columns. And let me say this again: Pinot is ridiculously fast, OMG!!!  
**@diogo.baeder:** By the way: the query above ran in 28ms on my computer,
with 1 year worth of data, but where each segment contains data for 1 day only
- in total I have 202 segments (it's where we have data). If I was doing
monthly merge roll-ups, this would be even faster. I can already see this
awesome database becoming our next data source for our main data backend!
:slightly_smiling_face:  
**@mayanks:** Glad you are able to use Pinot to solve your use cases.  
**@diogo.baeder:** Yep. We already use it for internal analytics, which is
important but not mission-critical; Now this experiment I'm doing is to serve
our main product, so if everything works, we'll put up another cluster, but a
much more robust one, with more and beefier nodes.  
 **@nutbiz:** @nutbiz has joined the channel  
 **@diogo.baeder:** Me again. New issue, not sure if it's me or if it's a
genuine bug: I'm trying to ingest JSON data for one of my columns, but I keep
getting an error for that column: `Cannot read single-value from Collection:`.
More on this thread.  
**@diogo.baeder:** This is a part of the exception, which I believe should
contain the necessary info to investigate it: ```Caused by:
java.lang.IllegalStateException: Cannot read single-value from Collection: [1,
2, 1, 1, 1, 1, 2, 1, 1, 1, 9, 1, 1] for column: brands_responses at
shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:721)
~[pinot-all-0.10.0-jar-with-
dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at
org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.standardizeCollection(DataTypeTransformer.java:176)
~[pinot-all-0.10.0-jar-with-
dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at
org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.standardize(DataTypeTransformer.java:119)
~[pinot-all-0.10.0-jar-with-
dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at
org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:63)
~[pinot-all-0.10.0-jar-with-
dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] ... 13 more
2022/04/25 00:19:39.735 ERROR [SegmentGenerationJobRunner] [pool-2-thread-1]
Failed to generate Pinot segment for file - file:/sensitive-
data/outputs/cases/br/20150501.json```  
**@diogo.baeder:** This is a JSON file I'm trying to consume: ```[ {
"brands_responses": { "first_1000226": 2, "second_1000226": 1,
"third_1000226": 1, "fourth_1000226": 1, "fifth_1000226": 9, "sixth_1000226":
2, "seventh_1000226": 1, "eighth_1000226": 1, "ninth_1000226": 1,
"tenth_1000226": 1, "eleventh_1000226": 1, "twelfth_1000226": 1,
"thirteenth_1000226": 1 }, "caseid": 251214750, "date_": 20150501, "pmxid":
52735743, "region": "br", "sector_id": 1010, "uuid":
"6702e33a-e961-4f62-b9df-2d65e4fe3fd5", "weight": 0.935066 } ]```  
**@diogo.baeder:** This is my schema: ```{ "schemaName": "cases_schema",
"dimensionFieldSpecs": [ { "name": "brands_responses", "dataType": "JSON",
"maxLength": 2147483647 }, { "name": "caseid", "dataType": "INT" }, { "name":
"pmxid", "dataType": "INT" }, { "name": "region", "dataType": "STRING" }, {
"name": "sector_id", "dataType": "INT" }, { "name": "uuid", "dataType":
"STRING" } ], "metricFieldSpecs": [ { "name": "weight", "dataType": "FLOAT" }
], "dateTimeFieldSpecs": [ { "name": "date_", "dataType": "INT", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd", "granularity": "1:DAYS" } ] }```  
**@diogo.baeder:** Sorry, while gathering the information I think I found what
the issue is, hold on...  
**@diogo.baeder:** Nah, it's still erroring. I had forgotten to add the fields
to `noDictionaryColumns`, but even with that added it still errors out  
**@diogo.baeder:** This is the table definition I send to Pinot when creating
it: ```{ "tableName": "cases", "tableType": "OFFLINE", "segmentsConfig": {
"schemaName": "cases_schema", "timeColumnName": "date_", "timeType": "DAYS",
"replicasPerPartition": "1", "replication": "1" }, "tableIndexConfig": {
"loadMode": "MMAP", "noDictionaryColumns": [ "brands_responses" ],
"jsonIndexColumns": [], "invertedIndexColumns": [], "nullHandlingEnabled":
true, "segmentPartitionConfig": { "columnPartitionMap": { "region": {
"functionName": "Murmur", "numPartitions": 400 } } } }, "tenants": { "broker":
"DefaultTenant", "server": "DefaultTenant" }, "metadata": { "customConfigs":
{} }, "routing": { "instanceSelectorType": "balanced", "segmentPrunerTypes": [
"partition", "time" ] }, "transformConfigs": [ { "columnName":
"brands_responses", "transformFunction": "jsonFormat(\"brands_responses\")" }
] }```  
**@diogo.baeder:** I noticed, however, that `transformConfigs` is missing from
the table definition when looking at how the table got created  
**@diogo.baeder:** This is the table config I see in the incubator UI: ```{
"OFFLINE": { "tableName": "cases_OFFLINE", "tableType": "OFFLINE",
"segmentsConfig": { "timeType": "DAYS", "schemaName": "cases_schema",
"replication": "1", "timeColumnName": "date_", "allowNullTimeValue": false,
"replicasPerPartition": "1" }, "tenants": { "broker": "DefaultTenant",
"server": "DefaultTenant" }, "tableIndexConfig": { "invertedIndexColumns": [],
"noDictionaryColumns": [ "brands_responses" ], "segmentPartitionConfig": {
"columnPartitionMap": { "region": { "functionName": "Murmur", "numPartitions":
400 } } }, "rangeIndexVersion": 2, "jsonIndexColumns": [],
"autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": false, "loadMode": "MMAP",
"enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false,
"aggregateMetrics": false, "nullHandlingEnabled": true }, "metadata": {
"customConfigs": {} }, "routing": { "segmentPrunerTypes": [ "partition",
"time" ], "instanceSelectorType": "balanced" }, "isDimTable": false } }```  
**@diogo.baeder:** Alright, I figured that I was missing a `ingestionConfig`
as part of the table config - the documentation about JSON indexing is wrong,
it doesn't mention this field. But even using this field, it doesn't work, if
I send the correct payload I get: ```{ "code": 400, "error": "Arguments of a
transform function '[brands_responses]' cannot contain the destination column
'brands_responses'" }```  
**@diogo.baeder:** Alright, it's working now. I figured out that the source
column from where the JSON has to be transformed cannot have the same name as
the destination column. I wish Pinot was able to just do the transformation
for us without all these extra configuration though.  
**@diogo.baeder:** Fixed on my side. I just opened a ticket to propose changes
to the docs to improve them:  
**@aroopganguly:** @aroopganguly has joined the channel  
 **@very312:** Hi, Does schema evolution only work at batch tables? We wanted
to add a new column on hybrid table (offline table + realtime streaming table)
which includes upsert columns. When we added new column on schema config, we
also needed to change table config since the new columns should have beed
overwritten. It means we must have deleted original realtime table and re-
generated realtime table. Is it right way of creating new columns in a
streaming upsert table?  only shows batch table  
**@mayanks:** Schema evolution works for hybrid tables as well. As for upsert
cc: @yupeng  
**@very312:** when i tried below in order, here is an error msg. cc. @yupeng •
schema a' add (i.e. schema a modify) • table A' modify (i.e. table A modify -
since we needed to highlight test_evolution column to be "OVERWRITE" in table
config) • run bin/pinot-admin.sh AddTable -schemaFile a_schema.json
-tableConfigFile a_table.json -exec; • send new msg ```[ { "message":
"MergeResponseError:\nData schema mismatch between merged block:
[card_alert(STRING),card_bin(STRING),card_id(STRING),card_issuer(STRING),card_number(STRING),card_type(STRING),create_time(TIMESTAMP),delete_time(TIMESTAMP),test_evolution(INT),update_time(TIMESTAMP),user_id(STRING)]
and block to merge:
[card_alert(STRING),card_bin(STRING),card_id(STRING),card_issuer(STRING),card_number(STRING),card_type(STRING),create_time(TIMESTAMP),delete_time(TIMESTAMP),update_time(TIMESTAMP),user_id(STRING)],
drop block to merge", "errorCode": 500 } ]```  
**@very312:** can you tell us process of adding new column which would be
overwritten by using meetupRsvp example?  
 **@diogo.baeder:** Fixed on my side. I just opened a ticket to propose
changes to the docs to improve them:  

### _#tmp_

  
 **@fding:** added an integration to this channel:  
**@g.kishore:** removed an integration from this channel:  

### _#getting-started_

  
 **@nutbiz:** @nutbiz has joined the channel  
 **@aroopganguly:** @aroopganguly has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org