You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/05/07 02:00:28 UTC

Apache Pinot Daily Email Digest (2022-05-06)

### _#general_

  
 **@zaikhan:** Hi, I have started `PinotController`, `PinotBroker` and
`PinotServer` using git branch `multi_stage_query_engine` code, still the join
query is not working. Do I need to do something else?  
**@kharekartik:** @walterddr  
**@walterddr:** Yes there are some configurations that needs to be used to
enable it. I will create a new PR to enable it by default  
**@zaikhan:** Another thing I noticed, These components are getting started
but when I do *Build Project* in IntelliJ, some classes are not found, for eg
- ```java: cannot find symbol symbol: class Plan location: package
org.apache.pinot.common.proto ```  
**@walterddr:** Yes. Some of the code is code generated so you will have to do
mvn install first and tag those directories as generated sources in intellij.  
**@walterddr:** Thanks for the feedback. Will also include this as part of the
PR with an instruction section  
**@zaikhan:** @walterddr Could you DM me config that I need to enable, later
you could create PR?  
**@walterddr:** yes. we created a quickstart but it is still on my branch  you
can take it for a try  
 **@haiylin:** @haiylin has joined the channel  
 **@jinal.panchal:** Hello, I didn't quite get the concept of dimension
columns in Pinot. If we have datatypes well-defined for the columns, then
what's the significance of specifying Pinot field specification, like
metricsField, dimensionFields, etc?  
**@mayanks:** Metrics are things you count/sum/avg/etc. Dimensions are ones
you slice/dice (filter/group) by.  
**@mayanks:** While that is the idea, pinot does not enforce these as strict
rules. Think of these as hints to pinot to do internal optimizations (for
example, metrics by end up being stored without a dictionary, may have a
different default null value, etc)  
 **@ashutosh25apr:** @ashutosh25apr has joined the channel  
 **@ashutosh25apr:** :wave: Hi everyone!  
**@mayanks:** Hi Ashutosh, welcome to the Pinot community.  
**@mitchellh:** Welcome!  
 **@diogo.baeder:** So, I just created a table with >40k rows, but with daily
segments, 318 segments in total - not good, I want to rollup to monthly
segments later -, and defined a JSON index for my main columns which contain
dynamic data (data that just can't be defined as static columns). Even trying
to brutalize this thing by querying all the data with a limit that surpasses
the amount of rows I still get ~600ms queries! Geez, this thing is fast!
:slightly_smiling_face:  
**@mayanks:** Yes, it it fast :slightly_smiling_face:. In your case, the data
size seems small as well.  
**@diogo.baeder:** It's quite small, yes. 1 year of data, ~215 MB total size.
It could easily fit a month of data for each segment - for larger regions of
data for us this will be a good size.  
 **@diogo.baeder:** Can't wait to test monthly rolled-up segments though.
Might make things even better.  
 **@rajat.taya:** @rajat.taya has joined the channel  
 **@ryan.persaud:** @ryan.persaud has joined the channel  
 **@mathieu.druart:** Hi ! this PR :  removed the Pulsar plug-in from the
Pinot build because of this issue : . Now that the issue is marked as closed,
does anyone know if the plug-in will be added back to the build ? Thank you !  
**@mayanks:** Pinot-pulsar plugin does exist already cc: @kharekartik  
**@mayanks:** `pinot-stream-ingestion/pinot-pulsar`  
**@mathieu.druart:** @mayanks yes the plugin exists, but the assembly file
doesn't add the jar plugin inside the plugin folder (lines are commented) :  
**@mayanks:** Hmm I thought that was resolved. @kharekartik any insights  
**@mathieu.druart:** we have to build a custom docker image to add the plugin
for now  
 **@ysuo:** Hi team, I have a question and don’t know how to solve it. How can
I extract numOfStas.Policy in Kafka message and save it to a Pinot table
field? When I use *transformFunction*, it doesn’t work. { “columnName”:
“stas_policy”, “transformFunction”: “jsonPathString(stats,
‘$.text_body.fields.numOfStas.Policy’)” } *And a sample Kafka message is like
this:* { “name”: “telemetry_signal_gfw_api_usage”, “stats”: { “text_body”: {
“fields”: { “numOfStas”: 0, “numOfStas.Policy”: 21 } } } }  

###  _#random_

  
 **@haiylin:** @haiylin has joined the channel  
 **@ashutosh25apr:** @ashutosh25apr has joined the channel  
 **@rajat.taya:** @rajat.taya has joined the channel  
 **@ryan.persaud:** @ryan.persaud has joined the channel  

###  _#troubleshooting_

  
 **@xuhongkun1103:** Hi,@xiangfu0 Could you please help me to fix this issue
about presto in workflow? Link:  
**@xiangfu0:** For presto fix, you need to make sure the Pom file changes for
pinot-spi, pinot-common, etc modules are also reflected in pinot-spi-jdk8,
pinot-common-jdk8  
**@xiangfu0:** Those modules are under pinot-connectors/prestodb-pinot-
dependencies  
**@xuhongkun1103:** @xiangfu0 Thanks for your prompt reply, Do you mean If I
just add one dependency in pinot-common, I have to add this dependency to
pinot-common-jdk8 pom file?  
**@xiangfu0:** yes  
**@xiangfu0:** We made sim link for source code  
**@xiangfu0:** But not Pom file  
**@xiangfu0:** So you need to make dependency aligned from both side  
**@xiangfu0:** We saw this issue when try to release for both jdk8 and jdk11  
**@xuhongkun1103:** Got it,Thx  
 **@haiylin:** @haiylin has joined the channel  
 **@diogo.baeder:** Hi folks! The  doesn't tell how to configure that; How can
that be done?  
**@mayanks:**  
**@diogo.baeder:** I'm already creating inverted indexes and I understood that
if a column is sorted Pinot will create a sorted index for it, but what about
"sorted inverted" ones? I just have to define a column that is sorted before
ingestion that it should have an inverted index then? There's no specific
configuration for "sorted inverted", at least I didn't find it there  
**@mayanks:** That is just saying that sorted forward index also duals as
inverted index. There isn’t and additional sorted inverted index. It is a bit
confusing, could we make it more readable @mark.needham  
**@diogo.baeder:** Ah, ok then. So if the column is already sorted I don't
need to do anything, just ingest it, right?  
**@mayanks:** For real-time I recommend specifying it as sorted in table
config  
**@diogo.baeder:** It's for offline ingestion  
**@mayanks:** Ok, then if data is already sorted, that is enough  
**@diogo.baeder:** Ah, cool. Thanks man!  
**@mark.needham:** will edit the docs  
**@mark.needham:** but while I was understanding sorted indexes I wrote this -  
**@diogo.baeder:** I'll take a look, thanks man. I'll use for offline tables
though.  
 **@ashutosh25apr:** @ashutosh25apr has joined the channel  
 **@rblau:** hello! we’re trying to batch ingest segments into our pinot
instance, but we are finding that some segments are in a bad state. the stack
trace we see from the debug/tables/{tablename} endpoint is like so:
`java.lang.IllegalArgumentException: newLimit > capacity: (604 > 28)\n\tat
java.base/java.nio.Buffer.createLimitException(Buffer.java:372)\n\tat
java.base/java.nio.Buffer.limit(Buffer.java:346)\n\tat
java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107)\n\tat
java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:235)\n\tat
java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:67)\n\tat
org.apache.pinot.segment.spi.memory.PinotByteBuffer.view(PinotByteBuffer.java:303)\n\tat
org.apache.pinot.segment.spi.memory.PinotDataBuffer.view(PinotDataBuffer.java:379)\n\tat
org.apache.pinot.segment.local.segment.index.readers.forward.BaseChunkSVForwardIndexReader.<init>(BaseChunkSVForwardIndexReader.java:97)\n\tat
org.apache.pinot.segment.local.segment.index.readers.forward.FixedByteChunkSVForwardIndexReader.<init>(FixedByteChunkSVForwardIndexReader.java:37)\n\tat
org.apache.pinot.segment.local.segment.index.readers.DefaultIndexReaderProvider.newForwardIndexReader(DefaultIndexReaderProvider.java:97)\n\tat
org.apache.pinot.segment.spi.index.IndexingOverrides$Default.newForwardIndexReader(IndexingOverrides.java:184)\n\tat
org.apache.pinot.segment.local.segment.index.column.PhysicalColumnIndexContainer.<init>(PhysicalColumnIndexContainer.java:166)\n\tat
org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:181)\n\tat
org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:121)\n\tat
org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:91)\n\tat
org.apache.pinot.core.data.manager.offline.OfflineTableDataManager.addSegment(OfflineTableDataManager.java:52)\n\tat
org.apache.pinot.core.data.manager.BaseTableDataManager.addOrReplaceSegment(BaseTableDataManager.java:373)\n\tat
org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addOrReplaceSegment(HelixInstanceDataManager.java:355)\n\tat
org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:162)\n\tat
jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)\n\tat
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat
org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404)\n\tat
org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331)\n\tat
org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97)\n\tat
org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)\n\tat
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat
java.base/java.lang.Thread.run(Thread.java:829)\n` @luisfernandez and I were
wondering what this `capacity` value (28, according to the trace) might be?
thanks!  
**@richard892:** hi, looks like an integer overflow  
**@richard892:** the default raw forward index format is v2, which only only
supports 2GB per column  
**@richard892:** you can try v3 or v4 which support larger sizes  
**@luisfernandez:** how to do that :smile:  
**@luisfernandez:** and also what does it mean does it mean one of the values
in the noDictionaryColumns is just too big?  
**@luisfernandez:** one of the values of the columns in noDictionaryColumns*  
**@steotia:** Is this column configured as a noDictionaryColumn ?  
**@steotia:** You can configure v3 as follows ```"fieldConfigList": [ {
"encodingType": "RAW", "name": "columnName", "properties": {
"deriveNumDocsPerChunkForRawIndex": "true", "rawIndexWriterVersion": "3" } }
]```  
**@luisfernandez:** yes it’s well most of them  
**@luisfernandez:** these columns are just counts  
**@luisfernandez:** ```"noDictionaryColumns": [ "click_count", "order_count",
"impression_count", "cost", "revenue" ],```  
**@steotia:** Also, make sure to add the column in noDictionaryColumns list in
the indexingConfig section of the table config ``` "noDictionaryColumns": [
"columnName" ],``` ideally it should not be needed in both places but yea
config cleanup is needed  
**@steotia:** I think you just need to setup `fieldConfigList` then  
**@steotia:** What is the type of this column ?  
**@luisfernandez:** type is int  
**@luisfernandez:** for all those columns  
**@luisfernandez:** cool thank you, we are just trying to understand what in
particular caused it to have that exception cause it’s a new one to us  
**@steotia:** The v3 especially was introduced since we were hitting 2GB limit
on STRING type columns. Since you are hitting this on an INT column, it
possibly means that you have 500 million rows in a single segment ?  
**@steotia:** which may not necessarily be optimal  
**@steotia:** btw, v3 will work for both fixed and variable width.. I am just
curious that there is a need to use it on INT / fixed width columns  
**@steotia:** cc @richard892  
**@rblau:** :eyes: i think we’re seeing that generally the number of rows in
our segments is around 200k, i’d be pretty surprised if one segment had
>500mill rows  
**@richard892:** are any of these multi value?  
**@luisfernandez:** none of them  
**@steotia:** seems like a different problem to me then  
**@steotia:** in fact the problem is happening during read / segment load
which potentially implies there is no need to bump the version from v2 to v3
because then the segment generation should have ideally failed initially as
the overflow would have resulted in a negative capacity (at least that's what
I have seen in the past whenever there is a need to go from v2 to v3)  
**@richard892:** I will look in to this on Monday  
 **@prashant.pandey:** Hi team. What should be the value of `controller.host`
in controller config for a k8s deployment? I am deploying Pinot to a new env
and leaving this field empty results in a NPE during controller startup:
```java.lang.NullPointerException: null at
org.apache.pinot.common.utils.helix.HelixHelper.updateHostnamePort(HelixHelper.java:550)
~[pinot-all-0.9.3-jar-with-
dependencies.jar:0.9.3-e23f213cf0d16b1e9e086174d734a4db868542cb] at
org.apache.pinot.controller.BaseControllerStarter.updateInstanceConfigIfNeeded(BaseControllerStarter.java:607)
~[pinot-all-0.9.3-jar-with-
dependencies.jar:0.9.3-e23f213cf0d16b1e9e086174d734a4db868542cb] at
org.apache.pinot.controller.BaseControllerStarter.registerAndConnectAsHelixParticipant(BaseControllerStarter.java:583)
~[pinot-all-0.9.3-jar-with-
dependencies.jar:0.9.3-e23f213cf0d16b1e9e086174d734a4db868542cb] at
org.apache.pinot.controller.BaseControllerStarter.setUpPinotController(BaseControllerStarter.java:382)
~[pinot-all-0.9.3-jar-with-
dependencies.jar:0.9.3-e23f213cf0d16b1e9e086174d734a4db868542cb]```  
**@walterddr:** you can probably use add this to the config
```pinot.set.instance.id.to.hostname=true```  
**@prashant.pandey:** Well this is set, but not sure why it’s not picking it
up. Literally same configs in all other envs, and they work fine. I am
probably making a very stupid mistake somewhere.  
**@prashant.pandey:** Here’s my controller config: ```apiVersion: v1 data:
pinot-controller.conf: |- controller.helix.cluster.name=myenv
controller.port=9000 controller.data.dir=/tmp/controller
controller.zk.str=apache-pinot-zookeeper-bitnami-
headless.pinot.svc.cluster.local:2181 pinot.set.instance.id.to.hostname=true
pinot.set.instance.id.to.hostname=true kind: ConfigMap metadata: annotations:
: pinot : pinot-controller : kubernetes/configMap : stage-pinot : configMap
pinot-controller : "false" name: pinot-controller namespace: pinot```  
**@prashant.pandey:** Oh, I see it’s repeated. Let me try deleting the
duplicate line.  
**@walterddr:** did you restart the pod. it should automatically pick up  
**@prashant.pandey:** Wow so I deleted that duplicate line and it picked it
up.  
**@prashant.pandey:** I’ll recheck this. Not sure why duplicate configs are
behaving like this. Might be a bug.  
**@prashant.pandey:** Thanks @walterddr  
**@walterddr:** np. glad i can help  
 **@rajat.taya:** @rajat.taya has joined the channel  
 **@ryan.persaud:** @ryan.persaud has joined the channel  
 **@ryan.persaud:** :wave: Hello, I am working through the QuickStart
Tutorial, and I started pinot locally with the command: `./bin/pinot-admin.sh
QuickStart -type batch`. I can see a log entry for the the table being added,
and no obvious errors: ```Adding offline table: baseballStats Executing
command: AddTable -tableConfigFile
/var/folders/jv/g99n5jcj3hz0lbbf90gykcc40000gq/T/1651874628141/baseballStats_1651874628195.config
-schemaFile
/var/folders/jv/g99n5jcj3hz0lbbf90gykcc40000gq/T/1651874604715/baseballStats/baseballStats_schema.json
-controllerProtocol http -controllerHost localhost -controllerPort 9000 -user
null -password [hidden] -exec``` but I do not see the table via the UI (please
see screenshot). Is there an additional step that I need to take in order to
see the table? Thanks! Not sure if it's relevant, but here is some version
information: Java: `openjdk 11.0.15 2022-04-19` pinot: `pinot-0.10.0`  
**@xiaobing:** if the cmd went well, the table should show up in UI  
**@xiaobing:** trying this on my side  
**@ryan.persaud:** Since I'm not seeing it, I'm guessing there was an issue
adding table. Is there anywhere else to check for logging? I looked in
`logs/pinot-all.log` as well, but I see `44293 2022/05/06 16:03:48.195 INFO
[BootstrapTableTool] [main] Adding offline table: baseballStats` and no
errors/exceptions.  
**@xiaobing:** hmm.. just tried this quickstart on my side (but on latest
master branch), looks like things worked as expected, and logs are just emit
to console  
**@xiaobing:** I can try it on `pinot-0.10.0` shortly  
**@xiaobing:** for a clean attempt, I downloaded pinot-0.10.0 binary and ran
the cmd again. It went well. The sample queries returned with results, and the
pinot UI showed the table too. And the logs simply emit to console (pretty
verbose actually) ```➜ apache-pinot-0.10.0-bin bin/pinot-admin.sh QuickStart
-type batch ... Query : select playerName, runs, homeRuns from baseballStats
order by yearID limit 10 Executing command: PostQuery -brokerProtocol http
-brokerHost 192.168.0.101 -brokerPort 8000 -queryType sql -query select
playerName, runs, homeRuns from baseballStats order by yearID limit 10
Processed
requestId=5,table=baseballStats_OFFLINE,segments(queried/processed/matched/consuming)=1/1/1/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=33,resSerMs=0,totalTimeMs=34,minConsumingFreshnessMs=-1,broker=Broker_192.168.0.101_8000,numDocsScanned=97889,scanInFilter=0,scanPostFilter=97919,sched=FCFS,threadCpuTimeNs(total/thread/sysActivity/resSer)=0/0/0/0
requestId=5,table=baseballStats_OFFLINE,timeMs=40,docs=97889/97889,entries=0/97919,segments(queried/processed/matched/consuming/unavailable):1/1/1/0/0,consumingFreshnessTimeMs=0,servers=1/1,groupLimitReached=false,brokerReduceTimeMs=2,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs,RequestSentDelayMs);192.168.0.101_O=0,36,642,0,1,offlineThreadCpuTimeNs(total/thread/sysActivity/resSer):0/0/0/0,realtimeThreadCpuTimeNs(total/thread/sysActivity/resSer):0/0/0/0,query=select
playerName, runs, homeRuns from baseballStats order by yearID limit 10
playerName runs homeRuns Alfred L. 0 0 Charles Roscoe 66 0 Adrian Constantine
29 0 Robert 9 0 Arthur Algernon 28 0 Douglas L. 28 2 Francis Patterson 0 0
Robert Edward 30 0 Franklin Lee 13 0 William 1 0
*************************************************** You can always go to  to
play around in the query console ... ➜ apache-pinot-0.10.0-bin java -version
openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment
AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) OpenJDK 64-Bit Server VM
AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)```  
**@ryan.persaud:** Interesting, I don't think I'm getting the query output. Is
it after all of the bootstrapping has completed?  
**@xiaobing:** yes, after the pinot components got started, table created and
sample data ingested into the table.  
**@ryan.persaud:** Did you get an explicit message for the data being ingested
into the table?  
**@xiaobing:** e.g. there were logs like ```Executing command: AddTable
-tableConfigFile
/var/folders/_0/gctvc27x5795n3rb5zh52qm00000gn/T/1651877834470/baseballStats_1651877834521.config
-schemaFile
/var/folders/_0/gctvc27x5795n3rb5zh52qm00000gn/T/1651877789218/baseballStats/baseballStats_schema.json
-controllerProtocol http -controllerHost localhost -controllerPort 9000 -user
null -password [hidden] -exec Adding schema: baseballStats with override: true
Added schema: baseballStats ... {"status":"Table baseballStats_OFFLINE
succesfully added"} ... Uploading a segment baseballStats_OFFLINE_0 to table:
baseballStats, push type SEGMENT, (Derived from API parameter) ... Added
segment: baseballStats_OFFLINE_0 to IdealState for table:
baseballStats_OFFLINE ...```  

###  _#getting-started_

  
 **@haiylin:** @haiylin has joined the channel  
 **@ashutosh25apr:** @ashutosh25apr has joined the channel  
 **@rajat.taya:** @rajat.taya has joined the channel  
 **@ryan.persaud:** @ryan.persaud has joined the channel  
 **@krishna080:** @krishna080 has joined the channel  

###  _#introductions_

  
 **@haiylin:** @haiylin has joined the channel  
 **@ashutosh25apr:** @ashutosh25apr has joined the channel  
 **@rajat.taya:** @rajat.taya has joined the channel  
 **@ryan.persaud:** @ryan.persaud has joined the channel  

###  _#linen_dev_

  
 **@slackbot:** removed an integration from this channel:  
**@slackbot:** removed an integration from this channel:  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org