You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/05/22 02:00:24 UTC
Apache Pinot Daily Email Digest (2021-05-21)

### _#general_

  
 **@onur.henden:** @onur.henden has joined the channel  
 **@gqian3:** @gqian3 has joined the channel  
 **@akash143shah:** @akash143shah has joined the channel  
 **@patidar.rahul8392:** How Pinot loads the segment.i.e. If I am consuming
data from Kafka and if any new events come in Kafka so pinot will load the
entire segment once again or only the new events.?  
**@mayanks:** For new events are consumed in memory. For segments already
flushed to disk, Pinot memory maps them so only part of the segment needed to
be read is pulled in.  
**@patidar.rahul8392:** Okay @mayanks Thank ! !.Do we hv any documents for the
same to understand how internally it works and how internally controller store
it on server.?  
**@mayanks:**  
**@mayanks:** Also, there are intro to Pinot videos in the docs page  
 **@mohit.asingh:** @mohit.asingh has joined the channel  
 **@hamza.senoussi:** Hello, I'm new to Pinot and I'm trying to run some tests
on the tool using a GCP Deployment. I have deployed GCP o GKE using the the
official documentation and it seems to work well. The next step is to import
batch data from Google Cloud Storage in order to run some queries. The data on
Cloud Storage is 300 Gb of csv files. *First thing :* I couldn't figure out
how to link Cloud Storage to Apache Pinot via the pinot plugin. *Second thing
:* How can I transfrom the csv files into Pinot Tables ? Can I have some
guidance on these two subjects ? Thanks in advance :smile:  
**@mayanks:** You can check out:  
**@hamza.senoussi:** I have checked this page, but it isn't clear where I can
access the controller and server config  
**@mayanks:** These are config files used to start pinot components. For
example:  
**@mayanks:** @fx19880617 How do we customize these configs when using GKE?  
**@mayanks:** @hamza.senoussi here's another sample:  
**@hamza.senoussi:** I have also followed the second link and I can
successfully access the pinot UI via localhost:9000 while it's running on GKE  
**@hamza.senoussi:** But I can't figure out how and where to edit the config
files in order to enable the gcs plugin  
**@mayanks:** So these are config files you provide when starting Pinot
components. I'll check how these configs get hooked up when using GKE. Also
tagging @fx19880617 who did this part.  
**@fx19880617:** for gke part, if you deploy under k8s then you can edit the
configMap to add all the required configs  
**@mayanks:** Could we add this in the doc?  
**@fx19880617:** those things can be edit inside the `values.yaml` when deploy
helm  
**@fx19880617:** some customization:  
**@hamza.senoussi:** Thank you ! That was helpful  
**@hamza.senoussi:** In order to launch an ingestion job I created a yml file
using this template : ```executionFrameworkSpec: name: 'standalone'
segmentGenerationJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush inputDirURI: '' outputDirURI: ''
overwriteOutput: true pinotFSSpecs: \- scheme: gs className:
org.apache.pinot.plugin.filesystem.GcsPinotFS configs: projectId: 'my-project'
gcpKey: 'path-to-gcp json key file' recordReaderSpec: dataFormat: 'csv'
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName:
'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' tableSpec:
tableName: 'students' pinotClusterSpecs: \- controllerURI: ''```  
**@hamza.senoussi:** I created the file under : `/incubator-
pinot/kubernetes/helm/pinot`  
**@fx19880617:** in k8s?  
**@fx19880617:** you can do that by change the ```controllerURI: ''```  
**@fx19880617:** and for gcp Key  
**@fx19880617:** you need to add that into your k8s as a secret  
**@fx19880617:** then mount it to the container  
**@hamza.senoussi:** and what should I do to launch it ?  
**@fx19880617:** you can create a k8s batch job to launch it as a one time job  
**@fx19880617:** this is one example:  but without gcp  
**@hamza.senoussi:** Thank you I'll try to go in depth !  
 **@ken:** What’s the use case for running with multiple controllers? These
are stateless, and don’t have a lot of load (if using something like HDFS for
deep storage), right? So is it just zero downtime (assuming you have a LB in
front of them) in case one goes down?  
**@mayanks:** Fault tolerance for one.  
**@mayanks:** Also, when you get into thousands of tables range, then single
controller might not cut it.  
**@ken:** Hmm, I thought the controller load with lots of tables was due to
synchronization load, but having two controllers doesn’t distribute that load,
right? So what’s the bottleneck with one controller and 1000s of tables?  
**@mayanks:** What do you mean by synchronization load?  
**@ken:** When viewing logs, I see a lot of Helix-related activity (what’s the
ideal vs. actual state). But maybe that’s just when we’re redeploying…  
**@mayanks:** Controller does run a lot of background jobs (eg retention)  
**@mayanks:** But you are right, with deepstore + metadata push, a lot of the
network traffic is avoided. Also some local IO + CPU is also avoided since the
controller won't need to untar/unzip the segments etc.  
**@g.kishore:** if using metadata based push then it’s really for fault
tolerance.. even with thousands of tables one controller is probably enough  
 **@ken:** If I have one controller and two brokers, does the controller
distribute the query load across the two brokers? I thought it would, but the
Pinot in production page recommends “HTTP load balancers for spraying queries
across brokers (or other mechanism to balance queries)“. Or is that to
automatically route traffic away from a broker which has gone down?  
**@mayanks:** Are you using controller to route queries?  
**@ken:** Yes  
**@mayanks:** Hmm, you don't need to (rather shouldn't)  
**@mayanks:** You can have a load balancer over brokers. Brokers provide rest
api for querying  
**@mayanks:** We added query console to the controller so it can be one stop
for everything (query/ZK/admin/etc). In production, there's no need to go
through controller for querying.  
**@ken:** Yes, good point  

###  _#random_

  
 **@onur.henden:** @onur.henden has joined the channel  
 **@gqian3:** @gqian3 has joined the channel  
 **@akash143shah:** @akash143shah has joined the channel  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#feat-presto-connector_

  
 **@vbondugula:** @vbondugula has joined the channel  

###  _#pinot-helix_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#feat-better-schema-evolution_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#troubleshooting_

  
 **@elon.azoulay:** is there a way to represent an array literal in pinot sql?
I tried `ARRAY[1,2,3]` in a filter and a select and it didn't work.  
**@fx19880617:** Hmm, do you want exact match of 1,2,3?  
**@elon.azoulay:** sure  
**@elon.azoulay:** is there a way to also get array contains?  
**@elon.azoulay:** but exact match also  
**@fx19880617:** for array contains, you can do col IN (1,2,3). This will
match the row contains either 1 or 2 or 3.  
**@fx19880617:** You van do col=1 and col=2 and col =3 to ensure the row has
all 1,2,3 matches  
**@fx19880617:** But it will also match 1,2,3,4,5  
**@elon.azoulay:** is there a way to represent an array literal ?  
**@elon.azoulay:** in filter or select?  
**@fx19880617:** I don’t think we have that right now  
**@elon.azoulay:** ok, thanks! good to know, I'll stop playing around with
that then:)  
 **@onur.henden:** @onur.henden has joined the channel  
 **@mags.carlin:** Hi there, can someone help for monitoring apache Pinot?is
there any api that Pinot exposes for metrics/dashboard by default. CC
@mohamed.sultan  
**@jackie.jxt:** Hi Manju, please read this page for Pinot monitoring:  
**@valentin:** Hello, I’m having issue with `LIMIT` on a table with 4.5
millions of rows, When I’m doing this query: ```SELECT * FROM
datasource_609bc4f74e3c000300131110 ORDER BY "timestamp" ASC LIMIT
100000,10``` I’m getting a result in ~2.5s, and I can see in the query
response stats `totalDocs=4794306` which is fine But when I’m doing this one
(offset 1 000 000 instead of 100 000): ```SELECT * FROM
datasource_609bc4f74e3c000300131110 ORDER BY "timestamp" ASC LIMIT
1000000,10``` I’m getting no rows (which isn’t the expected behavior) in ~10s
and the totalDocs is `569840` (which isn’t the expected behavior either?) I
have an hybrid table with segmentPruning by time Do you have any idea why I’m
having this kind of issue? Thank you  
**@jackie.jxt:** Hi, can you please check the other metadata within the query
response? I suspect some servers timed out because of the high offset  
**@jackie.jxt:** FYI, for offset queries, pinot has to gather all the records
even before the offset, i.e. `100010` records for the first query, and
`1000010` records for the second query  
 **@gqian3:** @gqian3 has joined the channel  
 **@akash143shah:** @akash143shah has joined the channel  
 **@syedakram93:** Hi,  
 **@syedakram93:** Exception caught: java.lang.NullPointerException: null at
.Reader.<init>(Reader.java:78) ~[?:1.8.0_275] at
.InputStreamReader.<init>(InputStreamReader.java:97) ~[?:1.8.0_275] at
org.apache.pinot.tools.admin.command.AbstractBaseAdminCommand.readInputStream(AbstractBaseAdminCommand.java:106)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-9becc57eec981d71d5b45af5da7b720840d18f17] at
org.apache.pinot.tools.admin.command.AbstractBaseAdminCommand.sendRequest(AbstractBaseAdminCommand.java:101)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-9becc57eec981d71d5b45af5da7b720840d18f17] at
org.apache.pinot.tools.admin.command.PostQueryCommand.run(PostQueryCommand.java:144)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-9becc57eec981d71d5b45af5da7b720840d18f17] at
org.apache.pinot.tools.admin.command.PostQueryCommand.execute(PostQueryCommand.java:150)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-9becc57eec981d71d5b45af5da7b720840d18f17] at
org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:164)
[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-9becc57eec981d71d5b45af5da7b720840d18f17] at
org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:184)
[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-9becc57eec981d71d5b45af5da7b720840d18f17]
getting above exception while running a query, what could be the possibility?  
**@jackie.jxt:** Hi Syed, can you please check if you got any ERROR logged on
your broker?  
**@jackie.jxt:** What arguments do you put in the `PostQueryCommand`?  
 **@pedro.cls93:** Hello, How is someone meant to configure a json index by a
string field that may sometimes be missing? I'm getting the following
exception: ```java.lang.IllegalStateException: Cannot flatten value node: null
at
shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:518)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.spi.utils.JsonUtils.flatten(JsonUtils.java:246) ~[pinot-
all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.core.realtime.impl.json.MutableJsonIndex.add(MutableJsonIndex.java:71)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.core.indexsegment.mutable.MutableSegmentImpl.addNewRow(MutableSegmentImpl.java:631)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.core.indexsegment.mutable.MutableSegmentImpl.index(MutableSegmentImpl.java:475)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.processStreamEvents(LLRealtimeSegmentDataManager.java:497)
[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.consumeLoop(LLRealtimeSegmentDataManager.java:402)
[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]``` Because a
given field comes with null default value: ``` "inputForUiControls" :
"null",``` It's schema definition is: ```dimensionFieldSpecs: [,...., {
"name": "inputForUiControls", "dataType": "STRING", "maxLength": 2147483647
}]```  
**@pedro.cls93:** Is the intended method to fix this situation to specify the
default value as: `"defaultNullValue": "{}"` ?  
**@jackie.jxt:** Hi Pedro, this issue has been fixed in the current master
branch  
**@jackie.jxt:** @fx19880617 Do we have a recent release that Pedro can try
with?  
**@fx19880617:** latest should have it  
**@pedro.cls93:** Thank you both, I will take a look next week  
 **@mohit.asingh:** @mohit.asingh has joined the channel  
 **@pedro.cls93:** Hello, can anyone share some light one on the root causes
of the following exception: ```2021/05/21 14:32:34.378 ERROR
[SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel]
[HelixTaskExecutor-message_handle_thread] Caught exception in state transition
from OFFLINE -> ONLINE for resource: HitExecutionView_REALTIME, partition:
HitExecutionView__12__21__20210520T1019Z
org.apache.pinot.core.segment.index.loader.V3UpdateIndexException: Default
value indices for column: inputForUiControls cannot be updated for V3 format
segment. at
org.apache.pinot.core.segment.index.loader.defaultcolumn.V3DefaultColumnHandler.updateDefaultColumn(V3DefaultColumnHandler.java:53)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.core.segment.index.loader.defaultcolumn.BaseDefaultColumnHandler.updateDefaultColumns(BaseDefaultColumnHandler.java:144)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.core.segment.index.loader.SegmentPreProcessor.process(SegmentPreProcessor.java:104)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.core.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:99)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:283)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:138)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:164)
[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source) ~[?:?] at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_282] at java.lang.reflect.Method.invoke(Method.java:498)
~[?:1.8.0_282] at
org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404)
[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331)
[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97) [pinot-
all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49) [pinot-
all-0.7.1-jar-with-
dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed] at
java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_282] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_282] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]``` I've
changed the default value of a String dimension field from `"null"` to `"{}"`  
**@g.kishore:** can you file an issue for this.. looks like a bug  
**@pedro.cls93:** Will do  
**@pedro.cls93:**  
**@pedro.cls93:** Please let me know if any additional information is
required.  
**@jackie.jxt:** Changing the existing field in schema is backward-
incompatible change, thus causing this exception. When server throw such
exceptions, it should drop the segment and try to download a new one from the
deep storage, then it should auto recover  

###  _#pinot-k8s-operator_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#transform-functions_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#docs_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#pinot-dev_

  
 **@onur.henden:** @onur.henden has joined the channel  
 **@mohit.asingh:** @mohit.asingh has joined the channel  
 **@mohit.asingh:** Hi, I have started exploring apache pinot, there are few
query regarding *schema* of apache pinot. I want to understand how apache
pinot work with Kafka topic that has AVRO schema (schema includes nested
object, array of object etc..) because i didn't find any resource or example
that shows how we can inject data with Kafka that has avro schema with it. As
per my understanding apache pinot we have to provide flat schema or other
option for nested Json object we can use transform function. Avro schema ```{
"namespace" : "my.avro.ns", "name": "MyRecord", "type" : "record", "fields" :
[ {"name": "uid", "type": "int"}, {"name": "somefield", "type": "string"},
{"name": "options", "type": { "type": "array", "items": { "type": "record",
"name": "lvl2_record", "fields": [ {"name": "item1_lvl2", "type": "string"},
{"name": "item2_lvl2", "type": { "type": "array", "items": { "type": "record",
"name": "lvl3_record", "fields": [ {"name": "item1_lvl3", "type": "string"},
{"name": "item2_lvl3", "type": "string"} ] } }} ] } }} ] } ``` Kafka Avro
Message: ```{ "uid": 29153333, "somefield": "somevalue", "options": [ {
"item1_lvl2": "a", "item2_lvl2": [ { "item1_lvl3": "x1", "item2_lvl3": "y1" },
{ "item1_lvl3": "x2", "item2_lvl3": "y2" } ] } ] }```  
**@mayanks:** By default you can map top level columns to Pinot schema. Or as
you mentioned either use transform functions or json objects to store them in
Pinot.  
**@mohit.asingh:** ok.. as per my understanding we have to use the combination
of top level column and for nested object we have to use transform functions.
What about message key do we store Kafka message key also?  
**@mayanks:** We don’t atm, there was a thread yesterday on
<#CDRCA57FC|general> , check it out  
**@mohit.asingh:** Thanks.. @mayanks In my scenario I want to use the Kafka
message key as unique key and want data to be updated in pinot whenever there
is a new message in Kafka topic with existing key, like upsert in pinot using
Kafka message key  
 **@hamza.senoussi:** @hamza.senoussi has joined the channel  
 **@vbondugula:** @vbondugula has joined the channel  

###  _#community_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#announcements_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#pinot-docs_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#discuss-validation_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#getting-started_

  
 **@vbondugula:** @vbondugula has joined the channel  

###  _#debug_upsert_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  

###  _#complex-type-support_

  
 **@mohit.asingh:** @mohit.asingh has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org