You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/12/11 02:00:18 UTC

Apache Pinot Daily Email Digest (2021-12-10)

### _#general_

  
 **@saulo.sobreiro:** @saulo.sobreiro has joined the channel  
 **@serhiish:** @serhiish has joined the channel  
 **@utkarsh.saxena:** @utkarsh.saxena has joined the channel  
 **@mapshen:** Is there a way or an API to get the latest offset consumed for
a real-time table/segment?  
**@npawar:** Under swagger Apis look for consumingSegmentsInfo API  
**@mapshen:** Ah sweet. The intention is to monitor if the consuming segment
offset is in sync with the kafka partition offset. Does Pinot expose such a
metric already?  
**@mapshen:** @npawar alternatively, what would be your suggestion on
monitoring this?  
**@npawar:** this API is the only way to monitor exact offset consumed from
Pinot side. Oe metric that is useful is LLC_PARTITION_CONSUMING. This is a
gauge which will be 0 if the partition is not consuming for any reason.
Monitoring this in a way such as “if 0 for more than 10 minutes, alert” would
be good  
**@npawar:** @mayanks just checking, we dont have any other ways to monitor
lag betwee kafka latest offset ad consumer offset right?  
**@mayanks:** @npawar @mapshen yes, I am not aware of any other existing ways
to monitor.  
 **@priyenpatel2014:** @priyenpatel2014 has joined the channel  
 **@lars-kristian_svenoy:** Hey guys. Regarding  (The Log4j vulnerability)
when can we expect a release of Pinot to mitigate that? I see you just
recently merged a PR to deal with it:  
**@mayanks:** Hey @lars-kristian_svenoy what version of jvm are you using and
is your Pinot accessible from internet?  
**@mayanks:** ```JDK versions greater than 6u211, 7u201, 8u191, and 11.0.1 are
not affected by the LDAP attack vector. In these versions
com.sun.jndi.ldap.object.trustURLCodebase is set to false meaning JNDI cannot
load a remote codebase using LDAP.```  
**@mayanks:** In the interim, you can `formatMsgNoLookups=true` as a w/a.  
**@lars-kristian_svenoy:** I am using the jdk11 image of pinot, is it built
with > 11.0.1?  
 **@j.vinodpatel:** @j.vinodpatel has joined the channel  

###  _#random_

  
 **@saulo.sobreiro:** @saulo.sobreiro has joined the channel  
 **@serhiish:** @serhiish has joined the channel  
 **@utkarsh.saxena:** @utkarsh.saxena has joined the channel  
 **@priyenpatel2014:** @priyenpatel2014 has joined the channel  
 **@j.vinodpatel:** @j.vinodpatel has joined the channel  

###  _#troubleshooting_

  
 **@saulo.sobreiro:** @saulo.sobreiro has joined the channel  
 **@tanmay.movva:** Hello. I am trying out the Pinot Connector in Trino and I
am facing the an error on a simple select query like ```select * from
pinot.default.table limit 10``` This is the stacktrace of the error. Can
anyone please help? Did anyone face a similar issue before?
```java.lang.NullPointerException: null value in entry:
Server_server-2.server-headless.pinot.svc.cluster.local_8098=null at
com.google.common.collect.CollectPreconditions.checkEntryNotNull(CollectPreconditions.java:32)
at
com.google.common.collect.SingletonImmutableBiMap.<init>(SingletonImmutableBiMap.java:42)
at com.google.common.collect.ImmutableBiMap.of(ImmutableBiMap.java:72) at
com.google.common.collect.ImmutableMap.of(ImmutableMap.java:119) at
com.google.common.collect.ImmutableMap.copyOf(ImmutableMap.java:454) at
com.google.common.collect.ImmutableMap.copyOf(ImmutableMap.java:433) at
io.trino.plugin.pinot.PinotSegmentPageSource.queryPinot(PinotSegmentPageSource.java:221)
at
io.trino.plugin.pinot.PinotSegmentPageSource.fetchPinotData(PinotSegmentPageSource.java:182)
at
io.trino.plugin.pinot.PinotSegmentPageSource.getNextPage(PinotSegmentPageSource.java:150)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:311)
at io.trino.operator.Driver.processInternal(Driver.java:387) at
io.trino.operator.Driver.lambda$processFor$9(Driver.java:291) at
io.trino.operator.Driver.tryWithLock(Driver.java:683) at
io.trino.operator.Driver.processFor(Driver.java:284) at
io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
at
io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at
io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
at io.trino.$gen.Trino_362____20211126_004329_2.run(Unknown Source) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)```  
**@tanmay.movva:** Our pinot is deployed on Kubernetes and every component has
a headless service and a `ClusterIP` service. Also I am able to query that
table directly from Pinot UI. But the query doesn’t work on trino.  
**@mayanks:** What version of Pinot and Trino?  
**@mayanks:** Some possibilities - a) Setup issue b) Connectivity issue c)
Version mismatch  
**@tanmay.movva:** trino - 362. Pinot - 0.9.0 Connectivity is there between
two services. I am able to run metadata queries such as ```show table from
pinot.default```  
**@mayanks:** I found similar thing discussed a while back:  
**@mayanks:** Search for `PinotSegmentPageSource` in that link  
**@mayanks:** cc @elon.azoulay  
**@elon.azoulay:** Can you try trino 365? It is compatible with pinot 0.8.0  
**@tanmay.movva:** Sure. Will upgrade and let you know.  
**@elon.azoulay:** We are working on updating to be compatible with pinot
0.9.0  
**@tanmay.movva:** So we are on 0.9.0 for pInot. Will it work with Trino 365
now? Or would I have to downgrade to pinot 0.8.0?  
**@elon.azoulay:** It should - it looks like the api's from 0.8.0 to 0.9.0
that trino connector uses are similar  
**@elon.azoulay:** I think pinot 0.9.0 has some really great features, no need
to downgrade.  
**@tanmay.movva:** Thanks this is working.  
 **@kangren.chia:** hi just checking again, does anybody know how i can bypass
the 1 million limit on rows returned by the broker?  
 **@serhiish:** @serhiish has joined the channel  
 **@alihaydar.atil:** Hello everyone, i wonder that would setting maxLength
property of STRING data types in schema to high values cause extra memory
allocation or performance degradation?  
 **@utkarsh.saxena:** @utkarsh.saxena has joined the channel  
 **@falexvr:** good morning guys. We setup a pinot cluster a while ago, while
it was 0.6.0 the latest version and recently spawned a new cluster with
version 0.8.0 to test it before using it in production. Before we start
streaming data into this new cluster I’d like to know first if having two
clusters with low level kafka consumers streaming data from the same topic
would represent an issue? I ask this because I see the current cluster doesn’t
rely on kafka consumer groups to keep track of the offsets, on the other hand
in our kafka provider I see there is an empty named consumer group consuming
data from the topics and it seems that one belongs to pinot  
**@xiangfu0:** I assume these two clusters are separated(on a new zk cluster
or same zk but different helix cluster name). Then you are fine, you can
create multiple tables consuming same kafka topic as well. Pinot internally
track zk offsets per table basis.  
**@falexvr:** Yep, different zk clusters as well  
**@falexvr:** Great! Thanks  
 **@priyenpatel2014:** @priyenpatel2014 has joined the channel  
 **@jeff.moszuti:** Hello, I am kicking the tyres of Pinot (v 0.9.0) by doing
the following tutorial . I load 4 records from a CSV file into an offline
table named transcript and I get 4 rows returned when executing the following
statement `select * from transcript limit 10` . As soon as I upload a realtime
table config and schema () only 3 rows are returned when running the same SQL
statement. I do however see 4 rows if I query the offline table e.g. `select *
from transcript_OFFLINE limit 10` . What could be the reason?  
**@g.kishore:** Time boundary ..  
**@g.kishore:** Latest days data will be pulled from real-time and not offline  
**@g.kishore:** Idea is give enough time for the batch jobs to push to offline
tables and avoid inconsistency during push  
**@jeff.moszuti:** At the moment I haven't pushed any real-data yet, I just
created the realtime table config and schema. Let's say, no real-time data
comes in for some time. At which point will selecting from transcript return
back 4 rows? Are there any setting that I can change to get a better
understanding on the time boundary works?  
**@mayanks:** The expectation from a hybrid table is that data is flowing in
both, and that there’s data overlap.  
**@mayanks:** If you are only interested in offline component, you can query
offline table explicitly by appending suffix _OFFLINE to the table name in the
query  
**@jeff.moszuti:** Thanks for the replies Kishore and Mayank. I've read the
documentation on time boundaries but still a bit confused on how a hybrid
table is supposed to be used. In the test data of the tutorial, the records
for offline and real time are unique but there a few records which exist both
on the Oct 24 and Oct 28. The hybrid table shows the records as in the diagram
below - records from the real-time table on the 24th are not visible and the
records from offline table on the 30th are also not visible. I understand why
this has been done. Given that the records are unique, will the hybrid table
show all records as some point later in time (reconciled?). I'll like to be
able to count the number of student transcripts to get an accurate total.  
 **@j.vinodpatel:** @j.vinodpatel has joined the channel  
 **@weixiang.sun:** Currently hybrid table is between offline table and
realtime table. Is it possible to hybrid offline table and upsert table?  
**@mayanks:** Upsert is limited to realtime only. You can have a hybrid table
with upsert enabled realtime. However, upserts will not apply to offline  
**@weixiang.sun:** Thanks @mayanks! When I created the hybrid table out of
offline table and upsert table, should I just follow the same process?  
**@mayanks:** Yes  
**@mayanks:** The upsert table is nothing but a real-time table with upsert
enabled  
**@mayanks:** Also you want to ensure it your app is functionally ok with the
fact that there won’t be any upsert coming to time that is in offline  
**@mayanks:** Otherwise you will have incorrect results  
**@weixiang.sun:** @mayanks Thanks!  

###  _#pinot-dev_

  
 **@serhiish:** @serhiish has joined the channel  
 **@richard892:** @kharekartik can you merge this please?  
 **@navina:** @navina has joined the channel  

###  _#getting-started_

  
 **@navina:** @navina has joined the channel  

###  _#pinot-docsrus_

  
 **@jeff.moszuti:** @jeff.moszuti has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org