You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/04/14 02:00:26 UTC
Apache Pinot Daily Email Digest (2022-04-13)

### _#general_

  
 **@liandycg_slack:** @liandycg_slack has joined the channel  
 **@nikhil.varma:** @nikhil.varma has joined the channel  
 **@janardhan.bodu:** @janardhan.bodu has joined the channel  
 **@lars-kristian_svenoy:** Hello team :wave: Any chance we could publish
linux/arm64 images for pinot? I see we've started doing that in 0.11.0, but
0.10 and below do not support that architecture. I'm running into problems
running pinot locally on the Mac M1 due to the chipset.  
**@kharekartik:** Hi, We can do that. For current needs, you can build pinot
from source on M1. Add this in your `~/.m2/settings.xml` ```<settings>
<activeProfiles> <activeProfile> apple-silicon </activeProfile>
</activeProfiles> <profiles> <profile> <id>apple-silicon</id> <properties>
<os.detected.classifier>osx-x86_64</os.detected.classifier> </properties>
</profile> </profiles> </settings>``` and then run the following from pinot
source directory `mvn clean package -DskipTests -Pbin-dist`  
**@navina:** @kharekartik can we document this in the pinot website?  
**@kharekartik:** yes, will add it  
 **@francois:** Hi :slightly_smiling_face: Little question comming with prod
getting closer and realData :smile: A few things goes wrong. I’ve got two
table reading the same kafka topic. Both of them are using a complexTypeConfig
to unnest 30 days arrays. And I gettin an infinite loop error
```java.lang.RuntimeException:
shaded.com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion
(StackOverflowError) (through reference chain:
org.apache.pinot.spi.data.readers.GenericRow["fieldToValueMap"]->java.util.Collections$UnmodifiableMap["$MULTIPLE_RECORDS_KEY$"]->java.util.ArrayList[0]->org.apache.pinot.spi.data.readers.GenericRow["fieldTo>
at org.apache.pinot.spi.data.readers.GenericRow.toString(GenericRow.java:247)
~[pinot-all-0.10.0-jar-with-
dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at
java.util.Formatter$FormatSpecifier.printString(Formatter.java:3031) ~[?:?] at
java.util.Formatter$FormatSpecifier.print(Formatter.java:2908) ~[?:?] at
java.util.Formatter.format(Formatter.java:2673) ~[?:?] at
java.util.Formatter.format(Formatter.java:2609) ~[?:?] at
java.lang.String.format(String.java:2897) ~[?:?] at
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.processStreamEvents(LLRealtimeSegmentDataManager.java:543)
~[pinot-all-0.10.0-jar-with-
dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.consumeLoop(LLRealtimeSegmentDataManager.java:420)
~[pinot-all-0.10.0-jar-with-
dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:598)
[pinot-all-0.10.0-jar-with-
dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at
java.lang.Thread.run(Thread.java:829) [?:?]``` TransformConfig as Folow -> ```
"complexTypeConfig": { "fieldsToUnnest": [ "data.attributes.regularTimes" ],
"delimiter": ".", "collectionNotUnnestedToJson": "NON_PRIMITIVE" }``` The
other table as the same complexTypeConfig but based on another field. Any idea
?  
**@mayanks:** @jackie.jxt  
**@francois:** I will try to increase the Xss size on the server side to avoid
that :confused: Getting 39k messages with at least 30days in each to unnest
not sure he like it :smile:  
**@francois:** reducing reading rate made the tricks using
“topic.consumption.rate.limit” : “2",  
**@mayanks:** Hmm how big is each event and how many levels deep is the
nesting  
**@francois:** Big : 30days per event and 27 cols  
**@mayanks:** What does 30days per event mean  
**@francois:** an array with 30 days in a single event I’m Unnesting  
**@mayanks:** Ok. What is the event rate?  
**@francois:** For now it’s quite big because lot (39k) message in kafka
queue. But expected to be 3 to 4 per seconds  
**@francois:** keep failling even with slow message rate :(  
**@mayanks:** Any reason of having 30 days worth of data in one event? That
seems like an anti pattern  
**@francois:** Yes 30 days is linked to a toplevel object summing a few things  
**@francois:** It's a time report.  
**@francois:** May using more partition help ? Only two partitions now :)  
**@mayanks:** Your ingestion rate is really low, not sure if that will help  
**@mayanks:** Can the upstream not flatten the events to be 1 row? Also, do I
understand it right, one event has an array of 30 elements, and each element
is a row of 27 columns?  
**@francois:** Yes you are right  
**@mayanks:** If so, it doesn’t seem terribly bad. The root cause of infinite
might be something different, and worth filing an issue  
**@mayanks:** If you can provide a sample payload with the issue that can help
reproduce it, we should be able to identify the root cause, and hopefully fix
it  
**@mayanks:** Can you file an issue and paste a link here?  
**@francois:** I will look for a bad message an try to reproduce the issue on
my local instance.  
**@mayanks:** Sounds good, thanks  
 **@janardhan.bodu:** Hi team @mayanks @g.kishore @xiangfu0..Im doing a POC to
use Pinot where we have some 70 tables(in older database) and majorly 10 to 15
tables are queried currently with joins to get aggregations and analytics of
the system. Our data is growing at fast pace and wanted to check if pinot
satisfies our needs. I wanted to use presto with Pinot, with existing join
queries and wanted to minimize new data modelling for pinot. I found some
bench marks here() for presto + pinot but it was only for single table(name-
complexWebsite, with billion records). Merging/Modelling all columns from our
tables to single table(to satisfy without joins) is very difficult since we
have many field dependencies. Do you have any reference links where I can find
kind of similar above benchmarking with multiple tables(with joins) queried
from presto to pinot. Can anyone help me in this aspect on how to
proceed..Thanks in advance..  
**@mayanks:** The number of tables shouldn’t really impact join performance.
@yupeng for any data points  

### _#random_

  
 **@liandycg_slack:** @liandycg_slack has joined the channel  
 **@nikhil.varma:** @nikhil.varma has joined the channel  
 **@janardhan.bodu:** @janardhan.bodu has joined the channel  

###  _#feat-text-search_

  
 **@francois:** @francois has joined the channel  

###  _#troubleshooting_

  
 **@liandycg_slack:** @liandycg_slack has joined the channel  
 **@nikhil.varma:** @nikhil.varma has joined the channel  
 **@saumya2700:** Hi everyone, we have realtime tables and data ingestion is
happening from kafka, but our query performance is very low even with around
in total we have 13 lkhs of data. query time is 17 secs, we have 1 tenant, 1
broker 2 server. Do we need to create indexes separately or it is done by
default on columns because i saw some indexes are created. Also is there a
option that we can create segments as per kafka - topic key. We are usually
doing query on timestamp based and id , and our kafka topics have id as key.  
**@mayanks:** do you have query response metadata you can share?  
**@saumya2700:** "numServersQueried": 2, "numServersResponded": 2,
"numSegmentsQueried": 65, "numSegmentsProcessed": 18, "numSegmentsMatched":
13, "numConsumingSegmentsQueried": 10, "numDocsScanned": 10603,
"numEntriesScannedInFilter": 391632, "numEntriesScannedPostFilter": 137839,
"numGroupsLimitReached": false, "totalDocs": 1220116, "timeUsedMs": 87,
"offlineThreadCpuTimeNs": 0, "realtimeThreadCpuTimeNs": 0,
"offlineSystemActivitiesCpuTimeNs": 0, "realtimeSystemActivitiesCpuTimeNs": 0,
"offlineResponseSerializationCpuTimeNs": 0,
"realtimeResponseSerializationCpuTimeNs": 0, "offlineTotalCpuTimeNs": 0,
"realtimeTotalCpuTimeNs": 0, "segmentStatistics": [], "traceInfo": {},
"numRowsResultSet": 5000, "minConsumingFreshnessTimeMs": 1649763344790  
**@mayanks:** Ok, one thing I can see is that too much data is being scanned
`391632` , so you probably need to have some indexing setup  
**@mayanks:** But 17s is too much. So the next questions a) What is the query
b) what is the cpu/mem for servers c) How many segments  
**@mayanks:** Wait `"timeUsedMs": 87,` this is the time Pinot used to compute
the query  
**@mayanks:** Where are you seeing 17s?  
**@mayanks:** If client side, then my guess is that the response is big and
your JSON deser is the bottleneck  
**@saumya2700:** yes I do have json fields and can you please tell me is there
a way we can add all data related to one key in same segment. Or in other
words can we create segments as per kafka topic key.  
**@mayanks:** No I am not talking about json fields. I am saying your query
took 87ms and not 17s  
**@saumya2700:** yes Mayank from pinot query console it is taking some ms but
from sqlachemy in our python app it is taking 15 secs when we are not doing
any other things just querying the data.  
**@mayanks:** That would be a sqlalchemy issue. My guess is it is spending
time in deserializing the response.  
**@mayanks:** What is the Pinot client you are using?  
**@saumya2700:** pinotdb and sqlalchemy  
**@saumya2700:** Mayank thank you for your support , it seems my local network
messed up with VPN , I run same code in server and it is giving result in ms  
 **@zliu:** Hi everyone, How to configure Kafka SSL for Kafka Clients in
pinot?  
**@navina:** Here is an example of how to talk to kafka with ssl - did you
have a specific question?  
**@zliu:** thanks  
 **@janardhan.bodu:** @janardhan.bodu has joined the channel  
 **@erik.bergsten:** Hi! We are trying to use tiered storage with an NFS
volume mounted on "server-b". When we trigger the rebalance and segments move
from server-a to server-b we get alot of errors like: ```Caused by:
java.nio.file.FileSystemException:
/var/pinot/server/data/index/environment_OFFLINE/environment_OFFLINE_1618208070664_1649743939567_7/v3/.nfs000000000134004000000058:
Device or resource busy``` in the logs from server-b. Could this be a problem
with how the server is implemented or is it strictly an NFS problem on our
end? The end result is that some or all segments go into an error state and
the data goes missing during a rebalance.  
**@dlavoie:** Seems like a linux mounting / NFS issue. Pinot could be
responsible of overloading the NFS service it’s not scaled as it needs to be.  
**@mayanks:** Do server-a and server-b share the same NFS? And if so what’s
the dataDir specified in this server? Wondering if both are trying to
overwrite each other  

###  _#pinot-dev_

  
 **@dadelcas:** hey there, I've raised this issue which I've already started
looking into. Once I've have some code to show I'll get back to you on this
channel  
**@dadelcas:** @mayanks  
**@mayanks:** Thanks @dadelcas  

### _#presto-pinot-connector_

  
 **@liandycg_slack:** @liandycg_slack has joined the channel  

###  _#pinot-perf-tuning_

  
 **@francois:** @francois has joined the channel  

###  _#getting-started_

  
 **@liandycg_slack:** @liandycg_slack has joined the channel  
 **@nikhil.varma:** @nikhil.varma has joined the channel  
 **@fizza.abid:** Hello, can anyone tell how can we connect spark streaming to
Apache pinot?  
**@mayanks:** The spark connector to Pinot is not production ready (needs
volunteer to take it to completion). May I ask what’s the use case?  
 **@janardhan.bodu:** @janardhan.bodu has joined the channel  

###  _#releases_

  
 **@francois:** @francois has joined the channel  

###  _#complex-type-support_

  
 **@francois:** @francois has joined the channel  

###  _#pinot-docsrus_

  
 **@francois:** @francois has joined the channel  

###  _#pinot-trino_

  
 **@francois:** @francois has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org