You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/05/03 02:00:31 UTC

Apache Pinot Daily Email Digest (2022-05-02)

### _#general_

**@laila.sabar098:** @laila.sabar098 has joined the channel
**@francois:** Hi. Is there any way from the rest API to retreive
informations to monitor like nb_messages read by consumer / nb messages
indexed . The goal here in my question is to monitor the ingestion and ensure
we are not missing messages. I’ve found messages like that on the pinot-
all.log but I want them from API if possible. Any recomanded way ?
**@kharekartik:** @navi.trinity
**@npawar:** there’s no API as of now, but you could just monitor the metrics
emitted ```REALTIME_ROWS_CONSUMED("rows", true),
INVALID_REALTIME_ROWS_DROPPED("rows", false),
REALTIME_CONSUMPTION_EXCEPTIONS("exceptions", true),
REALTIME_OFFSET_COMMITS("commits", true),
REALTIME_OFFSET_COMMIT_EXCEPTIONS("exceptions", false),
REALTIME_PARTITION_MISMATCH("mismatch", false), ROWS_WITH_ERRORS("rows",
false),``` If you want to see the same metrics that you’re seeing in the logs
(which are just for the scope of the consuming segment, not overall) that
should be easy to add to an existing /consumingSegmentsInfo API. Do you mind
filing a GH issue?
**@ghanta.vardhan:** Hey guys, I am trying to establish jdbc connection to
execute queries on pinot cluster. The pinot cluster is deployed on production
environment and i am connecting from local(port forwarded pinot controller) to
test the jdbc feature. I think while executing the query, the controller is
resolving the broker with its name rather than IP and hence getting
unknownhost exception. ```Caused by:
org.apache.pinot.client.PinotClientException:
java.util.concurrent.ExecutionException:
java.util.concurrent.ExecutionException: java.net.UnknownHostException: pinot-
broker-0.pinot-broker-headless.xxxxx-v2.svc.cluster.local: nodename nor
servname provided, or not known at
org.apache.pinot.client.JsonAsyncHttpPinotClientTransport.executeQuery(JsonAsyncHttpPinotClientTransport.java:104)
at org.apache.pinot.client.Connection.execute(Connection.java:127) at
org.apache.pinot.client.Connection.execute(Connection.java:96) at
org.apache.pinot.client.PinotStatement.executeQuery(PinotStatement.java:63)
... 1 more``` Is there a way i can avoid this error because the same might
happen when i move to production(Application is in different k8s cluster). TIA
**@kharekartik:** Hi, currently the broker hostname needs to be resolvable
from the machine on which client is running
**@mayanks:** Yes, also please directly use broker for querying in production.
The controller endpoint is only for query console, and it also calls the
broker api internally
**@kharekartik:** The question here is regarding JDBC driver. It fetches
brokers list from the provided tenant from the controller. The queries are
sent to brokers only. However, it can cause issues if broker hostname:port is
not resolvable from the client machine. @xiangfu0 Is there a solution for such
cases ?
**@mayanks:** The brokers should be behind an LB and the driver can just be
specified that instead of fetching from controller, right? I think it was
implemented that way due to absence of LB.
**@xiangfu0:** so far there is no such option, one thing is to init the jdbc
by just using the broker LB name, so no hostname resolution is required.
**@mayanks:** The pinot-java-client does have the broker list config. If it is
missing from jdbc client we should add it.
**@xiangfu0:** right
**@jinal.panchal:** @jinal.panchal has joined the channel
**@aswini.nellimarla:** Hi, Apache Pinot can directly talk to datastores like
Cassandra/Cosmos NoSql DB stores?
**@francois:** What do you mean by “talk” ? Joining ? ingesting ?
**@mayanks:** I think you mean pull data from these data stores directly? If
so, not at the moment.
**@aswini.nellimarla:** @francois yes can Pinot pull and ingest data from/to
any NoSql DBs like Cassandra?
**@aswini.nellimarla:** @mayanks understood. Thanks for the reply. If still we
want to connect to these data stores, we can do this by Trino integration am I
right?
**@mayanks:** Likely not. Trino Pinot connector will use data in Pinot to
query via Pinot+Trino.
**@aswini.nellimarla:** Excellent. Thanks for the confirmation Mayank :)
**@jinal.panchal:** Hello, I've started exploring Pinot.. So is there any way
to define primary key & foreign key relationships so that we can maintain
mapping? Because, how will it support join without maintaining relationships?
**@mayanks:** Pinot only supports lookup join today
**@jinal.panchal:** So, there is no way to maintain relationships, right? We
have use case like there is student table & subject table, which has foreign
key relationship based on subjectID. So, is there any way by which it supports
hibernate-ORM like functionality to update/modify child table(referenced)
based on parent table(referencing) modification?
**@mayanks:** Not at the moment. You need to denormalize the tables upfront,
or can use presto/trino for joins.
**@jinal.panchal:** Okay, so is pinot not built for applications where we need
relations or relational use case?
**@mayanks:** It is not a relational database, it is an OLAP datastore
**@erik.bergsten:** We started using the "latest" tagged docker image so we
can use timestamp indexes but in this version kafka sasl_plain authentication
doesnt work (class not found). Is it broken or will we just have to wait for
an official release to get timestamp indexes and full kafka support in one
image?
**@mayanks:** Is there a GH issue for the the Kafka problem you are seeing?
**@erik.bergsten:** No, and it isnt an issue in 0.10.0. It just looks like the
kakfka plain sasl login module isnt packaged in the latest docker image
**@mayanks:** @xiangfu0 ^^
**@xiangfu0:** Can you try to use the shaded path:
```shaded.org.apache.kafka.xxxx```
**@erik.bergsten:** @xiangfu0 it works! Will this be the standard path in 0.11
(and later)?
**@xiangfu0:** Thanks for pointing this out, in short, we tried to package
multiple kafka consumers libs(kafka 0.9, 2.0, 3.0 etc) together, so we need to
shade and relocate them separately. Let me rethink this problem and see if we
can make this experience seamlessly
**@jinal.panchal:** Okay, so is pinot not built for applications where we
need relations or relational use case?
**@ysuo:** Hi team, I noticed Timestamp Index is supported and tried to use
it. *But there is this error.* {“code”:400,“error”:“Cannot deserialize value
of type `org.apache.pinot.spi.config.table.FieldConfig$IndexType` from String
\“TIMESTAMP\“: not one of the values accepted for Enum class: [INVERTED, FST,
JSON, H3, TEXT, SORTED, RANGE]\n at [Source:
(String)\“{\“tableName\“:\“test_time_index\“,\“tableType\“:\“REALTIME\“,\“segmentsConfig\“:{\“schemaName\“:\“test_time_index\“,\“timeColumnName\“:\“created_on\“,\“timeType\“:\“MILLISECONDS\“,\“allowNullTimeValue\“:true,\“replicasPerPartition\“:\“1\“,\“retentionTimeUnit\“:\“DAYS\“,\“retentionTimeValue\“:\“30\“,\“segmentPushType\“:\“APPEND\“,\“completionConfig\“:{\“completionMode\“:\“DOWNLOAD\“}},\“tenants\“:{},\“fieldConfigList\“:[{\“name\“:\“timestamp\“,\“encodingType\“:\“DICTIONARY\“,\“indexTypes\“:[\“TIMESTAMP\“],\“time\“[truncated
3199 chars]; line: 1, column: 483] (through reference chain:
org.apache.pinot.spi.config.table.TableConfig[\“fieldConfigList\“]->java.util.ArrayList[0]->org.apache.pinot.spi.config.table.FieldConfig[\“indexTypes\“]->java.util.ArrayList[0])“}
*Part of my table schema is:* “dateTimeFieldSpecs”: [ { “name”: “timestamp”,
“dataType”: “TIMESTAMP”, “format”: “1:MILLISECONDS:EPOCH”, “granularity”:
“1:MILLISECONDS” } *And part of my table config is:* “fieldConfigList”: [ {
“name”: “timestamp”, “encodingType”: “DICTIONARY”, “indexTypes”:
[“TIMESTAMP”], “timestampConfig”: { “granularities”: [ “DAY”, “WEEK”, “MONTH”
] } } ] Any idea how to fix it?
**@mayanks:** @jackie.jxt
**@mayanks:** What version of Pinot?
**@ysuo:** I’m using Pinot 0.10.0 and I’m referring to this doc .
**@jackie.jxt:** This feature is not released yet. We should add a note in the
documentation denoting it will be available in the next release
**@ysuo:** I see. Thanks.
**@mailtorahuljain:** @mailtorahuljain has joined the channel
**@pedro.j.santos:** @pedro.j.santos has joined the channel
**@ricardoruas88:** @ricardoruas88 has joined the channel
**@padma:** Hi all, I am working on improving the query latency for my
realtime time series table. There is no corresponding offline table and all
the data is realtime data. It has about 61 billion records with 3.5 million
unique ids and a size of 2.7 TB. I have the range index set as the timestamp
and the unique id as the inverted index. I have the incoming streaming data
coming from kafka partitioned. I have the segmentation strategy set to the
default of balanced segmentation. Stats are saying that there are 2 servers
queried, 34 segments matched, 34 segments processed and 34 segments matched. I
am getting a query response time of ~2 seconds and sometimes 4 sec and
repeated querying is giving me 50 ms. Would the following changes improve the
query performance? 1\. Changing the segmentation strategy to Partitioned
Replica-Group Segment Assignment 2\. Bloom filter (does it improve the
performance for individual queries or aggregate queries only?) 3\. I am
assuming star tree index helps with aggregation and not independent records
4\. we have the partitioning set as murmur in the table config 5\. How can I
allocate / increase the hot/warm memory 6\. Tenants are set to DefaultTenant
for both server and broker. Would changing this improve? If so, what should be
changed 7\. Would enabling default star tree and dynamic start tree creation
help? 8\. Would disabling nullhandling affect the performance? Its currently
set to true, but i dont expect null values for the indexed id and timestamp
fields 9\. Should I set autoGeneratedInvertedIndex and
createInvertedIndexDuringSegmentGeneration to true. They are false currently
**@mayanks:** Few questions: • What’s the read qps? • Broker/Server VM cpu/mem
• What’s the JVM configurations?
**@padma:** Its not much currently. Even with 1 query, we are getting this low
perf
**@padma:** Its not being actively used.. Just testing perf against the table
on query console
**@padma:** server mem is 32 g and 8 cpu - We have 42 servers
**@padma:** Same configuration for broker and we have 3 brokers
**@padma:** server jvm used is around 9 gb avg across the servers
**@padma:** server cpu is about 20%
**@mayanks:** Are local disks attached to server SSD?
**@padma:** This is all setup on AWS
**@mayanks:** Is EBS SSD?
**@mayanks:** Also, can you share the broker response metadata and the log
when query takes 4s
**@padma:** Is there a way to check if the EBS is SSD?
**@padma:** broker latency is 1 second
**@padma:** let me share the log
**@padma:** you need the broker log?
**@mayanks:** Just the log line for the query request
**@mayanks:** And also the response metadata returned by broker
**@padma:** it could be any of the broker instances right?
**@padma:** should I look at each of the broker logs?
**@padma:** ```[BaseBrokerRequestHandler] [jersey-server-managed-async-
executor-204831]
requestId=17175209,table=xxx_REALTIME,timeMs=4407,docs=84901/508174840,entries=0/1358416,segments(queried/processed/matched/consuming/unavailable):36/36/36/1/0,consumingFreshnessTimeMs=1651536519201,servers=2/2,groupLimitReached=false,brokerReduceTimeMs=20,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs,RequestSentDelayMs);pinot-
server-34_R=0,4383,3116753,1,-1;pinot-
server-35_R=0,557,2997678,1,-1,offlineThreadCpuTimeNs=0,realtimeThreadCpuTimeNs=0```
**@padma:** Also, numEntriesScannedInFilter is 0 - what does it mean?
**@padma:** and numEntriesScannedPostFilter is 1358416 while numDocsScanned is
84901
**@padma:** that seems pretty high
**@padma:** Anything else you can suggest other than increasing the
resources?
**@brandon308:** @brandon308 has joined the channel

### _#random_

### _#troubleshooting_

**@laila.sabar098:** @laila.sabar098 has joined the channel
**@jinal.panchal:** @jinal.panchal has joined the channel
**@jinal.panchal:** Hello, I've started exploring Pinot.. So is there any way
to define primary key & foreign key relationships so that we can maintain
mapping?
**@diogo.baeder:** Hi folks, let me ask for your opinion on modeling tables
in Pinot. Suppose (just a fake case for simple illustration) that you had a
data source where you have users having different objects at home, where the
types and names of these objects are dynamic, and you wanted to have a way to
store them in such a way that you could be able to query them by objects
amounts, like finding users that have 2 cars and 2 TVs. Considering that you
don't know what objects would be coming in beforehand, how would you model
this? JSON field for the objects, to keep them in a single row that represents
the individual user? Spreading the objects as different rows and then
aggregating and filtering at the application side? How would you guys model
this?
**@g.kishore:** Model as json type columns as long the objects a single user
holds does not run into hundreds of thousands
**@g.kishore:** This will be the fastest and most efficient
**@diogo.baeder:** Hmm... In some cases I might have about 100 items or so,
but I hope this doesn't turn out to be a problem
**@mayanks:** Should be fine
**@g.kishore:** One thing missing in Pinot right now is ability to have
configure indexes for each field within a json
**@g.kishore:** We only do inverted index rt now by default
**@g.kishore:** Will be great if you can file an issue for this
**@diogo.baeder:** I can, yes. Will do ASAP. Thanks again!
:slightly_smiling_face:
**@mayanks:** Also is there structure to it, or you just want to do text match
on bunch of strings?
**@diogo.baeder:** Something like, imagine a user has: • TVs: 2 • Cars: 1 And
another user has: • TVs: 1 • Dogs: 4 So each user has a certain "thing" and
then a certain amount of that thing. Just one level, no complex structure
really. But the problem is that users would have different things, hence me
not being able to define them as columns.
**@g.kishore:** yeah json is right
**@diogo.baeder:** Initially I structured this as each "thing" being a
separate row, but it turned out to have an obvious problem: it would be
impossible to filter users that have "2 TVs and 1 car", for example.
**@ysuo:** Maybe you can transform one json to multiple records with
TVs/Dogs/.. stored in a field named as type and 2/1/.. stored in a field named
as amount.
**@g.kishore:** yes, thats also another idea and commonly used. the only
drawback with that is if you want to get count of users who have TVs and Dogs.
That will require distinctCount vs Count with json.
**@diogo.baeder:** Thanks for the hint, Alice! But I was doing that already,
and it didn't solve my problem because I ended up not being able to correlate
different rows in the same query (e.g. "users that have 2 TVs, and either 2
dogs or 2 cars")
**@g.kishore:** how big is the dataset
**@diogo.baeder:** I don't know yet how big the dataset will be, in total it
will probably be in the order of a few terabytes.
**@mailtorahuljain:** @mailtorahuljain has joined the channel
**@pedro.j.santos:** @pedro.j.santos has joined the channel
**@ricardoruas88:** @ricardoruas88 has joined the channel
**@brandon308:** @brandon308 has joined the channel

### _#custom-aggregators_

**@himanshu.rathore:** @himanshu.rathore has joined the channel

### _#query-latency_

**@himanshu.rathore:** @himanshu.rathore has joined the channel

### _#pinot-dev_

**@wcxzjtz:**
**@atri.sharma:** Is there a way to set the number of segments required when
creating a test data set for an integration test?
**@amrish.k.lal:** Not sure if this is exactly what you are looking for, but
in one of my unit test cases I created a table over two segments in the
following way: ```@BeforeClass public void setUp() throws Exception {
FileUtils.deleteDirectory(INDEX_DIR); List<GenericRow> records1 = new
ArrayList<>(NUM_RECORDS); records1.add(createRecord(120, 200.50F, "albert1",
"albert", 1643666769000L)); records1.add(createRecord(250, 32.50F, "martian1",
"mouse", 1643666728000L)); records1.add(createRecord(310, -44.50F, "martian2",
"mouse", 1643666432000L)); records1.add(createRecord(340, 11.50F, "donald1",
"duck", 1643666726000L)); records1.add(createRecord(110, 16, "goofy1",
"goofy", 1643667762000L)); records1.add(createRecord(150, 12, "goofy2",
"goofy", 1643667762000L)); records1.add(createRecord(100, -28, "daffy1",
"daffy", 1643667092000L)); records1.add(createRecord(120, -16, "pluto1",
"dwag", 1643666712000L)); records1.add(createRecord(120, -16, "zebra1",
"zookeeper", 1643666712000L)); records1.add(createRecord(220, -16, "zebra2",
"zookeeper", 1643666712000L)); createSegment(records1, SEGMENT_NAME_LEFT);
ImmutableSegment immutableSegment1 = ImmutableSegmentLoader.load(new
File(INDEX_DIR, SEGMENT_NAME_LEFT), ReadMode.mmap); List<GenericRow> records2
= new ArrayList<>(NUM_RECORDS); records2.add(createRecord(150, 10.50F,
"alice1", "wonderland", 1650069985000L)); records2.add(createRecord(200,
1.50F, "albert2", "albert", 1650050085000L)); records2.add(createRecord(32,
10.0F, "mickey1", "mouse", 1650040085000L)); records2.add(createRecord(-40,
250F, "minney2", "mouse", 1650043085000L)); records2.add(createRecord(10,
4.50F, "donald2", "duck", 1650011085000L)); records2.add(createRecord(5,
7.50F, "goofy3", "duck", 1650010085000L)); records2.add(createRecord(5, 4.50F,
"daffy2", "duck", 1650045085000L)); records2.add(createRecord(10, 46.0F,
"daffy3", "duck", 1650032085000L)); records2.add(createRecord(20, 20.5F,
"goofy4", "goofy", 1650011085000L)); records2.add(createRecord(-20, 2.5F,
"pluto2", "dwag", 1650052285000L)); createSegment(records2,
SEGMENT_NAME_RIGHT); ImmutableSegment immutableSegment2 =
ImmutableSegmentLoader.load(new File(INDEX_DIR, SEGMENT_NAME_RIGHT),
ReadMode.mmap); _indexSegment = null; _indexSegments =
Arrays.asList(immutableSegment1, immutableSegment2); }```
**@atri.sharma:** Rather than manually merging avro files together?
**@dadelcas:** hey there, can I get someone to review this PR? I'll need
this to finish implementing timestamp and json support in trino connector.
I've left a comment with regards to timestamp and time zones, I'll raise
separate issue for that if there isn't one yet
**@mayanks:** Thanks for your contribution, will review. cc: @jackie.jxt

### _#pinot-perf-tuning_

**@himanshu.rathore:** @himanshu.rathore has joined the channel

### _#getting-started_

**@laila.sabar098:** @laila.sabar098 has joined the channel
**@jinal.panchal:** @jinal.panchal has joined the channel
**@mailtorahuljain:** @mailtorahuljain has joined the channel
**@pedro.j.santos:** @pedro.j.santos has joined the channel
**@ricardoruas88:** @ricardoruas88 has joined the channel
**@brandon308:** @brandon308 has joined the channel
**@brandon308:** Hello, I'm just getting started and wondering if there is
any documentation on how to use the pulsar plugin for stream ingestion?
**@mayanks:** Seems like we need to add docs @kharekartik
**@mayanks:** In the meanwhile

### _#introductions_

**@laila.sabar098:** @laila.sabar098 has joined the channel
**@jinal.panchal:** @jinal.panchal has joined the channel
**@mailtorahuljain:** @mailtorahuljain has joined the channel
**@pedro.j.santos:** @pedro.j.santos has joined the channel
**@ricardoruas88:** @ricardoruas88 has joined the channel
**@brandon308:** @brandon308 has joined the channel
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org