You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/09/02 02:00:23 UTC

Apache Pinot Daily Email Digest (2021-09-01)

### _#general_

**@atri.sharma:** Are there examples of Pinot client running multiple
concurrent queries?
**@david.cyze:** @david.cyze has joined the channel
**@simone.franzini:** @simone.franzini has joined the channel
**@qianbo.wang:** Hi Pinot experts, I’m new to this analytics realm with
Pinot and I have a general question: Does pinot support something like “view”
that is common in OLTP? What I’m looking for is a way to optimize frequently
used queries that require aggregation over data entries, e.g,: sum of total
sales for the past 30, 60, 90 days which aggregates on a designated time
column. Another option I’m thinking of is to create separate table for this
aggregation which is derived from the fact table and use a scheduled job to
update it. Any idea? Thanks in advance!
**@ken:** The standard Pinot approach would be to define a star tree index
with the time column as the dimension and the sales column as the aggregate.
That should get you very fast results for pretty much any date range.
**@qianbo.wang:** That is interesting. I will take a look on that. Thanks!
**@mayanks:** @qianbo.wang I'd first suggest to evaluate the out of the box
performance for your queries. Only if the performance needs to be improved
further, you can explore StarTree for partial pre-materialization as @ken
suggested.
**@qianbo.wang:** Thanks. We will benchmark and see would StarTree index helps

### _#random_

**@david.cyze:** @david.cyze has joined the channel
**@simone.franzini:** @simone.franzini has joined the channel

### _#troubleshooting_

**@gonzalo:** Hi, I am trying to run the latest version of Pinot with Docker
(Mac) and the container suddenly stops. I don’t see any errors in the log nor
are there any other containers running at that time. ```docker run \
\--network=pinot-demo \ \--name pinot-quickstart \ -p 9000:9000 \
apachepinot/pinot:latest QuickStart \ -type batch``` Does anyone have any idea
what might be going on? Please find attached logs
**@david.cyze:** Not sure (and I'm a very novice pinot user myself) Logs stop
after attempting to start the swagger server. Maybe swagger is trying to start
on a port that is unavailable, and the exception handling just crashes with no
further logs
**@gonzalo:** thanks @david.cyze, but I think I got it. It was a memory issue.
Increasing the memory solve it
**@david.cyze:** Sure thing :slightly_smiling_face: that was my second guess
if you believe it :stuck_out_tongue:
**@gonzalo:** haha, I do
**@david.cyze:** @david.cyze has joined the channel
**@david.cyze:** I'm tasked with doing a Pinot POC for my organization, as
we're considering switching to it as our primary data store for reporting
data. I followed the guide and was able to create a realtime table ingesting
streaming github events. I'm now trying to setup my own realtime table
ingesting dummy data with a JSON column and UPSERTs enabled (this will be
required for our use case). I have successfully uploaded both a table config
and a schema to the pinot controller, and I also created a little app to push
dummy data into a Kafka topic. *I confirmed that the data is successfully
being added to the topic, however my table is not ingesting any records.* Can
someone help me troubleshoot why that may be happening? I will post the table
config and schema in this message's thread
**@david.cyze:**
**@mayanks:** Any errors on the controller or serve logs?
**@david.cyze:**
**@david.cyze:** Just a moment @mayanks, I will give it a look. (Thought I was
already tailing them, but it turned out I was looking at the kafka server)
**@mayanks:** Also what release of Pinot are you using? You can try the debug
table api in swagger with latest 0.8.0
**@david.cyze:** I am on 0.8.0. I was unaware of that API. I'll give that a
look too
**@david.cyze:** ```org.apache.kafka.common.errors.TimeoutException: Timeout
expired while fetching topic metadata java.lang.RuntimeException:
org.apache.kafka.common.errors.TimeoutException: Timeout expired while
fetching topic metadata```
**@david.cyze:** It appears my payloads to the kafka topic are malformed as
well. I will debug that and report back
**@david.cyze:** ```2021/08/31 21:54:12.977 ERROR [JSONMessageDecoder]
[simplejson__0__1__20210831T2011Z] Caught exception while decoding row,
discarding row. Payload is
{"uid":"ad23a2ea-1fac-4a57-8d47-597d3b77a52a","attr_json": {"A": "{"type":
"numTickets", "val": 83}","B": "{"type": "numTickets", "val": 51}","C":
"{"type": "numTickets", "val": 61}"},"createdDateInEpoch":1570000000247}
shaded.com.fasterxml.jackson.core.JsonParseException: Unexpected character
('t' (code 116)): was expecting comma to separate Object entries at [Source:
(ByteArrayInputStream); line: 1, column: 70]```
**@npawar:** You're missing ingestion config in your table config
**@npawar:** You need to set a transform function on attr_json
**@npawar:** `"columnName":"attr_json_str",
"transformFunction":"jsonFormat(attr_json)"` and change the column name in
schema to attr_json_str
**@david.cyze:** Thank you both. After fixing my data seeding app and adding
an `ingestionConfig`, I'm now able to ingest data into the table with a JSON
column. I'm seeing some behavior I don't quite understand, however. Prior to
adding the `ingestionConfig`, I ingested some rows where `attr_json` was null.
After adding the config, I saw new rows where `attr_json` was populated. In my
schema, I have defined `uid` as the primary key column. I am seeding 1,000
rows at a time, so I would expect to see `(number of runs prior
ingestionConfig * 1,000) + (n runs after config * 1,000)` rows. However, after
adding the `ingestionConfig` and seeding 1,000 more rows, my table now has
1,002 rows. My understanding of upserts is that the . This being the case, how
is it that so many of my rows were overwritten / deleted?* It is of course
exceedingly unlikely that I managed to generate 998 of the same UIDs during my
second round of ingestion .* I'm aware that Pinot does not support deletes.
I'm using "Delete" here because I'm not sure how else to explain my n(docs)
going from 2000 (prior to fixing the ingestion config) to to 1002
**@npawar:** @jackie.jxt @yupeng
**@jackie.jxt:** @david.cyze Pinot overwrite records based on primary key
only, and the record with newer timestamp is preserved
**@jackie.jxt:** So the expected behavior should be one record for each
different `uid`
**@david.cyze:** So there is no explanation for why so many records
disappeared? I had run two iterations of my faulty ingestion application (ie:
before adding the config, thus generating null `attr_json` values). There were
2,000 records before I ran ingestion with the fixed application. That means
that the minimum number of records that should have been present would be
2,000 --- assuming the exceedingly unlikely possibility that every randomly
generated UID was a duplicate of a previously randomly generated UID
**@david.cyze:** Note too that if there were an error with the UID generating
logic in my application (doubtful -- I used java's `UUID.randomUUID()`) such
that each run of my app produced identical `uid` values, the total # of
records should never have exceeded 1,000
**@david.cyze:** When adding a `transformConfig`, does Pinot re-process all
records with the updated config? This could explain the record loss: • 2k
records where JSON is malformed • update `transformConfig` • pinot re-
processes these records; they fail the `transformFunction` ; pinot writes a
new segment with them excluded • 0 records now • ingest records with fixed
application • 1k well-formed records are ingested (actually, 1,001, as I had
an off-by-one "error" in my app and actually generate 1,001 records each run.
This doesn't explain why I saw 1,00*2* records, however)
**@jackie.jxt:** No, pinot won't re-process the already consumed data
**@jackie.jxt:** Since there are not much data, you may re-create the table to
get a fresh start
**@david.cyze:** Thanks for the suggestion. As I mentioned, I'm doing a POC.
Unexplained data loss has me a bit worried, and I will continue to explore to
see if anything else pops up
**@jackie.jxt:** Understood. Once the table is correctly configured, there
should be no data loss
**@david.cyze:** Thank you all for your time and help. It is much appreciated
:slightly_smiling_face:
**@vibhor.jain:** Issue: Multiple issues seen with Pinot 0.8 integration with
Prestosql350 (Trino). *1. Selecting Boolean col in the projection list is a
problem for both real-time, offline tables. Query throws* select hasVideo from
table1 limit 10; Query 20210901_071343_00190_p5w66 failed: Unable to create
class org.apache.pinot.common.response.broker.BrokerResponseNative from JSON
response:
[{"resultTable":{"dataSchema":{"columnNames":["hasVideo"],"columnDataTypes":["BOOLEAN"]},"rows":[[false],[false],[false],[false],[false],[false],[false],[false],[true],[false]]},"exceptions":[],"numServersQueried":7,"numServersResponded":7,"numSegmentsQueried":7,"numSegmentsProcessed":7,"numSegmentsMatched":7,"numConsumingSegmentsQueried":0,"numDocsScanned":70,"numEntriesScannedInFilter":0,"numEntriesScannedPostFilter":70,"numGroupsLimitReached":false,"totalDocs":70000,"timeUsedMs":5,"offlineThreadCpuTimeNs":3468272,"realtimeThreadCpuTimeNs":0,"segmentStatistics":[],"traceInfo":{},"numRowsResultSet":10,"minConsumingFreshnessTimeMs":0}]
*2. Queries not working as expected for DateTime col* Pinot does not have a
direct DATETIME datatype and supports STRING, LONG, INT via
dateTimeFieldSpecs. Now we have a STRING col in dateTimeFieldSpecs section but
when using this col to query via prestosql, it's not working as expected. *3.
Alias feature is not working.* Executed a count(*) AS total_calls but
resultset shows col name as count(*) only. Alias flow is not taking effect.
P.S: We would be raising these concerns with Trino community but thought of
sharing it here too.
**@mayanks:** @elon.azoulay @xiangfu0 ^^
**@g.kishore:** Thanks for sharing Vibhor.. some of these might be related to
connector as well.
**@elon.azoulay:** This is fixed in the new version of the connector which
will support pinot 0.8.0, aliases, boolean types and more function calls as
well.
**@elon.azoulay:** Already have it working locally, should be soon.
**@vibhor.jain:** Hi @elon.azoulay, can you point me to the link where I
could try it? I'm assuming its not officially out.
**@simone.franzini:** @simone.franzini has joined the channel
**@elon.azoulay:** Right, still working on it and will push it soon, I'll
keep you updated.

### _#pinot-dev_

**@steve.reed:** @steve.reed has joined the channel

### _#getting-started_

**@luisfernandez:** hey friends, I have a need in my current project to do
stats for ads, (impressions, click_count, click_spent) etc…, now my client has
many dimensions they may want to look stuff by (locale, user_id, search query,
device etc) … we currently track all of this data thru kafka and was thinking
about using pinot to make this data queryable, the user facing dashboard looks
at this data by set timeranges and also custom time ranges, I was wondering if
pinot is a good candidate for given problem. Right now i’m working in a POC
with pinot so would appreciate any insights :slightly_smiling_face: thank you!
**@steve.reed:** @steve.reed has joined the channel
**@tiger:** Is there a way to specify the SegmentPush job to only push a
single segment instead of a directory?
**@npawar:** one way i can think of is setting “includePattern” in the yml
file. you can find that config in the doc
**@tiger:** includePattern seems to only work for ingest during segment
creation. For push, is it correct to set outputDirURI to exactly the segment
to push?
**@npawar:** ah you are talking about push only. Yes looking at the code, it
should work if you directly give segment path.
**@npawar:** are you seeing different behaviour?
**@tiger:** I just tried it by setting outputDirURI to that and it seems to
work. Just wanted to confirm that is a valid use case. Thanks!
**@tiger:** On another note, I have a question about how the push works. I'm
currently using the Metadata push method. If I split up the creation and push
steps, I believe the push job has to download the segment, and then generate
the metadata right? If I use SegmentCreationAndMetadataPush, is it more
efficient in that it can just directly create the segment and generate the
metadata in one go? So it would save an extra download of the segment?
**@npawar:** yes that is correct, separating the 2 phases will create an extra
download
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org