You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/06/01 03:36:02 UTC

Apache Pinot Daily Email Digest (2022-05-31)

### _#general_

  
 **@jacob.branch:** @jacob.branch has joined the channel  
 **@piercarlo.paltro:** @piercarlo.paltro has joined the channel  
 **@satyam.raj:** Hey guys, how should we be handling the scenario if there
are multiple kafka topics that need to be ingested to pinot and joined to have
the final result? Should there be a pre-aggregate/lookup streaming job that
consolidates multiple topics data into one topic that gets ingested by pinot,
or should we use Presto to do the joins?  
**@kharekartik:** Do all topics receive data in same format?  
**@kharekartik:** @walterddr for joins  
**@satyam.raj:** the topics will have different data, like “app_install” can
be one kafka topic, and “app_open” can be another kafka topic. these two need
to be joined  
**@mayanks:** You likely need a stream processing (flink) job upstream for
this. Unless all you want to do is dimension lookup, in which case refer to  
**@satyam.raj:** But what if there are lots of such events, any two of which
can be joined at query time?  
**@mayanks:** In lookup join, the dimension table is static (periodic
refresh). If you are referring to flink, then that’s what it is made for.  
 **@arawat:** Hi Pinot team, our security team flagged our pinot deployment in
labs for security vulnerabilities with majority coming from
`com.fasterxml.jackson` and all of them are addressed in newer versions of
dependencies. Any thoughts on how should we go about addressing these. Can
share with you the list if interested,  
**@mayanks:** Will dm  
 **@alex.gartner:** @alex.gartner has joined the channel  
 **@madison.s204:** @madison.s204 has joined the channel  
 **@alex.gartner:** Hi all, doing some testing with Pinot lately. Just
wondering, is there a "Kibana"-like tool for Pinot that can make it a little
bit easier to visualize data, without having to write an application that does
so?  
**@diogo.baeder:** Have you tried Apache Superset?  
**@mayanks:** Yes, you can refer to:  
**@alex.gartner:** thank you both! haven't checked superset but it looks
perfect  
**@mayanks:** Glad to assist  
 **@alex.gartner:** Another question I've been wondering about is this idea of
both realtime and offline tables being queried at once, via the same table
name. Does anyone have an interesting use case for when they've used this? I'm
trying to wrap my head around one  
**@mayanks:** Yes this is a very common pattern. What’s your question on this
one?  
**@alex.gartner:** really just trying to imagine where this would be useful.
in my cases, our streaming data sources are usually so different from our
batched data, I'm wondering why I'd want to query them at the same time  
**@alex.gartner:** do you have an example of this in practice?  
**@mayanks:** By example you mean a config setup? Or just want to know who is
running it?  
**@mayanks:** If former  
**@alex.gartner:** latter, just a scenario in which it makes sense  
**@mayanks:** I think many of LinkedIn’s use cases follow that pattern. For
example, “who viewed my profile” that is powered by Pinot follows that  
**@mayanks:** Real-time ingestion gives you freshness. Offline gives you
opportunity to pre-aggregate, correct stream error etc  
**@mayanks:** So you get best of both worlds. Does that make sense?  
**@alex.gartner:** ahhh yeah totally. ty!  
 **@wadodkar:** @wadodkar has joined the channel  
 **@kevin.kamel:** @kevin.kamel has joined the channel  
 **@carolyn:** @carolyn has joined the channel  

###  _#random_

  
 **@jacob.branch:** @jacob.branch has joined the channel  
 **@piercarlo.paltro:** @piercarlo.paltro has joined the channel  
 **@alex.gartner:** @alex.gartner has joined the channel  
 **@madison.s204:** @madison.s204 has joined the channel  
 **@wadodkar:** @wadodkar has joined the channel  
 **@kevin.kamel:** @kevin.kamel has joined the channel  
 **@carolyn:** @carolyn has joined the channel  

###  _#troubleshooting_

  
 **@jacob.branch:** @jacob.branch has joined the channel  
 **@sowmya.gowda:** Hi Team, I'm facing a issue with pinot datatypes. I have a
column jobTitle value as "Staff RN (Med Surg, Ortho/Neuro, GI/GU floor" in my
file and defined schema with string datatype only. But I'm getting error while
loading into table - `Cannot read single-value from Object[]: [Staff RN (Med
Surg, Ortho/Neuro, GI/GU floor] for column: jobTitle`  
**@saurabhd336:** Can you share your table config, schema json and the data
format, data file / data json you're trying to ingest? Is this a realtime
table or a offline table?  
**@sowmya.gowda:** Its a offline table ingesting from csv file. Sharing tar
file consisting table config, schema and job_specification file and
raw_data/xab.csv file  
**@saurabhd336:** @sowmya.gowda values like `Staff RN (Med Surg; Ortho/Neuro;
GI/GU floor` are the culprits here. The `;` character is the default multi
value separator for the CsvReader configured in the job spec to ingest the
data. I was able to generate the segment correctly with
```executionFrameworkSpec: name: 'standalone'
segmentGenerationJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush inputDirURI:
'/Users/saurabh.dubey/Downloads/test2_candidate/raw_data/'
includeFileNamePattern: 'glob:**/*.csv' outputDirURI:
'/Users/saurabh.dubey/Downloads/test2_candidate/segments/' overwriteOutput:
true pinotFSSpecs: \- scheme: file className:
org.apache.pinot.spi.filesystem.LocalPinotFS recordReaderSpec: dataFormat:
'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName:
'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' configs:
multiValueDelimiter: '\$' tableSpec: tableName: 'test2_candidate' schemaURI:
'' tableConfigURI: '' pinotClusterSpecs: \- controllerURI: ''``` ^ This spec.
Basically overriding the ``` configs: multiValueDelimiter: '\$'``` part to
change the multiValueDelimiter to some other character. But this may not
always work (if some strings contain $ character). But basically you should
figure out the correct multiValueDelimiter as per your data and use that in
the ingestion spec. Else change the ingestion from csv to something more
robust like json  
**@saurabhd336:** ^@kharekartik for more  
**@sowmya.gowda:** Thank you @saurabhd336 for the quick solution. It helps me
a lot !!  
 **@piercarlo.paltro:** @piercarlo.paltro has joined the channel  
 **@luisfernandez:** hello my friends, my team has been trying to ingest data
using the job spec for some weeks now, and it has been quite challenging, we
are trying to ingest around 500gb of data which is 2 years of data for our
system, we are using apache pinot `0.10.0` we ran into this issue:  so we had
to create a script to do the imports daily, however, for some reason pinot
servers are exhausting memory (32gbs) and before running the job they are
mostly at half capacity what are some of the reasons that our pinot servers
would ran out of memory from these ingestion jobs? also we are using the
standalone job and we change the input directory in our script every time it
finishes daily. Would appreciate any help!  
**@ken:** Can’t you use the `pushFileNamePattern` support to build a segment
name that’s composed of the previous directory name and the file name? So you
could create something like `2009-movies` as the final name.  
**@luisfernandez:** oh i have to check that out  
**@luisfernandez:** another question that i had is how do you tell the script
to output the logs somewhere just so that i can have it run it as a background
task  
**@luisfernandez:** do you know?  
**@ken:** Are you talking about the script that runs the admin tool? If so,
then it’s the usual Linux command line thing of adding `>>logfile.txt 2>&1`,
see  
**@luisfernandez:** right but that only logs this: ```SLF4J: Class path
contains multiple SLF4J bindings. SLF4J: Found binding in
[jar:file:/opt/pinot/lib/pinot-all-0.10.0-jar-with-
dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found
binding in [jar:file:/opt/pinot/plugins/pinot-environment/pinot-azure/pinot-
azure-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found
binding in [jar:file:/opt/pinot/plugins/pinot-file-
system/pinot-s3/pinot-s3-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-input-format/pinot-
parquet/pinot-
parquet-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:
Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-
yammer/pinot-
yammer-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:
Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-
dropwizard/pinot-
dropwizard-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:
See  for an explanation. SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory] WARNING:
sun.reflect.Reflection.getCallerClass is not supported. This will impact
performance. WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by
org.codehaus.groovy.reflection.CachedClass (file:/opt/pinot/lib/pinot-
all-0.10.0-jar-with-dependencies.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of
org.codehaus.groovy.reflection.CachedClass WARNING: Use --illegal-access=warn
to enable warnings of further illegal reflective access operations WARNING:
All illegal access operations will be denied in a future release```  
**@luisfernandez:** i’m currently running it like this:  
**@luisfernandez:** ```JAVA_OPTS='-Xms1G -Xmx1G -XX:+UseG1GC
-XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-controller.log
-javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent-0.12.0.jar=7007:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml'
/opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile
/opt/pinot/migration/job.yaml```  
**@luisfernandez:** (this is for one day worth of data)  
**@ken:** Don’t you also wind up with logs in the `logs/` subdir inside of
your `/opt/pinot/` directory?  
**@ken:** e.g. `pinot-all.log`?  
**@luisfernandez:** i do have those logs but i guess how would i differentiate
what’s log by what  
**@ken:** The minimal stdout/stderr logging output is what I often see when
slf4j finds multiple bindings. I would just focus on what’s in the logs/
subdir.  
**@ken:** I made a run at fixing up Pinot logging so you wouldn’t get the
issue of multiple bindings, but it’s a giant hairball.  
**@luisfernandez:** so in the logs/subdir i see the logs for the controller
itself and i guess i would see for the job too?  
**@ken:** In a normal configuration, each process (server, broker, controller)
has its own log file(s). So in that case, what gets logged when you run the
admin app should just be what it’s logging as part of your request. Note that
if you’re using Hadoop or Spark to run a segment generation job, then those
systems will have their own logging infrastructure as well.  
**@luisfernandez:** I'm using the standalone mode thank you now we got better
logging at least  
**@luisfernandez:** it has been a little harder to get this import process in
place  
**@luisfernandez:** we have year/month/day/severalfilesperday.parquet because
of the bug we are doing imports daily instead  
**@luisfernandez:** and it takes us days to do these imports  
**@ken:** If you do a metadata push it should be pretty fast. We load about
1100 segments from HDFS via this approach in a few hours. This assumes
segments have been already built and stored in HDFS, which we do via a Hadoop
job that takes about an hour or so.  
**@luisfernandez:** `SegmentCreationAndMetadataPush` this one right?  
**@luisfernandez:** we interface with GCS  
**@luisfernandez:** and we are just doing standalone  
**@ken:** Just `SegmentMetadataPush` for us, since we create the segments
using a scalable Hadoop map-reduce job.  
**@luisfernandez:** we have a spark process that grabs the data from bigquery
and puts it in gcs  
**@luisfernandez:** and then we use the standalone job to look at the gcs
buckets and create segments and do metadata push  
**@ken:** So you can use a Spark job to also create the segments from the text
files you extract from BigQuery.  
**@luisfernandez:** which that would be one of these guides right?  
**@ken:** That is scalable and can be much, much faster than trying to do it
in a single process via a standalone job  
**@ken:** Yes, that’s the guide. And yes, you can use this to ingest text,
parquet, or avro files.  
**@luisfernandez:** wouldn’t i run into the issue with the version problem
that we have with pinot 0.10.0?  
**@ken:** Are you talking about `pinot servers are exhausting memory (32gbs)
and before running the job they are mostly at half capacity what are some of
the reasons that our pinot servers would ran out of memory from these
ingestion jobs`?  
**@luisfernandez:** oh nonono, i’m talking about running this with spark
instead of the standalone job, which is what we are doing, i also don’t know
why that happened ^  
**@luisfernandez:** we gave it more memories to the machines but i feel like
something else is the root cause  
**@ken:** In your `tableIndexConfig` make sure you set
`"createInvertedIndexDuringSegmentGeneration": true,`  
**@ken:** This is in the table spec (Json file)  
**@luisfernandez:** let me check what it’s set at  
**@luisfernandez:** oofff what happens if it’s `false`?  
**@ken:** As per , if it’s false (which is the default) then indexes are
created on servers when segments are loaded. which can be both a CPU and
memory hog  
**@luisfernandez:** is it safe to change on a existing table?  
**@ken:** I believe so, yes - it should only impact the segment generation
job, not any segments that have been already deployed  
**@ken:** Generating the segment with the inverted index makes the segment
bigger, but if you’re deploying using metadata push that shouldn’t matter
much. Note though that currently metadata push requires each segment be
downloaded to the machine running the standalone job, so it can be untarred to
extract metadata. So you want a fast connection from that server and your deep
store.  
 **@luisfernandez:** another question kinda related to the above, we are
currently running on gke, and our deep storage is configured with gcs, we have
liveness and readiness probes configured in these machines, i think that the
server when it starts it tries to pull the data available from gcs, however, i
think this may take longer as more data gets ingested, how do you all manage
this? we had 10min configure for all the data to get into the server but now
we more data being in the machines it seems like we need even more wait time
for the data to be ready in the machines, any suggestions?  
**@mayanks:** Every restart should not require pull from deep store if you are
using ebs  
**@luisfernandez:** right  
**@luisfernandez:** i got confused  
**@luisfernandez:** so right now the issue is that our health/readiness check
is not getting ready in those 10 min  
**@luisfernandez:** and pod gets restarted  
**@luisfernandez:** `"message": "null:\n64370 segments [….] unavailable,
errorCode: 305` is the error we see in the brokers  
**@luisfernandez:** and those are all the segments available  
 **@alex.gartner:** @alex.gartner has joined the channel  
 **@madison.s204:** @madison.s204 has joined the channel  
 **@wadodkar:** @wadodkar has joined the channel  
 **@kevin.kamel:** @kevin.kamel has joined the channel  
 **@carolyn:** @carolyn has joined the channel  

###  _#getting-started_

  
 **@jacob.branch:** @jacob.branch has joined the channel  
 **@piercarlo.paltro:** @piercarlo.paltro has joined the channel  
 **@alex.gartner:** @alex.gartner has joined the channel  
 **@madison.s204:** @madison.s204 has joined the channel  
 **@wadodkar:** @wadodkar has joined the channel  
 **@kevin.kamel:** @kevin.kamel has joined the channel  
 **@carolyn:** @carolyn has joined the channel  

###  _#releases_

  
 **@wadodkar:** @wadodkar has joined the channel  

###  _#introductions_

  
 **@jacob.branch:** @jacob.branch has joined the channel  
 **@piercarlo.paltro:** @piercarlo.paltro has joined the channel  
 **@alex.gartner:** @alex.gartner has joined the channel  
 **@madison.s204:** @madison.s204 has joined the channel  
 **@wadodkar:** @wadodkar has joined the channel  
 **@kevin.kamel:** @kevin.kamel has joined the channel  
 **@carolyn:** @carolyn has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org