You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/08/17 02:00:17 UTC

Apache Pinot Daily Email Digest (2021-08-16)

### _#general_

  
 **@vicky301186:** how I can see the query plan in pinot, I want to verify it
only hits a certain set of segments based on specific time-range filter in my
query  
**@sosyalmedya.oguzhan:** i don't know you can see query plan or not, bu you
can check number of scanned segments in your query response  
**@prashant.pandey:** I think this feature is in development . However, I
think response stats contain the info you need.  
 **@vicky301186:** Hi Team, I am trying to create hour based segments in pinot
but it's creating more than one folder into segments for the same hour, I
guess this is due to some default row/data size, can I modify these default
configurations and how what it preferable size of the data segment in pinot,
what is the philosophy here too many files with a small size or minimum file
with a decent size any reference on above  
**@vicky301186:** schema: ```{ "schemaName": "svd", "dimensionFieldSpecs": [ {
"name" : "serviceId", "dataType" : "STRING" }, { "name" : "currentCity",
"dataType" : "STRING" }, { "name" : "currentCluster", "dataType" : "STRING" },
{ "name" : "phone", "dataType" : "STRING" }, { "name" : "epoch", "dataType" :
"LONG" } ], "metricFieldSpecs": [ { "name" : "surge", "dataType" : "DOUBLE" },
{ "name" : "subTotal", "dataType" : "DOUBLE" } ], "dateTimeFieldSpecs": [ {
"name": "dateString", "dataType": "STRING", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd-HH", "granularity": "1:DAYS" } ] }```  
**@vicky301186:** table config ```{ "tableName": "svd", "ingestionConfig": {
"transformConfigs": [ { "columnName": "dateString", "transformFunction":
"toDateTime(epoch, 'yyyy-MM-dd-HH')" } ] }, "segmentsConfig" : {
"timeColumnName": "dateString", "timeType": "MILLISECONDS", "replication" :
"1", "schemaName" : "svd" }, "tableIndexConfig" : { "invertedIndexColumns" :
["serviceId"], "loadMode" : "MMAP", "segmentPartitionConfig": {
"columnPartitionMap": { "currentCity": { "functionName": "Murmur",
"numPartitions": 4 } } } }, "routing": { "segmentPrunerTypes": ["partition"]
}, "tenants" : { "broker":"DefaultTenant", "server":"DefaultTenant" },
"tableType":"OFFLINE", "metadata": {} }```  
**@mayanks:** Hi, you can refer to  
**@sosyalmedya.oguzhan:** For offline tables, you have to configure number of
rows in your output file (that can be converted to segment later). Pinot just
converts input file to segment, and one file is equal to the one segment. For
your realtime tables; you can check configurations  
**@tiger:** @tiger has joined the channel  
 **@jai.patel856:** I had a general question about Upsert. Are the resource
required expected to be “significantly” higher than a normal Realtime table? I
ask because our Upsert table seems to take significantly more resources. Our
upsert table is a considerably wider table, but I’d like to understand if it’s
that width that’s contributing a bulk of that load, or if it could be Upsert
itself.  
**@g.kishore:** yes, upsert needs more resources because of key - row id
mapping. But the number of columns in the table should not increase the
overhead.  
**@yupeng:** also, consider not too complex primary key values (e.g. single
value but not composite). or use this `hashFunction`  
**@jai.patel856:** Thanks. Our keys are UUID or UUID+UUID. The first problem
we found was that they were not uniformly distributed. So we hashed them with
XX3 (xxhash). That definitely helped with the balance and turned them into
longs. But we continue to use the tuple of UUIDs for the partitionKeyColumns.  
**@jai.patel856:** Oh, and to add a little more detail, we found that the lack
of uniformity started with the Kafka key when we used UUIDs. So we weren’t
getting an even spread across the servers and we ultimately had hot nodes.  
 **@roberto:** One question, in the official java client (Not the JDBC one) is
it possible to configure the basic auth?  
**@mayanks:** Seems it does not support that right now. Perhaps you can file
an issue?  

###  _#random_

  
 **@tiger:** @tiger has joined the channel  

###  _#troubleshooting_

  
 **@kangren.chia:** i encounter this issue when trying the spark ingestion:
```Caused by: java.lang.NullPointerException at
org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(SystemUtils.java:1626)
at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala) at
org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:207) ... 27
more at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2611) at
org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner.run(SparkSegmentGenerationJobRunner.java:198)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) at
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at
org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:93)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:311)
at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala) at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
Exception in thread "main" java.lang.ExceptionInInitializerError at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
Source) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:359) at
org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189) at
org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:272) at
org.apache.spark.SparkContext.<init>(SparkContext.scala:448) at
org.apache.spark.SparkContext.<init>(SparkContext.scala:125) at
org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142)
at
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132)
at
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:67)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) at
org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:370) at
org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:113)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)ullp```  
**@mayanks:** What version of Java are you using?  
**@kangren.chia:** ```spark 3.0.2 pinot 0.7.1 java -version openjdk version
"11.0.10" 2021-01-19 OpenJDK Runtime Environment 18.9 (build 11.0.10+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.10+9, mixed mode, sharing)```  
**@kangren.chia:** i get the jars for spark submit from here:
```${SPARK_HOME}/bin/spark-submit \ \--class
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
\--deploy-mode cluster \ \--conf
"spark.driver.extraJavaOptions=-Dplugins.dir=/opt/pinot/plugins
-Dlog4j2.configurationFile=/opt/pinot/conf/pinot-ingestion-job-log4j2.xml" \
\--conf "spark.driver.extraClassPath=/opt/pinot/lib/pinot-all-0.7.1-jar-with-
dependencies.jar:/opt/pinot/plugins/pinot-batch-ingestion/pinot-batch-
ingestion-spark/pinot-batch-ingestion-
spark-0.7.1-shaded.jar:/opt/pinot/lib/pinot-all-0.7.1-jar-with-
dependencies.jar:/opt/pinot/plugins/pinot-file-
system/pinot-s3/pinot-s3-0.7.1-shaded.jar:/opt/pinot/plugins/pinot-input-
format/pinot-parquet/pinot-parquet-0.7.1-shaded.jar" \ \--jars
local:///opt/pinot/plugins/pinot-batch-ingestion/pinot-batch-ingestion-
spark/pinot-batch-ingestion-
spark-0.7.1-shaded.jar,local:///opt/pinot/lib/pinot-all-0.7.1-jar-with-
dependencies.jar,local:///opt/pinot/plugins/pinot-file-
system/pinot-s3/pinot-s3-0.7.1-shaded.jar,local:///opt/pinot/plugins/pinot-
input-format/pinot-parquet/pinot-parquet-0.7.1-shaded.jar \
local:///opt/pinot/lib/pinot-all-0.7.1-jar-with-dependencies.jar -jobSpecFile
jobSpec.yaml | tee output```  
**@mayanks:** We are seeing some issues with newer Spark version, could you
try Spark 2.3x?  
**@mayanks:** Here's a similar thread:  
**@kangren.chia:** i can’t see that thread, i think it’s buried due to the 10k
message limit  
**@kangren.chia:** let me try some workarounds  
**@bruce.ritchie:** Just upgrade apache commons to latest in your deployment.  
**@mayanks:** Thanks @bruce.ritchie  
**@tiger:** @tiger has joined the channel  
 **@roberto:** hi!! I’m trying to add authentication to my pinot instance and
it seems that after adding authentication I’m not able to perform queries from
the controller UI because of a 403. Is there any way to add authentication
using the UI?  
**@mayanks:** Does this help:  
**@roberto:** Exactly @mayanks!! I followed that guide. In fact the login page
is shown and I can log in without problems  
**@mayanks:** :+1:  
**@roberto:** the issue is in UI when I try to perform a query  
**@mayanks:** Oh, sorry, I thought you were confirming that you find a
solution.  
**@roberto:** Checking all requests I see that from the UI all calls includes
the `Authorization: Basic (my_token)` header but it isn’t included when a
query is performed  
**@roberto:** I have verified calling directly to the `/sql` endpoint adding
the header manually and it worked, I think that it is a UI problem  
**@mayanks:** I take it you have setup same username/password on the
controller as well as broker?  
**@roberto:** yep  
**@roberto:** My controller config:
```controller.admin.access.control.factory.class=org.apache.pinot.controller.api.access.BasicAuthAccessControlFactory
controller.admin.access.control.principals=MY_USERNAME
controller.admin.access.control.principals.oscilar.password=MYPASSWORD
controller.segment.fetcher.auth.token=Basic MYTOKEN (calculated as
base64(MY_USERNAME:MYPASSWORD))``` My broker config:
```pinot.broker.access.control.class=org.apache.pinot.broker.broker.BasicAuthAccessControlFactory
pinot.broker.access.control.principals=MY_USENAME
pinot.broker.access.control.principals.oscilar.password=MYPASSWORD```  
**@g.kishore:** I dont think we have hooked up the UI for auth yet  
**@roberto:** ok! that makes sense compared with what I have seen  
**@roberto:** thanks!  

###  _#pinot-dev_

  
 **@yash.agarwal:** @yash.agarwal has joined the channel  

###  _#getting-started_

  
 **@tiger:** @tiger has joined the channel  
 **@tiger:** Hi, I'm trying to batch ingest a lot of data in some ORC files,
what is the recommended way of doing this? I'm currently using the
SegmentCreationAndMetadataPush job with the command line interface.  
**@g.kishore:** Thats a good way to get started. In prod, you use spark to
setup these jobs.  
**@tiger:** Thanks! Also, is there a way to configure the segment generation
with batch ingest? For example, is it possible to pass in 1 ORC file, and
specify it to create N number of segments or to create segments of specific
size?  
**@g.kishore:** Not as of now. right now its input file -> one pinot segment  
**@g.kishore:** there is a segment process framework WIP that can allow you to
do some of these things  
**@tiger:** Ok got it. How important are segment sizes in pinot? I saw on the
FAQ that the recommended size is 100-500MB. Should I try to make it so that
all the segments are roughly the same size?  
**@mayanks:** As long as you are in the ballpark, it is fine.  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org