You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/06/03 02:00:26 UTC

Apache Pinot Daily Email Digest (2021-06-02)

### _#general_

**@humengyuk18:** I’m getting slow regexp_like performance, for 0.3 billion
rows, it is costing nearly 2 secs to match a prefix for a column, but in
Druid, the same data using `like` operator returned instantly. Is there any
configs I can apply to speed up this kind of query?
**@steotia:** have you tried text index ?
**@humengyuk18:** Text index will require raw value column, but I’m getting
array out of bound exception previously using raw value index.
**@steotia:** Text index is supported on both raw and dictionary columns
**@steotia:** What is the error you see when creating/using text index ?
**@humengyuk18:** I only looked at the documentation, saying only raw value is
supported. Is this feature from last release?
**@steotia:** Sorry that's my bad. Text index on dictionary columns has been
supported for quite some time (I think 1 or 2 release old). I will update the
documentation
**@steotia:** Can you share how you are setting up text index in table config
?
**@humengyuk18:** I will try text index in a test table, see if there are any
errors.
**@steotia:** Sure. Also, it should work for raw columns as well. Please share
the call stack for out of bounds error. The only error we have seen in the
past with raw columns is the integer overflow which was fixed with new segment
format that supports larger string column values.
**@humengyuk18:** Will text index have a memory overhead?
**@steotia:** Should not. We are running it on raw data where each string
value can be as large as upto 2 million characters. However for such cases,
disabling dictionary is preferable since dictionary creation will increase
heap usage and GC pressure. The text index itself should not introduce any
significant memory overhead.
**@humengyuk18:** Looks like text index is not using consuming segment data?
Text index is only built when generating segment?
**@steotia:** It uses consuming segment as well. Let me know and we can jump
on a call to see what's going on
**@saurabhd336:** @saurabhd336 has joined the channel
**@bcbazevedo:** @bcbazevedo has joined the channel
**@ming.liu:** @ming.liu has joined the channel
**@kanleecarro:** @kanleecarro has joined the channel
**@hari.prasanna:** @hari.prasanna has joined the channel
**@richard.hallier:** @richard.hallier has joined the channel
**@bowenwan:** @bowenwan has joined the channel
**@sharma.vinit:** @sharma.vinit has joined the channel

### _#random_

**@saurabhd336:** @saurabhd336 has joined the channel
**@bcbazevedo:** @bcbazevedo has joined the channel
**@ming.liu:** @ming.liu has joined the channel
**@kanleecarro:** @kanleecarro has joined the channel
**@hari.prasanna:** @hari.prasanna has joined the channel
**@richard.hallier:** @richard.hallier has joined the channel
**@bowenwan:** @bowenwan has joined the channel
**@sharma.vinit:** @sharma.vinit has joined the channel

### _#feat-presto-connector_

**@prabha.cloud:** @prabha.cloud has joined the channel

### _#troubleshooting_

**@saurabhd336:** @saurabhd336 has joined the channel
**@patidar.rahul8392:** Hi all , I am trying to push hdfs data in hybrid
table. I have added offline table in pinot and now trying to push the hdfs
file. When I am executing the final Hadoop jar command. It's showing pinot-
plugins.tar.gz doesn't exist. Someone kindly suggest. Error: File
file:/home/rah/hybrid/staging/pinot-plugin.tar.gz doesn't exits. I am
attaching my config file. Here /user/hdfs is my hdfs location and /home/rah is
local location .P.s. for staging and outputdir if I am giving hdfs Location
then it's giving error. "Wrong FS:" hdfs://location-of- inputdir/filename.txt,
expected: file:/// @ken @elon.azoulay @slack1 @tingchen @npawar @fx19880617
@mayanks Kindly suggest.
**@elon.azoulay:** Hi @patidar.rahul8392 are you using the gcs plugin? Or are
you on s3?
**@fx19880617:** I will take a look the plugin jar issue for hdfs
**@fx19880617:** This job is using hdfs not s3 I think
**@patidar.rahul8392:** Hi @elon.azoulay I m using hdfs
**@ken:** A few issues with your job spec: 1\. You need to use `` for your
staging directory. 2\. You need to use `` for your `outputDirURI`.
**@ken:** And you need to have a `configs:` section inside of the pinot FS
specs section, which has `hadoop.conf.path`. E.g. something like: ``````
**@ken:** ```pinotFSSpecs: \- # scheme: used to identify a PinotFS. # E.g.
local, hdfs, dbfs, etc scheme: hdfs className:
org.apache.pinot.plugin.filesystem.HadoopPinotFS configs: hadoop.conf.path:
'/root/hadoop-ops/config/master/'```
**@ken:** Also it would be good to include the stack trace with the error
message.
**@ken:** I think if you don’t have the hadoop.conf.path set, then Pinot falls
back to the default file system, which is why you get the errors about “wrong
FS”
**@fx19880617:** @patidar.rahul8392
**@patidar.rahul8392:**
**@patidar.rahul8392:** @fx19880617 @ken this is complete log details when I
am using local path as staging and output dir.
**@fx19880617:** so i guess you start the job from your local this means
hadoop job tries to add this uri into dist cache
`/home/rah/hybrid/staging/pinot-plugins.tar.gz`
**@fx19880617:** how do you submit the hadoop job?
**@patidar.rahul8392:** Ok @fx19880617 hadoop jar \
${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-
dependencies.jar \
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
-jobSpecFile
${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/hadoopIngestionJobSpechybrid.yaml
**@fx19880617:** this staging dir should be on hdfs as well I think
**@patidar.rahul8392:** Ok @fx19880617 let me try
**@patidar.rahul8392:** @ken @fx19880617 I have given output and staging dir
as hdfs dire same as I given for input directory just created new dir on same
location and passed in config. And added one extra property as
hadoop.conf.path: '/etc/Hadoop/conf/' Where all my Hadoop configuration files
are available i.e. hadoop-site.xml, core-site.xmo etc. But still it's giving
the same error wrong FS.
**@patidar.rahul8392:** This is how my files looks like now. Kindly suggest
@ken @fx19880617
**@ken:** Your `hadoop.conf.path` is in the wrong section. You have it as part
of the `file` specification, but it needs to be part of the `hdfs`
specification.
**@ken:** You should be able to remove the `file` scheme section from the
`pinotFSSpecs` configuration
**@patidar.rahul8392:** Error logs
**@patidar.rahul8392:** Ok let me remove file section and retry
**@patidar.rahul8392:** Thanks alot @fx19880617 @ken @elon.azoulay It
Worked.:clap:
**@fx19880617:** @ken huge thanks! we should document this into FAQ
**@fx19880617:** btw, what’s your final config file look like, wanna compare
with the init one
**@fx19880617:** so I can update the documentation to make it more clear
**@bcbazevedo:** @bcbazevedo has joined the channel
**@ming.liu:** @ming.liu has joined the channel
**@kanleecarro:** @kanleecarro has joined the channel
**@hari.prasanna:** @hari.prasanna has joined the channel
**@machhindra.nale:** Team, I added new index and sortedColumn in the table
config which was already ingesting data from Kafka stream. I used “AddTable”
command to update the index. “jsonIndexColumns”: [ “entityMap” ],
“sortedColumn”: [ “metric” ] I performed “Reload All Segments” in the UI. Is
there any way to know if the indexing is complete?
**@g.kishore:** check in the table page, reload status button
**@machhindra.nale:**
**@g.kishore:** ah, not sure why its not supported for real-time table @npawar
^^
**@npawar:** this was from a contributor in open source. He’s only done it for
offline.
**@npawar:** @omkar.halikar14 is working on adding the realtime support
**@npawar:** meanwhile, you can look at the status of indexing, bu going to
the segment directory on the server instance, and looking at
metadata.properties
**@richard.hallier:** @richard.hallier has joined the channel
**@ken:** I’m running into an issue when building segments with 0.7.1 that
didn’t occur with 0.6.0, due to (I think) using a Unicode code point for my
`multiValueDelimiter`
**@ken:** The relevant bit of my job file is: ```recordReaderSpec: dataFormat:
'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName:
'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' configs:
multiValueDelimiter: '\ufff0'``` With 0.6.0 this works fine. With 0.7.1 I get:
```shaded.com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot
deserialize instance of `char` out of VALUE_STRING token at [Source: UNKNOWN;
line: -1, column: -1] (through reference chain:
org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig["multiValueDelimiter"])
at
shaded.com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1442)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.DeserializationContext.handleUnexpectedToken(DeserializationContext.java:1216)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.DeserializationContext.handleUnexpectedToken(DeserializationContext.java:1126)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.deser.std.NumberDeserializers$CharacterDeserializer.deserialize(NumberDeserializers.java:448)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.deser.std.NumberDeserializers$CharacterDeserializer.deserialize(NumberDeserializers.java:405)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:288)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:1719)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
shaded.com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1350)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
org.apache.pinot.spi.utils.JsonUtils.jsonNodeToObject(JsonUtils.java:117)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:88)
~[pinot-all-0.7.1-jar-with-
dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$run$0(SegmentGenerationJobRunner.java:199)
~[pinot-batch-ingestion-
standalone-0.7.1-shaded.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[?:1.8.0_291] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[?:1.8.0_291] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_291] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_291] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]```
**@mayanks:** I am guessing we moved to a newer version of jackson that is
having trouble reading the delimiter into a char?
**@ken:** Well, it’s OK if I use `multiValueDelimiter: 'a'`, but it’s not OK
if I do something like `multiValueDelimiter: '\u0040'`. Where in the code is
the job yaml file converted to a RecordReaderSpec?
**@mayanks:** Check `IngestionJobLauncher.java`
**@mayanks:** Assuming that you are using it
**@ken:** Yes, thanks - working on a unit test to see if I can find the issue
:)
**@mayanks:** Cool, thanks
**@mayanks:** Either there's a code change or a lib change that is not able to
handle your delim.
**@bowenwan:** @bowenwan has joined the channel
**@sharma.vinit:** @sharma.vinit has joined the channel

### _#pinot-dev_

**@ken:** When running `mvn clean install -DskipTests -Pbin-dist` on master,
I got a failure: `on project pinot-jdbc-client: *Some files do not have the
expected license header`.*
**@ken:** The specific files were: ```[INFO] Checking licenses... [WARNING]
Unknown file extension: /Users/kenkrugler/git/pinot-ken/pinot-clients/pinot-
jdbc-client/.externalToolBuilders/Maven_Ant_Builder.launch [WARNING] Missing
header in: /Users/kenkrugler/git/pinot-ken/pinot-clients/pinot-jdbc-
client/maven-eclipse.xml``` Is this due to cruft in my filesystem, or some
missing exclusions that ought to be there, or something else?
**@fx19880617:** `mvn license:format`?
**@fx19880617:** I think there are some ignored files without header?
**@ken:** Yes - e.g. the `maven-eclipse.xml` looks like a generated file (not
under source control). Same for the `.externalToolBuilders` directory

### _#getting-started_

**@prabha.cloud:** @prabha.cloud has joined the channel

### _#fix_llc_segment_upload_

**@ssubrama:** I think you have one major part unimplemented as yet. You
should not be fetching the segments of a table when the periodic task starts.
I am not sure if by that time, the controller leadership has been decided.
Ideally, you should fetch this when the leadership is decided. Please chat
with @jlli to understand this better and see if a callback can be registered
with the lead controller manager.
**@changliu:** OK. I think a callback func will be a right solution
**@ssubrama:** It may be a bit tricky to set up, etc. You may need to
introduce a registration and callback mechanism, perhaps scheduled in a thread
(like helix does)
**@changliu:** I think if that is the case, I may need to open up a new pr
for this. For this pr, do you think if it’s OK just to fix the segment cached
from committing phase?
**@changliu:** After we add this call back registration, we can add the ZK
access part to LLCRealtimeManager
**@changliu:** What do you think?
**@ssubrama:** That may be fine, but then on a controller restart, we will
lose the cache, right?
**@changliu:** That’s right
**@changliu:** So we need a ZK scan
**@changliu:** But ZK scan logic part depends on the controller leadership
change, i.e. registration/ callback
**@changliu:** So I want to separate this two first
**@ssubrama:** If you are ok with that in the short run, then you can check
in as it is (after addrsesing some of the other comments) and put a TODO in
front of the `setupTask` method that there is a race condition there in that
the controller leadership may not be decided by the time the method is called.
In the next PR, you can fix it. Before that, you can also check with jack, how
to get notified. [2:02 PM] Oh, another solution to this (without introducing
callbacks) is to keep a boolean whether it is needed to download the segment
names. If the boolean is true, then download it (when the table is being
processed), and initialize the queue. Otherwise, use the queue. [2:03 PM] I
think this solution may work a little better since we don't download all the
table at the same time. We process a table, and then download the next one.
**@changliu:** :ok_hand:
**@jlli:** @jlli has joined the channel
**@changliu:** Hi @ssubrama, I just talked with @jlli about the leadership
change callback. We can use `addPartitionLeader` and `removePartitionLeader` .
But since they are partitioned based, controller can receive multiple state
transition within a short period of time.
**@jlli:** one workaround is to add a sleep time and count the zk access
request only once
**@ssubrama:** @jlli the need here is to get notified on mastership changes,
and invalidate the cache (of bad segments).
**@ssubrama:** I am not awre of addPartitinLeader or removePartitionLeader.
Are these callbacks already offered?
**@jlli:** for a single pinot controller, the mastership changes when a helix
state transition is received, that’s where `addPartitinLeader`
`removePartitionLeader` gets called
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org