You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/06/12 02:00:20 UTC

Apache Pinot Daily Email Digest (2021-06-11)

### _#general_

**@lovenishgoyal:** @lovenishgoyal has joined the channel
**@prashant.pandey:** @prashant.pandey has joined the channel
**@patidar.rahul8392:** Hi everyone , How to convert a string into double in
pinot with sum.functiom .I have tried these 2 queries but getting this error.
In my pinot schema file I have data type as string for this field so I m not
able to take sum. While wring the transformation in config file I have assign
default value as null for this field so it has both non null values and null
where data is not available.i guess it is not able to cast the null values
into double /decimal. Any way to ignore nulls . I have tried where
gross_amount is not null but didn't work. Kindly suggest
**@fx19880617:** I think you have “null” value in the string which failed the
conversion
**@fx19880617:** can you try to filter out the null value
**@fx19880617:** select … from myTable where gross_amout <> ‘null’
**@patidar.rahul8392:** Yes I have nulls in table.ok let me try with <>. I was
using is not null.
**@patidar.rahul8392:** @fx19880617 Worked.
**@animeshmishra.ml:** @animeshmishra.ml has joined the channel
**@gagandeep.singh:** Hello guys:wave: For my studies, I am talking about
Pinot and its architecture. I created an activity Diagram for demonstrating
the query process within the Cluster, but I think some things are missing. I
read the original Paper and orient myself on the Query process section.
Unfortunately, I was not fully capable of illustrating it. I would appreciate
it if some experts could give me some feedback. Thank you very much!
**@mayanks:** has all what you need
**@aaron:** I'm trying to understand the difference between Segment URI Push
and Segment Metadata Push. I was using Segment URI Push and I filled up the
disk on the controller. That seems to make sense to me since the controller
had to download all the segments. A couple related questions: 1\. If I use
metadata push, my understanding is that the controller will direct one of the
servers to download the segment instead, is that right? 2\. Does that mean the
controller will use less disk in that case? 3\. Is the final state after URI
Push and Metadata Push different? I'd assume in both cases, you should end up
with segments distributed across servers, is that right? So I'm just curious
why the controller's disk filled up, is it supposed to clean up and isn't
doing that, or is this behavior expected?
**@mayanks:** ```1. Yes. It is true for all pushes though, controller always
directs servers to download segments from deepstore (or whatever is configured
PinotFs). 2\. Yes. Controller only needs to look at segment metadata. There's
no need for it to have the entire segment. So metadata push is an optimization
to achieve this. 3\. Yes, final state is the same. If controller disk filled,
you should check what is filling it. if it is the segments, then perhaps
controller is not using the deepstore (dataDir not configured)?```
**@aaron:** Thank you! I'll see if I didn't configure dataDir correctly
**@aaron:** Hmm so from (1), maybe I don't understand the difference between
URI and metadata push
**@aaron:** It's just that the controller doesn't download from deep store?
**@aaron:** I do have dataDir configured, here's from the controller conf:
```controller.data.dir= controller.local.temp.dir=/usr/scratch/pinot
controller.enable.split.commit=true
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.controller.segment.fetcher.protocols=file,http,s3
pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
# This is required pinot.controller.storage.factory.s3.region=xxx
pinot.controller.storage.factory.s3.accessKey=xxx
pinot.controller.storage.factory.s3.secretKey=xxx
pinot.controller.storage.factory.s3.endpoint=xxx```
**@aaron:** OH interesting
**@aaron:** I screwed up and set controller.data.dir twice in the config.
**@aaron:** First to a path on local disk, second to s3
**@aaron:** I'm assuming it took the first data dir?
**@mayanks:** Seems so.
**@aaron:** Works now, thanks for the help
**@aaron:** FWIW it seems like it concatenated the two data dirs
:slightly_smiling_face:
**@aaron:** So it was writing to like, `/usr/.../s3:/pinot` on local disk
**@mayanks:** I see. Do you mind adding these into the FAQ, for the rest of
the community? <#C023BNDT0N8|pinot-docsrus>
**@pugarg:** Hello Team, i wanted to understand one thing does pinot provide
capability of overwrite any segment data ( like we do overwrite partitioned
data in hive table)
**@mayanks:** Yes, you can overwrite a segment simply by pushing it again.
**@mayanks:** The contents of the old segment are overwritten by the contents
of the new segment. Segments are matched based on their name
**@pugarg:** I dont understand, let me explain my use case.. i have date
partitioned parquet data in s3. Let say if i dump 10th june on 10th june and
again i pushed 10th june on 11th june, i want that pinot overwrite the data
and keep the latest data only
**@mayanks:** So when you generate segments, there's configurable segment
naming convention that defaults to (table_name_minTime_maxTime_num).
**@mayanks:** for example myTable_2021-06-10_2021-06-10_0
**@mayanks:** When you regenrate the June 10th data on June 11th, the segment
will still be named ^^. And when you push it to Pinot, it will overwrite the
previously pushed segment of the same name
**@mayanks:** It is segment name based, not not really time or partition
based.
**@pugarg:** Can you share some reference where i can see how to create
segment name??
**@mayanks:** this is the default
**@mayanks:** You can also refer to
**@lekant:** @lekant has joined the channel
**@akhilesh.chg:** @akhilesh.chg has joined the channel

### _#random_

**@lovenishgoyal:** @lovenishgoyal has joined the channel
**@prashant.pandey:** @prashant.pandey has joined the channel
**@animeshmishra.ml:** @animeshmishra.ml has joined the channel
**@lekant:** @lekant has joined the channel
**@akhilesh.chg:** @akhilesh.chg has joined the channel

### _#troubleshooting_

**@lovenishgoyal:** @lovenishgoyal has joined the channel
**@prashant.pandey:** @prashant.pandey has joined the channel
**@calvin.mwenda:** Hey guys, what’s the best way - if any - to deal with
JSON queries where 1) Some data points might be null 2) Not all data points
have the same schema … here’s some sample data: ```Pinot schema column is
called "properties" with a type of string # First Record { timestamp: '...',
browser: '...' } # Second Record null # Third Record { balance: '...' }```
This data is provided by the end user and can be whatever they’d like (this is
by design) … so there’s really no “schema” for the data in the JSON column.
**@pedro.cls93:** Depends on what you intend to do with the field. I had a
similar case where a json field could be "anything", all I knew is that the
max size of the field was 2147483647 chars (2GB per field). I defined the
field as a string dimension field in the schema with a reasonable maxLength
(by default pinot truncates to 512 chars). Then I added that column to
`noDictionaryColumns` & `jsonIndexColumns` in the table configuration. To
query the field I use making sure to always specify the default value
(usually to `null` )
**@calvin.mwenda:** Thanks @pedro.cls93… I’d actually set up most of what you
mentioned but was using the `JSONEXTRACTSCALAR` function wrong. One more issue
I had, for records with `null` I had to convert it to an empty object for it
work, otherwise I would get the exception below:
```QueryExecutionError:\njava.lang.IllegalArgumentException: json can not be
null ...``` Which makes sense :sweat_smile:
**@mayanks:** Thanks @pedro.cls93. Would you mind adding this to the FAQ
(<#C023BNDT0N8|pinot-docsrus> for details)?
**@calvin.mwenda:** Hey @mayanks sure thing. Just to clarify, is this about
the null entries?
**@mayanks:** The FAQ question would be about your original question and you
can add all details including null from here
**@calvin.mwenda:** Ah I see :thumbsup:
**@animeshmishra.ml:** @animeshmishra.ml has joined the channel
**@mayanks:** @mayanks set the channel topic: Have a question? Check first.
Give back to the community by updating the docs (<#C023BNDT0N8|pinot-docsrus>)
when you get your answer here.
**@lekant:** @lekant has joined the channel
**@luanmorenomaciel:** Hi experts, has someone faced this problem before
while submitting a task to pinot's controller? ```root@pinot-
controller-0:/opt/pinot# bin/pinot-admin.sh AddTable \ > -schemaFile
/opt/pinot/sch_realtime_enriched_music_data_users.json \ > -tableConfigFile
/opt/pinot/realtime_enriched_music_data_users_json.json \ > -exec Exception in
thread "main" java.lang.reflect.InvocationTargetException at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386)
at
sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401)
Caused by: java.net.BindException: Address already in use at
sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:461) at
sun.nio.ch.Net.bind(Net.java:453) at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222) at
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:85) at
sun.net.httpserver.ServerImpl.bind(ServerImpl.java:133) at
sun.net.httpserver.HttpServerImpl.bind(HttpServerImpl.java:54) at
io.prometheus.jmx.shaded.io.prometheus.client.exporter.HTTPServer.<init>(HTTPServer.java:145)
at
io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:31)
... 6 more FATAL ERROR in native method: processing of -javaagent failed
Aborted (core dumped) root@pinot-controller-0:/opt/pinot#```
**@fx19880617:** hmm, can you set `JAVA_OPTS=""`
**@luanmorenomaciel:** what a precision @fx19880617 impressed !!! seems that
worked, let me double check
**@akhilesh.chg:** @akhilesh.chg has joined the channel

### _#pinot-dev_

**@fx19880617:** I’m going to merge java 11 upgrade PR() today, for devs
still on java8, please use ```mvn clean install -DskipTests -Pbin-dist -T 4
-Djdk.version=8``` to build pinot source code.
**@ssubrama:** Does it make sense to have a flag for java 11 instead? I am
thinking code that links with the segment build logic -- may not get to choose
their own java RTE. This is certainly the case with Linkedin. Are there others
in a similar situation?
**@fx19880617:** that’s why we would like to call it out now and say we still
support java 8 until next release
**@fx19880617:** then we will drop java8 support and have people use java11
**@fx19880617:** otherwise we will never get to upgrade :stuck_out_tongue:
**@dlavoie:** :tada:

### _#pinot-docs_

**@ken:** I’m updating some doc, but trying to remember whether the internal
representation for BOOLEAN is a string (what I thought) or an int (from
existing docs)
**@jackie.jxt:** @ken We recently introduced the native BOOLEAN support which
stores boolean as int in the backend. In the old version (`0.7.1` and older
version), there is no native BOOLEAN support and BOOLEAN data type is treated
as STRING
**@ken:** Hi @jackie.jxt - OK, that explains it, thanks!
**@ken:** So should I doc as-is currently (it’s a string), and when the new
support gets released somebody will update the docs? Curious about doc
standards for calling out changes in versions.
**@jackie.jxt:** Good point. We should have version specific docs
**@jackie.jxt:** I'm also editing the `schema` doc now for the newly added
data types, let me add some note describing the behavior
**@ken:** Sounds good. If you’re in there editing the doc, I’ve got an old
note about adding the default max length (for only strings?).
**@ken:** And it seems weird to have basically the same data for fields at
and
**@ken:** Normally you’d put some stripped down info (like “The field types we
support are INT, LONG, ….” with a ref to the reference page for more details.
**@ken:** Unless, of course, everyone enjoys keeping multiple copies of
documentation in sync :slightly_smiling_face:
**@jackie.jxt:** Good call. I don't even know we have 2 pages for schema...
**@jackie.jxt:** @ken I just added the new data types and the behavior note to
**@jackie.jxt:** Let's figure out how to organize these 2 pages and minimize
the duplicate material

### _#pinot-docsrus_

**@calvin.mwenda:** @calvin.mwenda has joined the channel
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org