You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/01/13 02:00:18 UTC
Apache Pinot Daily Email Digest (2021-01-12)

### _#general_

  
 **@srini:** @srini has joined the channel  
 **@karinwolok1:** We have a lot of new :wine_glass: community members in
2021! :wave: Welcome!!! We're curious to know what brought you here! :smiley:
Please introduce yourself in this thread! Also, if you have any technical
questions, you can ask in <#C01H1S9J5BJ|getting-started> or
<#C011C9JHN7R|troubleshooting> @romualdo.gobbo @tsajay101 @lvs.pjx @sri
@valentin @aliouamardev @pandey.mayuresh367 @sankalp.jain02 @vinulam @rchandel
@gamparohit @john @egala @apandhi @sandun.wed @zxcware @ntyrewalla
@kizilirmak.mustafamer @srini  
 **@srini:** Welcome Pinot community! :pinot: I come from the Apache Superset
community, invited by @kenny I’m a huge viz nerd but also have a stats / ML
background. Happy to answer any Pinot <> Superset questions! I also spent the
last ~5 years building Dataquest (online learning platform for learning data
science) and am always open to discussing careers in data science!  
 **@ranemihir45:** @ranemihir45 has joined the channel  

###  _#random_

  
 **@srini:** @srini has joined the channel  
 **@ranemihir45:** @ranemihir45 has joined the channel  

###  _#feat-text-search_

  
 **@pabraham.usa:** Hello, My text index somehow stopped working. it is now
giving intermittent results. For eg: following is working `select * from
mytable where regexp_likg(log, '0D82F520-62C8-9914-14B8-4C2331E54075')`  
 **@pabraham.usa:** But this one will not `select * from mytable where
text_match('0D82F520-62C8-9914-14B8-4C2331E54075')`  
 **@pabraham.usa:** any pointers how to debug?  
 **@g.kishore:** Can you post this on <#C011C9JHN7R|troubleshooting>  
 **@pabraham.usa:** ok will do  

###  _#troubleshooting_

  
 **@pabraham.usa:** Hello, My text index somehow stopped working. it will give
results for some search data however not for all. For eg: following is working
`select * from mytable where regexp_likg(log,
'0D82F520-62C8-9914-14B8-4C2331E54075')` (edited) But this one will not
`select * from mytable where
text_match('0D82F520-62C8-9914-14B8-4C2331E54075')` (edited) Any pointers how
to debug?  
 **@g.kishore:** There was some bug with stop words @steotia ^^  
 **@g.kishore:** Is it tokenizing- ?  
 **@pabraham.usa:** How can i find out whether it is tokenizing? it seems like
some data are not going into text index  
 **@g.kishore:** Can you try text match without -  
 **@pabraham.usa:** that also not working , text match with `-` however will
work for some  
 **@g.kishore:** `select * from mytable where text_match('0D82F520')`  
**@g.kishore:** I see  
 **@g.kishore:** Does this work?  
 **@pabraham.usa:** no this will not work for that particular data  
 **@pabraham.usa:** but will work for someother  
 **@pabraham.usa:** in my testing all these were working before on an old
index, or I it could be I just started bit more extensive testing  
 **@pabraham.usa:** I deleted the entire cluster and recreated again, but
still no luck  
 **@g.kishore:** I don’t think that will fix it, we will try a fix quick test
and get back... can you file an issue  
**@pabraham.usa:** sure will do that now  
 **@g.kishore:** Looks like a bug to me... is this latest version?  
 **@pabraham.usa:** yes I upgraded to latest because of this  
 **@g.kishore:** Okay...  
**@steotia:** Hi @pabraham.usa, the other day this was the query cache issue  
 **@steotia:** Which you had accidentally enabled on your text index  
 **@steotia:** And that was leading to incorrect results. I had suggested to
disable it  
 **@steotia:** Also you need to enclose the search string as a phrase  
 **@steotia:** This was another issue with your queries as they were matching
incorrect documents.  
 **@steotia:** If you don't use phrase, all of them will get tokenized around
hyphen  
 **@steotia:** And will be a OR based term query  
 **@pabraham.usa:** Thats correct this issue is different  
 **@pabraham.usa:** I have cache disabled and also searching with quotes  
 **@pabraham.usa:** like  
 **@pabraham.usa:** `select * from mytable where regexp_likg(log,
'\"0D82F520-62C8-9914-14B8-4C2331E54075\"')`  
 **@pabraham.usa:** The issue is for some ids nothing is returned seems like
they are no in the text index at all  
 **@pabraham.usa:** After bit more analysis it looks like query is fine
however for text index the results only start to appear after a while. And it
seems text index is skipping segment with status CONSUMING/IN-PROGRESS.  
 **@pabraham.usa:** wondering whether this is a bug or I am missing some
settings to enable Near Real time searches  
 **@g.kishore:** That’s a bug  
 **@contact:** Hey question question, we wrote our own plugin for realtime
ingestion with google pubsub and in our test we always get one realtime
segment by server, even though we configured 1 replica per partition (the
stream is high level), do anyone have an idea ? Our ideal setup would be to
only have one (so no replica)  
 **@contact:** If that helps we have open-sourced the plugin there:  
**@mayanks:** Why not contribute this to the main Pinot project?  
**@contact:** We still didnt put anything in our prod env, so i believe a
little bit early  
**@mayanks:** If you open a PR against the main repo, you might get early
feedback as well.  
**@contact:** We are still not sure if we want to commit to the pubsub of gcp
too, is it still worth to upstream it if we don't use it ourselves ?  
**@mayanks:** Yeah, as long as the impl is good, I am sure someone else might
find it useful.  
 **@g.kishore:** That’s expected with high level stream consumer  
**@contact:** Not sure to understand why ?  
**@g.kishore:** high level stream combines all partitions of a stream into one
stream. splitting it into multiple segments will result in inconsistency and
data duplication  
**@g.kishore:**  
**@g.kishore:** this video explains the problems with high level stream
consumer and why we chose to implement partition level consumer  
**@contact:** > splitting it into multiple segments will result in
inconsistency and data duplication Well i agree on this one thats why i dont
get why we have multiple segments  
**@g.kishore:** Multiple parallel segments is required for scaling  
**@g.kishore:** If the event rate is in 100’s splitting is not needed  
**@g.kishore:** But once you reach thousands it helps  
**@g.kishore:** Also it’s unit of parallelism at query time  
**@g.kishore:** It just gives you more options as you scale on ingestion or on
query side  
**@contact:** Most of our segment will not be getting more than 500 events/s
(if so that would last only few minutes)  
**@contact:** I dont see where i can force to have only one segment for the
realtime table  
**@ssubrama:** @contact we also had multiple operational issues with high
level streams. Consider the case when you have 4 replicas, and one of the
hosts go down. You will need to bring up a new host, and wait until it catches
up with the latest offset before you can send queries to it. We also had
operational issues when hosts were mistakenly tagged with the same tag, thus
splitting the stream between the two.  
**@ssubrama:** I dont know what you mean by "force to have only one segment".
For the high level stream consumtption each consumer builds their own segments
and keeps it locally, since it can never be guaranteed that the rows consumed
by one replica is the same as rows consumed by any other  
**@contact:** i meant to only have one consumer  
**@contact:** from my understanding i have 3 segment (one on each of my
server), so i get 3 different consumer  
**@ssubrama:** If you are using high level consumers, as I understand you do,
then you should have one segment in progress and the others completed. The
older segments will be removed when the retention time is over  
**@contact:** I do use high level consumers, i got 3 realtime segment (for the
same realtime table), all of them in progress  
**@contact:** Is there any other place that i can check to verify i have only
one consuming ?  
**@g.kishore:** what ever you are seeing is the expected behavior...can you
paste your table config  
**@contact:** ```{ tableName: XXXXXX, tableType: 'REALTIME', quota: {},
routing: {}, segmentsConfig: { schemaName: YYYYY, timeColumnName: ZZZZZ,
timeType: ZZZZZ, replication: 1, replicasPerPartition: 1, segmentPushType:
'APPEND', segmentPushFrequency: 'HOURLY' }, tableIndexConfig: { streamConfigs:
{ 'streamType': 'pubsub', 'stream.pubsub.consumer.type': 'highlevel',
'stream.pubsub.decoder.class.name':
'com.reelevant.pinot.plugins.stream.pubsub.PubSubMessageDecoder',
'stream.pubsub.consumer.factory.class.name':
'com.reelevant.pinot.plugins.stream.pubsub.PubSubConsumerFactory',
'stream.pubsub.project.id': XXXXXX, 'stream.pubsub.topic.name': 'unused', //
unused but required because the plugin extends the kafka one
'stream.pubsub.subscription.id': ZZZZZ,
'realtime.segment.flush.threshold.time': '15d',
'realtime.segment.flush.threshold.rows': '390000' // 390k rows ~ 200MB (513
bytes / row) // 'realtime.segment.flush.threshold.segment.size': '200M' this
option need `realtime.segment.flush.threshold.rows` to be 0 and doesn't work
in 0.6.0 (`Illegal memory allocation 0 for segment ...`) },
nullHandlingEnabled: true, invertedIndexColumns: [], sortedColumn: [],
loadMode: 'mmap' }, tenants: {}, metadata: {} }```  
**@contact:** From the docs: ```Depending on the configured number of
replicas, multiple stream-level consumers are created, taking care that no two
replicas exist on the same server host. Therefore you need to provision
exactly as many hosts as the number of replicas configured.```  
**@contact:** However in our setup we have one replica with one partition, so
i expect to only have one segment (so one consumer).  
 **@srini:** @srini has joined the channel  
 **@mohammedgalalen056:** @mohammedgalalen056 has joined the channel  
 **@ranemihir45:** @ranemihir45 has joined the channel  

###  _#discuss-validation_

  
 **@mohammedgalalen056:** I've updated the schema in the docs  if it's good we
can move forward with opening the PR for making it configurable  
 **@chinmay.cerebro:** I'll review it today  
 **@chinmay.cerebro:** thanks Mohammed !  

###  _#getting-started_

  
 **@mohammedgalalen056:** @mohammedgalalen056 has joined the channel  
 **@ranemihir45:** @ranemihir45 has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org