You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/04/20 02:00:28 UTC

Apache Pinot Daily Email Digest (2022-04-19)

### _#general_

  
 **@octchristmas:** Hi. team~ What is the difference between 'append' and
'refresh' in segmentPushType for offline table?  
**@mayanks:** Append means data push is incremental (say hourly/daily) and is
appended to the table.  
**@mayanks:** REFRESH implies that all data to this table will be refreshed
with each push.  
**@mayanks:** Note, these are way to specify user intent to Pinot. So it can
take care of background stuff like time-boundary management, retention etc  
**@ken:** @mayanks - actually doesn’t REFRESH imply that _segments_ will be
refreshed (replaced completely) during the push? Asking because we have a
table where we push a subset of segments as updates every day, configured as
refresh, and it’s working fine. So it’s not “all data to this table will be
refreshed”.  
**@ken:** @mark.needham - I was looking at the page that @octchristmas
referenced in this question *(), and noticed that it said to use
`IngestionConfig -> BatchIngestionConfig -> segmentPushType` as of 0.7, but
the only link to `IngestionConfig` on that page takes you to Ingestion
Transformations (), which doesn’t talk about the configuration settings. I see
that `BatchIngestionConfig` is also referenced on , but that’s more like a
tutorial versus field documentation.  
**@mayanks:** @ken I don’t think Pinot does the REFRESH on its own. It relies
on user to overwrite all segments a something that needs to be clarified in
docs if it is not. cc: @mark.needham  
**@ken:** Hi @mayanks - My comment wasn’t about how much Pinot does on its
own. It’s about whether “all segments” need to be overwritten or not. Based on
my experience, refresh just means that the user is responsible for updating
segments (which can be all segments for a table, or some subset), but it’s not
a full table operation.  
**@mayanks:** You are correct user is responsible, and if they only update
part of data Pinot won’t flag it  
**@octchristmas:** @mayanks @ken Thank you. I found that offline table's
RetentionManager only deletes segments of type 'append'.  So I was curious
about the difference between the two. But I still can't tell the difference
between 'refresh' and 'append'. When I tested with sample data (transcript
data), both types of tables replaced segments when injecting data in the same
time range. I expected that the 'refresh' type replaces the segment, and the
'append' type creates one more segment and the data will be 2 copies, but it
was not. Can you tell me in more detail what the difference between these two
is the behavior in pinot ? And, is it correct to use 'injestionConfig >
batchIngestionConfig > segemntPushType' instead of 'segmentConfig >
segmentPushType'? When I set 'injestionConfig > batchIngestionConfig >
segemntPushType', it is not actually reflected. When I set 'segmentConfig >
segmentPushType', it is actually reflected, and I tested with this setting.
tableConfig: `{` `"tableName": "transcript_trans",` `"tableType": "OFFLINE",`
`"segmentsConfig": {` `"schemaName": "transcript_trans",` `"replication": 1,`
`"timeColumnName": "timestampInEpoch", "timeType": "MILLISECONDS",`
`"retentionTimeUnit": "HOURS", "retentionTimeValue": 1` `},` `"tenants": {
"broker":"DefaultTenant", "server":"DefaultTenant" },` `"tableIndexConfig": {`
`"loadMode": "MMAP"` `},` `"ingestionConfig": {` `"batchIngestionConfig": {`
`"segmentPushType": "APPEND"` `},` `"filterConfig": {` `"filterFunction":
"Groovy({(score1 as float) >= 4 && (score1 as float) < 6}, score1)"` `},`
`"transformConfigs": [` `{` `"columnName": "fullName",` `"transformFunction":
"Groovy({firstName+' '+lastName}, firstName, lastName)"` `},` `{`
`"columnName": "scoreSum",` `"transformFunction": "Groovy({score1+score2},
score1, score2)"` `},` `{` `"columnName": "datetime",` `"transformFunction":
"toDateTime(timestampInEpoch, 'yyyy-MM-dd HH:mm:ss')"` `}` `]` `},`
`"metadata": {}` `}` `-- pinot controller ui, tableConfig` `"ingestionConfig":
{` `"batchIngestionConfig": {},` `"filterConfig": {` `"filterFunction":
"Groovy({(score1 as float) >= 4 && (score1 as float) < 6}, score1)"` `},`  
**@mayanks:** From Pinot’s behavior difference you are right only difference
is that retention manager will delete segments for append table. Also time
boundary happens for append. Refresh is more for snapshot type use cases where
you want to rewrite entire data in Pinot with each push.  
**@mayanks:** In your case is it a real-time table, or a hybrid table with
time column and incremental push to Pinot? If yes that indicates you need
append table  
**@mayanks:** But if it is offline only and each time you want to overwrite
all data in Pinot, then it is refresh use case  
**@vasanth.reddy:** @vasanth.reddy has joined the channel  
 **@omkaunda:** @omkaunda has joined the channel  
 **@hughmil3s:** @hughmil3s has joined the channel  
 **@casber:** @casber has joined the channel  
 **@kiukchung:** @kiukchung has joined the channel  

###  _#random_

  
 **@vasanth.reddy:** @vasanth.reddy has joined the channel  
 **@omkaunda:** @omkaunda has joined the channel  
 **@hughmil3s:** @hughmil3s has joined the channel  
 **@casber:** @casber has joined the channel  
 **@kiukchung:** @kiukchung has joined the channel  
 **@tonya:** Hey folks! I wanted to let you know that StarTree will be hosting
our first real-time user-facing analytics conference in Aug in SFO. The CFP is
now open, and we’d love to see contributions from the Pinot community! :heart:
I’m the StarTree Community Manager, so please feel free to reach out with any
questions. I hope to see you there. :slightly_smiling_face:  
**@jt:** @jt has left the channel  

###  _#troubleshooting_

  
 **@vasanth.reddy:** @vasanth.reddy has joined the channel  
 **@prashant.pandey:** Hi team. I have a column with very high cardinality. It
is defined as a MV column in my schema. We also have an inverted index defined
on it. However, we want to disable dictionary on this column as it’s very high
cardinality and the dictionary size is almost 50% of the total segment size.
However, Pinot does not let me disable it and errors out with the following
message: “Cannot create an Inverted index on column tags__KEYS specified in
the noDictionaryColumns config” Can I disable dict. of this column and still
have inverted index defined on it? Also, when should we use a forward index
vs. an inverted index?  
**@kharekartik:** We have recently added feature to address this issue in out
main branch. `optimizeDictionaryForMetrics` config in  This only works for
Single-valued columns though currently. For your second question, forward
index are created by default. Inverted index should be used when you want to
do a constant time lookup for a column value.  
**@kharekartik:** Inverted index unfortunately can't be created on no
dictionary columns. So you will need to have a dictionary for it to work.
@mayanks Is there a better solution here?  
**@prashant.pandey:** > So you will need to have a dictionary for it to work.
In our case, we have around 800M segments out of which 400M is the size of the
dict due to the cardinality (700k unique values). Can we somehow optimise
this?  
**@prashant.pandey:** Like we want the inverted index without the dict. Dict.
anyway wouldn’t be that much helpful due to the high cardinality I presume?
The docs themselves say that if there is very less duplicate data, the
compression is not that useful.  
**@kharekartik:** Yeah, Having such large dicts doesn't make sense. I am not
sure though why is inverted index not support on Raw columns. Can I ask one
thing? If it is a high cardinality columns, do you do a lot of lookups on
that? Or mostly range filters?  
**@prashant.pandey:** Yes we do have a lot of filters. We have two such
columns: `tag__KEYS` : Stores keys of your tags. `tag__VALUES` : Stores values
of those the corresponding tags in `tag__KEYS`. So basically, they look like:
```tag__KEYS: "key1, key2, key3" tag__VALUES: "val1, val2, val3"``` Almost all
of our queries have predicates on these two columns.  
**@mayanks:** Are you using Map type here? Seems like you can use JSON type
which can have json index on no dict column  
**@mayanks:** @prashant.pandey ^^  
**@prashant.pandey:** @mayanks Sorry was AFK for a long time. Yes true in fact
that is exactly what is in the roadmap. But that’ll take some time. I was
looking at some quick solution that could reduce the size of my segments by
almost 50% :slightly_smiling_face:.  
**@mayanks:** Short of that, I don’t think think there’s a quick solution
there.  
 **@harish.bohara:** In my setup i have ```volumeMounts: \- name: data
mountPath: /bitnami/zookeeper I have no name!@zookeeper-0:/$ df -h Filesystem
Size Used Avail Use% Mounted on overlay 50G 3.4G 47G 7% / tmpfs 64M 0 64M 0%
/dev tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup /dev/nvme0n1p1 50G 3.4G 47G 7%
/etc/hosts /dev/nvme1n1 99G 441M 98G 1% /bitnami/zookeeper shm 64M 0 64M 0%
/dev/shm tmpfs 3.8G 12K 3.8G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 3.8G 0 3.8G 0% /proc/acpi tmpfs 3.8G 0 3.8G 0% /sys/firmware
zookeeper-0:/bitnami/zookeeper/data$ ls -la total 16 drwxrwsr-x 3 1001 1001
4096 Apr 19 07:58 . drwxrwsr-x 4 root 1001 4096 Apr 19 07:58 .. -rw-rw-r-- 1
1001 1001 2 Apr 19 07:58 myid drwxrwsr-x 2 1001 1001 4096 Apr 19 15:08
version-2```  
**@omkaunda:** @omkaunda has joined the channel  
 **@hughmil3s:** @hughmil3s has joined the channel  
 **@srini:** hey y’all! @hughmil3s and I were having some trouble getting
Pinot connected to Superset and would love any help here. We tried both the
URI’s in the superset docs () and the Pinot docs () and neither seemed to work
for us. I specifically tested with the pinot quickstart docker image and kept
getting generic SQLalchemy errors, which makes it hard to debug! Hugh’s trying
to test this PR () in the Superset repo, that improves the Pinot capabilities
in Superset :zap: cc @kennybastani @chinmay.cerebro @g.kishore  
**@hughmil3s:** I’m currently running pinot locally with docker on 9000  
**@hughmil3s:** and this the connection string that i’ve created: ``  
**@hughmil3s:** when i try connect just using the `pinotdb` i’ve been getting
this error: ```~/src/superset/venv/lib/python3.8/site-packages/pinotdb/db.py
in check_sufficient_responded(self, query, queried, responded) 261 needed =
fraction 262 if responded < 0 or responded < needed: \--> 263 raise
exceptions.DatabaseError( 264 f"Query\n\n{query} timed out: Out of {queried},
only" 265 f" {responded} responded, while needed was {needed}" DatabaseError:
Query select * from baseballStats timed out: Out of -1, only -1 responded,
while needed was -1```  
**@hughmil3s:** let me know if anyone can help me debug so i can verify my
changes work as intended  
**@kennybastani:** Try `pinot+` for the connection string  
**@hughmil3s:** same error  
**@hughmil3s:** ```2-04-19 10:21:16,806:INFO:pinotdb.sqlalchemy:Updated pinot
dialect args from {'dbapi': <module 'pinotdb' from
'/Users/hugh/src/superset/venv/lib/python3.8/site-
packages/pinotdb/__init__.py'>, 'debug': False, 'verify_ssl': True}: None and
False 2022-04-19 10:21:16,807:INFO:pinotdb.sqlalchemy:Updated pinot dialect
args from {'host': 'pinot', 'port': 8000, 'path': 'query', 'scheme': 'http',
'verify_ssl': True, 'username': None, 'password': None, 'debug': False}:  and
False 2022-04-19 10:21:16,807:DEBUG:superset.stats_logger:[stats_logger]
(incr) test_connection_attempt 2022-04-19
10:21:16,837:DEBUG:urllib3.connectionpool:Starting new HTTP connection (1):
pinot:8000 2022-04-19 10:21:16,844:DEBUG:superset.stats_logger:[stats_logger]
(incr) test_connection_error.DBAPIError 2022-04-19
10:21:16,846:DEBUG:superset.stats_logger:[stats_logger] (incr)
DatabaseRestApi.test_connection.error 2022-04-19
10:21:16,847:WARNING:superset.views.base:[SupersetError(message='(builtins.NoneType)
None\n(Background on this error at: ',
error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR:
'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>,
extra={'engine_name': 'Apache Pinot', 'issue_codes': [{'code': 1002,
'message': 'Issue 1002 - The database returned an unexpected error.'}]})]```  
**@kennybastani:** Can you get the log output from Pinot?  
**@kennybastani:** Oh hey, sorry my URL was wrong  
**@kennybastani:** `pinot+`  
**@kennybastani:** That's what I get from copy and pasting from my old code
bases :stuck_out_tongue_winking_eye:  
**@kennybastani:** Oh still wrong...  
**@kennybastani:** Okay, it should be good now  
**@kennybastani:** URL decoded it looks like this: `pinot+`  
**@kennybastani:** Where did you get the format for your connection string?  
**@hughmil3s:** from the docs  this is what i’m using to test now: ```from
sqlalchemy import * ...: from sqlalchemy.engine import create_engine ...: from
sqlalchemy.schema import * ...: ...: engine = create_engine('pinot+') ...:
...: places = Table('baseballStats', MetaData(bind=engine), autoload=True)
...: print(select([func.count('*')], from_obj=places).scalar())```  
**@chinmay.cerebro:** @hughmil3s just to confirm - database connection is fine
- but query is timing out correct ?  
**@kennybastani:** Did that connection string work out?  
**@hughmil3s:** for what i posted above this is the error i’m getting:
```ConnectionError: HTTPConnectionPool(host='localhost', port=8000): Max
retries exceeded with url: /query/sql (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10730a790>:
Failed to establish a new connection: [Errno 61] Connection refused'))```  
**@hughmil3s:** With pinotdb: ```In [26]: from pinotdb import connect ...:
...: conn = connect(host='localhost', port=9000, path='/query/sql',
scheme='http') ...: curs = conn.cursor() ...: curs.execute(""" ...: select *
from baseballStats ...: """) ...: for row in curs: ...: print(row) ...:```
```~/src/superset/venv/lib/python3.8/site-packages/pinotdb/db.py in
check_sufficient_responded(self, query, queried, responded) 261 needed =
fraction 262 if responded < 0 or responded < needed: \--> 263 raise
exceptions.DatabaseError( 264 f"Query\n\n{query} timed out: Out of {queried},
only" 265 f" {responded} responded, while needed was {needed}" DatabaseError:
Query select * from baseballStats timed out: Out of -1, only -1 responded,
while needed was -1```  
**@kennybastani:** Can you `curl `  
**@hughmil3s:** curl: (7) Failed to connect to localhost port 8000: Connection
refused  
**@kennybastani:** Okay, so your broker is not available at `localhost:8000`  
**@hughmil3s:**  
**@hughmil3s:** for context  
**@hughmil3s:** how do i figure out what port my broker is living at  
**@kennybastani:** One sec  
**@hughmil3s:** is it 8000?  
**@kennybastani:** It should be  
**@kennybastani:** Try `netstat -vanp tcp | grep '*.2123\|9000\|8000\|7000'`  
**@hughmil3s:** ```✦ ➜ netstat -vanp tcp | grep '*.2123\|9000\|8000\|7000'
tcp6 0 0 ::1.9000 ::1.57711 FIN_WAIT_2 407508 146808 478 0 0x0112 0x0000000c
tcp6 0 0 ::1.57711 ::1.9000 CLOSE_WAIT 407577 146808 57242 0 0x0022 0x00000000
tcp46 0 0 *.9000 *.* LISTEN 131072 131072 478 0 0x0100 0x00000006 tcp6 0 0
::1.9000 ::1.63536 TIME_WAIT 403702 146808 478 0 0x2131 0x0000000c tcp6 0 0
::1.9000 ::1.63537 TIME_WAIT 400527 146808 478 0 0x2131 0x0000000c tcp6 0 0
::1.9000 ::1.63538 TIME_WAIT 404686 146808 478 0 0x2131 0x0000000c```  
**@kennybastani:** Well thats odd :stuck_out_tongue_winking_eye:  
**@kennybastani:** Which version of Pinot are you using?  
**@hughmil3s:** so i ran this to start up the cluster: ```docker run \ -p
9000:9000 \ apachepinot/pinot:0.9.3 QuickStart \ -type batch```  
**@hughmil3s:** so 0.9.3  
**@kennybastani:** Ah, so you're running your test script inside the container
of the controller  
**@kennybastani:** Which makes sense why you can't see the broker  
**@hughmil3s:** got it  
**@kennybastani:** Let's see  
**@hughmil3s:** what is the proper command  
**@kennybastani:** Oh, if it's QuickStart it should be all in one container  
**@kennybastani:** Try adding the broker and server ports to your docker
command  
**@hughmil3s:** ```docker run \ -p 9000:9000 \ -p 8000:8000 \ -p 7000:7000 \
apachepinot/pinot:0.9.3 QuickStart \ -type batch```  
**@hughmil3s:** right?  
**@kennybastani:** Yeah that looks good  
**@kennybastani:** That should solve the problem I think  
**@hughmil3s:** working on it now  
**@hughmil3s:** and i’ll report back thanks kenny  
**@kennybastani:** :tada:  
**@kennybastani:** Sounds good Hugh. Let me know if I can help with anything  
**@hughmil3s:** getting the same thing  
**@hughmil3s:** ```✦ ➜ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS
PORTS NAMES fe8b5fdf9dcb apachepinot/pinot:latest "./bin/pinot-admin.s…" 14
minutes ago Up 2 minutes 0.0.0.0:7000->7000/tcp, 0.0.0.0:8000->8000/tcp,
0.0.0.0:9000->9000/tcp, 8096-8099/tcp confident_curie```  
**@hughmil3s:** ```✦ ➜ netstat -vanp tcp | grep '*.2123\|9000\|8000\|7000'
tcp46 0 0 *.8000 *.* LISTEN 131072 131072 478 0 0x0100 0x00000006 tcp46 0 0
*.7000 *.* LISTEN 131072 131072 478 0 0x0100 0x00000006 tcp46 0 0 *.9000 *.*
LISTEN 131072 131072 478 0 0x0100 0x00000006 tcp6 0 0 ::1.9000 ::1.57711
FIN_WAIT_2 407508 146808 478 0 0x0112 0x0000000c tcp6 0 0 ::1.57711 ::1.9000
CLOSE_WAIT 407577 146808 57242 0 0x0022 0x00000000```  
**@kennybastani:** Okay, that looks good. We might need to update that
connection string. It is likely correct in the docs. I took mine from a much
older version of Pinot.  
**@hughmil3s:** waiitttt  
**@hughmil3s:** we got it!  
**@hughmil3s:** `pinot+`  
**@kennybastani:** Woot :slightly_smiling_face:  
**@kennybastani:** Sweet. If you folks need anything else, ping me. Looking
forward to seeing the enhancements.  
**@hughmil3s:** naw we are good i’m going to test this pr with the new grains
then report there thanks kenny  
**@kennybastani:** You're very welcome  
**@hughmil3s:** Running Pinot w/ Docker ```docker run \ -p 9000:9000 \ -p
8000:8000 \ -p 7000:7000 \ apachepinot/pinot:0.9.3 QuickStart \ -type batch```
Working Connection String: `pinot+` Python Test Example: ```In [2]: from
pinotdb import connect ...: ...: conn = connect(host='localhost', port=8000,
path='/query/sql', scheme='http') ...: curs = conn.cursor() ...:
curs.execute(""" ...: select * from baseballStats ...: """) ...: for row in
curs: ...: print(row) ...: ```  
**@kennybastani:** Okay I'll track back in the doc changes to see what
happened  
**@hughmil3s:** :thumbsup:  
**@ahsen.m:** i am able to connect superset with pinot like this ```pinot+```  
 **@casber:** @casber has joined the channel  
 **@kiukchung:** @kiukchung has joined the channel  

###  _#getting-started_

  
 **@vasanth.reddy:** @vasanth.reddy has joined the channel  
 **@omkaunda:** @omkaunda has joined the channel  
 **@hughmil3s:** @hughmil3s has joined the channel  
 **@casber:** @casber has joined the channel  
 **@kiukchung:** @kiukchung has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org