You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Sven Cattell (Jira)" <ji...@apache.org> on 2022/10/18 22:02:00 UTC
[jira] [Created] (ARROW-18090) Dictionary Style array for Keywords or Tags
Sven Cattell created ARROW-18090:
------------------------------------
Summary: Dictionary Style array for Keywords or Tags
Key: ARROW-18090
URL: https://issues.apache.org/jira/browse/ARROW-18090
Project: Apache Arrow
Issue Type: New Feature
Reporter: Sven Cattell
I want to efficiently encode lists of tags for each element in my database. In my case I have 30 tags, and a few are assigned to each of my ~20m records. Here's a simplified example of 5 records:
* pe, keylogger, cryptojack
* pe, packed
* pe, cryptojack, c2
* pe, keylogger, c2
* pe
Right now I have to store these in a List<Utf8> and have huge amounts of duplicate data. The dictionary array looks almost perfect for this task. I just want to allow for a List<T> instead of just T for the allowed primitive index type in a dictionary.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)