You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/04/22 14:55:00 UTC

[jira] [Commented] (ARROW-4753) [C++] Extension types and layouts for text-optimized data structures

    [ https://issues.apache.org/jira/browse/ARROW-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823155#comment-16823155 ] 

Wes McKinney commented on ARROW-4753:
-------------------------------------

Users could experiment with such data types embedded in other Arrow types using the ExtensionType facility. That might help validate whether or not the functionality is useful. It's unclear to be whether we would want a full-fledged logical type

> [C++] Extension types and layouts for text-optimized data structures
> --------------------------------------------------------------------
>
>                 Key: ARROW-4753
>                 URL: https://issues.apache.org/jira/browse/ARROW-4753
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++, Format
>         Environment: C/C++
>            Reporter: Edmon Begoli
>            Priority: Minor
>              Labels: features
>
> Narrative (text), by default, is notoriously inefficient to store on the disk or in memory. It is, in the most basic form, a long sequence of bytes with no indexing or other optimized layout structure. 
>   
>  There are data structures such as [tries|https://en.wikipedia.org/wiki/Trie], [DAFSAs|https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton], or [b-tries|https://dl.acm.org/citation.cfm?id=1541552] that support more efficient storage and lookup of phrases. 
>   
>  We would like to enable arrow to serialize from/to these efficient structures as the format/carrier between high performance text processing steps which like to operate on binary data structures (lookups, spellers, or more advance NLP routines).
>   
>  so, it could be something like:
>   
>  *{color:#707070}_text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow_{color}* {color:#14892c}// writes arrow as format for the specified encoding. This could be implicit if we could store encoding in some kind of manifest{color}
>   
>  *{color:#707070}_arrow.to_text(infer=true|dafsa|trie|b-trie) : string_{color}* {color:#14892c}// restores text from the arrow format, and from a specified encoding, same as above. {color}
>   
>  {color:#333333}On the dev mailing list we are discussion creation of the contrib folder where such features could be optionally included for Arrow.{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)