You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Antoine Pitrou (JIRA)" <ji...@apache.org> on 2019/04/17 15:26:00 UTC

[jira] [Commented] (ARROW-4753) Support optionally, and as an extension, an encoding layout for text-optimized data structures

    [ https://issues.apache.org/jira/browse/ARROW-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820195#comment-16820195 ] 

Antoine Pitrou commented on ARROW-4753:
---------------------------------------

Honestly I think this might be better as a separate project or library, where domain experts can freely devise and iterate on the best algorithms and data structures. I'm skeptical about standardizing layouts for Tries or other specialized structures at the Arrow project level, especially since it's likely there are many variants to choose from.

[~wesmckinn]

> Support optionally, and as an extension, an encoding layout for text-optimized data structures
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-4753
>                 URL: https://issues.apache.org/jira/browse/ARROW-4753
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++
>         Environment: C/C++
>            Reporter: Edmon Begoli
>            Priority: Minor
>              Labels: features
>
> Narrative (text), by default, is notoriously inefficient to store on the disk or in memory. It is, in the most basic form, a long sequence of bytes with no indexing or other optimized layout structure. 
>   
>  There are data structures such as [tries|https://en.wikipedia.org/wiki/Trie], [DAFSAs|https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton], or [b-tries|https://dl.acm.org/citation.cfm?id=1541552] that support more efficient storage and lookup of phrases. 
>   
>  We would like to enable arrow to serialize from/to these efficient structures as the format/carrier between high performance text processing steps which like to operate on binary data structures (lookups, spellers, or more advance NLP routines).
>   
>  so, it could be something like:
>   
>  *{color:#707070}_text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow_{color}* {color:#14892c}// writes arrow as format for the specified encoding. This could be implicit if we could store encoding in some kind of manifest{color}
>   
>  *{color:#707070}_arrow.to_text(infer=true|dafsa|trie|b-trie) : string_{color}* {color:#14892c}// restores text from the arrow format, and from a specified encoding, same as above. {color}
>   
>  {color:#333333}On the dev mailing list we are discussion creation of the contrib folder where such features could be optionally included for Arrow.{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)