You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dominik Moritz (Jira)" <ji...@apache.org> on 2021/04/25 18:39:00 UTC

[jira] [Commented] (ARROW-10220) [JS] Cache javascript utf-8 dictionary keys?

    [ https://issues.apache.org/jira/browse/ARROW-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331610#comment-17331610 ] 

Dominik Moritz commented on ARROW-10220:
----------------------------------------

I like the idea of caching the strings and I'm happy to help with TypeScript. You can join the #arrow-js channel on the-asf.slack.com. 

> [JS] Cache javascript utf-8 dictionary keys?
> --------------------------------------------
>
>                 Key: ARROW-10220
>                 URL: https://issues.apache.org/jira/browse/ARROW-10220
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: JavaScript
>    Affects Versions: 1.0.1
>            Reporter: Ben Schmidt
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> String decoding from arrow tables is a major bottleneck in using arrow in Javascript–it can take a second to decode a million rows. For utf-8 types, I'm not sure what could be done; but some memoization would help utf-8 dictionary types.
> Currently, the javascript implementation decodes a utf-8 string every time you request an item from a dictionary with utf-8 data. If arrow cached the decoded strings to a native js Map, routine operations like looping over all the entries in a text column might be on the order of 10x faster. Here's an observable notebook [benchmarking that and a couple other strategies|https://observablehq.com/@bmschmidt/faster-arrow-dictionary-unpacking].
> I would file a pull request, but 1) I would have to learn some typescript to do so, and 2) this idea may be undesirable because it creates new objects that will increase the memory footprint of a table, rather than just using the typed arrays.
> Some discussion of how the real-world issues here affect the arquero project is [here|https://github.com/uwdata/arquero/issues/1].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)