You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Paul Taylor (Jira)" <ji...@apache.org> on 2021/02/05 06:08:00 UTC

[jira] [Commented] (ARROW-10450) [Javascript] Table.fromStruct() silently truncates vectors to the first chunk

    [ https://issues.apache.org/jira/browse/ARROW-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279371#comment-17279371 ] 

Paul Taylor commented on ARROW-10450:
-------------------------------------

Yeah, this is unfortunately a tricky spot with the current Chunked vectors. The `.data` getter on Chunked only returns the data field of the first chunk. Table.fromStruct() doesn't expect to get a ChunkedVector as input, it expects a single-chunk StructVector.

Your `Vector.from({data: <JS objects>}) ` call runs those JS objects through the Arrow Struct Builder and serialized into binary Arrow vectors.

The `highWaterMark` defaults to 1000 to avoid the case where someone tries to serialize lots of data, and the builder has to grow allocations past the 2GB limit. Builder internal buffers grow geometrically, so this is relatively easy to do with strings.

As you noted, you don't run into this issue when you do `Table.new()` because that method expects its input is likely split up across multiple chunks. The only downside is now you have a Table of struct of fields, rather than a Table of fields.

> [Javascript] Table.fromStruct() silently truncates vectors to the first chunk
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10450
>                 URL: https://issues.apache.org/jira/browse/ARROW-10450
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: JavaScript
>    Affects Versions: 2.0.0
>            Reporter: David Saslawsky
>            Priority: Minor
>
> Table.fromStruct() only uses the first chunk from the input vector.
> {code:javascript}
> import { Bool, Field, Int32, Struct, Table, Vector } from "apache-arrow";
> const myStruct = new Struct([
>   Field.new({ name: "over", type: new Int32() }),
>   Field.new({ name: "out", type: new Bool() })
> ]);
> const data = [];
> for(let i=0;i<1500;i++) {
>   data.push({ over:i, out:i%2 === 0 });
> // create a vector with two chunks
> const victor = Vector.from({
>   type: myStruct,
>   /*highWaterMark: Infinity,*/
>   values: data
> });
> console.log(victor.length);  // 1500 
> const table = Table.fromStruct(victor);
> console.log(table.length);   // 1000
> {code}
>  The workaround is to set highWaterMark to Infinity
>  
> Table.new() works as expected
> {code:javascript}
> const int32Array = new Int32Array(1500);for(let i=0;i<1500;i++)  int32Array[i] = i;
> const intVector = Vector.from({  type: new Int32(),  values: int32Array});
> console.log(intVector.length);  // 1500
>  const intTable = Table.new({ intColumn:intVector });
> console.log(intTable.length);   // 1500
> {code}
>  
> The origin seems to be in Chunked.data() but I don't understand the code enough to propose a fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)