You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "domoritz (via GitHub)" <gi...@apache.org> on 2023/05/26 09:00:00 UTC

[GitHub] [arrow] domoritz opened a new pull request, #35780: GH-15060: [JS] Add LargeUtf8 type

domoritz opened a new pull request, #35780:
URL: https://github.com/apache/arrow/pull/35780

   Fixes #15060.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1847413986

   @kylebarron @bmschmidt @FrNecas This pull request is ready for a review. I will look into actually supporting large value buffers in a follow up pull request. For now, people can chunk vectors as we discussed and at least we have support for the new type. @trxcllnt is currently out so it would be great if you could review this pull request so I can merge it. I will work on LargeBinary in a follow up pull request but it will be easier to have this merged first. 
   
   <details>
   <summary>
   Test scripts I played with to debug this code (just to have it saved somewhere).
   </summary>
   ```ts
   import { constants } from 'node:buffer';
   import { vectorFromArray, LargeUtf8, Utf8, makeBuilder, Table, tableToIPC, RecordBatchJSONWriter, tableFromIPC, Vector, Data, Int32, makeTable } from './src/Arrow.dom.ts';
   
   console.log(`buffer.constants.MAX_LENGTH: ${constants.MAX_LENGTH.toLocaleString()}`)
   console.log(`buffer.constants.MAX_STRING_LENGTH: ${constants.MAX_STRING_LENGTH.toLocaleString()}`)
   console.log(`Number.MAX_SAFE_INTEGER: ${Number.MAX_SAFE_INTEGER.toLocaleString()}`)
   console.log(`2**32: ${(2 ** 32).toLocaleString()}`)
   
   // const roundLengthUpToNearest64Bytes_1 = (len: number, BPE: number) => ((((len * BPE) + 63) & ~63) || 64) / BPE;
   // function roundLengthUpToNearest64Bytes_2(len: number, BPE: number) {
   //     const bytesMinus1 = len * BPE - 1;
   //     return ((bytesMinus1 - bytesMinus1 % 64 + 64) || 64) / BPE;
   // }
   
   // const roundLengthUpToNearest64Bytes = (len: number, BPE: number) => (round64(len * BPE) || 64) / BPE;
   // const roundLengthUpToNearest64Bytes = (len: number, BPE: number) => len + 64 - 1 - (len + 63) % 64;
   
   // const round64_1 = (num: number) => Math.ceil(num / 64) * 64
   // const round64_2 = (num: number) => ((num) + 63) & ~63
   // const round64_3 = (num: number) => num + 63 - (num + 63) % 64;
   // const round64_4 = (num: number) => num - 1 - (num - 1) % 64 + 64;
   // const round64_5 = (num: number) => ((num + 63) >> 6) << 6;
   
   // {
   //     for (const f of [round64_1, round64_2, round64_3, round64_4, round64_5]) {
   //         console.log(f)
   //         console.log(f(-1))
   //         console.log(f(0))
   //         console.log(f(1))
   //         console.log(f(2))
   //         console.log(f(63))
   //         console.log(f(64))
   //         console.log(f(65))
   //         console.log(f(2 ** 16))
   //         console.log(f(2 ** 32))
   //         console.log(f(2 ** 42))
   //         console.log(f(Number.MAX_SAFE_INTEGER / 100))
   //         console.log(f(Number.MAX_SAFE_INTEGER))
   //         console.log(f(Number.MAX_SAFE_INTEGER * 10))
   //     }
   // }
   
   // {
   //     const compare = (v, b) => roundLengthUpToNearest64Bytes_1(v, b) === roundLengthUpToNearest64Bytes_2(v, b);
   
   //     for (const b of [1, 2]) {
   //         console.log(`b: ${b}`);
   //         console.log(compare(-1, b));
   //         console.log(compare(0, b));
   //         console.log(compare(1, b));
   //         console.log(compare(2, b));
   //         console.log(compare(63, b));
   //         console.log(compare(64, b));
   //         console.log(compare(65, b));
   //         console.log(compare(128, b));
   //         console.log(compare(672396, b));
   //         console.log(compare(2 ** 20, b));
   //         // console.log(compare(Number.MAX_SAFE_INTEGER / 100, b));
   //     }
   // }
   
   // {
   //     for (const f of [roundLengthUpToNearest64Bytes_1, roundLengthUpToNearest64Bytes_2]) {
   //         console.log(f)
   
   //         console.log("1 byte");
   //         console.log(f(0, 1));
   //         console.log(f(1, 1));
   //         console.log(f(2, 1));
   //         console.log(f(63, 1));
   //         console.log(f(64, 1));
   //         console.log(f(65, 1));
   //         console.log(f(2 ** 20, 1));
   //         console.log(f(Number.MAX_SAFE_INTEGER / 100, 1));
   
   //         console.log("2 bytes");
   //         console.log(f(0, 2));
   //         console.log(f(1, 2));
   //         console.log(f(2, 2));
   //         console.log(f(63, 2));
   //         console.log(f(64, 2));
   //         console.log(f(65, 2));
   //         console.log(f(2 ** 20, 2));
   //         console.log(f(Number.MAX_SAFE_INTEGER / 100, 2));
   //     }
   // }
   
   // {
   //     const utf8Vector = vectorFromArray(["foo", "bar", "baz"], new Utf8);
   //     const largeUtf8Vector = vectorFromArray(["foo", "bar", "baz"], new LargeUtf8);
   
   //     console.log(largeUtf8Vector);
   
   //     console.log(largeUtf8Vector.toArray());
   //     console.log(utf8Vector.toArray());
   
   //     const table = new Table({ utf8Vector, largeUtf8Vector });
   //     const writer = RecordBatchJSONWriter.writeAll(table);
   //     const string = await writer.toString();
   
   //     // JSON serialization
   //     // console.log(string);
   
   //     const table2 = tableFromIPC(JSON.parse(string))
   //     console.log(table2.toString())
   
   //     const table3 = tableFromIPC(tableToIPC(table))
   //     console.log(table3.toString())
   // }
   
   // {
   //     // try putting a lot of strings in a map (works in node but not bun)
   //     const N = 1e4;
   
   //     const map = new Map();
   //     for (let i = 0; i < N; i++) {
   //         const longString = "a".repeat(constants.MAX_STRING_LENGTH);
   //         map.set(i, longString);
   //     }
   
   //     console.log(`Total characters in map: ${(map.size * constants.MAX_STRING_LENGTH).toLocaleString()}`);
   // }
   
   // {
   //     const builder = makeBuilder({ type: new LargeUtf8 });
   
   //     const string = "hello world";
   
   //     builder.append(string);
   
   //     console.log(builder.finish().toVector());
   // }
   
   
   {
       // table with two columns with different chunk lengths
   
       const b1 = makeBuilder({ type: new Int32 });
       const b2 = makeBuilder({ type: new Int32 });
   
       const d1 = [] as Data[];
       const d2 = [] as Data[];
   
       b1.append(1);
       b1.append(2);
       d1.push(b1.flush());
       b1.append(3);
       b2.append(4);
       d1.push(b1.flush());
   
       b2.append(5);
       d2.push(b2.flush());
       b2.append(6);
       b2.append(7);
       b2.append(8);
       d2.push(b2.flush());
   
       const v1 = new Vector(d1);
       const v2 = new Vector(d2);
   
       const table = new Table({ v1, v2 });
   
       console.log(table.batches.map((x) => x.toArray()));
   
       // console.log(table.toArray());
       // console.log(table.data);
   }
   
   // {
   //     const longString = "a".repeat(constants.MAX_STRING_LENGTH);
   
   //     const builder = makeBuilder({ type: new LargeUtf8 });
   
   //     const data = [] as Data[];
   //     // const vectors = [] as Vector<any>[];
   
   //     builder.append(longString);
   //     builder.append(longString);
   //     builder.append(longString);
   //     // vectors.push(builder.toVector());
   //     data.push(builder.flush());
   
   //     builder.append(longString);
   //     builder.append(longString);
   //     builder.append(longString);
   //     // vectors.push(builder.toVector());
   //     data.push(builder.flush());
   
   //     console.log(new Vector(data));
   // }
   
   {
       // make the longest possible string (longer will crash with `RangeError: Invalid string length`)
       const longString = "a".repeat(constants.MAX_STRING_LENGTH);
   
       const builder = makeBuilder({ type: new LargeUtf8 });
   
       // length of vector
       const N = 8;
   
       console.log(`Building vector with total string length (potential need for offsets): ${(constants.MAX_STRING_LENGTH * N).toLocaleString()} or <= 2^${Math.ceil(Math.log2(constants.MAX_STRING_LENGTH * N))}`);
   
       for (let i = 0; i < N; i++) {
           builder.append(longString);
       }
       // add string to force another offset
       builder.append("");
       const vector = builder.finish().toVector();
   
       console.log(vector.length);
       console.log(vector.byteLength.toLocaleString());
   
       // const table = new Table({ strings: vector });
       // console.log(table);
   
       // const ipc = tableToIPC(table);
       // console.log(ipc);
   
   
       // console.log(vector.toArray());
   }
   
   // {
   //     const builder = makeBuilder({ type: new LargeUtf8 });
   
   //     const string = "a".repeat(1e5);
   //     const N = 1e7;
   
   //     for (let i = 0; i < N; i++) {
   //         builder.append(string);
   //     }
   //     const vector = builder.toVector();
   
   //     console.log(vector);
   // }
   ```
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kylebarron commented on a diff in pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1297704621


##########
js/src/enum.ts:
##########
@@ -201,6 +201,7 @@ export enum Type {
     SparseUnion = -24,
     IntervalDayTime = -25,
     IntervalYearMonth = -26,
+    LargeUtf8 = -27,

Review Comment:
   >  * **Note**: Only enum values 0-17 (NONE through Map) are written to an Arrow IPC payload.
   
   Seems like `LargeUtf8` would be given a positive number then



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] domoritz commented on pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1729410405

   Yes. We need to have a buffer that supports lengths of more than what js supports natively. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1843314450

   > Maybe you're right that this is effectively the same as what pyarrow is doing. It's hard to tell if those batches are views on the existing buffers
   
   What do "those batches" denote here? I might be able to answer if you phrase the question more precisely :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] domoritz commented on a diff in pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1235281770


##########
js/src/builder.ts:
##########
@@ -289,7 +290,7 @@ export abstract class Builder<T extends DataType = any, TNull = any> {
             valueOffsets = _offsets?.flush(length);
         } else if (valueOffsets = _offsets?.flush(length)) { // Variable-width primitives (Binary, Utf8), and Lists
             // Binary, Utf8
-            data = _values?.flush(_offsets.last());
+            data = _values?.flush(Number(_offsets.last()));

Review Comment:
   Inevitable because JS supports indexing with numbers only. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1412871892


##########
js/src/type.ts:
##########
@@ -39,9 +39,11 @@ export type IsSigned = { 'true': true; 'false': false };
 export interface DataType<TType extends Type = Type, TChildren extends TypeMap = any> {
     readonly TType: TType;
     readonly TArray: any;
+    readonly TOffset: any;
     readonly TValue: any;
     readonly TChildren: TChildren;
     readonly ArrayType: any;
+    readonly OffsetType: TypedArrayConstructor<Int32Array> | BigIntArrayConstructor<BigInt64Array>;

Review Comment:
   Changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1839849622

   Hmm, then we can't really automatically flush in the vector builder. That's a bummer. Maybe then I need to implement a chunked buffer that can satisfy the `DataBuffer` type. I need to look at what implications that may have. 
   
   Maybe @trxcllnt has some suggestion here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kylebarron commented on pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1729790334

   I was talking with @trxcllnt about this a couple weeks ago. One use case for the `LargeUtf8` type is to support zero-copy operations with webassembly, which is separate from the use case of actually handling string buffers beyond 2^32. In https://github.com/kylebarron/arrow-js-ffi I currently support large types in by asserting (well right now assuming) that the values buffer is smaller than 2^32 and copying/downcasting the offsets into an `Int32Array` array, but it would be _nice_ if that would work without having to downcast


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1846236787

   > The only issue I can see with this approach is that in the case of variable width vectors (like utf8)---where we have a data buffer where we might reference the same position multiple times from the offset buffer
   
   This is impossible given the [layout](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) of the binary types.
   It is possible for the [binary view](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout) types, though.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1572608124

   Why not add `LargeBinary` at the same time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1564056308

   :warning: GitHub issue #15060 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "geclos (via GitHub)" <gi...@apache.org>.
geclos commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1825830391

   Hi,
   
   Very interested in this PR moving forward. What is the status? Any way I could be of help?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1413211156


##########
js/src/enum.ts:
##########
@@ -137,8 +137,7 @@ export enum MessageHeader {
  * nested type consisting of other data types, or another data type (e.g. a
  * timestamp encoded as an int64).
  *
- * **Note**: Only enum values 0-18 (NONE through Duration) are written to an Arrow
- * IPC payload.
+ * **Note**: Only positive enum values are written to an Arrow IPC payload.

Review Comment:
   ```suggestion
    * **Note**: Only non-negative enum values are written to an Arrow IPC payload.
   ```
   Maybe this is pedantic; does `0` get written to IPC as the original comment suggested?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1428641669


##########
js/src/enum.ts:
##########
@@ -175,6 +174,7 @@ export enum Type {
     FixedSizeList = 16, /** Fixed-size list. Each value occupies the same number of bytes */
     Map = 17, /** Map of named logical types */
     Duration = 18, /** Measure of elapsed time in either seconds, milliseconds, microseconds or nanoseconds. */
+    LargeUtf8 = 20, /** Large variable-length string as List<Char> */

Review Comment:
   Did you skip 19 on purpose?



##########
js/src/util/buffer.ts:
##########
@@ -208,10 +208,12 @@ export async function* toArrayBufferViewAsyncIterator<T extends TypedArray>(Arra
 /** @ignore */ export const toUint8ClampedArrayAsyncIterator = (input: ArrayBufferViewAsyncIteratorInput) => toArrayBufferViewAsyncIterator(Uint8ClampedArray, input);
 
 /** @ignore */
-export function rebaseValueOffsets(offset: number, length: number, valueOffsets: Int32Array) {
+export function rebaseValueOffsets(offset: number, length: number, valueOffsets: Int32Array): Int32Array;
+export function rebaseValueOffsets(offset: bigint, length: number, valueOffsets: BigInt64Array): BigInt64Array;
+export function rebaseValueOffsets(offset: number | bigint, length: number, valueOffsets: any) {
     // If we have a non-zero offset, create a new offsets array with the values
     // shifted by the start offset, such that the new start offset is 0
-    if (offset !== 0) {
+    if (offset != 0) {

Review Comment:
   is this necessary to compare the bigint to a number?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1564056247

   * Closes: #15060


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] domoritz commented on pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1572611166

   > Why not add LargeBinary at the same time?
   
   I will look at it when I get this working. I just need some help wrapping it up. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1428807127


##########
js/src/util/buffer.ts:
##########
@@ -208,10 +208,12 @@ export async function* toArrayBufferViewAsyncIterator<T extends TypedArray>(Arra
 /** @ignore */ export const toUint8ClampedArrayAsyncIterator = (input: ArrayBufferViewAsyncIteratorInput) => toArrayBufferViewAsyncIterator(Uint8ClampedArray, input);
 
 /** @ignore */
-export function rebaseValueOffsets(offset: number, length: number, valueOffsets: Int32Array) {
+export function rebaseValueOffsets(offset: number, length: number, valueOffsets: Int32Array): Int32Array;
+export function rebaseValueOffsets(offset: bigint, length: number, valueOffsets: BigInt64Array): BigInt64Array;
+export function rebaseValueOffsets(offset: number | bigint, length: number, valueOffsets: any) {
     // If we have a non-zero offset, create a new offsets array with the values
     // shifted by the start offset, such that the new start offset is 0
-    if (offset !== 0) {
+    if (offset != 0) {

Review Comment:
   Yes. This works for both number and big int. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "geclos (via GitHub)" <gi...@apache.org>.
geclos commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1859797631

   Thanks for your work guys!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1842106179

   Alright, small distraction on the way. I found that the rounding for buffer sizes was not working for large numbers so we failed to even get close to the 2**32 limit. I fixed the method and so now we can make vectors that are about twice as large as before: https://github.com/apache/arrow/pull/35780/commits/6949b6c78bb2e728e9fd77507a5aa5f8a9b8714f. Next up: finding a way to not make such large buffers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] domoritz commented on a diff in pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1225888237


##########
js/src/enum.ts:
##########
@@ -201,6 +201,7 @@ export enum Type {
     SparseUnion = -24,
     IntervalDayTime = -25,
     IntervalYearMonth = -26,
+    LargeUtf8 = -27,

Review Comment:
   Is this right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1412871839


##########
js/src/enum.ts:
##########
@@ -201,6 +201,7 @@ export enum Type {
     SparseUnion = -24,
     IntervalDayTime = -25,
     IntervalYearMonth = -26,
+    LargeUtf8 = -27,

Review Comment:
   Fixed it but we should probably do https://github.com/apache/arrow/issues/39048. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1837209620

   Looks like it's pretty easy to exceed the buffer sizes, though.
   
   ```ts
   import { constants } from 'node:buffer';
   import { LargeUtf8, makeBuilder } from './src/Arrow.dom.ts';
   
   // make the longest posstible string (longer will crash with `RangeError: Invalid string length`)
   const MAX_STRING_LENGTH = constants.MAX_STRING_LENGTH;
   console.log(`MAX_STRING_LENGTH: ${MAX_STRING_LENGTH}`);
   const longString = "a".repeat(MAX_STRING_LENGTH);
   
   const builder = makeBuilder({ type: new LargeUtf8 });
   
   // length of vector
   const N = 5;
   
   console.log(`Building vector with total string length (potential need for offsets): ${MAX_STRING_LENGTH * N} or 2^${Math.ceil(Math.log2(MAX_STRING_LENGTH * N))}`);
   
   for (let i = 0; i < N; i++) {
       builder.append(longString);
   }
   // add string to force another offset
   builder.append("");
   builder.finish();
   const vector = builder.toVector();
   
   console.log(vector);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1837757137

   Sounds good. Thanks for the input. 
   
   Vectors are chunked but buffers in builders are not. So I think I want to add a chunked buffer or figure out how to flush to vectors when the buffer is full.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1841553824

   Is that a problem? Looking at the C++ docs at https://arrow.apache.org/docs/cpp/api/array.html#classarrow_1_1_chunked_array it seems like it should be okay if chunk lengths are not consistent. 
   
   <img width="654" alt="Screenshot 2023-12-05 at 15 08 28" src="https://github.com/apache/arrow/assets/589034/5841d9a3-3c2a-4e49-819a-0a79617ec5d2">
   
   Using a vector with chunked data would be a lot cleaner than e.g. implementing a chunked buffer (then we would have a chunked data buffer inside a vector that could also be chunked).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] domoritz commented on a diff in pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1240068754


##########
js/src/type.ts:
##########
@@ -39,9 +39,11 @@ export type IsSigned = { 'true': true; 'false': false };
 export interface DataType<TType extends Type = Type, TChildren extends TypeMap = any> {
     readonly TType: TType;
     readonly TArray: any;
+    readonly TOffset: any;
     readonly TValue: any;
     readonly TChildren: TChildren;
     readonly ArrayType: any;
+    readonly OffsetType: TypedArrayConstructor<Int32Array> | BigIntArrayConstructor<BigInt64Array>;

Review Comment:
   Maybe this should be `OffsetArrayType`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1839842156

   But then you call
   
   ```ts
   table.batches[0].numRows
   // 1
   ```
   
   Implicitly, a batch has a constant number of rows across all `Data` instances.
   
   When I test it in Python, surprisingly the `Table` constructor _allows_ it, but then when you call `to_batches` it reaggregates chunks into consistent record batches:
   
   ```py
   import pyarrow as pa
   
   d1 = [
       pa.array([1, 2], pa.int32()),
       pa.array([3], pa.int32()),
   ]
   d2 = [
       pa.array([4], pa.int32()),
       pa.array([5, 6], pa.int32()),
   ]
   
   c1 = pa.chunked_array(d1)
   c2 = pa.chunked_array(d2)
   table = pa.table({"v1": c1, "v2": c2})
   table
   # pyarrow.Table
   # v1: int32
   # v2: int32
   # ----
   # v1: [[1,2],[3]]
   # v2: [[4],[5,6]]
   
   # to_batches regroups into batches of length 1
   table.to_batches()
   # [pyarrow.RecordBatch
   #  v1: int32
   #  v2: int32
   #  ----
   #  v1: [1]
   #  v2: [4],
   #  pyarrow.RecordBatch
   #  v1: int32
   #  v2: int32
   #  ----
   #  v1: [2]
   #  v2: [5],
   #  pyarrow.RecordBatch
   #  v1: int32
   #  v2: int32
   #  ----
   #  v1: [3]
   #  v2: [6]]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1843288624

   > Is that a problem? Looking at the C++ docs at https://arrow.apache.org/docs/cpp/api/array.html#classarrow_1_1_chunked_array it seems like it should be okay if chunk lengths are not consistent.
   
   It's not the ChunkedArray you want to look at. That indeed allows each `Data` instance within a _single_ `Vector` to have different lengths. But if you look at the [doc for `RecordBatch`](https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatchE) it indeed says
   
   > Collection of **equal-length arrays** matching a particular [Schema](https://arrow.apache.org/docs/cpp/api/datatype.html#classarrow_1_1_schema)
   
   (emphasis mine). So two columns can't have different lengths within a single `Batch`. 
   
   Maybe you're right that this is effectively the same as what pyarrow is doing. It's hard to tell if those batches are views on the existing buffers, like the slicing Arrow JS's `batches` have on the existing buffers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1858910452

   After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 49fde2313bc429547bc7e13886ed28c9c7fc6a84.
   
   There was 1 benchmark result indicating a performance regression:
   
   - Commit Run on `ursa-i9-9960x` at [2023-12-16 16:50:47Z](https://conbench.ursa.dev/compare/runs/fdbf90617722448fb926eb7da99b3194...2ab8f4ced8414325bc25c23e5e891634/)
     - [`tpch` (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-15, scale_factor=1](https://conbench.ursa.dev/compare/benchmarks/0657b6c060a6756880005cccaf7b3a08...0657dedafde376a58000be841ddfe814)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/19709480607) has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1837599573

   I do feel that supporting the full spec (i.e. _having_ a `LargeUtf8` type) and supporting massive strings larger than `Number.MAX_SAFE_INTEGER` are two separate objectives.
   
   I personally feel that it's fine to use `number` for indexing (with some safety checks to avoid overflows) and to instruct people to use Vectors if they need an offset value larger than `Number.MAX_SAFE_INTEGER`. I think "a new data structure that chunks" is a lot of added complexity when `Vector` is _already_ a data structure that chunks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1837268285

   Hmm, Number.MAX_SAFE_INTEGER is actually pretty large. Maybe it's better to still use numbers but implement a chunked buffer for the data values. I'll keep implementing chunked value arrays to see what makes sense. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] domoritz commented on a diff in pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1225872793


##########
js/src/builder.ts:
##########
@@ -289,7 +290,7 @@ export abstract class Builder<T extends DataType = any, TNull = any> {
             valueOffsets = _offsets?.flush(length);
         } else if (valueOffsets = _offsets?.flush(length)) { // Variable-width primitives (Binary, Utf8), and Lists
             // Binary, Utf8
-            data = _values?.flush(_offsets.last());
+            data = _values?.flush(Number(_offsets.last()));

Review Comment:
   I think this will fail for very large offsets that are larger than `number` allows but I think that's inevitable. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] FrNecas commented on a diff in pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "FrNecas (via GitHub)" <gi...@apache.org>.
FrNecas commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1301947907


##########
js/src/enum.ts:
##########
@@ -201,6 +201,7 @@ export enum Type {
     SparseUnion = -24,
     IntervalDayTime = -25,
     IntervalYearMonth = -26,
+    LargeUtf8 = -27,

Review Comment:
   I think it should be 20, according to the flatbuffers file that you get when you build the package (also the comment about 0-17) is probably outdated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ArcRiiad commented on pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "ArcRiiad (via GitHub)" <gi...@apache.org>.
ArcRiiad commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1729403980

   Some work left on this PR? :p 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1846069737

   It looks like what happens is that when you put differently chunked vectors into a table, the record batch boundaries are created at the chunk gaps from all vectors. So if you have two vectors with chunk lengths of 2,3 and 3,2, you will get record batches of length 2,1,2. I would hope that the data buffers are views on the existing buffers. The only issue I can see with this approach is that in the case of variable width vectors (like utf8)---where we have a data buffer where we might reference the same position multiple times from the offset buffer---we would need to copy the data buffer in serialized record batches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1837775663

   > Vectors are chunked but buffers in builders are not. So I think I want to add a chunked buffer or figure out how to flush to vectors when the buffer is full.
   
   That's true, but assuming that the user has more than one column, the builder can't make independent decisions about when to finish a chunk, because that would lead to different offsets than another column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1839808240

   Is it actually an issue that chunks have different lengths? Seems to work fine here
   
   ```js
   const b1 = makeBuilder({ type: new Int32 });
   const b2 = makeBuilder({ type: new Int32 });
   
   const d1 = [] as Data[];
   const d2 = [] as Data[];
   
   b1.append(1);
   b1.append(2);
   d1.push(b1.flush());
   b1.append(3);
   d1.push(b1.flush());
   
   b2.append(4);
   d2.push(b2.flush());
   b2.append(5);
   b2.append(6);
   d2.push(b2.flush());
   
   const v1 = new Vector(d1);
   const v2 = new Vector(d2);
   
   const table = new Table({ v1, v2 });
   console.log(table.toArray());
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1691912131

   Also, here as well, need to remove the corresponding skips in the integration tests:
   https://github.com/apache/arrow/blob/b9453a2a33e935e26dc8358846ad671aefdcb4d0/dev/archery/archery/integration/datagen.py#L1668-L1670


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] FrNecas commented on pull request #35780: GH-15060: [JS] Add LargeUtf8 type

Posted by "FrNecas (via GitHub)" <gi...@apache.org>.
FrNecas commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1690303717

   One more place where this needs to be added: https://github.com/apache/arrow/blob/main/js/src/ipc/metadata/json.ts#L159 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1825962774

   I left it unfinished before I went on vacation and just got back. What's left is addressing the comments here and also implementing a buffer that supports chunks so it can be larger than a normal js array. pRs against this branch would be awesome. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1837198871

   I updated things based on the comments so far. I mentioned we would need to support larger arrays. I plan to look into the following
   * We support BigInt64Array as offset buffers. The offsets are into the data buffer. The offset buffer itself can only have 2^32−2 many entries (because they are indexed by number)
   * The offsets are into the data buffer. This data buffer now needs to be indexed by a bigint. Since normal JS arrays are indexed by (32 bit) numbers, we need to implement a new data structure that chunks. I really wonder, though, whether there is ever a use case for arrays this large in JS since it would probably exceed the available browser memory and a serialized buffer can also not be that large.
   
   Thoughts? especially @kylebarron since you mentioned above that you think the issues of supporting deserialization and actually supporting massive strings are separate.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on code in PR #35780:
URL: https://github.com/apache/arrow/pull/35780#discussion_r1428807193


##########
js/src/enum.ts:
##########
@@ -175,6 +174,7 @@ export enum Type {
     FixedSizeList = 16, /** Fixed-size list. Each value occupies the same number of bytes */
     Map = 17, /** Map of named logical types */
     Duration = 18, /** Measure of elapsed time in either seconds, milliseconds, microseconds or nanoseconds. */
+    LargeUtf8 = 20, /** Large variable-length string as List<Char> */

Review Comment:
   We haven't implemented it so I'm skipping it for now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz merged PR #35780:
URL: https://github.com/apache/arrow/pull/35780


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1846233104

   > It looks like what happens is that when you put differently chunked vectors into a table, the record batch boundaries are created at the chunk gaps from all vectors.
   
   Yes, indeed.
   
   > I would hope that the data buffers are views on the existing buffers.
   
   In C++ they are.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org