You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "shuhaowu (via GitHub)" <gi...@apache.org> on 2023/04/06 16:42:18 UTC

[GitHub] [arrow] shuhaowu opened a new pull request, #34939: GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector

shuhaowu opened a new pull request, #34939:
URL: https://github.com/apache/arrow/pull/34939

   ### Rationale for this change
   
   Certain codebases that previously uses row-oriented way to access data may wish to migrate to Arrow to save serialization and deserialization cost, and to be able to gain access to fast column-oriented operations. As it stands, Arrow is sort of a drop-in replacement to row-oriented data such as a JavaScript Array of objects. This is great to incrementally migrate legacy codebases to Arrow, as it is frequently infeasible to rewrite the application to use the column-oriented data access patterns. For most data, JavaScript-object-compatible and row-oriented access is already provided via the `StructRowProxy`. However, if the structs themselves include a `Vector`, existing code will break as it assumes the `Vector` object to behave like a JavaScript array, which it does not due to the lack of index access. An example of such a data structure is as follows:
   
   ```
   [
     {x: 1, y: [1, 2]},
     {x: 2, y: [2, 3]},
   ]
   ```
   
   In this case, with the Arrow JS library as it is, the API consumer is unable to get individual element of the `y` array via `table[i].y[j]`. Instead, the API consumer must use the API `table.get(i).y.get(j)`. In the situation where we are migrating a legacy code base to Arrow, this requires a large refactor of the entire codebase, which is infeasible in a short time. This negates the advantage of using Arrow as a drop-in replacement and prevents incremental migration of code to Arrow.
   
   ### What changes are included in this PR?
   
   To address this problem, this patch adds a Proxy at the root of the prototype chain for the `Table`, `RecordBatch`, and `Vector` objects and allow index access for these objects for backward compatibility purposes. Basically, objects like `Vector` now supports `vector[i]` in addition to `vector.get(i)`.
   
   However, code should not be using `vector[i]` as it is ~1.5 orders of magnitude slower than `vector.get(i)` as ES6 Proxy objects are quite slow. This should only be used to provide compatibility for legacy codebases. For code that desires high performance, `.get` remains a much better solution. This is also why the Proxy object is added to the root of the prototype chain, as opposed to the usual pattern where a Proxy object is returned from a constructor.
   
   Documentation has been added to compare the performance of the various access.
   
   ### Are there any user-facing changes?
   
   `Table`, `RecordBatch`, and `Vector` elements can now be accessed via index operators, albeit much slower than `.get`. All changes should be backward compatible.
   
   ### Are these changes tested?
   
   Yes. The performance of the base objects does not seem to be affected. To establish the performance change, we first see how much variability in the `ops/s` there is between two runs of `yarn perf`. The left plot below shows the percent difference between the two runs. This shows a baseline level of "variability" (although may not be statistically significant, but gives an indication) between the runs. We see most benchmarks are within 20% of each other, although a few tests at the bottom show higher variability. The right plot shows the difference in `ops/s` before and after the patch. No benchmark showed significantly higher or lower performance and looks superficially similar to the variability comparison. As such, we can say that the performance of the objects are not impacted by this patch.
   
   ![image](https://user-images.githubusercontent.com/338100/230441327-6162a318-362a-4c18-bb81-68bc9a32e987.png)
   
   The following shows the performance of `vector.get(i)` vs `vector[i]`. The raw results shows the index access to have about 50 ops/s while `.get` are around 750 ops/s. This is about 1.5 orders of magnitude difference in performance. Basically, this shouldn't be used unless it is used for backward compatibility with code that expects native JavaScript arrays.
   
   <img src=https://user-images.githubusercontent.com/338100/230442000-0d7bd599-47bc-4268-a8fe-a9f858c2d319.png width=600 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34939: GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34939:
URL: https://github.com/apache/arrow/pull/34939#issuecomment-1499342439

   :warning: GitHub issue #34936 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34939: GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34939:
URL: https://github.com/apache/arrow/pull/34939#issuecomment-1499342399

   * Closes: #34936


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector [arrow]

Posted by "domoritz (via GitHub)" <gi...@apache.org>.
domoritz commented on PR #34939:
URL: https://github.com/apache/arrow/pull/34939#issuecomment-1879001707

   I love the perf charts you made. Is there a script or something you could share to make them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org