You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/15 18:45:11 UTC

[GitHub] [arrow] paddyhoran commented on a change in pull request #8901: ARROW-10889: [Rust] [Proposal] Add guidelines about usage of `unsafe`

paddyhoran commented on a change in pull request #8901:
URL: https://github.com/apache/arrow/pull/8901#discussion_r543591924



##########
File path: rust/arrow/README.md
##########
@@ -96,6 +96,71 @@ Arrow uses the following features:
 Other than `simd` all the other features are enabled by default. Disabling `prettyprint` might be necessary in order to
 compile Arrow to the `wasm32-unknown-unknown` WASM target.
 
+## Guidelines in usage of `unsafe`
+
+[`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the [primary cause of undefined behavior](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) in a program written in Rust.
+For two real world examples of where `unsafe` has consumed time in the past in this project see [#8545](https://github.com/apache/arrow/pull/8645) and [8829](https://github.com/apache/arrow/pull/8829)
+This crate only accepts the usage of `unsafe` code upon careful consideration, and strives to avoid it to the largest possible extent.
+
+### When can `unsafe` be used?
+
+Generally, `unsafe` should only be used when a `safe` counterpart is not available and there is no `safe` way to achieve additional performance in that area. The following is a summary of the current components of the crate that require `unsafe`:
+
+* alloc, dealloc and realloc of buffers along cache lines
+* Interpreting bytes as certain rust types, for access, representation and compute
+* Foreign interfaces (C data interface)
+* Inter-process communication (IPC)
+* SIMD
+* Performance (e.g. omit bounds checks, use of pointers to avoid bound checks)
+
+#### cache-line aligned memory management
+
+The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior.
+However, Rust's global allocator does not allocates memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`.

Review comment:
       ```suggestion
   However, Rust's global allocator does not allocate memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`.
   ```

##########
File path: rust/arrow/README.md
##########
@@ -96,6 +96,71 @@ Arrow uses the following features:
 Other than `simd` all the other features are enabled by default. Disabling `prettyprint` might be necessary in order to
 compile Arrow to the `wasm32-unknown-unknown` WASM target.
 
+## Guidelines in usage of `unsafe`
+
+[`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the [primary cause of undefined behavior](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) in a program written in Rust.
+For two real world examples of where `unsafe` has consumed time in the past in this project see [#8545](https://github.com/apache/arrow/pull/8645) and [8829](https://github.com/apache/arrow/pull/8829)
+This crate only accepts the usage of `unsafe` code upon careful consideration, and strives to avoid it to the largest possible extent.
+
+### When can `unsafe` be used?
+
+Generally, `unsafe` should only be used when a `safe` counterpart is not available and there is no `safe` way to achieve additional performance in that area. The following is a summary of the current components of the crate that require `unsafe`:
+
+* alloc, dealloc and realloc of buffers along cache lines
+* Interpreting bytes as certain rust types, for access, representation and compute
+* Foreign interfaces (C data interface)
+* Inter-process communication (IPC)
+* SIMD
+* Performance (e.g. omit bounds checks, use of pointers to avoid bound checks)
+
+#### cache-line aligned memory management
+
+The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior.
+However, Rust's global allocator does not allocates memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`.
+
+#### Interpreting bytes
+
+The arrow format is specified in bytes (`u8`), which can be logically represented as certain types
+depending on the DataType.

Review comment:
       ```suggestion
   depending on the `DataType`.
   ```

##########
File path: rust/arrow/README.md
##########
@@ -96,6 +96,71 @@ Arrow uses the following features:
 Other than `simd` all the other features are enabled by default. Disabling `prettyprint` might be necessary in order to
 compile Arrow to the `wasm32-unknown-unknown` WASM target.
 
+## Guidelines in usage of `unsafe`
+
+[`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the [primary cause of undefined behavior](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) in a program written in Rust.
+For two real world examples of where `unsafe` has consumed time in the past in this project see [#8545](https://github.com/apache/arrow/pull/8645) and [8829](https://github.com/apache/arrow/pull/8829)
+This crate only accepts the usage of `unsafe` code upon careful consideration, and strives to avoid it to the largest possible extent.
+
+### When can `unsafe` be used?
+
+Generally, `unsafe` should only be used when a `safe` counterpart is not available and there is no `safe` way to achieve additional performance in that area. The following is a summary of the current components of the crate that require `unsafe`:
+
+* alloc, dealloc and realloc of buffers along cache lines
+* Interpreting bytes as certain rust types, for access, representation and compute
+* Foreign interfaces (C data interface)
+* Inter-process communication (IPC)
+* SIMD
+* Performance (e.g. omit bounds checks, use of pointers to avoid bound checks)
+
+#### cache-line aligned memory management
+
+The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior.
+However, Rust's global allocator does not allocates memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`.
+
+#### Interpreting bytes
+
+The arrow format is specified in bytes (`u8`), which can be logically represented as certain types
+depending on the DataType.
+For many operations, such as access, representation, numerical computation and string manipulation,
+it is often necessary to interpret bytes as other physical types (e.g. `i32`).
+
+Usage of `unsafe` for the purpose of interpreting bytes in their corresponding type (according to the arrow specification) is allowed. Specifically, the pointer to the byte slice must be aligned to the type that it intends to represent and the length of the slice is a multiple of the size of the target type of the transmutation.
+
+#### FFI
+
+The arrow format declares an ABI for zero-copy from and to libraries that implement the specification
+(foreign interfaces). In Rust, receiving and sending pointers via FFI requires usage of `unsafe` due to 
+the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment), from the source code alone as they are part of the FFI contract.

Review comment:
       ```suggestion
   the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment) from the source code alone as they are part of the FFI contract.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org