You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/21 21:29:54 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #2123: Faster parquet DictEncoder

tustvold opened a new pull request, #2123:
URL: https://github.com/apache/arrow-rs/pull/2123

   # Which issue does this PR close?
   
   Part of #1764 
   
   # Rationale for this change
    
   The existing implementation is complex, and sloewr
   
   # What changes are included in this PR?
   
   Gives the encoder the same treatment as #1861, switching to using ahash and hashbrown.
   
   # Are there any user-facing changes?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a diff in pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#discussion_r927783932


##########
parquet/src/encodings/encoding/dict_encoder.rs:
##########
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// ----------------------------------------------------------------------
+// Dictionary encoding
+
+use crate::basic::{Encoding, Type};
+use crate::data_type::{AsBytes, DataType};
+use crate::encodings::encoding::{Encoder, PlainEncoder};
+use crate::encodings::rle::RleEncoder;
+use crate::errors::{ParquetError, Result};
+use crate::schema::types::ColumnDescPtr;
+use crate::util::bit_util::num_required_bits;
+use crate::util::memory::ByteBufferPtr;
+use hashbrown::hash_map::RawEntryMut;
+use hashbrown::HashMap;
+use std::hash::{BuildHasher, Hash, Hasher};
+use std::io::Write;
+use crate::data_type::private::ParquetValueType;
+
+/// Dictionary encoder.
+/// The dictionary encoding builds a dictionary of values encountered in a given column.
+/// The dictionary page is written first, before the data pages of the column chunk.
+///
+/// Dictionary page format: the entries in the dictionary - in dictionary order -
+/// using the plain encoding.
+///
+/// Data page format: the bit width used to encode the entry ids stored as 1 byte
+/// (max bit width = 32), followed by the values encoded using RLE/Bit packed described
+/// above (with the given bit width).
+pub struct DictEncoder<T: DataType> {
+    /// Descriptor for the column to be encoded.
+    desc: ColumnDescPtr,
+
+    state: ahash::RandomState,
+
+    /// Used to provide a lookup from value to unique value

Review Comment:
   https://www.youtube.com/watch?v=PAAkCSZUG1c&t=9m28s 🤷 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Dandandan commented on a diff in pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#discussion_r927303739


##########
parquet/Cargo.toml:
##########
@@ -49,6 +50,7 @@ serde_json = { version = "1.0", default-features = false, features = ["std"], op
 rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }
 futures = { version = "0.3", default-features = false, features = ["std"], optional = true }
 tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "fs", "rt", "io-util"] }
+hashbrown = { version = "0.12", default-features = false }

Review Comment:
   There is a feature 'inline-more" which is enabled by default in hashbrown which gives sometimes a bit better performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] ursabot commented on pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#issuecomment-1199001679

   Benchmark runs are scheduled for baseline = 985760f5609125f1ca4797c1df9ba69ab01aab85 and contender = 6ce4c4ebf2d4c0a27feb7f341c6eee2525a1cce7. 6ce4c4ebf2d4c0a27feb7f341c6eee2525a1cce7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/2c019c99ec5e45e9befaeda6b2b62f94...8e86863c8a53450c84d02b61e5339cce/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/160e66b8ebd34a01936c0c5dac5d4c46...d7ee4cbebd7a4c59a3036bd8edb4a90d/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fb65205de0dd486484265dab1aac07cc...d70206f5aa794485a036e5ca77398063/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f510d661b1db42848c909348b2b5fd67...c946cb221897490f8cb86d6b082b1b68/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a diff in pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#discussion_r927522428


##########
parquet/Cargo.toml:
##########
@@ -30,6 +30,7 @@ edition = "2021"
 rust-version = "1.57"
 
 [dependencies]
+ahash = "0.7"

Review Comment:
   These seem to be new dependencies (if optional features are not enabled)



##########
parquet/src/encodings/encoding/dict_encoder.rs:
##########
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// ----------------------------------------------------------------------
+// Dictionary encoding
+
+use crate::basic::{Encoding, Type};
+use crate::data_type::{AsBytes, DataType};
+use crate::encodings::encoding::{Encoder, PlainEncoder};
+use crate::encodings::rle::RleEncoder;
+use crate::errors::{ParquetError, Result};
+use crate::schema::types::ColumnDescPtr;
+use crate::util::bit_util::num_required_bits;
+use crate::util::memory::ByteBufferPtr;
+use hashbrown::hash_map::RawEntryMut;
+use hashbrown::HashMap;
+use std::hash::{BuildHasher, Hash, Hasher};
+use std::io::Write;
+use crate::data_type::private::ParquetValueType;
+
+/// Dictionary encoder.
+/// The dictionary encoding builds a dictionary of values encountered in a given column.
+/// The dictionary page is written first, before the data pages of the column chunk.
+///
+/// Dictionary page format: the entries in the dictionary - in dictionary order -
+/// using the plain encoding.
+///
+/// Data page format: the bit width used to encode the entry ids stored as 1 byte
+/// (max bit width = 32), followed by the values encoded using RLE/Bit packed described
+/// above (with the given bit width).
+pub struct DictEncoder<T: DataType> {
+    /// Descriptor for the column to be encoded.
+    desc: ColumnDescPtr,
+
+    state: ahash::RandomState,
+
+    /// Used to provide a lookup from value to unique value

Review Comment:
   Given the replication of this pattern (maybe now in three places?) perhaps we can factor it into its own structure, mostly for readability as the use of `HashMap` to implement a `HashSet` takes some thought to totally grok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #2123: Faster parquet DictEncoder

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#issuecomment-1191955893

   Running benchmarks with just the change to ahash show no significant performance change. This is not entirely surprising as the current implementation uses crc32 which is very cheap to compute (although not DOS resistant).
   
   The change to hashbrown nets a non-trivial return where value encoding is the major bottleneck.
   
   ```
   write_batch primitive/4096 values primitive                                                                             
                           time:   [1.5325 ms 1.5331 ms 1.5338 ms]
                           thrpt:  [115.02 MiB/s 115.07 MiB/s 115.12 MiB/s]
                    change:
                           time:   [-20.677% -20.632% -20.590%] (p = 0.00 < 0.05)
                           thrpt:  [+25.929% +25.995% +26.068%]
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   Benchmarking write_batch primitive/4096 values primitive non-null: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
   write_batch primitive/4096 values primitive non-null                                                                             
                           time:   [1.4838 ms 1.4847 ms 1.4857 ms]
                           thrpt:  [116.44 MiB/s 116.52 MiB/s 116.59 MiB/s]
                    change:
                           time:   [-12.080% -12.017% -11.954%] (p = 0.00 < 0.05)
                           thrpt:  [+13.577% +13.659% +13.739%]
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     2 (2.00%) high mild
     2 (2.00%) high severe
   write_batch primitive/4096 values bool                                                                            
                           time:   [111.01 us 111.09 us 111.19 us]
                           thrpt:  [10.224 MiB/s 10.233 MiB/s 10.240 MiB/s]
                    change:
                           time:   [-0.8794% -0.6831% -0.4488%] (p = 0.00 < 0.05)
                           thrpt:  [+0.4508% +0.6878% +0.8872%]
                           Change within noise threshold.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   write_batch primitive/4096 values bool non-null                                                                            
                           time:   [52.931 us 53.012 us 53.094 us]
                           thrpt:  [21.411 MiB/s 21.444 MiB/s 21.477 MiB/s]
                    change:
                           time:   [-2.2177% -2.1085% -1.9913%] (p = 0.00 < 0.05)
                           thrpt:  [+2.0318% +2.1539% +2.2680%]
                           Performance has improved.
   Found 15 outliers among 100 measurements (15.00%)
     5 (5.00%) high mild
     10 (10.00%) high severe
   write_batch primitive/4096 values string                                                                            
                           time:   [891.20 us 891.52 us 891.88 us]
                           thrpt:  [89.239 MiB/s 89.275 MiB/s 89.306 MiB/s]
                    change:
                           time:   [-8.4838% -8.4391% -8.3955%] (p = 0.00 < 0.05)
                           thrpt:  [+9.1650% +9.2170% +9.2703%]
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   Benchmarking write_batch primitive/4096 values string non-null: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
   write_batch primitive/4096 values string non-null                                                                             
                           time:   [1.0208 ms 1.0213 ms 1.0218 ms]
                           thrpt:  [77.889 MiB/s 77.931 MiB/s 77.970 MiB/s]
                    change:
                           time:   [+0.0730% +0.1746% +0.2545%] (p = 0.00 < 0.05)
                           thrpt:  [-0.2538% -0.1743% -0.0730%]
                           Change within noise threshold.
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   Benchmarking write_batch nested/4096 values primitive list: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.8s, enable flat sampling, or reduce sample count to 50.
   write_batch nested/4096 values primitive list                                                                             
                           time:   [1.9798 ms 2.0064 ms 2.0368 ms]
                           thrpt:  [80.409 MiB/s 81.627 MiB/s 82.725 MiB/s]
                    change:
                           time:   [+0.9435% +1.8832% +3.0013%] (p = 0.00 < 0.05)
                           thrpt:  [-2.9139% -1.8484% -0.9347%]
                           Change within noise threshold.
   Found 19 outliers among 100 measurements (19.00%)
     1 (1.00%) high mild
     18 (18.00%) high severe
   write_batch nested/4096 values primitive list non-null                                                                             
                           time:   [2.4385 ms 2.4696 ms 2.5038 ms]
                           thrpt:  [76.896 MiB/s 77.959 MiB/s 78.952 MiB/s]
                    change:
                           time:   [-0.1096% +1.1302% +2.5102%] (p = 0.10 > 0.05)
                           thrpt:  [-2.4488% -1.1176% +0.1097%]
                           No change in performance detected.
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a diff in pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#discussion_r933141260


##########
parquet/src/util/interner.rs:
##########
@@ -0,0 +1,94 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::data_type::AsBytes;
+use hashbrown::hash_map::RawEntryMut;
+use hashbrown::HashMap;
+use std::hash::Hash;
+
+/// Storage trait for [`Interner`]
+pub trait Storage {

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#discussion_r927758561


##########
parquet/src/encodings/encoding/dict_encoder.rs:
##########
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// ----------------------------------------------------------------------
+// Dictionary encoding
+
+use crate::basic::{Encoding, Type};
+use crate::data_type::{AsBytes, DataType};
+use crate::encodings::encoding::{Encoder, PlainEncoder};
+use crate::encodings::rle::RleEncoder;
+use crate::errors::{ParquetError, Result};
+use crate::schema::types::ColumnDescPtr;
+use crate::util::bit_util::num_required_bits;
+use crate::util::memory::ByteBufferPtr;
+use hashbrown::hash_map::RawEntryMut;
+use hashbrown::HashMap;
+use std::hash::{BuildHasher, Hash, Hasher};
+use std::io::Write;
+use crate::data_type::private::ParquetValueType;
+
+/// Dictionary encoder.
+/// The dictionary encoding builds a dictionary of values encountered in a given column.
+/// The dictionary page is written first, before the data pages of the column chunk.
+///
+/// Dictionary page format: the entries in the dictionary - in dictionary order -
+/// using the plain encoding.
+///
+/// Data page format: the bit width used to encode the entry ids stored as 1 byte
+/// (max bit width = 32), followed by the values encoded using RLE/Bit packed described
+/// above (with the given bit width).
+pub struct DictEncoder<T: DataType> {
+    /// Descriptor for the column to be encoded.
+    desc: ColumnDescPtr,
+
+    state: ahash::RandomState,
+
+    /// Used to provide a lookup from value to unique value

Review Comment:
   I did consider this, but I was unsure where to put it. It can't live in arrow, as parquet needs to compile without arrow, but aside from creating a new crate I wasn't really sure where to put it...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#discussion_r932952043


##########
parquet/Cargo.toml:
##########
@@ -49,6 +50,7 @@ serde_json = { version = "1.0", default-features = false, features = ["std"], op
 rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }
 futures = { version = "0.3", default-features = false, features = ["std"], optional = true }
 tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "fs", "rt", "io-util"] }
+hashbrown = { version = "0.12", default-features = false }

Review Comment:
   By disabling this here, we can delegate that decision downstream



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#issuecomment-1198968800

   I'm going to get this in as I need it for #1764, we have time until the next release to address any issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Dandandan commented on a diff in pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#discussion_r927470234


##########
parquet/src/encodings/encoding/dict_encoder.rs:
##########
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// ----------------------------------------------------------------------
+// Dictionary encoding
+
+use crate::basic::{Encoding, Type};
+use crate::data_type::{AsBytes, DataType};
+use crate::encodings::encoding::{Encoder, PlainEncoder};
+use crate::encodings::rle::RleEncoder;
+use crate::errors::{ParquetError, Result};
+use crate::schema::types::ColumnDescPtr;
+use crate::util::bit_util::num_required_bits;
+use crate::util::memory::ByteBufferPtr;
+use hashbrown::hash_map::RawEntryMut;
+use hashbrown::HashMap;
+use std::hash::{BuildHasher, Hash, Hasher};
+use std::io::Write;
+use crate::data_type::private::ParquetValueType;
+
+/// Dictionary encoder.
+/// The dictionary encoding builds a dictionary of values encountered in a given column.
+/// The dictionary page is written first, before the data pages of the column chunk.
+///
+/// Dictionary page format: the entries in the dictionary - in dictionary order -
+/// using the plain encoding.
+///
+/// Data page format: the bit width used to encode the entry ids stored as 1 byte
+/// (max bit width = 32), followed by the values encoded using RLE/Bit packed described
+/// above (with the given bit width).
+pub struct DictEncoder<T: DataType> {
+    /// Descriptor for the column to be encoded.
+    desc: ColumnDescPtr,
+
+    state: ahash::RandomState,
+
+    /// Used to provide a lookup from value to unique value
+    ///
+    /// Note: `u64`'s hash implementation is not used, instead the raw entry
+    /// API is used to store keys w.r.t the hash of the strings themselves
+    ///
+    dedup: HashMap<u64, (), ()>,
+
+    /// The unique observed values.
+    uniques: Vec<T::T>,
+
+    /// The buffered indices
+    indices: Vec<u64>,
+
+    /// Size in bytes needed to encode this dictionary.
+    uniques_size_in_bytes: usize,
+}
+
+impl<T: DataType> DictEncoder<T> {
+    /// Creates new dictionary encoder.
+    pub fn new(desc: ColumnDescPtr) -> Self {
+        Self {
+            desc,
+            state: Default::default(),
+            dedup: HashMap::with_hasher(()),
+            uniques: vec![],
+            indices: vec![],
+            uniques_size_in_bytes: 0,
+        }
+    }
+
+    /// Returns true if dictionary entries are sorted, false otherwise.
+    pub fn is_sorted(&self) -> bool {
+        // Sorting is not supported currently.
+        false
+    }
+
+    /// Returns number of unique values (keys) in the dictionary.
+    pub fn num_entries(&self) -> usize {
+        self.uniques.len()
+    }
+
+    /// Returns size of unique values (keys) in the dictionary, in bytes.
+    pub fn dict_encoded_size(&self) -> usize {
+        self.uniques_size_in_bytes
+    }
+
+    /// Writes out the dictionary values with PLAIN encoding in a byte buffer, and return
+    /// the result.
+    pub fn write_dict(&self) -> Result<ByteBufferPtr> {
+        let mut plain_encoder = PlainEncoder::<T>::new(self.desc.clone(), vec![]);
+        plain_encoder.put(&self.uniques)?;
+        plain_encoder.flush_buffer()
+    }
+
+    /// Writes out the dictionary values with RLE encoding in a byte buffer, and return
+    /// the result.
+    pub fn write_indices(&mut self) -> Result<ByteBufferPtr> {
+        let buffer_len = self.estimated_data_encoded_size();
+        let mut buffer = vec![0; buffer_len];
+        buffer[0] = self.bit_width() as u8;
+
+        // Write bit width in the first byte
+        buffer.write_all((self.bit_width() as u8).as_bytes())?;
+        let mut encoder = RleEncoder::new_from_buf(self.bit_width(), buffer, 1);
+        for index in &self.indices {
+            if !encoder.put(*index as u64)? {
+                return Err(general_err!("Encoder doesn't have enough space"));
+            }
+        }
+        self.indices.clear();
+        Ok(ByteBufferPtr::new(encoder.consume()?))
+    }
+
+    fn compute_hash(state: &ahash::RandomState, value: &T::T) -> u64 {
+        let mut hasher = state.build_hasher();
+        value.as_bytes().hash(&mut hasher);
+        hasher.finish()
+    }
+
+    fn put_one(&mut self, value: &T::T) {
+        let hash = Self::compute_hash(&self.state, value);
+
+        let entry = self
+            .dedup
+            .raw_entry_mut()
+            .from_hash(hash, |index| value == &self.uniques[*index as usize]);
+
+        let index = match entry {
+            RawEntryMut::Occupied(entry) => *entry.into_key(),
+            RawEntryMut::Vacant(entry) => {
+                let index = self.uniques.len() as u64;
+                self.uniques.push(value.clone());
+
+                let (base_size, num_elements) = value.dict_encoding_size();
+
+                let unique_size = match T::get_physical_type() {
+                    Type::BYTE_ARRAY => base_size + num_elements,
+                    Type::FIXED_LEN_BYTE_ARRAY => self.desc.type_length() as usize,
+                    _ => base_size,
+                };
+                self.uniques_size_in_bytes += unique_size;
+
+                *entry
+                    .insert_with_hasher(hash, index, (), |index| {
+                        Self::compute_hash(&self.state, &self.uniques[*index as usize])
+                    })
+                    .0
+            }
+        };
+
+        self.indices.push(index);
+    }
+
+    #[inline]
+    fn bit_width(&self) -> u8 {
+        let num_entries = self.uniques.len();
+        if num_entries <= 1 {
+            num_entries as u8
+        } else {
+            num_required_bits(num_entries as u64 - 1)
+        }
+    }
+}
+
+impl<T: DataType> Encoder<T> for DictEncoder<T> {
+    fn put(&mut self, values: &[T::T]) -> Result<()> {
+        for i in values {

Review Comment:
   Not sure if it's a bottleneck, it might be faster to compute the hashes for `values` in one go (i.e. vectorized)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter commented on pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#issuecomment-1191973140

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2123?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2123](https://codecov.io/gh/apache/arrow-rs/pull/2123?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f945d1f) into [master](https://codecov.io/gh/apache/arrow-rs/commit/efd3152f85f182757138dd5984c9832c3257448c?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (efd3152) will **decrease** coverage by `0.03%`.
   > The diff coverage is `89.58%`.
   
   > :exclamation: Current head f945d1f differs from pull request most recent head 5d8d756. Consider uploading reports for the commit 5d8d756 to get more accurate results
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2123      +/-   ##
   ==========================================
   - Coverage   83.73%   83.69%   -0.04%     
   ==========================================
     Files         225      225              
     Lines       59412    59481      +69     
   ==========================================
   + Hits        49748    49784      +36     
   - Misses       9664     9697      +33     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/2123?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/equal/decimal.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL2RlY2ltYWwucnM=) | `96.29% <0.00%> (-3.71%)` | :arrow_down: |
   | [arrow/src/array/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L21vZC5ycw==) | `100.00% <ø> (ø)` | |
   | [arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9maXhlZF9iaW5hcnkucnM=) | `68.18% <0.00%> (-6.82%)` | :arrow_down: |
   | [arrow/src/buffer/mutable.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2J1ZmZlci9tdXRhYmxlLnJz) | `89.13% <ø> (ø)` | |
   | [arrow/src/compute/kernels/temporal.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy90ZW1wb3JhbC5ycw==) | `94.10% <ø> (ø)` | |
   | [arrow/src/csv/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi9yZWFkZXIucnM=) | `89.98% <0.00%> (+0.09%)` | :arrow_up: |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `54.45% <0.00%> (-0.14%)` | :arrow_down: |
   | [arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy9yZWFkZXIucnM=) | `90.82% <ø> (ø)` | |
   | [integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aW50ZWdyYXRpb24tdGVzdGluZy9zcmMvbGliLnJz) | `0.00% <0.00%> (ø)` | |
   | [parquet/src/arrow/async\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXN5bmNfcmVhZGVyLnJz) | `0.00% <0.00%> (ø)` | |
   | ... and [51 more](https://codecov.io/gh/apache/arrow-rs/pull/2123/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold merged pull request #2123: Faster parquet DictEncoder (~20%)

Posted by GitBox <gi...@apache.org>.

tustvold merged PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org