You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/15 13:09:15 UTC

[GitHub] [arrow-rs] alamb opened a new issue, #1882: Remove `indexmap` dependency

alamb opened a new issue, #1882:
URL: https://github.com/apache/arrow-rs/issues/1882

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   We are using the [indexmap](https://crates.io/crates/indexmap) for a relatively small usecase in the json reader to preserve iteration order
   
   
   This version isn't keeping up with dependencies (ses discussion on  https://github.com/apache/arrow-rs/pull/1861#discussion_r895649049)
   
   
   **Describe the solution you'd like**
   I would like to remove the use of `indexmap` entirely and streamline the arrow dependency chain
   
   Fascinatingly when I apply the following diff all the tests pass (though I don't think it is entirely correct as  `BTreeSet` and `BTreeMap` do not actually preserve the insert order) so there is clearly a gap in test coverage too
   
   
   ```diff
   diff --git a/arrow/Cargo.toml b/arrow/Cargo.toml
   index b59a697538..e0008abde4 100644
   --- a/arrow/Cargo.toml
   +++ b/arrow/Cargo.toml
   @@ -41,7 +41,6 @@ bench = false
    serde = { version = "1.0", default-features = false }
    serde_derive = { version = "1.0", default-features = false }
    serde_json = { version = "1.0", default-features = false, features = ["preserve_order"] }
   -indexmap = { version = "1.6", default-features = false, features = ["std"] }
    rand = { version = "0.8", default-features = false, features =  ["std", "std_rng"], optional = true }
    num = { version = "0.4", default-features = false, features = ["std"] }
    half = { version = "1.8", default-features = false }
   diff --git a/arrow/src/json/reader.rs b/arrow/src/json/reader.rs
   index e1fa54f8a6..f7384c7e5b 100644
   --- a/arrow/src/json/reader.rs
   +++ b/arrow/src/json/reader.rs
   @@ -50,8 +50,8 @@
    use std::io::{BufRead, BufReader, Read, Seek, SeekFrom};
    use std::sync::Arc;
    
   -use indexmap::map::IndexMap as HashMap;
   -use indexmap::set::IndexSet as HashSet;
   +use std::collections::BTreeMap as HashMap;
   +use std::collections::BTreeSet as HashSet;
    use serde_json::json;
    use serde_json::{map::Map as JsonMap, Value};
    
   ```
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   This code and dependency appears to have come in originally via 8ef1fb8745635369be815304180de833c31de870 / https://github.com/apache/arrow/pull/3702 from(@nevi-me)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1882: Remove `indexmap` dependency

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #1882: Remove `indexmap` dependency
URL: https://github.com/apache/arrow-rs/issues/1882


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on issue #1882: Remove `indexmap` dependency

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #1882:
URL: https://github.com/apache/arrow-rs/issues/1882#issuecomment-1169650394

   FYI I managed to solve my cyclic dependency issue in another way, by removing `ahash` and by excluding a load testing tool from the workspace which transitively activated the `js` feature of `getrandom`.
   
   My attempt at removing the `indexmap` dependency started a bit more complicated, since I also tried to remove the set in `InferredType::Scalar` and instead eagerly coerce the data types. The simpler approach by @nevi-me looks fine and probably the field order does not actually need to be guaranteed when inferring a schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1882: Remove `indexmap` dependency

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1882:
URL: https://github.com/apache/arrow-rs/issues/1882#issuecomment-1167950378

   @jhorstmann  notes he may try this issue -- more context on https://lists.apache.org/thread/qxt3qv5pv8tfy4cj6jgp8gl90k3rr562 / https://github.com/apache/arrow-rs/pull/1929


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1882: Remove `indexmap` dependency

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1882:
URL: https://github.com/apache/arrow-rs/issues/1882#issuecomment-1169787390

   Given the sheer number of things that depend on indexmap (#1961), and that it no longer brings in an old version of hash brown, I wonder if this is still worth the effort of pursuing? If it were some esoteric crate with questionable maintenance sure, but I'm not sure that is the case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] nevi-me commented on issue #1882: Remove `indexmap` dependency

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #1882:
URL: https://github.com/apache/arrow-rs/issues/1882#issuecomment-1169528840

   @alamb there's indeed a coverage gap. If I change the column "a" to "f", and run the tests, the schema order changes, with the fields ordered alphabetically with `BTreeMap`.
   
   From what I can see, this change should be safe, as users should be looking up columns in a record batch by schema index, and not the column order of the input JSON data.
   
   It would be hard to see someone relying on some column order for JSON though, because what if the first record doesn't have some field (`b, c, d`) and the next one has (`a, e, f`)? So I think preserving field order might not that big a requirement.
   
   The failures are because we're testing field orders, and changing the input JSON reveals the difference.
   
   The test `test_basic_json` has the following:
   
   ```rust
   let a = schema.column_with_name("a").unwrap();
   # check that the column index is 0
   assert_eq!(0, a.0);
   ```
   
   This fails after changing column "a" to "f"
   
   ```diff
   let f = schema.column_with_name("f").unwrap();
   # check that the column index is 0
   - assert_eq!(0, f.0);
   + assert_eq!(4, f.0);
   ```
   
   So it looks like a safe solution is to assert that the field does exists, and not its order in the schema.
   
   By aliasing `IndexMap as HashMap`, we mistakenly leaked this type in a few public functions, e.g. `ReaderBuilder::with_format_strings`.
   Changing them back to `std::collections::HashMap` suffices, but this introduces a breaking change in the API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on issue #1882: Remove `indexmap` dependency

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #1882:
URL: https://github.com/apache/arrow-rs/issues/1882#issuecomment-1156682605

   I'm in favor of removing it. I think it still comes in as a transitive dependency via the `preserve_order` feature, so that would need to be removed too.
   
   We are also pinned to an old version because of <https://github.com/tkaitchuck/aHash/issues/95> so removing it might break that dependency cycle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1882: Remove `indexmap` dependency

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1882:
URL: https://github.com/apache/arrow-rs/issues/1882#issuecomment-1170604270

   If the dependency is hard to remove and it isn't doing much harm then I agree the priority of the work might be lower? 
   
   In general I think the fewer dependencies the better (for precisely the reason that removing them in the future is really hard) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1882: Remove `indexmap` dependency

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1882:
URL: https://github.com/apache/arrow-rs/issues/1882#issuecomment-1159378665

   A new release has been cut which updates some dependencies, including hashbrown https://github.com/bluss/indexmap/pull/231


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org