You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "oleggator (via GitHub)" <gi...@apache.org> on 2023/10/19 13:51:30 UTC

[PR] Use btree to search fields in DFSchema [arrow-datafusion]

oleggator opened a new pull request, #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870

## Which issue does this PR close?

Part of #7698.

## Rationale for this change

Current DFSchema implementation uses vector to operate with fields. It makes search of a column by name algorithmically complex.

## What changes are included in this PR?

Use BTreeMap to index field qualifiers.

## Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
-->

## Are there any user-facing changes?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "crepererum (via GitHub)" <gi...@apache.org>.

crepererum commented on code in PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#discussion_r1369928872


##########
datafusion/common/src/dfschema.rs:
##########
@@ -102,8 +217,12 @@ impl DFSchema {
                 ));
             }
         }
+
+        let fields_index = build_index(&fields);

Review Comment:
   this use case might indeed be a good call of interior mutability, i.e. use an RWLock and init the lookup table on the first use



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "oleggator (via GitHub)" <gi...@apache.org>.

oleggator commented on PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#issuecomment-1781894246

   > Is there a reason to use a b-tree ( O(log⁡n) ) vs a hash map ( O(1) )?
   
   Using b-tree we can query all fields matching to a "prefix" in one O(logn) hop (`column.*.*.*`, `column.table.*.*`, `column.table.schema.*`, `column.table.schema.catalog`).
   It is used in `fields_with_unqualified_name` method to query all fields by specific name.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "karlovnv (via GitHub)" <gi...@apache.org>.

karlovnv commented on PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#issuecomment-1792481415

   > Thank you -- I plan to review this more carefully tomorrow
   
   @alamb I think it's a good idea to introduce user defined cacheprovider for both DFSchema and arrow Schema. It will allow to take benefits from btree and avoid building it when is not necessary.
   My assumption is that user knows when schema become invalid and can manage it invalidation from the cache


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#issuecomment-1781787734

   Related comment: https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1781787244


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [datafusion]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #7870:
URL: https://github.com/apache/datafusion/pull/7870#issuecomment-2076177519

   Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org
For additional commands, e-mail: github-help@datafusion.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "crepererum (via GitHub)" <gi...@apache.org>.

crepererum commented on code in PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#discussion_r1372786446


##########
datafusion/common/src/dfschema.rs:
##########
@@ -102,8 +217,12 @@ impl DFSchema {
                 ));
             }
         }
+
+        let fields_index = build_index(&fields);

Review Comment:
   Thinking about this more: instead of `RwLock`, this can be solved even more elegantly w/ [`OnceLock::get_or_init`](https://doc.rust-lang.org/std/sync/struct.OnceLock.html#method.get_or_init) (this is usually used for static variables, but you can totally use that for struct members as well).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.

Weijun-H commented on code in PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#discussion_r1374337099


##########
datafusion/common/src/dfschema.rs:
##########
@@ -35,11 +38,122 @@ use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// [`FieldReference`]s represent a multi part identifier (path) to a
+/// field that may require further resolution.
+#[derive(Debug, Clone, PartialEq, Eq)]
+struct FieldReference<'a> {
+    /// The field name
+    name: Cow<'a, str>,
+    /// Optional qualifier (usually a table or relation name)
+    qualifier: Option<TableReference<'a>>,
+}
+
+impl<'a> PartialOrd for FieldReference<'a> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<'a> Ord for FieldReference<'a> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        if self == other {
+            return Ordering::Equal;
+        }
+
+        match self.field().cmp(other.field()) {
+            Ordering::Less => return Ordering::Less,
+            Ordering::Greater => return Ordering::Greater,
+            Ordering::Equal => {}
+        }
+
+        match (self.table(), other.table()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.schema(), other.schema()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },

Review Comment:
   ```suggestion
               (Some(lhs), Some(rhs)) => {
                   let cmp = lhs.cmp(rhs);
                   if cmp != Ordering::Equal {
                       return cmp;
                   }
               }
   ```



##########
datafusion/common/src/dfschema.rs:
##########
@@ -35,11 +38,122 @@ use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// [`FieldReference`]s represent a multi part identifier (path) to a
+/// field that may require further resolution.
+#[derive(Debug, Clone, PartialEq, Eq)]
+struct FieldReference<'a> {
+    /// The field name
+    name: Cow<'a, str>,
+    /// Optional qualifier (usually a table or relation name)
+    qualifier: Option<TableReference<'a>>,
+}
+
+impl<'a> PartialOrd for FieldReference<'a> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<'a> Ord for FieldReference<'a> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        if self == other {
+            return Ordering::Equal;
+        }
+
+        match self.field().cmp(other.field()) {
+            Ordering::Less => return Ordering::Less,
+            Ordering::Greater => return Ordering::Greater,
+            Ordering::Equal => {}
+        }
+
+        match (self.table(), other.table()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.schema(), other.schema()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.catalog(), other.catalog()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },

Review Comment:
   ```suggestion
               (Some(lhs), Some(rhs)) => {
                   let cmp = lhs.cmp(rhs);
                   if cmp != Ordering::Equal {
                       return cmp;
                   }
               }
   ```



##########
datafusion/common/src/dfschema.rs:
##########
@@ -35,11 +38,122 @@ use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// [`FieldReference`]s represent a multi part identifier (path) to a
+/// field that may require further resolution.
+#[derive(Debug, Clone, PartialEq, Eq)]
+struct FieldReference<'a> {
+    /// The field name
+    name: Cow<'a, str>,
+    /// Optional qualifier (usually a table or relation name)
+    qualifier: Option<TableReference<'a>>,
+}
+
+impl<'a> PartialOrd for FieldReference<'a> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<'a> Ord for FieldReference<'a> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        if self == other {
+            return Ordering::Equal;
+        }
+
+        match self.field().cmp(other.field()) {
+            Ordering::Less => return Ordering::Less,
+            Ordering::Greater => return Ordering::Greater,
+            Ordering::Equal => {}
+        }
+
+        match (self.table(), other.table()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.schema(), other.schema()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.catalog(), other.catalog()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        Ordering::Equal
+    }
+}
+
+/// This is a [`FieldReference`] that has 'static lifetime (aka it
+/// owns the underlying strings)
+type OwnedFieldReference = FieldReference<'static>;
+
+impl<'a> FieldReference<'a> {
+    /// Convenience method for creating a [`FieldReference`].
+    pub fn new(
+        name: impl Into<Cow<'a, str>>,
+        qualifier: Option<TableReference<'a>>,
+    ) -> Self {
+        Self {
+            name: name.into(),
+            qualifier,
+        }
+    }
+
+    /// Compare with another [`FieldReference`] as if both are resolved.
+    /// This allows comparing across variants, where if a field is not present
+    /// in both variants being compared then it is ignored in the comparison.
+    pub fn resolved_eq(&self, other: &Self) -> bool {
+        self.name == other.name
+            && match (&self.qualifier, &other.qualifier) {
+                (Some(lhs), Some(rhs)) => lhs.resolved_eq(rhs),
+                _ => true,
+            }
+    }
+
+    fn field(&self) -> &str {
+        &self.name
+    }
+
+    fn table(&self) -> Option<&str> {
+        self.qualifier.as_ref().map(|q| q.table())
+    }
+
+    fn schema(&self) -> Option<&str> {
+        self.qualifier.as_ref().and_then(|q| q.schema())
+    }
+
+    fn catalog(&self) -> Option<&str> {
+        self.qualifier.as_ref().and_then(|q| q.catalog())
+    }
+}
+
 /// DFSchema wraps an Arrow schema and adds relation names
 #[derive(Debug, Clone, PartialEq, Eq)]
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    /// Fields index
+    fields_index: BTreeMap<OwnedFieldReference, Vec<usize>>,

Review Comment:
   I think we use BTree here because it ensures the order of the index 🤔.



##########
datafusion/common/src/dfschema.rs:
##########
@@ -35,11 +38,122 @@ use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// [`FieldReference`]s represent a multi part identifier (path) to a
+/// field that may require further resolution.
+#[derive(Debug, Clone, PartialEq, Eq)]
+struct FieldReference<'a> {
+    /// The field name
+    name: Cow<'a, str>,
+    /// Optional qualifier (usually a table or relation name)
+    qualifier: Option<TableReference<'a>>,
+}
+
+impl<'a> PartialOrd for FieldReference<'a> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<'a> Ord for FieldReference<'a> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        if self == other {
+            return Ordering::Equal;
+        }
+
+        match self.field().cmp(other.field()) {
+            Ordering::Less => return Ordering::Less,
+            Ordering::Greater => return Ordering::Greater,
+            Ordering::Equal => {}
+        }
+
+        match (self.table(), other.table()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.schema(), other.schema()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.catalog(), other.catalog()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        Ordering::Equal
+    }
+}
+
+/// This is a [`FieldReference`] that has 'static lifetime (aka it
+/// owns the underlying strings)
+type OwnedFieldReference = FieldReference<'static>;
+
+impl<'a> FieldReference<'a> {
+    /// Convenience method for creating a [`FieldReference`].
+    pub fn new(
+        name: impl Into<Cow<'a, str>>,
+        qualifier: Option<TableReference<'a>>,
+    ) -> Self {
+        Self {
+            name: name.into(),
+            qualifier,
+        }
+    }
+
+    /// Compare with another [`FieldReference`] as if both are resolved.
+    /// This allows comparing across variants, where if a field is not present
+    /// in both variants being compared then it is ignored in the comparison.
+    pub fn resolved_eq(&self, other: &Self) -> bool {
+        self.name == other.name
+            && match (&self.qualifier, &other.qualifier) {
+                (Some(lhs), Some(rhs)) => lhs.resolved_eq(rhs),
+                _ => true,
+            }
+    }
+
+    fn field(&self) -> &str {
+        &self.name
+    }
+
+    fn table(&self) -> Option<&str> {
+        self.qualifier.as_ref().map(|q| q.table())
+    }
+
+    fn schema(&self) -> Option<&str> {
+        self.qualifier.as_ref().and_then(|q| q.schema())
+    }
+
+    fn catalog(&self) -> Option<&str> {
+        self.qualifier.as_ref().and_then(|q| q.catalog())
+    }
+}
+
 /// DFSchema wraps an Arrow schema and adds relation names
 #[derive(Debug, Clone, PartialEq, Eq)]
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    /// Fields index
+    fields_index: BTreeMap<OwnedFieldReference, Vec<usize>>,

Review Comment:
   I think we use BTree here because it ensures the order of the index 🤔.



##########
datafusion/common/src/dfschema.rs:
##########
@@ -35,11 +38,122 @@ use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// [`FieldReference`]s represent a multi part identifier (path) to a
+/// field that may require further resolution.
+#[derive(Debug, Clone, PartialEq, Eq)]
+struct FieldReference<'a> {
+    /// The field name
+    name: Cow<'a, str>,
+    /// Optional qualifier (usually a table or relation name)
+    qualifier: Option<TableReference<'a>>,
+}
+
+impl<'a> PartialOrd for FieldReference<'a> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<'a> Ord for FieldReference<'a> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        if self == other {
+            return Ordering::Equal;
+        }
+
+        match self.field().cmp(other.field()) {
+            Ordering::Less => return Ordering::Less,
+            Ordering::Greater => return Ordering::Greater,
+            Ordering::Equal => {}
+        }

Review Comment:
   ```suggestion
           let field_cmp = self.field().cmp(other.field());
           if field_cmp != Ordering::Equal {
               return field_cmp;
           }
   ```



##########
datafusion/common/src/dfschema.rs:
##########
@@ -35,11 +38,122 @@ use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// [`FieldReference`]s represent a multi part identifier (path) to a
+/// field that may require further resolution.
+#[derive(Debug, Clone, PartialEq, Eq)]
+struct FieldReference<'a> {
+    /// The field name
+    name: Cow<'a, str>,
+    /// Optional qualifier (usually a table or relation name)
+    qualifier: Option<TableReference<'a>>,
+}
+
+impl<'a> PartialOrd for FieldReference<'a> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<'a> Ord for FieldReference<'a> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        if self == other {
+            return Ordering::Equal;
+        }
+
+        match self.field().cmp(other.field()) {
+            Ordering::Less => return Ordering::Less,
+            Ordering::Greater => return Ordering::Greater,
+            Ordering::Equal => {}
+        }
+
+        match (self.table(), other.table()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },

Review Comment:
   ```suggestion
               (Some(lhs), Some(rhs)) => {
                   let cmp = lhs.cmp(rhs);
                   if cmp != Ordering::Equal {
                       return cmp;
                   }
               }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#discussion_r1369249398


##########
datafusion/common/src/dfschema.rs:
##########
@@ -102,8 +217,12 @@ impl DFSchema {
                 ));
             }
         }
+
+        let fields_index = build_index(&fields);

Review Comment:
   If the index is built for all DFSchema that are created, I wonder if that will be too much overhead. Maybe we could consider creating it on first use 🤔  or finding some way to canonicalize / cache the map



##########
datafusion/common/src/dfschema.rs:
##########
@@ -102,8 +217,12 @@ impl DFSchema {
                 ));
             }
         }
+
+        let fields_index = build_index(&fields);

Review Comment:
   If the index is built for all DFSchema that are created, I wonder if that will be too much overhead. Maybe we could consider creating it on first use 🤔  or finding some way to canonicalize / cache the map



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [datafusion]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #7870: Use btree to search fields in DFSchema
URL: https://github.com/apache/datafusion/pull/7870


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org
For additional commands, e-mail: github-help@datafusion.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "crepererum (via GitHub)" <gi...@apache.org>.

crepererum commented on PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#issuecomment-1775469244

   Is there a reason to use a b-tree ( $\mathrm{O}(\log{n})$ ) vs a hash map ( $\mathrm{O}(1)$ )?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.

Weijun-H commented on code in PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#discussion_r1374412774


##########
datafusion/common/src/dfschema.rs:
##########
@@ -35,11 +38,122 @@ use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// [`FieldReference`]s represent a multi part identifier (path) to a
+/// field that may require further resolution.
+#[derive(Debug, Clone, PartialEq, Eq)]
+struct FieldReference<'a> {
+    /// The field name
+    name: Cow<'a, str>,
+    /// Optional qualifier (usually a table or relation name)
+    qualifier: Option<TableReference<'a>>,
+}
+
+impl<'a> PartialOrd for FieldReference<'a> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<'a> Ord for FieldReference<'a> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        if self == other {
+            return Ordering::Equal;
+        }
+
+        match self.field().cmp(other.field()) {
+            Ordering::Less => return Ordering::Less,
+            Ordering::Greater => return Ordering::Greater,
+            Ordering::Equal => {}
+        }
+
+        match (self.table(), other.table()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.schema(), other.schema()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.catalog(), other.catalog()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        Ordering::Equal
+    }
+}
+
+/// This is a [`FieldReference`] that has 'static lifetime (aka it
+/// owns the underlying strings)
+type OwnedFieldReference = FieldReference<'static>;
+
+impl<'a> FieldReference<'a> {
+    /// Convenience method for creating a [`FieldReference`].
+    pub fn new(
+        name: impl Into<Cow<'a, str>>,
+        qualifier: Option<TableReference<'a>>,
+    ) -> Self {
+        Self {
+            name: name.into(),
+            qualifier,
+        }
+    }
+
+    /// Compare with another [`FieldReference`] as if both are resolved.
+    /// This allows comparing across variants, where if a field is not present
+    /// in both variants being compared then it is ignored in the comparison.
+    pub fn resolved_eq(&self, other: &Self) -> bool {
+        self.name == other.name
+            && match (&self.qualifier, &other.qualifier) {
+                (Some(lhs), Some(rhs)) => lhs.resolved_eq(rhs),
+                _ => true,
+            }
+    }
+
+    fn field(&self) -> &str {
+        &self.name
+    }
+
+    fn table(&self) -> Option<&str> {
+        self.qualifier.as_ref().map(|q| q.table())
+    }
+
+    fn schema(&self) -> Option<&str> {
+        self.qualifier.as_ref().and_then(|q| q.schema())
+    }
+
+    fn catalog(&self) -> Option<&str> {
+        self.qualifier.as_ref().and_then(|q| q.catalog())
+    }
+}
+
 /// DFSchema wraps an Arrow schema and adds relation names
 #[derive(Debug, Clone, PartialEq, Eq)]
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    /// Fields index
+    fields_index: BTreeMap<OwnedFieldReference, Vec<usize>>,

Review Comment:
   Fair enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "crepererum (via GitHub)" <gi...@apache.org>.

crepererum commented on PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#issuecomment-1782686355

   > > Is there a reason to use a b-tree ( O(log⁡n) ) vs a hash map ( O(1) )?
   > 
   > Using b-tree we can query all fields matching to a "prefix" in one O(logn) hop (`column.*.*.*`, `column.table.*.*`, `column.table.schema.*`, `column.table.schema.catalog`). It is used in `fields_with_unqualified_name` method to query all fields by specific name.
   
   Is that such a common operation that it is worth to keep an expensive index on every single schema in the query graph? I think the planner that resolves these names can easily order the fields and build this index locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#issuecomment-1780126159

   I plan to review this and related PRs tomorrow morning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#issuecomment-1788115134

   Thank you -- I plan to review this more carefully tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "crepererum (via GitHub)" <gi...@apache.org>.

crepererum commented on code in PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#discussion_r1374379563


##########
datafusion/common/src/dfschema.rs:
##########
@@ -35,11 +38,122 @@ use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// [`FieldReference`]s represent a multi part identifier (path) to a
+/// field that may require further resolution.
+#[derive(Debug, Clone, PartialEq, Eq)]
+struct FieldReference<'a> {
+    /// The field name
+    name: Cow<'a, str>,
+    /// Optional qualifier (usually a table or relation name)
+    qualifier: Option<TableReference<'a>>,
+}
+
+impl<'a> PartialOrd for FieldReference<'a> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<'a> Ord for FieldReference<'a> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        if self == other {
+            return Ordering::Equal;
+        }
+
+        match self.field().cmp(other.field()) {
+            Ordering::Less => return Ordering::Less,
+            Ordering::Greater => return Ordering::Greater,
+            Ordering::Equal => {}
+        }
+
+        match (self.table(), other.table()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.schema(), other.schema()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        match (self.catalog(), other.catalog()) {
+            (Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
+                Ordering::Less => return Ordering::Less,
+                Ordering::Greater => return Ordering::Greater,
+                Ordering::Equal => {}
+            },
+            (Some(_), None) => return Ordering::Greater,
+            (None, Some(_)) => return Ordering::Less,
+            _ => {}
+        }
+
+        Ordering::Equal
+    }
+}
+
+/// This is a [`FieldReference`] that has 'static lifetime (aka it
+/// owns the underlying strings)
+type OwnedFieldReference = FieldReference<'static>;
+
+impl<'a> FieldReference<'a> {
+    /// Convenience method for creating a [`FieldReference`].
+    pub fn new(
+        name: impl Into<Cow<'a, str>>,
+        qualifier: Option<TableReference<'a>>,
+    ) -> Self {
+        Self {
+            name: name.into(),
+            qualifier,
+        }
+    }
+
+    /// Compare with another [`FieldReference`] as if both are resolved.
+    /// This allows comparing across variants, where if a field is not present
+    /// in both variants being compared then it is ignored in the comparison.
+    pub fn resolved_eq(&self, other: &Self) -> bool {
+        self.name == other.name
+            && match (&self.qualifier, &other.qualifier) {
+                (Some(lhs), Some(rhs)) => lhs.resolved_eq(rhs),
+                _ => true,
+            }
+    }
+
+    fn field(&self) -> &str {
+        &self.name
+    }
+
+    fn table(&self) -> Option<&str> {
+        self.qualifier.as_ref().map(|q| q.table())
+    }
+
+    fn schema(&self) -> Option<&str> {
+        self.qualifier.as_ref().and_then(|q| q.schema())
+    }
+
+    fn catalog(&self) -> Option<&str> {
+        self.qualifier.as_ref().and_then(|q| q.catalog())
+    }
+}
+
 /// DFSchema wraps an Arrow schema and adds relation names
 #[derive(Debug, Clone, PartialEq, Eq)]
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    /// Fields index
+    fields_index: BTreeMap<OwnedFieldReference, Vec<usize>>,

Review Comment:
   we do you care about the index order? You either iterate over the fields in order (use `self.fields.iter()`) or you lookup a field by name (use `self.field_index.get(...)`). The index is orderd by field name. So this argument would only be valid if we OFTEN iterate over the fields in name order, which I don't think we do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "oleggator (via GitHub)" <gi...@apache.org>.

oleggator commented on PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#issuecomment-1782873654

   Made a [benchmark](https://github.com/apache/arrow-datafusion/pull/7948)
   
   ## Baseline - Data Fusion 32 (a0c5affca271d67980286cb2ae08ea8eec75a326)
   ```
   index_of_column_by_name 10
                           time:   [11.323 ns 11.325 ns 11.328 ns]
                           change: [-0.0714% +0.3045% +0.6180%] (p = 0.09 > 0.05)
                           No change in performance detected.
   Found 6 outliers among 100 measurements (6.00%)
     2 (2.00%) low mild
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   index_of_column_by_name 20
                           time:   [4.1947 ns 4.1963 ns 4.1981 ns]
                           change: [-2.1038% -1.5880% -1.2714%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   index_of_column_by_name 50
                           time:   [34.841 ns 34.851 ns 34.871 ns]
                           change: [-0.2590% -0.1783% -0.0774%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 13 outliers among 100 measurements (13.00%)
     1 (1.00%) low severe
     4 (4.00%) low mild
     5 (5.00%) high mild
     3 (3.00%) high severe
   
   index_of_column_by_name 100
                           time:   [88.736 ns 88.927 ns 89.119 ns]
                           change: [+4.6597% +5.0086% +5.3786%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) low mild
     4 (4.00%) high mild
   
   index_of_column_by_name 500
                           time:   [403.20 ns 403.70 ns 404.29 ns]
                           change: [+1.5771% +1.6483% +1.7326%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 8 outliers among 100 measurements (8.00%)
     1 (1.00%) low severe
     3 (3.00%) low mild
     4 (4.00%) high severe
   
   index_of_column_by_name 1000
                           time:   [909.73 ns 910.11 ns 910.48 ns]
                           change: [-2.0626% -1.6648% -1.3588%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   DFSchema::new 10        time:   [328.91 ns 329.14 ns 329.38 ns]
                           change: [-0.8652% -0.8013% -0.7418%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   
   DFSchema::new 20        time:   [725.37 ns 725.93 ns 726.56 ns]
                           change: [+0.4542% +0.5177% +0.5841%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   DFSchema::new 50        time:   [1.6864 µs 1.6892 µs 1.6924 µs]
                           change: [+1.3382% +1.4765% +1.6362%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   DFSchema::new 100       time:   [3.4953 µs 3.4965 µs 3.4978 µs]
                           change: [-3.4655% -3.2889% -3.1317%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     1 (1.00%) low severe
     1 (1.00%) high mild
     2 (2.00%) high severe
   
   DFSchema::new 500       time:   [23.470 µs 23.477 µs 23.485 µs]
                           change: [-1.8427% -1.7821% -1.7253%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     3 (3.00%) high mild
     3 (3.00%) high severe
   
   DFSchema::new 1000      time:   [45.504 µs 45.515 µs 45.528 µs]
                           change: [-2.8088% -2.6555% -2.4933%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     4 (4.00%) high mild
     1 (1.00%) high severe
   
   cargo bench  172.06s user 0.50s system 153% cpu 1:52.07 total
   ```
   
   ## This PR
   ```
   index_of_column_by_name 10
                           time:   [33.607 ns 33.663 ns 33.717 ns]
                           change: [+196.44% +196.92% +197.41%] (p = 0.00 < 0.05)
                           Performance has regressed.
   
   index_of_column_by_name 20
                           time:   [21.509 ns 21.522 ns 21.535 ns]
                           change: [+412.46% +412.90% +413.42%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 6 outliers among 100 measurements (6.00%)
     2 (2.00%) low mild
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   index_of_column_by_name 50
                           time:   [43.590 ns 43.651 ns 43.713 ns]
                           change: [+24.956% +25.143% +25.325%] (p = 0.00 < 0.05)
                           Performance has regressed.
   
   index_of_column_by_name 100
                           time:   [68.349 ns 68.373 ns 68.401 ns]
                           change: [-23.444% -23.221% -22.998%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     1 (1.00%) low severe
     2 (2.00%) low mild
     4 (4.00%) high mild
     1 (1.00%) high severe
   
   index_of_column_by_name 500
                           time:   [65.428 ns 65.444 ns 65.461 ns]
                           change: [-83.785% -83.768% -83.752%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
     2 (2.00%) low severe
     1 (1.00%) low mild
     4 (4.00%) high mild
     3 (3.00%) high severe
   
   index_of_column_by_name 1000
                           time:   [74.167 ns 74.174 ns 74.183 ns]
                           change: [-91.855% -91.850% -91.844%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     1 (1.00%) low severe
     1 (1.00%) low mild
     3 (3.00%) high mild
     3 (3.00%) high severe
   
   DFSchema::new 10        time:   [956.63 ns 957.20 ns 957.81 ns]
                           change: [+190.77% +191.00% +191.28%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   DFSchema::new 20        time:   [2.4375 µs 2.4384 µs 2.4393 µs]
                           change: [+235.82% +236.06% +236.36%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 7 outliers among 100 measurements (7.00%)
     4 (4.00%) low mild
     1 (1.00%) high mild
     2 (2.00%) high severe
   
   DFSchema::new 50        time:   [6.5247 µs 6.5275 µs 6.5303 µs]
                           change: [+287.52% +288.07% +288.63%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 4 outliers among 100 measurements (4.00%)
     1 (1.00%) low mild
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   DFSchema::new 100       time:   [15.298 µs 15.330 µs 15.368 µs]
                           change: [+337.14% +340.86% +347.06%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 15 outliers among 100 measurements (15.00%)
     4 (4.00%) low mild
     6 (6.00%) high mild
     5 (5.00%) high severe
   
   DFSchema::new 500       time:   [92.211 µs 92.284 µs 92.361 µs]
                           change: [+292.82% +293.14% +293.47%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) low mild
   
   DFSchema::new 1000      time:   [204.70 µs 204.87 µs 205.05 µs]
                           change: [+349.22% +349.78% +350.32%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   cargo bench  252.05s user 1.60s system 150% cpu 2:48.82 total
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Use btree to search fields in DFSchema [arrow-datafusion]

Posted by "karlovnv (via GitHub)" <gi...@apache.org>.

karlovnv commented on PR #7870:
URL: https://github.com/apache/arrow-datafusion/pull/7870#issuecomment-1787925244

   > Made a [benchmark](https://github.com/apache/arrow-datafusion/pull/7948).
   > 
   > ## Baseline - Data Fusion 32 ([a0c5aff](https://github.com/apache/arrow-datafusion/commit/a0c5affca271d67980286cb2ae08ea8eec75a326))
   > ```
   > index_of_column_by_name 10
   >                         time:   [11.323 ns 11.325 ns 11.328 ns]
   >                         change: [-0.0714% +0.3045% +0.6180%] (p = 0.09 > 0.05)
   >                         No change in performance detected.
   > Found 6 outliers among 100 measurements (6.00%)
   >   2 (2.00%) low mild
   >   3 (3.00%) high mild
   >   1 (1.00%) high severe
   > 
   > index_of_column_by_name 20
   >                         time:   [4.1947 ns 4.1963 ns 4.1981 ns]
   >                         change: [-2.1038% -1.5880% -1.2714%] (p = 0.00 < 0.05)
   >                         Performance has improved.
   > Found 3 outliers among 100 measurements (3.00%)
   >   3 (3.00%) high mild
   > 
   > index_of_column_by_name 50
   >                         time:   [34.841 ns 34.851 ns 34.871 ns]
   >                         change: [-0.2590% -0.1783% -0.0774%] (p = 0.00 < 0.05)
   >                         Change within noise threshold.
   > Found 13 outliers among 100 measurements (13.00%)
   >   1 (1.00%) low severe
   >   4 (4.00%) low mild
   >   5 (5.00%) high mild
   >   3 (3.00%) high severe
   > 
   > index_of_column_by_name 100
   >                         time:   [88.736 ns 88.927 ns 89.119 ns]
   >                         change: [+4.6597% +5.0086% +5.3786%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 5 outliers among 100 measurements (5.00%)
   >   1 (1.00%) low mild
   >   4 (4.00%) high mild
   > 
   > index_of_column_by_name 500
   >                         time:   [403.20 ns 403.70 ns 404.29 ns]
   >                         change: [+1.5771% +1.6483% +1.7326%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 8 outliers among 100 measurements (8.00%)
   >   1 (1.00%) low severe
   >   3 (3.00%) low mild
   >   4 (4.00%) high severe
   > 
   > index_of_column_by_name 1000
   >                         time:   [909.73 ns 910.11 ns 910.48 ns]
   >                         change: [-2.0626% -1.6648% -1.3588%] (p = 0.00 < 0.05)
   >                         Performance has improved.
   > Found 2 outliers among 100 measurements (2.00%)
   >   2 (2.00%) high mild
   > 
   > DFSchema::new 10        time:   [328.91 ns 329.14 ns 329.38 ns]
   >                         change: [-0.8652% -0.8013% -0.7418%] (p = 0.00 < 0.05)
   >                         Change within noise threshold.
   > 
   > DFSchema::new 20        time:   [725.37 ns 725.93 ns 726.56 ns]
   >                         change: [+0.4542% +0.5177% +0.5841%] (p = 0.00 < 0.05)
   >                         Change within noise threshold.
   > Found 6 outliers among 100 measurements (6.00%)
   >   1 (1.00%) low mild
   >   3 (3.00%) high mild
   >   2 (2.00%) high severe
   > 
   > DFSchema::new 50        time:   [1.6864 µs 1.6892 µs 1.6924 µs]
   >                         change: [+1.3382% +1.4765% +1.6362%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 2 outliers among 100 measurements (2.00%)
   >   1 (1.00%) high mild
   >   1 (1.00%) high severe
   > 
   > DFSchema::new 100       time:   [3.4953 µs 3.4965 µs 3.4978 µs]
   >                         change: [-3.4655% -3.2889% -3.1317%] (p = 0.00 < 0.05)
   >                         Performance has improved.
   > Found 4 outliers among 100 measurements (4.00%)
   >   1 (1.00%) low severe
   >   1 (1.00%) high mild
   >   2 (2.00%) high severe
   > 
   > DFSchema::new 500       time:   [23.470 µs 23.477 µs 23.485 µs]
   >                         change: [-1.8427% -1.7821% -1.7253%] (p = 0.00 < 0.05)
   >                         Performance has improved.
   > Found 6 outliers among 100 measurements (6.00%)
   >   3 (3.00%) high mild
   >   3 (3.00%) high severe
   > 
   > DFSchema::new 1000      time:   [45.504 µs 45.515 µs 45.528 µs]
   >                         change: [-2.8088% -2.6555% -2.4933%] (p = 0.00 < 0.05)
   >                         Performance has improved.
   > Found 5 outliers among 100 measurements (5.00%)
   >   4 (4.00%) high mild
   >   1 (1.00%) high severe
   > 
   > cargo bench  172.06s user 0.50s system 153% cpu 1:52.07 total
   > ```
   > 
   > ## This PR
   > ```
   > index_of_column_by_name 10
   >                         time:   [33.607 ns 33.663 ns 33.717 ns]
   >                         change: [+196.44% +196.92% +197.41%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > 
   > index_of_column_by_name 20
   >                         time:   [21.509 ns 21.522 ns 21.535 ns]
   >                         change: [+412.46% +412.90% +413.42%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 6 outliers among 100 measurements (6.00%)
   >   2 (2.00%) low mild
   >   3 (3.00%) high mild
   >   1 (1.00%) high severe
   > 
   > index_of_column_by_name 50
   >                         time:   [43.590 ns 43.651 ns 43.713 ns]
   >                         change: [+24.956% +25.143% +25.325%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > 
   > index_of_column_by_name 100
   >                         time:   [68.349 ns 68.373 ns 68.401 ns]
   >                         change: [-23.444% -23.221% -22.998%] (p = 0.00 < 0.05)
   >                         Performance has improved.
   > Found 8 outliers among 100 measurements (8.00%)
   >   1 (1.00%) low severe
   >   2 (2.00%) low mild
   >   4 (4.00%) high mild
   >   1 (1.00%) high severe
   > 
   > index_of_column_by_name 500
   >                         time:   [65.428 ns 65.444 ns 65.461 ns]
   >                         change: [-83.785% -83.768% -83.752%] (p = 0.00 < 0.05)
   >                         Performance has improved.
   > Found 10 outliers among 100 measurements (10.00%)
   >   2 (2.00%) low severe
   >   1 (1.00%) low mild
   >   4 (4.00%) high mild
   >   3 (3.00%) high severe
   > 
   > index_of_column_by_name 1000
   >                         time:   [74.167 ns 74.174 ns 74.183 ns]
   >                         change: [-91.855% -91.850% -91.844%] (p = 0.00 < 0.05)
   >                         Performance has improved.
   > Found 8 outliers among 100 measurements (8.00%)
   >   1 (1.00%) low severe
   >   1 (1.00%) low mild
   >   3 (3.00%) high mild
   >   3 (3.00%) high severe
   > 
   > DFSchema::new 10        time:   [956.63 ns 957.20 ns 957.81 ns]
   >                         change: [+190.77% +191.00% +191.28%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 4 outliers among 100 measurements (4.00%)
   >   3 (3.00%) high mild
   >   1 (1.00%) high severe
   > 
   > DFSchema::new 20        time:   [2.4375 µs 2.4384 µs 2.4393 µs]
   >                         change: [+235.82% +236.06% +236.36%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 7 outliers among 100 measurements (7.00%)
   >   4 (4.00%) low mild
   >   1 (1.00%) high mild
   >   2 (2.00%) high severe
   > 
   > DFSchema::new 50        time:   [6.5247 µs 6.5275 µs 6.5303 µs]
   >                         change: [+287.52% +288.07% +288.63%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 4 outliers among 100 measurements (4.00%)
   >   1 (1.00%) low mild
   >   2 (2.00%) high mild
   >   1 (1.00%) high severe
   > 
   > DFSchema::new 100       time:   [15.298 µs 15.330 µs 15.368 µs]
   >                         change: [+337.14% +340.86% +347.06%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 15 outliers among 100 measurements (15.00%)
   >   4 (4.00%) low mild
   >   6 (6.00%) high mild
   >   5 (5.00%) high severe
   > 
   > DFSchema::new 500       time:   [92.211 µs 92.284 µs 92.361 µs]
   >                         change: [+292.82% +293.14% +293.47%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 2 outliers among 100 measurements (2.00%)
   >   2 (2.00%) low mild
   > 
   > DFSchema::new 1000      time:   [204.70 µs 204.87 µs 205.05 µs]
   >                         change: [+349.22% +349.78% +350.32%] (p = 0.00 < 0.05)
   >                         Performance has regressed.
   > Found 3 outliers among 100 measurements (3.00%)
   >   3 (3.00%) high mild
   > 
   > cargo bench  252.05s user 1.60s system 150% cpu 2:48.82 total
   > ```
   
   Could you please add summary?
   
   It seems that btree provides an advantage with 100+ cols


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org