You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "lidavidm (via GitHub)" <gi...@apache.org> on 2023/05/17 13:01:18 UTC

[GitHub] [arrow-adbc] lidavidm opened a new issue, #685: format: add statistics for tables, columns, queries, etc.

lidavidm opened a new issue, #685:
URL: https://github.com/apache/arrow-adbc/issues/685

   More research is needed on what systems typically support.
   
   This would make ADBC more useful in situations where it supplies data to other systems, since then those systems could query statistics using a standard interface and integrate them into query planning. (Interestingly, Spark at least doesn't seem to have this in DataSourceV2 - I suppose the smarts are directly in their JDBC support.)
   
   Examples:
   - Estimated result set size for queries (basically, a semi-standardized EXPLAIN (ANALYZE))
   - Parquet-style min/max/null % statistics per column


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1587760266

   Possibly: add `ConnectionGetStatistics` whose result set schema is something like
   
   ```
   - catalog
   - db_schema
   - table_name
   - statistic_type: min, max, null percentage, row count, ndv
   - column_name
   - value (union of int64, double, uint64, string, binary?)
   ```
   
   so for a table, you would have a row per column per statistic type, and a row per statistic type (for table-wide statistics: row count, ndv)
   
   (do we want to/how would we account for partitions, in the sense of Flight SQL, etc.?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] adamkennedy commented on issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "adamkennedy (via GitHub)" <gi...@apache.org>.
adamkennedy commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1599450929

   In addition to Calcite, Hive contains a fairly decent statistics set, especially since it has column statistics as well as table statistics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1600860890

   The [proposal](https://github.com/apache/arrow-adbc/pull/765) was updated to include min/max value and max byte width as standardized statistics. Digging into the Hive code, histograms are implemented but top/bottom K never were. The proposal allows for backends to return custom statistics so Hive/Postgres could still encode histograms (the encoding with Arrow gets very messy, however, given the lack of an 'any' type; they would have to pack the histogram values into a binary column).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1587760644

   TODO: need to cross-compare against JDBC, PostgreSQL, SQL Server, etc. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1601527203

   Looking at JDBC drivers:
   
   - PostgreSQL basically just gives you row count of columns with indices on them (including primary key). The debugger shows it executes this query internally:
   
     ```sql
     SELECT 
       tmp.TABLE_CAT, 
       tmp.TABLE_SCHEM, 
       tmp.TABLE_NAME, 
       tmp.NON_UNIQUE, 
       tmp.INDEX_QUALIFIER, 
       tmp.INDEX_NAME, 
       tmp.TYPE, 
       tmp.ORDINAL_POSITION, 
       trim(
         both '"' 
         from 
           pg_catalog.pg_get_indexdef(
             tmp.CI_OID, tmp.ORDINAL_POSITION, 
             false
           )
       ) AS COLUMN_NAME, 
       CASE tmp.AM_NAME WHEN 'btree' THEN CASE tmp.I_INDOPTION[tmp.ORDINAL_POSITION - 1] & 1 :: smallint WHEN 1 THEN 'D' ELSE 'A' END ELSE NULL END AS ASC_OR_DESC, 
       tmp.CARDINALITY, 
       tmp.PAGES, 
       tmp.FILTER_CONDITION 
     FROM 
       (
         SELECT 
           NULL AS TABLE_CAT, 
           n.nspname AS TABLE_SCHEM, 
           ct.relname AS TABLE_NAME, 
           NOT i.indisunique AS NON_UNIQUE, 
           NULL AS INDEX_QUALIFIER, 
           ci.relname AS INDEX_NAME, 
           CASE i.indisclustered WHEN true THEN 1 ELSE CASE am.amname WHEN 'hash' THEN 2 ELSE 3 END END AS TYPE, 
           (
             information_schema._pg_expandarray(i.indkey)
           ).n AS ORDINAL_POSITION, 
           ci.reltuples AS CARDINALITY, 
           ci.relpages AS PAGES, 
           pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS FILTER_CONDITION, 
           ci.oid AS CI_OID, 
           i.indoption AS I_INDOPTION, 
           am.amname AS AM_NAME 
         FROM 
           pg_catalog.pg_class ct 
           JOIN pg_catalog.pg_namespace n ON (ct.relnamespace = n.oid) 
           JOIN pg_catalog.pg_index i ON (ct.oid = i.indrelid) 
           JOIN pg_catalog.pg_class ci ON (ci.oid = i.indexrelid) 
           JOIN pg_catalog.pg_am am ON (ci.relam = am.oid) 
         WHERE 
           true 
           AND ct.relname = 'mytable'
       ) AS tmp 
     ORDER BY 
       NON_UNIQUE, 
       TYPE, 
       INDEX_NAME, 
       ORDINAL_POSITION
     ```
   - SQL Server gives you the row count for the table, and tells you about columns with a UNIQUE constraint, but doesn't give you much else. Internally it runs `EXEC sp_statistics_100 ?,?,?,?,?, ?`
   
   It seems we shouldn't expect much here from JDBC (and to be fair, getIndexInfo was meant to get info about indices, not really get us detailed statistics), so if we want detailed statistics we'll have to do it per database.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm closed issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm closed issue #685: format: add statistics for tables, columns, queries, etc.
URL: https://github.com/apache/arrow-adbc/issues/685


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1593409915

   Other potential statistics:
   
   - Calcite defines these: https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.html
   
     If we were to provide a direct equivalent of getIndexInfo, we could possibly also map ColumnUniqueness, Collation. It may work to allow `column_name` to be `list<str>`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1593277547

   ## JDBC
   
   [`getIndexInfo`](https://docs.oracle.com/en/java/javase/11/docs/api/java.sql/java/sql/DatabaseMetaData.html#getIndexInfo(java.lang.String,java.lang.String,java.lang.String,boolean,boolean))
   - Oriented around database indices
   - Provides row count and ndv
   - Also provides page count and ordering
   - Differentiates between exact/approximate; this can [affect performance](https://groups.google.com/g/h2-database/c/Vago-0qkWL4) of the call
   
   => We may want a statistic for "abstract size"? (But the values wouldn't be comparable between drivers.)
   => Is ordering useful?
   => Exact/approximate may also be useful to indicate
   
   ## ODBC
   
   [`SQLStatistics`](https://learn.microsoft.com/en-us/sql/odbc/reference/syntax/sqlstatistics-function?view=sql-server-ver16)
   
   - Effectively the same as JDBC
   
   ## PostgreSQL
   
   https://www.postgresql.org/docs/current/planner-stats.html and https://www.postgresql.org/docs/current/view-pg-stats.html
   
   - Row count (may be out of date), page count
   - null percentage
   - a very odd ndv estimate
   - histograms, most common elements, etc.
   - average column width (~= average size of a row in a column; i.e. an estimate of string length, etc?)
   
   => How should we define ndv?
   => Do we want to be able to map through "most common elements" etc. or should we leave that alone? (Probably leave it alone)
   => We may want to define a column width statistic


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1599637612

   Ah, thanks for the pointer. It seems Hive also stores min/max N values, histograms, percentiles, and average/sum of numeric columns. 
   
   This overlaps somewhat with PostgreSQL, so maybe we should try to support them. That said, encoding polymorphic types (if we want min/max of say a string column) and list types is a bit of a pain in Arrow (for min/max N, histograms, etc.) but it's doable via a union.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm closed issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm closed issue #685: format: add statistics for tables, columns, queries, etc.
URL: https://github.com/apache/arrow-adbc/issues/685


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-adbc] lidavidm commented on issue #685: format: add statistics for tables, columns, queries, etc.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1593400443

   So proposal is for:
   
   ```
   AdbcConnectionGetStatistics(struct AdbcConnection*, const char* catalog, const char* db_schema, const char* table_name, bool approximate, struct ArrowArrayStream* out);
   
   - Parameters allow filtering down to an individual table, or you can request data for multiple tables at once
   - "approximate" is an enum allowing you to request exact statistics, or just get approximate/best-effort/out of date statistics
   ```
   
   The result set has schema:
   
   ```
   - catalog: str
   - db_schema: str
   - table_name: str
   - statistic_type: str (one of null percentage, row count, ndv, byte_width or a database-specific value)
   - column_name: str (null if table-wide statistics)
   - value: double
   - is_approximate: bool
   ```
   
   - column_name is null if the statistic applies to the whole table
   - null_percentage is a value in [0, 1] representing the % of rows in the column that are null
   - ndv is the number of distinct values in the column (I'm tempted to take the PostgreSQL definition: positive means a fixed number of distinct values, negative means a percentage of distinct values)
   - row count is a value in [0, inf)
   - byte_width is a value in [0, inf) representing the average size in bytes of a row in the column (e.g. for a string column, this would be the average string size)
   
   unknown values should be null, or the whole row should simply be omitted
   
   
   Questions:
   - Do we care about min, max, etc.? IMO no, this complicates the encoding of the 'value' in the result, and utility is questionable. (But maybe we do want `value` to be at least `union[double, string]` to perhaps allow for this?)
   - Do we encode the statistic names as strings, or requiring dictionary encoding, or specifying an enumeration? (I would prefer dictionary encoding, but this complicates implementation a bit. The benefit is that if we specify some fixed dictionary values, we can save space on the common values and avoid lots of string comparisons while still allowing self-describing extensibility by vendors)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org