You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Kohei KaiGai <ka...@heterodb.com> on 2021/07/20 02:30:12 UTC
Re: Any standard way for min/max values per record-batch?

Hello,

Let me share our trial to support the min/max statistics per record batch.
https://github.com/heterodb/pg-strom/wiki/806:-Apache-Arrow-Min-Max-Statistics-Hint

The latest pg2arrow supports --stat option that can specify the
columns to include min/max statistics
for each record batch.
The statistics are embedded in the custom_metadata[] of the Field (in
the Footer area), if any.

The example below shows database table dump with statistics on the
lo_orderdate column.

$ pg2arrow -d postgres -o /dev/shm/flineorder_sort.arrow -t
lineorder_sort --stat=lo_orderdate --progress
RecordBatch[0]: offset=1640 length=268437080 (meta=920,
body=268436160) nitems=1303085
RecordBatch[1]: offset=268438720 length=268437080 (meta=920,
body=268436160) nitems=1303085
              :
RecordBatch[9]: offset=2415935360 length=55668888 (meta=920,
body=55667968) nitems=270231

Then, you can find out the custom-metadata on the lo_orderdate field;
min_values and max_values.
These are comma separated integer lists with 10 elements as many as
the number of record batches.
So, the first item of min_values and max_values are min-/max-datum of
the record-batch[0].

$ python3
Python 3.6.8 (default, Aug 24 2020, 17:57:11)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> X = pa.RecordBatchFileReader('/dev/shm/flineorder_sort.arrow')
>>> X.schema
lo_orderkey: decimal(30, 8)
lo_linenumber: int32
lo_custkey: decimal(30, 8)
lo_partkey: int32
lo_suppkey: decimal(30, 8)
lo_orderdate: int32
  -- field metadata --
  min_values: '19920101,19920919,19930608,19940223,19941111,19950730,1996' + 31
  max_values: '19920919,19930608,19940223,19941111,19950730,19960417,1997' + 31
lo_orderpriority: fixed_size_binary[15]
lo_shippriority: fixed_size_binary[1]
lo_quantity: decimal(30, 8)
lo_extendedprice: decimal(30, 8)
lo_ordertotalprice: decimal(30, 8)
              :

When we scan the arrow_fdw foreign table that maps that Apache Arrow
file with the above min/max
statistics, it automatically checks WHERE-clause and skipps
record-batches that contain no rows
to be survived.

postgres=# EXPLAIN ANALYZE
           SELECT count(*) FROM flineorder_sort
            WHERE lo_orderpriority = '2-HIGH'
              AND lo_orderdate BETWEEN 19940101 AND 19940630;

                                                                 QUERY PLAN
------------------------------------------------------------------------------------------
 Aggregate  (cost=33143.09..33143.10 rows=1 width=8) (actual
time=115.591..115.593 rows=1loops=1)
   ->  Custom Scan (GpuPreAgg) on flineorder_sort
(cost=33139.52..33142.07 rows=204 width=8) (actual
time=115.580..115.585 rows=1 loops=1)
         Reduction: NoGroup
         Outer Scan: flineorder_sort  (cost=4000.00..33139.42 rows=300
width=0) (actual time=10.682..21.555 rows=2606170 loops=1)
         Outer Scan Filter: ((lo_orderdate >= 19940101) AND
(lo_orderdate <= 19940630) AND (lo_orderpriority = '2-HIGH'::bpchar))
         Rows Removed by Outer Scan Filter: 2425885
         referenced: lo_orderdate, lo_orderpriority
         Stats-Hint: (lo_orderdate >= 19940101), (lo_orderdate <=
19940630)  [loaded: 2, skipped: 8]
         files0: /dev/shm/flineorder_sort.arrow (read: 217.52MB, size:
2357.11MB)
 Planning Time: 0.210 ms
 Execution Time: 153.508 ms
(11 rows)

This EXPLAIN ANALYZE displays the Stats-Hint line.
It says Arrow_Fdw could use (lo_orderdate >= 19940101) and
(lo_orderdate <= 19940630) to check
the min/max statistics, then actually skipped 8 record-batches but
only 2 record-batches were loaded.


Our expectation is IoT/M2M grade time-series log-processing because
they always contain timestamp
values for each entry, and physically closed rows tend to have similar values.
Not only Apache Arrow files generated by pg2arrow, this min/max
statistics values are appendable by
rewrite of the Footer portion, without relocation of record-batches.
So, we plan to provide a standalone
command to attach the min/max statistics onto the existing Apache
Arrow generated by other tools.

Best regards,

2021年2月18日(木) 13:33 Kohei KaiGai <ka...@heterodb.com>:
>
> Thanks for the clarification.
>
> > There is key-value metadata available on Message which might be able to
> > work in the short term (some sort of encoded message).  I think
> > standardizing how we store statistics per batch does make sense.
> >
> For example, JSON array of min/max values as a key-value metadata
> in the Footer->Schema->Fields[]->custom_metadata?
> Even though the metadata field must be less than INT_MAX, I think it
> is enough portable and not invasive way.
>
> > We unfortunately can't add anything to field-node without breaking
> > compatibility.  But  another option would be to add a new structure as a
> > parallel list on RecordBatch itself.
> >
> > If we do add a new structure or arbitrary key-value pair we should not use
> > KeyValue but should have something where the values can be bytes.
> >
> What is the parallel-list means?
> If we would have a standardized binary structure, like DictionaryBatch,
> to store the statistics including min/max values, it exactly makes sense
> more than text-encoded key-value metadata, of course.
>
> Best regards,
>
> 2021年2月18日(木) 12:37 Micah Kornfield <em...@gmail.com>:
> >
> > There is key-value metadata available on Message which might be able to
> > work in the short term (some sort of encoded message).  I think
> > standardizing how we store statistics per batch does make sense.
> >
> > We unfortunately can't add anything to field-node without breaking
> > compatibility.  But  another option would be to add a new structure as a
> > parallel list on RecordBatch itself.
> >
> > If we do add a new structure or arbitrary key-value pair we should not use
> > KeyValue but should have something where the values can be bytes.
> >
> > On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai <ka...@heterodb.com> wrote:
> >
> > > Hello,
> > >
> > > Does Apache Arrow have any standard way to embed min/max values of the
> > > fields
> > > per record-batch basis?
> > > It looks FieldNode supports neither dedicated min/max attribute nor
> > > custom-metadata.
> > > https://github.com/apache/arrow/blob/master/format/Message.fbs#L28
> > >
> > > If we embed an array of min/max values into the custom-metadata of the
> > > Field-node,
> > > we may be able to implement.
> > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344
> > >
> > > What I like to implement is something like BRIN index at PostgreSQL.
> > > http://heterodb.github.io/pg-strom/brin/
> > >
> > > This index contains only min/max values for a particular block ranges, and
> > > query
> > > executor can skip blocks that obviously don't contain the target data.
> > > If we can skip 9990 of 10000 record batch by checking metadata on a query
> > > that
> > > tries to fetch items in very narrow timestamps, it is a great
> > > acceleration more than
> > > full file scans.
> > >
> > > Best regards,
> > > --
> > > HeteroDB, Inc / The PG-Strom Project
> > > KaiGai Kohei <ka...@heterodb.com>
> > >
>
>
>
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei <ka...@heterodb.com>



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <ka...@heterodb.com>