You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/15 22:16:13 UTC

[GitHub] [arrow] wjones127 opened a new pull request, #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

wjones127 opened a new pull request, #12902:
URL: https://github.com/apache/arrow/pull/12902

   Thanks to Joris for pointing out we had this function in C++.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #12902:
URL: https://github.com/apache/arrow/pull/12902#issuecomment-1102844303

   Ah, I missed that you added `date` already to the tests as well. 
   Checking the parquet docs, I think the only other logical type that we don't yet explicitly handled is "interval" (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval), but that is also not yet handled by the C++ `StatisticsAsScalars`, so this won't have changed in behaviour (and in general I don't know the support of mapping our own interval types to parquet interval). 
   
   @wjones127 would you want to open an issue with kartothek to warn them about the upcoming change?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12902:
URL: https://github.com/apache/arrow/pull/12902#issuecomment-1100439767

   https://issues.apache.org/jira/browse/ARROW-7350


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #12902:
URL: https://github.com/apache/arrow/pull/12902#issuecomment-1102788129

   We already have `min`/`max` and `min_raw`/`max_raw`, so I would like to avoid adding a third one to overcome this backwards compatibility issue ...
   
   It's certainly also an option to see this as a "bug fix" (we shouldn't have used integers for the date type before) and ask kartothek to update their code. It's also for the lesser used date type (not for timestamp), and quite likely the behaviour here now also might have changes for other less used types that were not explicitly handled before (eg I don't how "interval" type would be handled?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #12902:
URL: https://github.com/apache/arrow/pull/12902#issuecomment-1102760600

   Hmm, it sure seems like it. Should we revert this? Or maybe we expose the 'old' value under `min` and the 'new' value under `min_value` or `min_scalar`? (We can use the same impl in both cases, with a special case for dates?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm closed pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
lidavidm closed pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars
URL: https://github.com/apache/arrow/pull/12902


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
wjones127 commented on PR #12902:
URL: https://github.com/apache/arrow/pull/12902#issuecomment-1102819889

   FWIW I considered it a bug that date statistics returned an integer. That's why I added tests for dates and decimals.
   
   Would you like me to add tests for all other types in a new PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #12902:
URL: https://github.com/apache/arrow/pull/12902#issuecomment-1102750381

   And thanks for the PR @wjones127 !
   
   Unfortunately it seems this one caused some failures in one of the nightly integration builds: https://github.com/ursacomputing/crossbow/runs/6071471966?check_suite_focus=true 
   The kartothek build is failing with errors that _seem_ related to this (but didn't actually verify this!)
   
   From a quick look at the error message, this might be because we now return a datetime.date object instead of integer for date type?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #12902:
URL: https://github.com/apache/arrow/pull/12902#issuecomment-1102791197

   Hmm, fair. In that case we should test all the data types then just to make the behavior explicit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12902:
URL: https://github.com/apache/arrow/pull/12902#issuecomment-1101927547

   Benchmark runs are scheduled for baseline = 5b2c0a0f2a9a8eeca86abaaaeb16b3e2b73e313d and contender = aa641d519cac4bd7eef6c7f08eb58254870f2482. aa641d519cac4bd7eef6c7f08eb58254870f2482 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/f90df501d5e84a7fb6c52cb38baf6216...629b1836fcee4f2dbd7a261bcb310835/)
   [Finished :arrow_down:0.46% :arrow_up:0.13%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/095a8d9ec13f4d37ad63c9290f178054...46d255a6b404431b94a3184c6c6c9c3f/)
   [Failed :arrow_down:0.75% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/dd43a3bb124a4d33a5d4f9d0b1f56263...2d02941c10be482789b836e5875253db/)
   [Finished :arrow_down:0.3% :arrow_up:0.13%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f377f882db0c4a54ab3c892559bae0f7...49944273dee04dc2a988156ee3efbd9b/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/528| `aa641d51` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/515| `aa641d51` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/514| `aa641d51` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/525| `aa641d51` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/527| `5b2c0a0f` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/514| `5b2c0a0f` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/513| `5b2c0a0f` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/524| `5b2c0a0f` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12902:
URL: https://github.com/apache/arrow/pull/12902#discussion_r852154584


##########
python/pyarrow/_parquet.pyx:
##########
@@ -226,61 +229,12 @@ cdef _cast_statistic_raw_max(CStatistics* statistics):
         return _box_flba((<CFLBAStatistics*> statistics).max(), type_length)
 
 
-cdef _cast_statistic_min(CStatistics* statistics):
-    min_raw = _cast_statistic_raw_min(statistics)
-    return _box_logical_type_value(min_raw, statistics.descr())
-
-
-cdef _cast_statistic_max(CStatistics* statistics):
-    max_raw = _cast_statistic_raw_max(statistics)
-    return _box_logical_type_value(max_raw, statistics.descr())
-
-
-cdef _box_logical_type_value(object value, const ColumnDescriptor* descr):
+cdef _cast_statistics(CStatistics* statistics):
     cdef:
-        const CParquetLogicalType* ltype = descr.logical_type().get()
-        ParquetTimeUnit time_unit
-        const CParquetIntType* itype
-        const CParquetTimestampType* ts_type
-
-    if ltype.type() == ParquetLogicalType_STRING:
-        return value.decode('utf8')
-    elif ltype.type() == ParquetLogicalType_TIME:
-        time_unit = (<const CParquetTimeType*> ltype).time_unit()
-        if time_unit == ParquetTimeUnit_MILLIS:
-            return _datetime_from_int(value, unit=TimeUnit_MILLI).time()
-        else:
-            return _datetime_from_int(value, unit=TimeUnit_MICRO).time()
-    elif ltype.type() == ParquetLogicalType_TIMESTAMP:
-        ts_type = <const CParquetTimestampType*> ltype
-        time_unit = ts_type.time_unit()
-        if ts_type.is_adjusted_to_utc():
-            import pytz
-            tzinfo = pytz.utc
-        else:
-            tzinfo = None
-        if time_unit == ParquetTimeUnit_MILLIS:
-            return _datetime_from_int(value, unit=TimeUnit_MILLI,
-                                      tzinfo=tzinfo)
-        elif time_unit == ParquetTimeUnit_MICROS:
-            return _datetime_from_int(value, unit=TimeUnit_MICRO,
-                                      tzinfo=tzinfo)
-        elif time_unit == ParquetTimeUnit_NANOS:
-            return _datetime_from_int(value, unit=TimeUnit_NANO,
-                                      tzinfo=tzinfo)
-        else:
-            raise ValueError("Unsupported time unit")
-    elif ltype.type() == ParquetLogicalType_INT:
-        itype = <const CParquetIntType*> ltype
-        if not itype.is_signed() and itype.bit_width() == 32:
-            return int(np.int32(value).view(np.uint32))
-        elif not itype.is_signed() and itype.bit_width() == 64:
-            return int(np.int64(value).view(np.uint64))
-        else:
-            return value
-    else:
-        # No logical boxing defined
-        return value
+        shared_ptr[CScalar] min
+        shared_ptr[CScalar] max
+    StatisticsAsScalars(statistics[0], &min, &max)

Review Comment:
   And nit, but maybe use `c_min` or something instead of shadowing a builtin



##########
python/pyarrow/_parquet.pyx:
##########
@@ -226,61 +229,12 @@ cdef _cast_statistic_raw_max(CStatistics* statistics):
         return _box_flba((<CFLBAStatistics*> statistics).max(), type_length)
 
 
-cdef _cast_statistic_min(CStatistics* statistics):
-    min_raw = _cast_statistic_raw_min(statistics)
-    return _box_logical_type_value(min_raw, statistics.descr())
-
-
-cdef _cast_statistic_max(CStatistics* statistics):
-    max_raw = _cast_statistic_raw_max(statistics)
-    return _box_logical_type_value(max_raw, statistics.descr())
-
-
-cdef _box_logical_type_value(object value, const ColumnDescriptor* descr):
+cdef _cast_statistics(CStatistics* statistics):
     cdef:
-        const CParquetLogicalType* ltype = descr.logical_type().get()
-        ParquetTimeUnit time_unit
-        const CParquetIntType* itype
-        const CParquetTimestampType* ts_type
-
-    if ltype.type() == ParquetLogicalType_STRING:
-        return value.decode('utf8')
-    elif ltype.type() == ParquetLogicalType_TIME:
-        time_unit = (<const CParquetTimeType*> ltype).time_unit()
-        if time_unit == ParquetTimeUnit_MILLIS:
-            return _datetime_from_int(value, unit=TimeUnit_MILLI).time()
-        else:
-            return _datetime_from_int(value, unit=TimeUnit_MICRO).time()
-    elif ltype.type() == ParquetLogicalType_TIMESTAMP:
-        ts_type = <const CParquetTimestampType*> ltype
-        time_unit = ts_type.time_unit()
-        if ts_type.is_adjusted_to_utc():
-            import pytz
-            tzinfo = pytz.utc
-        else:
-            tzinfo = None
-        if time_unit == ParquetTimeUnit_MILLIS:
-            return _datetime_from_int(value, unit=TimeUnit_MILLI,
-                                      tzinfo=tzinfo)
-        elif time_unit == ParquetTimeUnit_MICROS:
-            return _datetime_from_int(value, unit=TimeUnit_MICRO,
-                                      tzinfo=tzinfo)
-        elif time_unit == ParquetTimeUnit_NANOS:
-            return _datetime_from_int(value, unit=TimeUnit_NANO,
-                                      tzinfo=tzinfo)
-        else:
-            raise ValueError("Unsupported time unit")
-    elif ltype.type() == ParquetLogicalType_INT:
-        itype = <const CParquetIntType*> ltype
-        if not itype.is_signed() and itype.bit_width() == 32:
-            return int(np.int32(value).view(np.uint32))
-        elif not itype.is_signed() and itype.bit_width() == 64:
-            return int(np.int64(value).view(np.uint64))
-        else:
-            return value
-    else:
-        # No logical boxing defined
-        return value
+        shared_ptr[CScalar] min
+        shared_ptr[CScalar] max
+    StatisticsAsScalars(statistics[0], &min, &max)

Review Comment:
   Check the status here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #12902: ARROW-7350: [Python] Decode parquet statistics as scalars

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #12902:
URL: https://github.com/apache/arrow/pull/12902#issuecomment-1104945301

   Thanks for opening that issue!
   
   On the short term, we should also fix our nightly builds (either temporarily disabling them altogether, or ideally on skipping those failing tests). Opened https://issues.apache.org/jira/browse/ARROW-16262 for tracking that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org