You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sedona.apache.org by "Doug Dennis (Jira)" <ji...@apache.org> on 2023/01/01 09:19:00 UTC

[jira] [Created] (SEDONA-227) Python SerDe Performance Degradation

Doug Dennis created SEDONA-227:
----------------------------------

             Summary: Python SerDe Performance Degradation
                 Key: SEDONA-227
                 URL: https://issues.apache.org/jira/browse/SEDONA-227
             Project: Apache Sedona
          Issue Type: Bug
            Reporter: Doug Dennis


With the new geometry serde in Sedona, there appears to be a fairly significant performance regression on the python side. The PR's author acknowledged a regression in the PR so this is expected, however my trials are showing a regression that is sometimes far higher than the 2x noted in the PR.

For serialization, I'm seeing points and short linestrings taking about twice as long (as expected). Unfortunately, small polygons are taking about 7-8 times longer while long linestrings and large polygons are taking between 11-12 times longer.

The news isn't all bad though. For me, short linestrings are consistently deserializing faster (about 25-30% faster) and points are deserializing at roughly the same rate as before. The other deserializations show regressions that are more or less in line with the results for serialization though.

To test this, I'm strictly comparing the new serialize and deserialize sedona functions against shapely's wkb loads and dumps functions. Below you will find my most recent results (which have been fairly consistent) as well as the python code I used to generate it. I'm very open to critiques of my approach to measuring performance, and hope that some of this performance loss is due to my own error.

Serialization results:
{code:java}
short line serialize trial:
        Total Time (seconds):
                Shapely: 1.7364926
                Sedona: 5.4626863
                Factor: 2.145816054730092        
        Average Time (nanoseconds):
                Shapely: 8682.463
                Sedona: 27313.4315
                Factor: 2.145816054730092

long line serialize trial:
        Total Time (seconds):
                Shapely: 4.0879395
                Sedona: 50.1508444
                Factor: 11.268000639441949
        Average Time (nanoseconds):
                Shapely: 40879.395
                Sedona: 501508.444
                Factor: 11.268000639441949

point serialize trial:
        Total Time (seconds):
                Shapely: 4.7864782
                Sedona: 13.0319586
                Factor: 1.7226612251153677
        Average Time (nanoseconds):
                Shapely: 9572.9564
                Sedona: 26063.9172
                Factor: 1.7226612251153677

small polygon serialize trial:
        Total Time (seconds):
                Shapely: 1.8339082
                Sedona: 14.9376628
                Factor: 7.145262014750793
        Average Time (nanoseconds):
                Shapely: 9169.541
                Sedona: 74688.314
                Factor: 7.145262014750793

large polygon serialize trial:
        Total Time (seconds):
                Shapely: 2.3705298
                Sedona: 30.4154897
                Factor: 11.830671734225826
        Average Time (nanoseconds):
                Shapely: 23705.298
                Sedona: 304154.897
                Factor: 11.830671734225826 {code}
Deserialization results:
{code:java}
short line deserialize trial:
        Total Time (seconds):
                Shapely: 2.5166469
                Sedona: 1.7909991
                Factor: -0.28833913887562057
        Average Time (nanoseconds):
                Shapely: 12583.2345
                Sedona: 8954.9955
                Factor: -0.28833913887562057

long line deserialize trial:
        Total Time (seconds):
                Shapely: 3.1818201
                Sedona: 45.1792348
                Factor: 13.199179519923204
        Average Time (nanoseconds):
                Shapely: 31818.201
                Sedona: 451792.348
                Factor: 13.199179519923204

point deserialize trial:
        Total Time (seconds):
                Shapely: 5.7874722
                Sedona: 5.3168965
                Factor: -0.08130936680784402
        Average Time (nanoseconds):
                Shapely: 11574.9444
                Sedona: 10633.793
                Factor: -0.08130936680784402

small polygon deserialize trial:
        Total Time (seconds):
                Shapely: 2.5079775
                Sedona: 4.0216245
                Factor: 0.6035329264317563
        Average Time (nanoseconds):
                Shapely: 12539.8875
                Sedona: 20108.1225
                Factor: 0.6035329264317563

large polygon deserialize trial:
        Total Time (seconds):
                Shapely: 1.9952702
                Sedona: 19.909025
                Factor: 8.978109731704508
        Average Time (nanoseconds):
                Shapely: 19952.702
                Sedona: 199090.25
                Factor: 8.978109731704508 {code}
Python code used to generate results:
{code:java}
from sedona.utils.geometry_serde import serialize, deserialize
from shapely.geometry import LineString, Point, Polygon
from shapely.wkb import dumps, loads

import time

def run_serialize_trial(geom, number_iterations, name):
    print(f"{name} serialize trial:")

    start_time = time.perf_counter_ns()
    for _ in range(number_iterations):
        dumps(geom)
    shapely_time = time.perf_counter_ns() - start_time

    start_time = time.perf_counter_ns()
    for _ in range(number_iterations):
        serialize(geom)
    sedona_time = time.perf_counter_ns() - start_time

    print(f"\tTotal Time (seconds):")
    print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
    print(f"\tAverage Time (nanoseconds):")
    print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")

def run_deserialize_trial(geom, number_iterations, name):
    print(f"{name} deserialize trial:")

    shapely_serialized_geom = dumps(geom)
    sedona_serialized_geom = serialize(geom)

    start_time = time.perf_counter_ns()
    for _ in range(number_iterations):
        loads(shapely_serialized_geom)
    shapely_time = time.perf_counter_ns() - start_time

    start_time = time.perf_counter_ns()
    for _ in range(number_iterations):
        deserialize(sedona_serialized_geom)
    sedona_time = time.perf_counter_ns() - start_time

    print(f"\tTotal Time (seconds):")
    print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
    print(f"\tAverage Time (nanoseconds):")
    print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")

short_line_iterations = 200_000
short_line = LineString([(10.0, 10.0), (20.0, 20.0)])

long_line_iterations = 100_000
long_line = LineString([(float(n), float(n)) for n in range(1000)])

point_iterations = 500_000
point = Point(12.3, 45.6)

small_polygon_iterations = 200_000
small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])

large_polygon_iterations = 100_000
large_polygon = Polygon(
    [(0.0, float(n * 10)) for n in range(100)]
    + [(float(n * 10), 990.0) for n in range(100)]
    + [(990.0, float(n * 10)) for n in reversed(range(100))]
    + [(float(n * 10), 0.0) for n in reversed(range(100))]
)

run_serialize_trial(short_line, short_line_iterations, "short line")
run_serialize_trial(long_line, long_line_iterations, "long line")
run_serialize_trial(point, point_iterations, "point")
run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon")
run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon")

run_deserialize_trial(short_line, short_line_iterations, "short line")
run_deserialize_trial(long_line, long_line_iterations, "long line")
run_deserialize_trial(point, point_iterations, "point")
run_deserialize_trial(small_polygon, small_polygon_iterations, "small polygon")
run_deserialize_trial(large_polygon, large_polygon_iterations, "large polygon"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)