You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sedona.apache.org by GitBox <gi...@apache.org> on 2023/01/05 09:35:47 UTC
[GitHub] [sedona] douglasdennis opened a new pull request, #745: [SEDONA-227] Python Serde Refactor
douglasdennis opened a new pull request, #745:
URL: https://github.com/apache/sedona/pull/745
## Did you read the Contributor Guide?
- Yes, I have read [Contributor Rules](https://sedona.apache.org/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/community/develop/)
## Is this PR related to a JIRA ticket?
- Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-227. The PR name follows the format `[SEDONA-XXX] my subject`.
## What changes were proposed in this PR?
A refactor of some of the python serialization/deserialization functions is proposed. This is due to the performance regression experienced from the change in serialization formats. Most of the functions were refactored to increase performance. Attempts were made to maintain readability of the code at the same time.
A significant pain point involves interacting with shapely geometry. Most of the shapely methods are too slow and involve repeated calls to native code. Once shapely 2.0 is adopted by Sedona and its users, it may be a good idea to revisit this code and attempt to utilize some of the libgeos connections that shapely exposes.
There are other issues in the current implementation of serde:
1. Shapely does not respect an M coordinate yet, and so that is not supported in python at the moment.
2. SRID is not supported in python for a similar reason. Part of Sedona's python library does add a "userData" field to shapely geometry classes at runtime. This could be utilized to store SRIDs and possibly M coordinates.
3. This PR mostly avoids using the GeometryBuffer class. I don't believe it will be the fastest solution in the future. However, I think it's a nice class for specifically handling the GeometryCollection type.
4. Should the serialization format allow for empty geometry within collections? They have no WKT representation, so I'm not sure that should be allowed.
Finally, here are performance results between current master and this branch:
```
############################################################################################
# MASTER # REFACTOR #
############################################################################################
short line serialize trial: # short line serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 3.1197412 # Shapely: 1.6295438
Sedona: 7.6139485 # Sedona: 1.624361
Factor: 1.440570551172642 # Factor: -0.00318052205776856
#
long line serialize trial: # long line serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 4.3532153 # Shapely: 4.106511
Sedona: 53.1161724 # Sedona: 20.2585903
Factor: 11.201595542494763 # Factor: 3.9332852876809534
#
point serialize trial: # point serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 4.5569092 # Shapely: 4.1321635
Sedona: 12.9271814 # Sedona: 3.6287594
Factor: 1.8368310257312128 # Factor: -0.12182579416327549
#
small polygon serialize trial: # small polygon serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 1.9922023 # Shapely: 1.6930909
Sedona: 15.1673936 # Sedona: 2.5668217
Factor: 6.613380227499988 # Factor: 0.5160566393688608
#
large polygon serialize trial: # large polygon serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 2.5921638 # Shapely: 2.3090157
Sedona: 27.4987078 # Sedona: 2.8944101
Factor: 9.608398975404254 # Factor: 0.25352551738821005
#
small multipoint serialize trial: # small multipoint serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.0964883 # Shapely: 0.0957348
Sedona: 0.8657994 # Sedona: 0.5249629
Factor: 7.97310243832672 # Factor: 4.483511742856307
#
large multipoint serialize trial: # large multipoint serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.2059924 # Shapely: 0.2002637
Sedona: 18.6648998 # Sedona: 13.9681053
Factor: 89.60965258912465 # Factor: 68.74856301965858
#
small multilinestring serialize trial: # small multilinestring serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.1003702 # Shapely: 0.0984753
Sedona: 1.2611056 # Sedona: 0.5051064
Factor: 11.564542065274354 # Factor: 4.129269979375539
#
large multilinestring serialize trial: # large multilinestring serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.1382676 # Shapely: 0.1290179
Sedona: 14.4627404 # Sedona: 6.4109233
Factor: 103.59963433226584 # Factor: 48.69018485031922
#
small multipolygon serialize trial: # small multipolygon serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.1074735 # Shapely: 0.107037
Sedona: 2.5492284 # Sedona: 1.1849261
Factor: 22.71959971527865 # Factor: 10.070247671365976
#
large multipolygon serialize trial: # large multipolygon serialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.218638 # Shapely: 0.2061548
Sedona: 33.7390888 # Sedona: 16.8197829
Factor: 153.31484371426743 # Factor: 80.58812164451179
#
short line deserialize trial: # short line deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 2.6179912 # Shapely: 2.3070263
Sedona: 1.8462964 # Sedona: 1.8466384
Factor: -0.2947660022692208 # Factor: -0.19955901673075854
#
long line deserialize trial: # long line deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 3.1905663 # Shapely: 3.0322167
Sedona: 46.6516656 # Sedona: 3.0514442
Factor: 13.62175087851959 # Factor: 0.006341070544199562
#
point deserialize trial: # point deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 5.8758644 # Shapely: 5.7208507
Sedona: 5.2898097 # Sedona: 5.2560053
Factor: -0.0997393166527124 # Factor: -0.08125459383164815
#
small polygon deserialize trial: # small polygon deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 2.4844221 # Shapely: 2.3755022
Sedona: 4.120936 # Sedona: 3.0382661
Factor: 0.6587100879516408 # Factor: 0.27899948903436084
#
large polygon deserialize trial: # large polygon deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 2.0467411 # Shapely: 1.9710574
Sedona: 20.173081 # Sedona: 3.0417363
Factor: 8.85619578362891 # Factor: 0.5432002639801358
#
small multipoint deserialize trial: # small multipoint deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.1267147 # Shapely: 0.1221836
Sedona: 0.4915774 # Sedona: 0.4675109
Factor: 2.8794031000349603 # Factor: 2.8262982920784787
#
large multipoint deserialize trial: # large multipoint deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.2152692 # Shapely: 0.2037456
Sedona: 10.9262919 # Sedona: 11.1691675
Factor: 49.75641057801116 # Factor: 53.81918382531942
#
small multilinestring deserialize trial: # small multilinestring deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.1260666 # Shapely: 0.1310238
Sedona: 0.4858702 # Sedona: 0.5045361
Factor: 2.8540755441964802 # Factor: 2.8507210140447765
#
large multilinestring deserialize trial: # large multilinestring deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.1432103 # Shapely: 0.1417217
Sedona: 5.63192 # Sedona: 5.8515723
Factor: 38.3262216474653 # Factor: 40.28917660457079
#
small multipolygon deserialize trial: # small multipolygon deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.1392047 # Shapely: 0.12827
Sedona: 1.3398053 # Sedona: 1.1914238
Factor: 8.624713102359332 # Factor: 8.288405706712403
#
large multipolygon deserialize trial: # large multipolygon deserialize trial:
Total Time (seconds): # Total Time (seconds):
Shapely: 0.2063552 # Shapely: 0.2101143
Sedona: 19.0151752 # Sedona: 15.8966164
Factor: 91.14778789194554 # Factor: 74.65699431214344
```
Here is the python script I used to generate these results:
```
from sedona.utils.geometry_serde import serialize, deserialize
from shapely.geometry import LineString, Point, Polygon, MultiPoint, MultiLineString, MultiPolygon
from shapely.wkb import dumps, loads
import time
def run_serialize_trial(geom, number_iterations, name):
print(f"{name} serialize trial:")
start_time = time.perf_counter_ns()
for _ in range(number_iterations):
dumps(geom)
shapely_time = time.perf_counter_ns() - start_time
start_time = time.perf_counter_ns()
for _ in range(number_iterations):
serialize(geom)
sedona_time = time.perf_counter_ns() - start_time
print(f"\tTotal Time (seconds):")
print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
def run_deserialize_trial(geom, number_iterations, name):
print(f"{name} deserialize trial:")
shapely_serialized_geom = dumps(geom)
sedona_serialized_geom = serialize(geom)
start_time = time.perf_counter_ns()
for _ in range(number_iterations):
loads(shapely_serialized_geom)
shapely_time = time.perf_counter_ns() - start_time
start_time = time.perf_counter_ns()
for _ in range(number_iterations):
deserialize(sedona_serialized_geom)
sedona_time = time.perf_counter_ns() - start_time
print(f"\tTotal Time (seconds):")
print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
short_line_iterations = 200_000
short_line = LineString([(10.0, 10.0), (20.0, 20.0)])
long_line_iterations = 100_000
long_line = LineString([(float(n), float(n)) for n in range(1000)])
point_iterations = 500_000
point = Point(12.3, 45.6)
small_polygon_iterations = 200_000
small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])
large_polygon_iterations = 100_000
large_polygon = Polygon(
[(0.0, float(n * 10)) for n in range(100)]
+ [(float(n * 10), 990.0) for n in range(100)]
+ [(990.0, float(n * 10)) for n in reversed(range(100))]
+ [(float(n * 10), 0.0) for n in reversed(range(100))]
)
small_multipoint_iterations = 10_000
small_multipoint = MultiPoint([(n, n) for n in range(3)])
large_multipoint_iterations = 10_000
large_multipoint = MultiPoint([(n, n) for n in range(100)])
small_multilinestring_iterations = 10_000
small_multilinestring = MultiLineString([[(10.0, 10.0), (20.0, 20.0)] for _ in range(3)])
large_multilinestring_iterations = 5_000
large_multilinestring = MultiLineString([[(10.0, 10.0), (20.0, 20.0)] for _ in range(100)])
small_multipolygon_iterations = 10_000
small_multipolygon = MultiPolygon([small_polygon for _ in range(3)])
large_multipolygon_iterations = 5_000
large_multipolygon = MultiPolygon([small_polygon for _ in range(100)])
run_serialize_trial(short_line, short_line_iterations, "short line")
run_serialize_trial(long_line, long_line_iterations, "long line")
run_serialize_trial(point, point_iterations, "point")
run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon")
run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon")
run_serialize_trial(small_multipoint, small_multipoint_iterations, "small multipoint")
run_serialize_trial(large_multipoint, large_multipoint_iterations, "large multipoint")
run_serialize_trial(small_multilinestring, small_multilinestring_iterations, "small multilinestring")
run_serialize_trial(large_multilinestring, large_multilinestring_iterations, "large multilinestring")
run_serialize_trial(small_multipolygon, small_multipolygon_iterations, "small multipolygon")
run_serialize_trial(large_multipolygon, large_multipolygon_iterations, "large multipolygon")
run_deserialize_trial(short_line, short_line_iterations, "short line")
run_deserialize_trial(long_line, long_line_iterations, "long line")
run_deserialize_trial(point, point_iterations, "point")
run_deserialize_trial(small_polygon, small_polygon_iterations, "small polygon")
run_deserialize_trial(large_polygon, large_polygon_iterations, "large polygon")
run_deserialize_trial(small_multipoint, small_multipoint_iterations, "small multipoint")
run_deserialize_trial(large_multipoint, large_multipoint_iterations, "large multipoint")
run_deserialize_trial(small_multilinestring, small_multilinestring_iterations, "small multilinestring")
run_deserialize_trial(large_multilinestring, large_multilinestring_iterations, "large multilinestring")
run_deserialize_trial(small_multipolygon, small_multipolygon_iterations, "small multipolygon")
run_deserialize_trial(large_multipolygon, large_multipolygon_iterations, "large multipolygon")
```
## How was this patch tested?
A parameterized set of unit tests were added that test the serialization/deserialization round trip of a geometry object between python and spark.
## Did this PR include necessary documentation updates?
- No, this PR does not affect any public API so no need to change the docs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] douglasdennis commented on a diff in pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
douglasdennis commented on code in PR #745:
URL: https://github.com/apache/sedona/pull/745#discussion_r1064067248
##########
python/sedona/utils/geometry_serde.py:
##########
@@ -217,70 +233,97 @@ def create_buffer_for_geom(geom_type: int, coord_type: int, size: int, num_coord
struct.pack_into('i', buffer, 4, num_coords)
return buffer
+def generate_header_bytes(geom_type: int, coord_type: int, num_coords: int) -> bytes:
+ preamble_byte = (geom_type << 4) | (coord_type << 1)
+ return struct.pack(
+ 'BBBBi',
+ preamble_byte,
+ 0,
+ 0,
+ 0,
+ num_coords
+ )
+
-def put_coordinates(buffer: bytearray, offset: int, coord_type: int, coords: CoordinateSequence):
+def put_coordinates(buffer: bytearray, offset: int, coord_type: int, coords: ListCoordType) -> int:
for coord in coords:
- offset = put_coordinate(buffer, offset, coord_type, coord)
+ struct.pack_into(CoordinateType.unpack_format(coord_type, buffer, offset, *coord))
+ offset += CoordinateType.bytes_per_coord(coord_type)
return offset
-def put_coordinate(buffer: bytearray, offset: int, coord_type: int, coord: tuple):
- x = coord[0]
- y = coord[1]
- z = coord[2] if len(coord) > 2 else math.nan
- if coord_type == CoordinateType.XY:
- struct.pack_into('d', buffer, offset, x)
- struct.pack_into('d', buffer, offset + 8, y)
- offset += 16
- elif coord_type == CoordinateType.XYZ:
- struct.pack_into('d', buffer, offset, x)
- struct.pack_into('d', buffer, offset + 8, y)
- struct.pack_into('d', buffer, offset + 16, z)
- offset += 24
- else:
- # Shapely does not support M dimension for now
- raise ValueError("Invalid coordinate type: {}".format(coord_type))
+def put_coordinate(buffer: bytearray, offset: int, coord_type: int, coord: CoordType) -> int:
+ struct.pack_into(CoordinateType.unpack_format(coord_type, buffer, offset, *coord))
+ offset += CoordinateType.bytes_per_coord(coord_type)
return offset
-def get_coordinates(buffer: bytearray, offset: int, coord_type: int, num_coords: int) -> List[tuple]:
- coords = []
- bytes_per_coord = CoordinateType.bytes_per_coord(coord_type)
- for i in range(num_coords):
- coord = get_coordinate(buffer, offset, coord_type)
- coords.append(coord)
- offset += bytes_per_coord
+def get_coordinates(buffer: bytearray, offset: int, coord_type: int, num_coords: int) -> Union[np.ndarray, ListCoordType]:
+ if coord_type == CoordinateType.XYM or coord_type == CoordinateType.XYZM:
+ raise NotImplementedError("XYM or XYZM coordinates are not supported")
+
+ if num_coords < 50:
+ coords = [
+ struct.unpack_from(CoordinateType.unpack_format(coord_type), buffer, offset + (i * CoordinateType.bytes_per_coord(coord_type)))
+ for i in range(num_coords)
+ ]
+ else:
+ nums_per_coord = CoordinateType.components_per_coord(coord_type)
+ coords = np.frombuffer(buffer, np.float64, num_coords * nums_per_coord, offset).reshape((num_coords, nums_per_coord))
+
return coords
-def get_coordinate(buffer: bytearray, offset: int, coord_type: int) -> tuple:
- x = struct.unpack_from('d', buffer, offset)[0]
- y = struct.unpack_from('d', buffer, offset + 8)[0]
- # Shapely does not support M dimension for now, so we'll simply ignore them
- if coord_type == CoordinateType.XY or coord_type == CoordinateType.XYM:
- return x, y
- elif coord_type == CoordinateType.XYZ or coord_type == CoordinateType.XYZM:
- z = struct.unpack_from('d', buffer, offset + 16)[0]
- return x, y, z
- else:
- raise NotImplementedError("XYM or XYZM coordinates were not supported")
+def get_coordinate(buffer: bytearray, offset: int, coord_type: int) -> CoordType:
+ # Shapely does not support M dimension for now, so raise if it was passed
+ return struct.unpack_from(CoordinateType.unpack_format(coord_type), buffer, offset)
-def aligned_offset(offset):
+def aligned_offset(offset: int) -> int:
return (offset + 7) & ~7
-def serialize_point(geom: Point) -> bytearray:
- coords = geom.coords
- if not coords:
- return create_buffer_for_geom(GeometryTypeID.POINT, CoordinateType.XY, 8, 0)
- coord_type = CoordinateType.type_of(coords, geom.has_z)
- bytes_per_coord = CoordinateType.bytes_per_coord(coord_type)
- size = 8 + bytes_per_coord
- buffer = create_buffer_for_geom(GeometryTypeID.POINT, coord_type, size, 1)
- put_coordinates(buffer, 8, coord_type, coords)
- return buffer
-
+def serialize_point(geom: Point) -> bytes:
+ coords = [tuple(c) for c in geom.coords]
Review Comment:
You're right :) Changes made.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] douglasdennis commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
douglasdennis commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1381195068
@Imbruced Took me a little longer but here are some memory profiling results.
TL;DR
Serialization Average Memory Increment Results (MiB):
|Serializer|Point|LineString|Polygon|
|----------|------|----------|---------|
|WKB|7.22|13.82|13.76|
|Master|9.98|16.2|17.7|
|Refactor|7.1|13.6|14.98|
Deserialization Average Memory Increment Results (MiB):
|Deserializer|Point|LineString|Polygon|
|----------|------|----------|---------|
|WKB|28.5|39|46.8|
|Master|34.42|45.26|52.84|
|Refactor|34.8|45.26|52.8|
Method:
I used the `memory_profiler` package to run trials using wkb.dumps and wkb.loads, this PR's refactored serialize and deserialize functions, and the current master branch's serialize and deserialize function. For each system of serde I ran trials operating on 100,000 `Point`s, `LineString`s with five vertices, and square `Polygon`s. For each combination of serde and geometry I ran five trials and averaged the results. The script I used to run these trials is included below. Each script was ran using a command such as:
`mprof run python python/mem_profile_serialize.py wkb point`
This would give a printout like this:
```
mprof: Sampling memory every 0.1s
running new process
Filename: python/mem_profile_serialize.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
21 41.8 MiB 41.8 MiB 1 @profile
22 def run_serialize_profile(serializer, geom):
23 48.8 MiB 7.0 MiB 100003 x = [serializer(geom) for _ in range(100_000)]
24 48.8 MiB 0.0 MiB 1 return x
```
I then averaged the "Increment" amount for line 23 across all five trials. Additionally, I plotted a representative run of the serialize operations (if you click on these you should get a bigger image):
| |Point|LineString|Polygon|
|---|----|----|----|
|WKB|![wkb_point](https://user-images.githubusercontent.com/4318895/212215809-55e7cfcd-3449-4ef5-bd9c-4c950f3d6f78.png)|![wkb_linestring](https://user-images.githubusercontent.com/4318895/212215837-6291dd6c-a2fc-4496-bb85-f3f677a7d778.png)|![wkb_polygon](https://user-images.githubusercontent.com/4318895/212215868-86a46a16-1289-421d-9e0b-859029c82172.png)|
|Master|![master_point](https://user-images.githubusercontent.com/4318895/212215948-32067e4c-1ec2-47ac-ae1f-3ce2d6c4e7cf.png)|![master_linestring](https://user-images.githubusercontent.com/4318895/212215964-6d2a3bbd-558c-48a3-9807-d55a364a94e6.png)|![master_polygon](https://user-images.githubusercontent.com/4318895/212215974-3988e685-cb18-42f2-9143-1b2b913290f3.png)|
|Refactor|![refactor_point](https://user-images.githubusercontent.com/4318895/212216008-ad8cc711-32f5-4aef-9f87-576e18bdff69.png)|![refactor_linestring](https://user-images.githubusercontent.com/4318895/212216034-c7b786ff-0187-4659-b73d-cc11dfd4254f.png)|![refactor_polygon](https://user-images.githubusercontent.com/4318895/212216047-41ca2be3-595b-4f02-ae8b-77d707012d43.png)|
Lastly here is the serialization trial script:
```python
import sys
from typing import List
from sedona.utils.geometry_serde import serialize
from shapely.geometry import Point, LineString, Polygon
from shapely.wkb import dumps
from memory_profiler import profile
def make_point():
return Point(12.3, 45.6)
def make_linestring():
return LineString([(n, n) for n in range(5)])
def make_polygon():
return Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])
@profile
def run_serialize_profile(serializer, geom):
x = [serializer(geom) for _ in range(100_000)]
return x
def main(args: List[str]) -> int:
if len(args) < 2:
print("Usage: mem_profile_run.py <sedona or wkb> <point>")
return 1
serializer_type = args[0]
geom_type = args[1]
if geom_type == "point":
geom = make_point()
elif geom_type == "linestring":
geom = make_linestring()
elif geom_type == "polygon":
geom = make_polygon()
else:
print(f"Geometry type is not supported: {geom_type}")
return 1
if serializer_type == "sedona":
serializer = serialize
elif serializer_type == "wkb":
serializer = dumps
else:
print(f"Serializer type is not supported: {serializer_type}")
return 1
run_serialize_profile(serializer, geom)
if __name__ == "__main__":
exit(main(sys.argv[1:]))
```
Deserialize trial script:
```python
import sys
from typing import List
from sedona.utils.geometry_serde import serialize, deserialize
from shapely.geometry import Point, LineString, Polygon
from shapely.wkb import dumps, loads
from memory_profiler import profile
def make_point():
return Point(12.3, 45.6)
def make_linestring():
return LineString([(n, n) for n in range(5)])
def make_polygon():
return Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])
@profile
def run_deserialize_profile(deserializer, geom):
x = [deserializer(geom) for _ in range(100_000)]
return x
def main(args: List[str]) -> int:
if len(args) < 2:
print("Usage: mem_profile_run.py <sedona or wkb> <point>")
return 1
serializer_type = args[0]
geom_type = args[1]
if geom_type == "point":
geom = make_point()
elif geom_type == "linestring":
geom = make_linestring()
elif geom_type == "polygon":
geom = make_polygon()
else:
print(f"Geometry type is not supported: {geom_type}")
return 1
if serializer_type == "sedona":
geom = serialize(geom)
deserializer = deserialize
elif serializer_type == "wkb":
geom = dumps(geom)
deserializer = loads
else:
print(f"Serializer type is not supported: {serializer_type}")
return 1
run_deserialize_profile(deserializer, geom)
if __name__ == "__main__":
exit(main(sys.argv[1:]))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] Imbruced commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
Imbruced commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1374664244
Do you have a chance to compare the memory footprint between wkb.loads approach and the changes after 29.12 and the current version ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] douglasdennis commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
douglasdennis commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1373758835
@Kontinuation What's your take on dropping support for empty geometry inside of multi-geometries in the new serialization format? They don't have representation in any other form so I'm not sure that supporting them provides any value.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] Imbruced commented on a diff in pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
Imbruced commented on code in PR #745:
URL: https://github.com/apache/sedona/pull/745#discussion_r1064062557
##########
python/sedona/utils/geometry_serde.py:
##########
@@ -217,70 +233,97 @@ def create_buffer_for_geom(geom_type: int, coord_type: int, size: int, num_coord
struct.pack_into('i', buffer, 4, num_coords)
return buffer
+def generate_header_bytes(geom_type: int, coord_type: int, num_coords: int) -> bytes:
+ preamble_byte = (geom_type << 4) | (coord_type << 1)
+ return struct.pack(
+ 'BBBBi',
+ preamble_byte,
+ 0,
+ 0,
+ 0,
+ num_coords
+ )
+
-def put_coordinates(buffer: bytearray, offset: int, coord_type: int, coords: CoordinateSequence):
+def put_coordinates(buffer: bytearray, offset: int, coord_type: int, coords: ListCoordType) -> int:
for coord in coords:
- offset = put_coordinate(buffer, offset, coord_type, coord)
+ struct.pack_into(CoordinateType.unpack_format(coord_type, buffer, offset, *coord))
+ offset += CoordinateType.bytes_per_coord(coord_type)
return offset
-def put_coordinate(buffer: bytearray, offset: int, coord_type: int, coord: tuple):
- x = coord[0]
- y = coord[1]
- z = coord[2] if len(coord) > 2 else math.nan
- if coord_type == CoordinateType.XY:
- struct.pack_into('d', buffer, offset, x)
- struct.pack_into('d', buffer, offset + 8, y)
- offset += 16
- elif coord_type == CoordinateType.XYZ:
- struct.pack_into('d', buffer, offset, x)
- struct.pack_into('d', buffer, offset + 8, y)
- struct.pack_into('d', buffer, offset + 16, z)
- offset += 24
- else:
- # Shapely does not support M dimension for now
- raise ValueError("Invalid coordinate type: {}".format(coord_type))
+def put_coordinate(buffer: bytearray, offset: int, coord_type: int, coord: CoordType) -> int:
+ struct.pack_into(CoordinateType.unpack_format(coord_type, buffer, offset, *coord))
+ offset += CoordinateType.bytes_per_coord(coord_type)
return offset
-def get_coordinates(buffer: bytearray, offset: int, coord_type: int, num_coords: int) -> List[tuple]:
- coords = []
- bytes_per_coord = CoordinateType.bytes_per_coord(coord_type)
- for i in range(num_coords):
- coord = get_coordinate(buffer, offset, coord_type)
- coords.append(coord)
- offset += bytes_per_coord
+def get_coordinates(buffer: bytearray, offset: int, coord_type: int, num_coords: int) -> Union[np.ndarray, ListCoordType]:
+ if coord_type == CoordinateType.XYM or coord_type == CoordinateType.XYZM:
+ raise NotImplementedError("XYM or XYZM coordinates are not supported")
+
+ if num_coords < 50:
Review Comment:
Is that some magical number ? :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] Kontinuation commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
Kontinuation commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1373383865
LGTM. The refactored implementation looks much more pythonic while having better performance. It is also a great move to remove the expensive `has_z` calls, as shapely currently does not support M dimension.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] jiayuasu commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
jiayuasu commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1381141376
@douglasdennis Is this PR ready to be merged? :-)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] Imbruced commented on a diff in pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
Imbruced commented on code in PR #745:
URL: https://github.com/apache/sedona/pull/745#discussion_r1064062431
##########
python/sedona/utils/geometry_serde.py:
##########
@@ -217,70 +233,97 @@ def create_buffer_for_geom(geom_type: int, coord_type: int, size: int, num_coord
struct.pack_into('i', buffer, 4, num_coords)
return buffer
+def generate_header_bytes(geom_type: int, coord_type: int, num_coords: int) -> bytes:
+ preamble_byte = (geom_type << 4) | (coord_type << 1)
+ return struct.pack(
+ 'BBBBi',
+ preamble_byte,
+ 0,
+ 0,
+ 0,
+ num_coords
+ )
+
-def put_coordinates(buffer: bytearray, offset: int, coord_type: int, coords: CoordinateSequence):
+def put_coordinates(buffer: bytearray, offset: int, coord_type: int, coords: ListCoordType) -> int:
for coord in coords:
- offset = put_coordinate(buffer, offset, coord_type, coord)
+ struct.pack_into(CoordinateType.unpack_format(coord_type, buffer, offset, *coord))
+ offset += CoordinateType.bytes_per_coord(coord_type)
return offset
-def put_coordinate(buffer: bytearray, offset: int, coord_type: int, coord: tuple):
- x = coord[0]
- y = coord[1]
- z = coord[2] if len(coord) > 2 else math.nan
- if coord_type == CoordinateType.XY:
- struct.pack_into('d', buffer, offset, x)
- struct.pack_into('d', buffer, offset + 8, y)
- offset += 16
- elif coord_type == CoordinateType.XYZ:
- struct.pack_into('d', buffer, offset, x)
- struct.pack_into('d', buffer, offset + 8, y)
- struct.pack_into('d', buffer, offset + 16, z)
- offset += 24
- else:
- # Shapely does not support M dimension for now
- raise ValueError("Invalid coordinate type: {}".format(coord_type))
+def put_coordinate(buffer: bytearray, offset: int, coord_type: int, coord: CoordType) -> int:
+ struct.pack_into(CoordinateType.unpack_format(coord_type, buffer, offset, *coord))
+ offset += CoordinateType.bytes_per_coord(coord_type)
return offset
-def get_coordinates(buffer: bytearray, offset: int, coord_type: int, num_coords: int) -> List[tuple]:
- coords = []
- bytes_per_coord = CoordinateType.bytes_per_coord(coord_type)
- for i in range(num_coords):
- coord = get_coordinate(buffer, offset, coord_type)
- coords.append(coord)
- offset += bytes_per_coord
+def get_coordinates(buffer: bytearray, offset: int, coord_type: int, num_coords: int) -> Union[np.ndarray, ListCoordType]:
+ if coord_type == CoordinateType.XYM or coord_type == CoordinateType.XYZM:
+ raise NotImplementedError("XYM or XYZM coordinates are not supported")
+
+ if num_coords < 50:
+ coords = [
+ struct.unpack_from(CoordinateType.unpack_format(coord_type), buffer, offset + (i * CoordinateType.bytes_per_coord(coord_type)))
+ for i in range(num_coords)
+ ]
+ else:
+ nums_per_coord = CoordinateType.components_per_coord(coord_type)
+ coords = np.frombuffer(buffer, np.float64, num_coords * nums_per_coord, offset).reshape((num_coords, nums_per_coord))
+
return coords
-def get_coordinate(buffer: bytearray, offset: int, coord_type: int) -> tuple:
- x = struct.unpack_from('d', buffer, offset)[0]
- y = struct.unpack_from('d', buffer, offset + 8)[0]
- # Shapely does not support M dimension for now, so we'll simply ignore them
- if coord_type == CoordinateType.XY or coord_type == CoordinateType.XYM:
- return x, y
- elif coord_type == CoordinateType.XYZ or coord_type == CoordinateType.XYZM:
- z = struct.unpack_from('d', buffer, offset + 16)[0]
- return x, y, z
- else:
- raise NotImplementedError("XYM or XYZM coordinates were not supported")
+def get_coordinate(buffer: bytearray, offset: int, coord_type: int) -> CoordType:
+ # Shapely does not support M dimension for now, so raise if it was passed
+ return struct.unpack_from(CoordinateType.unpack_format(coord_type), buffer, offset)
-def aligned_offset(offset):
+def aligned_offset(offset: int) -> int:
return (offset + 7) & ~7
-def serialize_point(geom: Point) -> bytearray:
- coords = geom.coords
- if not coords:
- return create_buffer_for_geom(GeometryTypeID.POINT, CoordinateType.XY, 8, 0)
- coord_type = CoordinateType.type_of(coords, geom.has_z)
- bytes_per_coord = CoordinateType.bytes_per_coord(coord_type)
- size = 8 + bytes_per_coord
- buffer = create_buffer_for_geom(GeometryTypeID.POINT, coord_type, size, 1)
- put_coordinates(buffer, 8, coord_type, coords)
- return buffer
-
+def serialize_point(geom: Point) -> bytes:
+ coords = [tuple(c) for c in geom.coords]
Review Comment:
It can be simplified to
```python
def serialize_point_2(geom: Point) -> bytes:
coords = [tuple(c) for c in geom.coords]
if coords:
# FIXME this does not handle M yet, but geom.has_z is extremely slow
pack_format = "BBBBi" + "d" * geom._ndim
coord_type = CoordinateType.type_of(geom)
preamble_byte = ((GeometryTypeID.POINT << 4) | (coord_type << 1))
return struct.pack(
pack_format,
preamble_byte,
0,
0,
0,
1,
*coords[0])
return struct.pack(
'BBBBi',
18,
0,
0,
0,
0
)
```
Or am I wrong ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] Imbruced commented on a diff in pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
Imbruced commented on code in PR #745:
URL: https://github.com/apache/sedona/pull/745#discussion_r1064062506
##########
python/sedona/utils/geometry_serde.py:
##########
@@ -15,22 +15,32 @@
# specific language governing permissions and limitations
# under the License.
+import array
import struct
import math
-from typing import List, Optional
+from typing import List, Optional, Tuple, Union
+import numpy as np
from shapely.coords import CoordinateSequence
Review Comment:
We are not using this
```python
from shapely.coords import CoordinateSequence
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] douglasdennis commented on a diff in pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
douglasdennis commented on code in PR #745:
URL: https://github.com/apache/sedona/pull/745#discussion_r1064067552
##########
python/sedona/utils/geometry_serde.py:
##########
@@ -217,70 +233,97 @@ def create_buffer_for_geom(geom_type: int, coord_type: int, size: int, num_coord
struct.pack_into('i', buffer, 4, num_coords)
return buffer
+def generate_header_bytes(geom_type: int, coord_type: int, num_coords: int) -> bytes:
+ preamble_byte = (geom_type << 4) | (coord_type << 1)
+ return struct.pack(
+ 'BBBBi',
+ preamble_byte,
+ 0,
+ 0,
+ 0,
+ num_coords
+ )
+
-def put_coordinates(buffer: bytearray, offset: int, coord_type: int, coords: CoordinateSequence):
+def put_coordinates(buffer: bytearray, offset: int, coord_type: int, coords: ListCoordType) -> int:
for coord in coords:
- offset = put_coordinate(buffer, offset, coord_type, coord)
+ struct.pack_into(CoordinateType.unpack_format(coord_type, buffer, offset, *coord))
+ offset += CoordinateType.bytes_per_coord(coord_type)
return offset
-def put_coordinate(buffer: bytearray, offset: int, coord_type: int, coord: tuple):
- x = coord[0]
- y = coord[1]
- z = coord[2] if len(coord) > 2 else math.nan
- if coord_type == CoordinateType.XY:
- struct.pack_into('d', buffer, offset, x)
- struct.pack_into('d', buffer, offset + 8, y)
- offset += 16
- elif coord_type == CoordinateType.XYZ:
- struct.pack_into('d', buffer, offset, x)
- struct.pack_into('d', buffer, offset + 8, y)
- struct.pack_into('d', buffer, offset + 16, z)
- offset += 24
- else:
- # Shapely does not support M dimension for now
- raise ValueError("Invalid coordinate type: {}".format(coord_type))
+def put_coordinate(buffer: bytearray, offset: int, coord_type: int, coord: CoordType) -> int:
+ struct.pack_into(CoordinateType.unpack_format(coord_type, buffer, offset, *coord))
+ offset += CoordinateType.bytes_per_coord(coord_type)
return offset
-def get_coordinates(buffer: bytearray, offset: int, coord_type: int, num_coords: int) -> List[tuple]:
- coords = []
- bytes_per_coord = CoordinateType.bytes_per_coord(coord_type)
- for i in range(num_coords):
- coord = get_coordinate(buffer, offset, coord_type)
- coords.append(coord)
- offset += bytes_per_coord
+def get_coordinates(buffer: bytearray, offset: int, coord_type: int, num_coords: int) -> Union[np.ndarray, ListCoordType]:
+ if coord_type == CoordinateType.XYM or coord_type == CoordinateType.XYZM:
+ raise NotImplementedError("XYM or XYZM coordinates are not supported")
+
+ if num_coords < 50:
Review Comment:
Yup, I wrestled a leprechaun for it :)
In reality, it's roughly where the performances of the two approaches crossed each other on my system so it's fairly arbitrary. In my experience geometry either has a low number of points or a lot. Unfortunately, neither approach here performs well at both scales. So, 50 seemed like a reasonable "gonna call it here" kind of threshold.
I created a constant with a better name and am using that now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] douglasdennis commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
douglasdennis commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1373879622
I stand corrected :) I should have been a little more thorough. I think empty geometries in collections should be supported then, since the standards expressly provide for them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] douglasdennis commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
douglasdennis commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1381198538
@jiayuasu I feel good about this PR. If the others do as well then I think it can merge. This code will probably have to be returned to at a later date once shapely decides how it wants to handle M coordinates and once Shapely < 2.0 is no longer supported.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] jiayuasu commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
jiayuasu commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1372700871
@Kontinuation @Imbruced Please feel free to chime in with comments :-)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] Kontinuation commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
Kontinuation commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1373869602
> @Kontinuation What's your take on dropping support for empty geometry inside of multi-geometries in the new serialization format? They don't have representation in any other form so I'm not sure that supporting them provides any value.
I think it is OK to ignore empty geometries inside geometry collections since it has no impact on the topological properties of the geometries, though it may change some structural properties of the geometry collection such as indices of geometry elements (`ST_GeometryN`) and number of geometries in the geometry collection (`ST_NumGeometries`).
If I understand it correctly, both WKT and WKB support empty geometries in geometry collections according to the OGC standard [Simple Feature Access - Part 1: Common Architecture](https://www.ogc.org/standards/sfa). The standard does not explicitly forbid empty geometries inside geometry collections, and these kinds of geometries could be represented according to the BNF of WKT:
```
<multilinestring tagged text> ::= multilinestring <multilinestring text>
<multilinestring text> ::= <empty set> | <left parent> <linestring text> {<comma> <linestring text>}* <right parent>
<linestring text> ::= <empty set> | <left paren> <point> {<comma> <point>}* <right paren>
<empty set> ::= EMPTY
```
A MultiLineString containing 2 empty LineStrings could be represented as `MULTILINESTRING (EMPTY, EMPTY)`. Both JTS and Shapely could successfully parse it, though their behavior varies in various aspects:
* JTS 1.19.0 parses it as a MultiLineString object containing 2 empty LineStrings.
* Shapely 2.0 parses it as a MultiLineString object containing 2 empty LineStrings, though Shapely does not allow creating MultiLineString with empty components (`MultiLineString([LineString(), LineString()])` raises an exception).
* Shapely 1.8 parses it as a MultiLineString containing no geometries.
WKB may also represent `MULTILINESTRING (EMPTY, EMPTY)` as `b'\x00\x00\x00\x00\x05\x00\x00\x00\x02\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00'`, though some implementation failed to parse it.
* JTS 1.19.0 raised a `ParseException` when parsing it
* Shapely 2.0 and Shapely 1.8 parsed it without a problem. Shapely 2.0 parsed it as a MultiLineString containing 2 empty LineStrings while Shapely 1.8 parsed it as an empty MultiLineString.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] Imbruced commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
Imbruced commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1374663015
Is the shapely section refering to wkb.loads and wkb.dumps methods ? What do you think about letting user to chose if the Z/M support should be turned on ? (I guess that majority of use cases are about 2D data). At the end we need sth low level written I am wondering if we can create some PR's in the shapely project ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] douglasdennis commented on pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
douglasdennis commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1374679892
> Do you have a chance to compare the memory footprint between wkb.loads approach and the changes after 29.12 and the current version ?
I'd love too. Unfortunately, it won't be till tomorrow or early this next week.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [sedona] jiayuasu merged pull request #745: [SEDONA-227] Python Serde Refactor
Posted by GitBox <gi...@apache.org>.
jiayuasu merged PR #745:
URL: https://github.com/apache/sedona/pull/745
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org