You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2022/10/12 02:30:45 UTC

[GitHub] [superset] john-bodley opened a new pull request, #21778: fix(migration): Ensure the paginated update is deterministic

john-bodley opened a new pull request, #21778:
URL: https://github.com/apache/superset/pull/21778

   <!---
   Please write the PR title following the conventions at https://www.conventionalcommits.org/en/v1.0.0/
   Example:
   fix(dashboard): load charts correctly
   -->
   
   ### SUMMARY
   
   This PR fixes an issue with the `paginated_update` method which is used in a number of migrations. The problem is the pagination was not deterministic, i.e., per iteration it [slices](https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.slice) the query via a SQL statement using an `OFFSET` and `LIMIT`. The issue is if the results are not ordered in a consistent way, i.e., by primary key, the ordering of sliced results is random meaning that a record may never be processed or processed multiple times.
   
   The TL;DR is any existing migrations which used said logic are potentially wrong and need to be re-run.
   
   ### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
   <!--- Skip this if not applicable -->
   
   ### TESTING INSTRUCTIONS
   
   Tested locally.
   
   ### ADDITIONAL INFORMATION
   <!--- Check any relevant boxes with "x" -->
   <!--- HINT: Include "Fixes #nnn" if you are fixing an existing issue -->
   - [ ] Has associated issue:
   - [ ] Required feature flags:
   - [ ] Changes UI
   - [ ] Includes DB Migration (follow approval process in [SIP-59](https://github.com/apache/superset/issues/13351))
     - [ ] Migration is atomic, supports rollback & is backwards-compatible
     - [ ] Confirm DB migration upgrade and downgrade tested
     - [ ] Runtime estimates and downtime expectations provided
   - [ ] Introduces new feature or API
   - [ ] Removes existing feature or API
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org


[GitHub] [superset] john-bodley commented on pull request #21778: fix(migration): Ensure the paginated update is deterministic

Posted by GitBox <gi...@apache.org>.
john-bodley commented on PR #21778:
URL: https://github.com/apache/superset/pull/21778#issuecomment-1275542412

   @ktmud 
   
   > Your comment reminded me, what if an object was updated in the loop in a way that makes it no longer matches the query's filtering condition
   
   Now that the query only executes once this isn't a problem. The result set in paginated rather than the query being sliced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org


[GitHub] [superset] john-bodley commented on a diff in pull request #21778: fix(migration): Ensure the paginated update is deterministic

Posted by GitBox <gi...@apache.org>.
john-bodley commented on code in PR #21778:
URL: https://github.com/apache/superset/pull/21778#discussion_r992926312


##########
superset/migrations/shared/utils.py:
##########
@@ -100,22 +100,31 @@ def paginated_update(
     """
     Update models in small batches so we don't have to load everything in memory.
     """
-    start = 0
-    count = query.count()
+
+    total = query.count()
+    processed = 0
     session: Session = inspect(query).session
+    result = session.execute(query)
+
     if print_page_progress is None or print_page_progress is True:
-        print_page_progress = lambda current, total: print(
-            f"    {current}/{total}", end="\r"
+        print_page_progress = lambda processed, total: print(
+            f"    {processed}/{total}", end="\r"
         )
-    while start < count:
-        end = min(start + batch_size, count)
-        for obj in query[start:end]:
-            yield obj
-            session.merge(obj)
+
+    while True:

Review Comment:
   The option was to use either `query.slice(...)` or `session.execute(query).fetchmany(...)`. I opted for the later because:
   
   1. Otherwise one would need to include an `order_by(...)` condition and there's no guarantee that it would be defined.
   2. Re-executing the query _n_ times using a different OFFSET per query is likely neither efficient or guarantees correct pagination given that the filter condition could change. Executing the query once and paginating through the result set seems more optimal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org


[GitHub] [superset] john-bodley merged pull request #21778: fix(migration): Ensure the paginated update is deterministic

Posted by GitBox <gi...@apache.org>.
john-bodley merged PR #21778:
URL: https://github.com/apache/superset/pull/21778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org


[GitHub] [superset] john-bodley commented on a diff in pull request #21778: fix(migration): Ensure the paginated update is deterministic

Posted by GitBox <gi...@apache.org>.
john-bodley commented on code in PR #21778:
URL: https://github.com/apache/superset/pull/21778#discussion_r992926312


##########
superset/migrations/shared/utils.py:
##########
@@ -100,22 +100,31 @@ def paginated_update(
     """
     Update models in small batches so we don't have to load everything in memory.
     """
-    start = 0
-    count = query.count()
+
+    total = query.count()
+    processed = 0
     session: Session = inspect(query).session
+    result = session.execute(query)
+
     if print_page_progress is None or print_page_progress is True:
-        print_page_progress = lambda current, total: print(
-            f"    {current}/{total}", end="\r"
+        print_page_progress = lambda processed, total: print(
+            f"    {processed}/{total}", end="\r"
         )
-    while start < count:
-        end = min(start + batch_size, count)
-        for obj in query[start:end]:
-            yield obj
-            session.merge(obj)
+
+    while True:

Review Comment:
   The option was to use either `query.slice(...)` or `session.execute(query).fetchmany(...)`. I opted for the later because:
   
   1. Otherwise one would need to include an `order_by(...)` condition and there's no guarantee that it would be defined.
   2. Re-executing the query _n_ times using a different OFFSET per query is likely neither efficient nor guarantees correct pagination given that the filter condition could change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org


[GitHub] [superset] john-bodley commented on a diff in pull request #21778: fix(migration): Ensure the paginated update is deterministic

Posted by GitBox <gi...@apache.org>.
john-bodley commented on code in PR #21778:
URL: https://github.com/apache/superset/pull/21778#discussion_r992926475


##########
superset/migrations/shared/utils.py:
##########
@@ -100,22 +100,31 @@ def paginated_update(
     """
     Update models in small batches so we don't have to load everything in memory.
     """
-    start = 0
-    count = query.count()
+
+    total = query.count()
+    processed = 0
     session: Session = inspect(query).session
+    result = session.execute(query)
+
     if print_page_progress is None or print_page_progress is True:
-        print_page_progress = lambda current, total: print(
-            f"    {current}/{total}", end="\r"
+        print_page_progress = lambda processed, total: print(
+            f"    {processed}/{total}", end="\r"
         )
-    while start < count:
-        end = min(start + batch_size, count)
-        for obj in query[start:end]:
-            yield obj
-            session.merge(obj)

Review Comment:
   There's no need to merge the record. The caller should handle this if required.



##########
superset/migrations/versions/2022-06-27_14-59_7fb8bca906d2_permalink_rename_filterstate.py:
##########
@@ -66,7 +66,6 @@ def upgrade():
                 state["anchor"] = state["hash"]
                 del state["hash"]
             entry.value = pickle.dumps(value)
-    session.commit()

Review Comment:
   There's no  need to commit as the `paginated_update` handles it.



##########
superset/migrations/versions/2022-06-27_14-59_7fb8bca906d2_permalink_rename_filterstate.py:
##########
@@ -87,5 +86,3 @@ def downgrade():
                 state["hash"] = state["anchor"]
                 del state["anchor"]
             entry.value = pickle.dumps(value)
-        session.merge(entry)

Review Comment:
   There's no need to merge the existing entry. See [here](https://michaelcho.me/article/sqlalchemy-commit-flush-expire-refresh-merge-whats-the-difference) for details:
   
   > Used when you may have more than 1 in-memory objects which map to the same database record with some key.



##########
superset/migrations/shared/utils.py:
##########
@@ -100,22 +100,31 @@ def paginated_update(
     """
     Update models in small batches so we don't have to load everything in memory.
     """
-    start = 0
-    count = query.count()
+
+    total = query.count()
+    processed = 0
     session: Session = inspect(query).session
+    result = session.execute(query)
+
     if print_page_progress is None or print_page_progress is True:
-        print_page_progress = lambda current, total: print(
-            f"    {current}/{total}", end="\r"
+        print_page_progress = lambda processed, total: print(
+            f"    {processed}/{total}", end="\r"
         )
-    while start < count:
-        end = min(start + batch_size, count)
-        for obj in query[start:end]:
-            yield obj
-            session.merge(obj)
+
+    while True:

Review Comment:
   The option was to use either `query.slice(...)` or `session.execute(query).fetchmany(...)`. I opted for the later because:
   
   1. Otherwise one would need to include an `order_by(...)` condition and there's no guarantee that it would be defined.
   2. I'm not sure re-executing the query _n_ times using a different OFFSET per query is efficient. Executing the query once and paginating through the result set seems more optimal.



##########
superset/migrations/versions/2022-06-27_14-59_7fb8bca906d2_permalink_rename_filterstate.py:
##########
@@ -87,5 +86,3 @@ def downgrade():
                 state["hash"] = state["anchor"]
                 del state["anchor"]
             entry.value = pickle.dumps(value)
-        session.merge(entry)
-    session.commit()

Review Comment:
   There's no  need to commit as the `paginated_update` handles it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org


[GitHub] [superset] codecov[bot] commented on pull request #21778: fix(migration): Ensure the paginated update is deterministic

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on PR #21778:
URL: https://github.com/apache/superset/pull/21778#issuecomment-1275515892

   # [Codecov](https://codecov.io/gh/apache/superset/pull/21778?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#21778](https://codecov.io/gh/apache/superset/pull/21778?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f8f44a5) into [master](https://codecov.io/gh/apache/superset/commit/bd3166b6034f79e731abc662f427ef0dff23d3d4?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (bd3166b) will **decrease** coverage by `11.37%`.
   > The diff coverage is `0.00%`.
   
   ```diff
   @@             Coverage Diff             @@
   ##           master   #21778       +/-   ##
   ===========================================
   - Coverage   66.88%   55.50%   -11.38%     
   ===========================================
     Files        1802     1802               
     Lines       68987    68988        +1     
     Branches     7345     7345               
   ===========================================
   - Hits        46139    38291     -7848     
   - Misses      20951    28800     +7849     
     Partials     1897     1897               
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | hive | `52.92% <0.00%> (-0.01%)` | :arrow_down: |
   | mysql | `?` | |
   | postgres | `?` | |
   | presto | `52.82% <0.00%> (-0.01%)` | :arrow_down: |
   | python | `57.93% <0.00%> (-23.53%)` | :arrow_down: |
   | sqlite | `?` | |
   | unit | `51.05% <0.00%> (-0.01%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/superset/pull/21778?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [superset/migrations/shared/utils.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQvbWlncmF0aW9ucy9zaGFyZWQvdXRpbHMucHk=) | `32.25% <0.00%> (-3.81%)` | :arrow_down: |
   | [superset/utils/dashboard\_import\_export.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQvdXRpbHMvZGFzaGJvYXJkX2ltcG9ydF9leHBvcnQucHk=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [superset/tags/core.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQvdGFncy9jb3JlLnB5) | `4.54% <0.00%> (-95.46%)` | :arrow_down: |
   | [superset/key\_value/commands/update.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQva2V5X3ZhbHVlL2NvbW1hbmRzL3VwZGF0ZS5weQ==) | `0.00% <0.00%> (-90.91%)` | :arrow_down: |
   | [superset/key\_value/commands/delete.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQva2V5X3ZhbHVlL2NvbW1hbmRzL2RlbGV0ZS5weQ==) | `0.00% <0.00%> (-87.88%)` | :arrow_down: |
   | [superset/key\_value/commands/delete\_expired.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQva2V5X3ZhbHVlL2NvbW1hbmRzL2RlbGV0ZV9leHBpcmVkLnB5) | `0.00% <0.00%> (-84.00%)` | :arrow_down: |
   | [superset/dashboards/commands/importers/v0.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQvZGFzaGJvYXJkcy9jb21tYW5kcy9pbXBvcnRlcnMvdjAucHk=) | `15.62% <0.00%> (-76.25%)` | :arrow_down: |
   | [superset/datasets/commands/update.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQvZGF0YXNldHMvY29tbWFuZHMvdXBkYXRlLnB5) | `25.00% <0.00%> (-69.05%)` | :arrow_down: |
   | [superset/datasets/commands/importers/v0.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQvZGF0YXNldHMvY29tbWFuZHMvaW1wb3J0ZXJzL3YwLnB5) | `24.03% <0.00%> (-69.00%)` | :arrow_down: |
   | [superset/datasets/commands/create.py](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3VwZXJzZXQvZGF0YXNldHMvY29tbWFuZHMvY3JlYXRlLnB5) | `31.25% <0.00%> (-68.75%)` | :arrow_down: |
   | ... and [284 more](https://codecov.io/gh/apache/superset/pull/21778/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org