You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2022/03/29 09:34:55 UTC

[GitHub] [superset] ktmud opened a new pull request #19406: perf(alembic): paginize db migration for new dataset models

ktmud opened a new pull request #19406:
URL: https://github.com/apache/superset/pull/19406


   ### SUMMARY
   
   We run into some scalability issues with the db migration script for [SIP-68](https://github.com/apache/superset/issues/14909) (PR #17543). For context, we have more than 165k datasets, 1.9 million columns and 345k metrics. Loading all of them in memory and convert them to the new tables in one giant commit, like current implementation does, is impractical. It'd kill the Python process if not the db connection.
   
   This PR tries to optimize this migration script by:
   1. **Adding pagination:** instead of migrate all datasets & columns in one commit, I added a manual pagination to fetch datasets 100 at a time.  SQLA streaming with [yield_per](https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.yield_per) may also be an option but it's more complicated when read and write are interlaced with each other---SQLA will complain `iter_next` is impossible because entities have changed.
   2. **Committing changes in small batches:** use [autocommit_block()](https://alembic.sqlalchemy.org/en/latest/api/runtime.html#alembic.runtime.migration.MigrationContext.autocommit_block) to move write operations for each dataset out of the [per-migration transaction](https://github.com/apache/superset/blob/b3db6140c88106fedebe91db0ca817eca4234dc8/superset/migrations/env.py#L117) so new entities are written to the database faster. This avoids keeping all loaded and created entities in memory, either on the Python side or in db buffer. I'm placing the autocommit block around each dataset instead of each page---i.e., writing the dataset and all related columns for each SqlTable in one transaction instead of writing all datasets and all columns from one page of SqlTables in one transaction, as this is tested to be the fastest configuration. Also tested different page sizes and 20 seemed to be working the best. 2,000 datasets can be converted in under 3 minutes. The average memory usa
 ge is less than 350MB while writing is in progress.
   3. **Use eager loading:** added `lazy="selectin"` to enable [`SELECT IN`](https://docs.sqlalchemy.org/en/14/tutorial/orm_related_objects.html?highlight=selectin#selectin-load) eager loading that pulls related data in one SQL statement instead of three.
   
   ### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
   
   <img width="818" alt="image" src="https://user-images.githubusercontent.com/335541/160579397-4bbabd92-382d-4759-b61b-7fec83b67d1b.png">
   
   
   ### TESTING INSTRUCTIONS
   
   - Tested locally with our large internal database.
   - Added following filters [here](https://github.com/apache/superset/blob/b8e1999b943b121259f56f53e3906e59721159ea/superset/migrations/versions/b8d3a24d9131_new_dataset_models.py#L513) to test db records generated for specific data tables:
     ```python
               .filter(
                   or_(
                       SqlaTable.table_name == "my_table_name",
                       SqlaTable.sql.like("%my_table_name%"),
                   )
               )
     ```
   - Verified values of created entities in MySQL shell via selected SELECT queries:
      ```SQL
       SELECT `schema`, `name`, `is_physical` from sl_datasets limit 10;
       SELECT count(*) from sl_columns;
      ```
   
   ### ADDITIONAL INFORMATION
   <!--- Check any relevant boxes with "x" -->
   <!--- HINT: Include "Fixes #nnn" if you are fixing an existing issue -->
   - [ ] Has associated issue:
   - [ ] Required feature flags:
   - [ ] Changes UI
   - [ ] Includes DB Migration (follow approval process in [SIP-59](https://github.com/apache/superset/issues/13351))
     - [ ] Migration is atomic, supports rollback & is backwards-compatible
     - [ ] Confirm DB migration upgrade and downgrade tested
     - [ ] Runtime estimates and downtime expectations provided
   - [ ] Introduces new feature or API
   - [ ] Removes existing feature or API
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] ktmud commented on a change in pull request #19406: perf(alembic): paginize db migration for new dataset models

Posted by GitBox <gi...@apache.org>.

ktmud commented on a change in pull request #19406:
URL: https://github.com/apache/superset/pull/19406#discussion_r837757439



##########
File path: superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
##########
@@ -373,65 +366,35 @@ def upgrade():
         # ExtraJSONMixin
         sa.Column("extra_json", sa.Text(), nullable=True),
         # ImportExportMixin
-        sa.Column("uuid", UUIDType(binary=True), primary_key=False, default=uuid4),
+        sa.Column(
+            "uuid", UUIDType(binary=True), primary_key=False, default=uuid4, unique=True

Review comment:
       I'm moving unique key and foreign key constraints inline.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] ktmud commented on pull request #19406: perf(alembic): paginize db migration for new dataset models

Posted by GitBox <gi...@apache.org>.

ktmud commented on pull request #19406:
URL: https://github.com/apache/superset/pull/19406#issuecomment-1085048156


   Close in favor of #19421 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] github-actions[bot] commented on pull request #19406: perf(alembic): paginize db migration for new dataset models

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #19406:
URL: https://github.com/apache/superset/pull/19406#issuecomment-1083850335


   ⚠️ @ktmud Your base branch `master` has just also updated `superset/migrations`.
   
   ❗ **Please consider rebasing your branch to avoid db migration conflicts.**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] ktmud commented on a change in pull request #19406: perf(alembic): paginize db migration for new dataset models

Posted by GitBox <gi...@apache.org>.

ktmud commented on a change in pull request #19406:
URL: https://github.com/apache/superset/pull/19406#discussion_r837266456



##########
File path: superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
##########
@@ -262,18 +260,21 @@ def after_insert(target: SqlaTable) -> None:  # pylint: disable=too-many-locals
         columns.append(
             NewColumn(
                 name=column.column_name,
-                type=column.type or "Unknown",
-                expression=column.expression or conditional_quote(column.column_name),
                 description=column.description,
-                is_temporal=column.is_dttm,
+                external_url=target.external_url,
+                expression=column.expression or conditional_quote(column.column_name),
+                extra_json=json.dumps(extra_json) if extra_json else None,
                 is_aggregation=False,
-                is_physical=column.expression is None or column.expression == "",
-                is_spatial=False,
-                is_partition=False,
                 is_increase_desired=True,
-                extra_json=json.dumps(extra_json) if extra_json else None,
                 is_managed_externally=target.is_managed_externally,
-                external_url=target.external_url,
+                is_temporal=column.is_dttm,
+                is_spatial=False,
+                is_partition=False,
+                is_physical=(
+                    is_physical_table
+                    and (column.expression is None or column.expression == "")
+                ),
+                type=column.type or "Unknown",

Review comment:
       A-Z

##########
File path: superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
##########
@@ -292,69 +293,62 @@ def after_insert(target: SqlaTable) -> None:  # pylint: disable=too-many-locals
         columns.append(
             NewColumn(
                 name=metric.metric_name,
-                type="Unknown",  # figuring this out would require a type inferrer
-                expression=metric.expression,
-                warning_text=metric.warning_text,
                 description=metric.description,
+                expression=metric.expression,
+                external_url=target.external_url,
+                extra_json=json.dumps(extra_json) if extra_json else None,
                 is_aggregation=True,
                 is_additive=is_additive,
-                is_physical=False,
-                is_spatial=False,
-                is_partition=False,
                 is_increase_desired=True,
-                extra_json=json.dumps(extra_json) if extra_json else None,
                 is_managed_externally=target.is_managed_externally,
-                external_url=target.external_url,
+                is_partition=False,
+                is_physical=False,
+                is_spatial=False,
+                is_temporal=False,
+                type="Unknown",  # figuring this out would require a type inferrer
+                warning_text=metric.warning_text,

Review comment:
       Ditto A-Z

##########
File path: superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
##########
@@ -499,14 +484,46 @@ def upgrade():
         ),
     )
 
-    # migrate existing datasets to the new models
-    bind = op.get_bind()
-    session = db.Session(bind=bind)  # pylint: disable=no-member
+    with op.batch_alter_table("sl_columns") as batch_op:
+        batch_op.create_unique_constraint("uq_sl_columns_uuid", ["uuid"])
+
+    with op.batch_alter_table("sl_tables") as batch_op:
+        batch_op.create_unique_constraint("uq_sl_tables_uuid", ["uuid"])
+
+    with op.batch_alter_table("sl_datasets") as batch_op:
+        batch_op.create_unique_constraint("uq_sl_datasets_uuid", ["uuid"])
+        batch_op.create_unique_constraint(
+            "uq_sl_datasets_sqlatable_id", ["sqlatable_id"]
+        )

Review comment:
       Move constraints to the end to improve readability.

##########
File path: superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
##########
@@ -232,13 +229,14 @@ def after_insert(target: SqlaTable) -> None:  # pylint: disable=too-many-locals
     """
     Copy old datasets to the new models.
     """
-    session = inspect(target).session
+    session: Session = inspect(target).session
+    database_id = target.database_id
+    is_physical_table = not target.sql

Review comment:
       `BaseDatasource` checks whether a table is physical/virtual by checking whether `table.sql` is falsy. I'm changing all `target.sql is None` in this script to keep it consistent with current behavior.

##########
File path: superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
##########
@@ -292,69 +293,62 @@ def after_insert(target: SqlaTable) -> None:  # pylint: disable=too-many-locals
         columns.append(
             NewColumn(
                 name=metric.metric_name,
-                type="Unknown",  # figuring this out would require a type inferrer
-                expression=metric.expression,
-                warning_text=metric.warning_text,
                 description=metric.description,
+                expression=metric.expression,
+                external_url=target.external_url,
+                extra_json=json.dumps(extra_json) if extra_json else None,
                 is_aggregation=True,
                 is_additive=is_additive,
-                is_physical=False,
-                is_spatial=False,
-                is_partition=False,
                 is_increase_desired=True,
-                extra_json=json.dumps(extra_json) if extra_json else None,
                 is_managed_externally=target.is_managed_externally,
-                external_url=target.external_url,
+                is_partition=False,
+                is_physical=False,
+                is_spatial=False,
+                is_temporal=False,
+                type="Unknown",  # figuring this out would require a type inferrer
+                warning_text=metric.warning_text,
             ),
         )
 
-    # physical dataset
-    tables = []
-    if target.sql is None:
-        physical_columns = [column for column in columns if column.is_physical]
-
-        # create table
+    if is_physical_table:
+        # create physical sl_table
         table = NewTable(
             name=target.table_name,
             schema=target.schema,
             catalog=None,  # currently not supported
-            database_id=target.database_id,
-            columns=physical_columns,
+            database_id=database_id,
+            # only save physical columns
+            columns=[column for column in columns if column.is_physical],
             is_managed_externally=target.is_managed_externally,
             external_url=target.external_url,
         )
-        tables.append(table)
-
-    # virtual dataset
+        tables = [table]
+        expression = conditional_quote(target.table_name)
     else:
-        # mark all columns as virtual (not physical)
-        for column in columns:
-            column.is_physical = False
-
-        # find referenced tables
+        # find referenced tables and link to dataset
         parsed = ParsedQuery(target.sql)
         referenced_tables = parsed.tables
-
-        # predicate for finding the referenced tables
         predicate = or_(
             *[
                 and_(
+                    NewTable.database_id == database_id,

Review comment:
       Add `database_id` enforcement as all three together (db + schema + table name) forms a [unique key](https://github.com/apache/superset/blob/88029e21b6068f845d806cfc10d478a5d972ffa5/superset/connectors/sqla/models.py#L509).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] betodealmeida commented on a change in pull request #19406: perf(alembic): paginize db migration for new dataset models

Posted by GitBox <gi...@apache.org>.

betodealmeida commented on a change in pull request #19406:
URL: https://github.com/apache/superset/pull/19406#discussion_r837702596



##########
File path: superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
##########
@@ -292,69 +293,62 @@ def after_insert(target: SqlaTable) -> None:  # pylint: disable=too-many-locals
         columns.append(
             NewColumn(
                 name=metric.metric_name,
-                type="Unknown",  # figuring this out would require a type inferrer
-                expression=metric.expression,
-                warning_text=metric.warning_text,
                 description=metric.description,
+                expression=metric.expression,
+                external_url=target.external_url,
+                extra_json=json.dumps(extra_json) if extra_json else None,
                 is_aggregation=True,
                 is_additive=is_additive,
-                is_physical=False,
-                is_spatial=False,
-                is_partition=False,
                 is_increase_desired=True,
-                extra_json=json.dumps(extra_json) if extra_json else None,
                 is_managed_externally=target.is_managed_externally,
-                external_url=target.external_url,
+                is_partition=False,
+                is_physical=False,
+                is_spatial=False,
+                is_temporal=False,
+                type="Unknown",  # figuring this out would require a type inferrer
+                warning_text=metric.warning_text,
             ),
         )
 
-    # physical dataset
-    tables = []
-    if target.sql is None:
-        physical_columns = [column for column in columns if column.is_physical]
-
-        # create table
+    if is_physical_table:
+        # create physical sl_table
         table = NewTable(
             name=target.table_name,
             schema=target.schema,
             catalog=None,  # currently not supported
-            database_id=target.database_id,
-            columns=physical_columns,
+            database_id=database_id,
+            # only save physical columns
+            columns=[column for column in columns if column.is_physical],
             is_managed_externally=target.is_managed_externally,
             external_url=target.external_url,
         )
-        tables.append(table)
-
-    # virtual dataset
+        tables = [table]
+        expression = conditional_quote(target.table_name)
     else:
-        # mark all columns as virtual (not physical)
-        for column in columns:
-            column.is_physical = False
-
-        # find referenced tables
+        # find referenced tables and link to dataset
         parsed = ParsedQuery(target.sql)
         referenced_tables = parsed.tables
-
-        # predicate for finding the referenced tables
         predicate = or_(
             *[
                 and_(
+                    NewTable.database_id == database_id,

Review comment:
       We also need to update `superset/connectors/sqla/models.py`, where the original logic lives (it had to be copied here so that this migration would still work in the future).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] github-actions[bot] commented on pull request #19406: perf(alembic): paginize db migration for new dataset models

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #19406:
URL: https://github.com/apache/superset/pull/19406#issuecomment-1082640198


   ⚠️ @ktmud Your base branch `master` has just also updated `superset/migrations`.
   
   ❗ **Please consider rebasing your branch to avoid db migration conflicts.**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] github-actions[bot] commented on pull request #19406: perf(alembic): paginize db migration for new dataset models

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #19406:
URL: https://github.com/apache/superset/pull/19406#issuecomment-1082170333


   ⚠️ @ktmud Your base branch `master` has just also updated `superset/migrations`.
   
   ❗ **Please consider rebasing your branch to avoid db migration conflicts.**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org