You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by yvsubhash <gi...@git.apache.org> on 2016/11/11 11:37:20 UTC

[GitHub] cloudstack pull request #1762: CLOUDSTACK-9595 Transactions are not getting ...

GitHub user yvsubhash opened a pull request:

    https://github.com/apache/cloudstack/pull/1762

    CLOUDSTACK-9595 Transactions are not getting retried in case of datab\u2026

    CLOUDSTACK-9595 Transactions are not getting retried in case of database deadlock errors
    Problem Statement
    --------------------------
    MySQLTransactionRollbackException is seen frequently in logs
    Root Cause
    ----------------
    Attempts to lock rows in the core data access layer of database fails if there is a possibility of deadlock. However Operations are not getting retried in case of deadlock. So introducing retries here
    Solution
    -----------
    Operations would be retried after some wait time in case of dead lock exception

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yvsubhash/cloudstack CLOUDSTACK-9595

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/cloudstack/pull/1762.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1762
    
----
commit 600ae7cbee28d43e6007979b87e150abd2a70a7e
Author: subhash yedugundla <ve...@citrix.com>
Date:   2016-06-14T10:16:16Z

    CLOUDSTACK-9595 Transactions are not getting retried in case of database deadlock errors

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by blueorangutan <gi...@git.apache.org>.
Github user blueorangutan commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    <b>Trillian test result (tid-347)</b>
    Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
    Total time taken: 26094 seconds
    Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr1762-t347-kvm-centos7.zip
    Test completed. 47 look ok, 1 have error(s)
    
    
    Test | Result | Time (s) | Test File
    --- | --- | --- | ---
    test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL | `Failure` | 369.46 | test_vpc_redundant.py
    test_01_vpc_site2site_vpn | Success | 154.87 | test_vpc_vpn.py
    test_01_vpc_remote_access_vpn | Success | 66.24 | test_vpc_vpn.py
    test_01_redundant_vpc_site2site_vpn | Success | 255.75 | test_vpc_vpn.py
    test_02_VPC_default_routes | Success | 273.12 | test_vpc_router_nics.py
    test_01_VPC_nics_after_destroy | Success | 534.12 | test_vpc_router_nics.py
    test_05_rvpc_multi_tiers | Success | 513.09 | test_vpc_redundant.py
    test_04_rvpc_network_garbage_collector_nics | Success | 1408.56 | test_vpc_redundant.py
    test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers | Success | 553.46 | test_vpc_redundant.py
    test_02_redundant_VPC_default_routes | Success | 753.12 | test_vpc_redundant.py
    test_09_delete_detached_volume | Success | 15.44 | test_volumes.py
    test_08_resize_volume | Success | 15.36 | test_volumes.py
    test_07_resize_fail | Success | 20.45 | test_volumes.py
    test_06_download_detached_volume | Success | 15.29 | test_volumes.py
    test_05_detach_volume | Success | 100.25 | test_volumes.py
    test_04_delete_attached_volume | Success | 10.18 | test_volumes.py
    test_03_download_attached_volume | Success | 15.30 | test_volumes.py
    test_02_attach_volume | Success | 73.79 | test_volumes.py
    test_01_create_volume | Success | 712.21 | test_volumes.py
    test_deploy_vm_multiple | Success | 278.61 | test_vm_life_cycle.py
    test_deploy_vm | Success | 0.03 | test_vm_life_cycle.py
    test_advZoneVirtualRouter | Success | 0.02 | test_vm_life_cycle.py
    test_10_attachAndDetach_iso | Success | 26.47 | test_vm_life_cycle.py
    test_09_expunge_vm | Success | 125.19 | test_vm_life_cycle.py
    test_08_migrate_vm | Success | 35.86 | test_vm_life_cycle.py
    test_07_restore_vm | Success | 0.10 | test_vm_life_cycle.py
    test_06_destroy_vm | Success | 125.83 | test_vm_life_cycle.py
    test_03_reboot_vm | Success | 125.82 | test_vm_life_cycle.py
    test_02_start_vm | Success | 10.16 | test_vm_life_cycle.py
    test_01_stop_vm | Success | 40.30 | test_vm_life_cycle.py
    test_CreateTemplateWithDuplicateName | Success | 75.58 | test_templates.py
    test_08_list_system_templates | Success | 0.04 | test_templates.py
    test_07_list_public_templates | Success | 0.04 | test_templates.py
    test_05_template_permissions | Success | 0.06 | test_templates.py
    test_04_extract_template | Success | 5.18 | test_templates.py
    test_03_delete_template | Success | 5.10 | test_templates.py
    test_02_edit_template | Success | 90.17 | test_templates.py
    test_01_create_template | Success | 70.57 | test_templates.py
    test_10_destroy_cpvm | Success | 131.62 | test_ssvm.py
    test_09_destroy_ssvm | Success | 163.18 | test_ssvm.py
    test_08_reboot_cpvm | Success | 101.42 | test_ssvm.py
    test_07_reboot_ssvm | Success | 133.53 | test_ssvm.py
    test_06_stop_cpvm | Success | 166.54 | test_ssvm.py
    test_05_stop_ssvm | Success | 133.56 | test_ssvm.py
    test_04_cpvm_internals | Success | 0.95 | test_ssvm.py
    test_03_ssvm_internals | Success | 3.27 | test_ssvm.py
    test_02_list_cpvm_vm | Success | 0.11 | test_ssvm.py
    test_01_list_sec_storage_vm | Success | 0.12 | test_ssvm.py
    test_01_snapshot_root_disk | Success | 16.20 | test_snapshots.py
    test_04_change_offering_small | Success | 209.44 | test_service_offerings.py
    test_03_delete_service_offering | Success | 0.03 | test_service_offerings.py
    test_02_edit_service_offering | Success | 0.05 | test_service_offerings.py
    test_01_create_service_offering | Success | 0.10 | test_service_offerings.py
    test_02_sys_template_ready | Success | 0.12 | test_secondary_storage.py
    test_01_sys_vm_start | Success | 0.17 | test_secondary_storage.py
    test_09_reboot_router | Success | 30.26 | test_routers.py
    test_08_start_router | Success | 25.25 | test_routers.py
    test_07_stop_router | Success | 10.15 | test_routers.py
    test_06_router_advanced | Success | 0.05 | test_routers.py
    test_05_router_basic | Success | 0.04 | test_routers.py
    test_04_restart_network_wo_cleanup | Success | 5.59 | test_routers.py
    test_03_restart_network_cleanup | Success | 50.47 | test_routers.py
    test_02_router_internal_adv | Success | 1.09 | test_routers.py
    test_01_router_internal_basic | Success | 0.58 | test_routers.py
    test_router_dns_guestipquery | Success | 76.67 | test_router_dns.py
    test_router_dns_externalipquery | Success | 0.08 | test_router_dns.py
    test_router_dhcphosts | Success | 271.73 | test_router_dhcphosts.py
    test_01_updatevolumedetail | Success | 0.07 | test_resource_detail.py
    test_01_reset_vm_on_reboot | Success | 156.00 | test_reset_vm_on_reboot.py
    test_createRegion | Success | 0.04 | test_regions.py
    test_create_pvlan_network | Success | 5.22 | test_pvlan.py
    test_dedicatePublicIpRange | Success | 0.42 | test_public_ip_range.py
    test_04_rvpc_privategw_static_routes | Success | 476.67 | test_privategw_acl.py
    test_03_vpc_privategw_restart_vpc_cleanup | Success | 516.24 | test_privategw_acl.py
    test_02_vpc_privategw_static_routes | Success | 452.44 | test_privategw_acl.py
    test_01_vpc_privategw_acl | Success | 113.17 | test_privategw_acl.py
    test_01_primary_storage_nfs | Success | 35.83 | test_primary_storage.py
    test_createPortablePublicIPRange | Success | 15.18 | test_portable_publicip.py
    test_createPortablePublicIPAcquire | Success | 15.41 | test_portable_publicip.py
    test_isolate_network_password_server | Success | 89.53 | test_password_server.py
    test_UpdateStorageOverProvisioningFactor | Success | 0.17 | test_over_provisioning.py
    test_oobm_zchange_password | Success | 21.02 | test_outofbandmanagement.py
    test_oobm_multiple_mgmt_server_ownership | Success | 16.51 | test_outofbandmanagement.py
    test_oobm_issue_power_status | Success | 10.52 | test_outofbandmanagement.py
    test_oobm_issue_power_soft | Success | 15.52 | test_outofbandmanagement.py
    test_oobm_issue_power_reset | Success | 15.51 | test_outofbandmanagement.py
    test_oobm_issue_power_on | Success | 15.52 | test_outofbandmanagement.py
    test_oobm_issue_power_off | Success | 15.52 | test_outofbandmanagement.py
    test_oobm_issue_power_cycle | Success | 15.52 | test_outofbandmanagement.py
    test_oobm_enabledisable_across_clusterzones | Success | 57.54 | test_outofbandmanagement.py
    test_oobm_enable_feature_valid | Success | 5.34 | test_outofbandmanagement.py
    test_oobm_enable_feature_invalid | Success | 0.14 | test_outofbandmanagement.py
    test_oobm_disable_feature_valid | Success | 5.19 | test_outofbandmanagement.py
    test_oobm_disable_feature_invalid | Success | 0.11 | test_outofbandmanagement.py
    test_oobm_configure_invalid_driver | Success | 0.09 | test_outofbandmanagement.py
    test_oobm_configure_default_driver | Success | 0.09 | test_outofbandmanagement.py
    test_oobm_background_powerstate_sync | Success | 29.57 | test_outofbandmanagement.py
    test_extendPhysicalNetworkVlan | Success | 15.28 | test_non_contigiousvlan.py
    test_01_nic | Success | 631.19 | test_nic.py
    test_releaseIP | Success | 278.86 | test_network.py
    test_reboot_router | Success | 409.34 | test_network.py
    test_public_ip_user_account | Success | 10.29 | test_network.py
    test_public_ip_admin_account | Success | 40.26 | test_network.py
    test_network_rules_acquired_public_ip_3_Load_Balancer_Rule | Success | 67.17 | test_network.py
    test_network_rules_acquired_public_ip_2_nat_rule | Success | 61.81 | test_network.py
    test_network_rules_acquired_public_ip_1_static_nat_rule | Success | 121.88 | test_network.py
    test_delete_account | Success | 308.97 | test_network.py
    test_02_port_fwd_on_non_src_nat | Success | 55.61 | test_network.py
    test_01_port_fwd_on_src_nat | Success | 109.78 | test_network.py
    test_nic_secondaryip_add_remove | Success | 258.78 | test_multipleips_per_nic.py
    login_test_saml_user | Success | 24.76 | test_login.py
    test_assign_and_removal_lb | Success | 134.58 | test_loadbalance.py
    test_02_create_lb_rule_non_nat | Success | 187.42 | test_loadbalance.py
    test_01_create_lb_rule_src_nat | Success | 218.79 | test_loadbalance.py
    test_03_list_snapshots | Success | 0.08 | test_list_ids_parameter.py
    test_02_list_templates | Success | 0.04 | test_list_ids_parameter.py
    test_01_list_volumes | Success | 0.03 | test_list_ids_parameter.py
    test_07_list_default_iso | Success | 0.06 | test_iso.py
    test_05_iso_permissions | Success | 0.06 | test_iso.py
    test_04_extract_Iso | Success | 5.19 | test_iso.py
    test_03_delete_iso | Success | 95.19 | test_iso.py
    test_02_edit_iso | Success | 0.05 | test_iso.py
    test_01_create_iso | Success | 21.87 | test_iso.py
    test_04_rvpc_internallb_haproxy_stats_on_all_interfaces | Success | 249.35 | test_internal_lb.py
    test_03_vpc_internallb_haproxy_stats_on_all_interfaces | Success | 188.14 | test_internal_lb.py
    test_02_internallb_roundrobin_1RVPC_3VM_HTTP_port80 | Success | 516.68 | test_internal_lb.py
    test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80 | Success | 480.48 | test_internal_lb.py
    test_dedicateGuestVlanRange | Success | 10.25 | test_guest_vlan_range.py
    test_UpdateConfigParamWithScope | Success | 0.13 | test_global_settings.py
    test_rolepermission_lifecycle_update | Success | 7.14 | test_dynamicroles.py
    test_rolepermission_lifecycle_list | Success | 6.92 | test_dynamicroles.py
    test_rolepermission_lifecycle_delete | Success | 6.75 | test_dynamicroles.py
    test_rolepermission_lifecycle_create | Success | 6.77 | test_dynamicroles.py
    test_rolepermission_lifecycle_concurrent_updates | Success | 6.88 | test_dynamicroles.py
    test_role_lifecycle_update_role_inuse | Success | 6.80 | test_dynamicroles.py
    test_role_lifecycle_update | Success | 12.01 | test_dynamicroles.py
    test_role_lifecycle_list | Success | 6.87 | test_dynamicroles.py
    test_role_lifecycle_delete | Success | 11.81 | test_dynamicroles.py
    test_role_lifecycle_create | Success | 6.79 | test_dynamicroles.py
    test_role_inuse_deletion | Success | 6.77 | test_dynamicroles.py
    test_role_account_acls_multiple_mgmt_servers | Success | 9.05 | test_dynamicroles.py
    test_role_account_acls | Success | 9.12 | test_dynamicroles.py
    test_default_role_deletion | Success | 6.86 | test_dynamicroles.py
    test_04_create_fat_type_disk_offering | Success | 0.06 | test_disk_offerings.py
    test_03_delete_disk_offering | Success | 0.03 | test_disk_offerings.py
    test_02_edit_disk_offering | Success | 0.04 | test_disk_offerings.py
    test_02_create_sparse_type_disk_offering | Success | 0.07 | test_disk_offerings.py
    test_01_create_disk_offering | Success | 0.10 | test_disk_offerings.py
    test_deployvm_userdispersing | Success | 20.57 | test_deploy_vms_with_varied_deploymentplanners.py
    test_deployvm_userconcentrated | Success | 20.56 | test_deploy_vms_with_varied_deploymentplanners.py
    test_deployvm_firstfit | Success | 50.62 | test_deploy_vms_with_varied_deploymentplanners.py
    test_deployvm_userdata_post | Success | 45.62 | test_deploy_vm_with_userdata.py
    test_deployvm_userdata | Success | 111.07 | test_deploy_vm_with_userdata.py
    test_02_deploy_vm_root_resize | Success | 6.87 | test_deploy_vm_root_resize.py
    test_01_deploy_vm_root_resize | Success | 7.12 | test_deploy_vm_root_resize.py
    test_00_deploy_vm_root_resize | Success | 253.45 | test_deploy_vm_root_resize.py
    test_deploy_vm_from_iso | Success | 208.43 | test_deploy_vm_iso.py
    test_DeployVmAntiAffinityGroup | Success | 55.84 | test_affinity_groups.py
    test_03_delete_vm_snapshots | Skipped | 0.00 | test_vm_snapshots.py
    test_02_revert_vm_snapshots | Skipped | 0.00 | test_vm_snapshots.py
    test_01_test_vm_volume_snapshot | Skipped | 0.00 | test_vm_snapshots.py
    test_01_create_vm_snapshots | Skipped | 0.00 | test_vm_snapshots.py
    test_06_copy_template | Skipped | 0.00 | test_templates.py
    test_static_role_account_acls | Skipped | 0.02 | test_staticroles.py
    test_11_ss_nfs_version_on_ssvm | Skipped | 0.02 | test_ssvm.py
    test_01_scale_vm | Skipped | 0.00 | test_scale_vm.py
    test_01_primary_storage_iscsi | Skipped | 0.03 | test_primary_storage.py
    test_06_copy_iso | Skipped | 0.00 | test_iso.py
    test_deploy_vgpu_enabled_vm | Skipped | 0.03 | test_deploy_vgpu_enabled_vm.py
    test_3d_gpu_support | Skipped | 0.03 | test_deploy_vgpu_enabled_vm.py



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @jburwell @yvsubhash My understanding that all roll back statements will receive MYSQL_DEADLOCK_ERROR_CODE  and will be retired as a part of this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @jburwell I thought that most if not all of ACS interaction through DAO is rather atomic transactions. Do we have cases of multiple DML statements as a part of the same transaction? We have been seeing quite a few deadlock in a high transaction volume environments where multiple management servers are employed. This causes quite a pain for users due to the randomness and no good recourse/explanation. I would argue that proper retry is a better choice should we cover all the cases including all cases with complex transactions. We have been successful leveraging this approach in systems built on the top of ACS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by rafaelweingartner <gi...@git.apache.org>.
Github user rafaelweingartner commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38, it is great that you found one of the methods that cause the deadlock problem \u201ccom.cloud.host.dao.HostDaoImpl.findAndUpdateDirectAgentToLoad(long, Long, long)\u201d.
    
    This method surely is problematic. I would first start asking, (i) does it need to manually open a transaction (at line 512)? Isn\u2019t that the goal of \u201c@DB\u201d annotation? (ii) what is the objective of the method (\u201cfindAndUpdateDirectAgentToLoad\u201d)? It is looking too complicated, with too many accesses to the DB.
    
    The method \u201cresetHosts\u201d at line 517 looks for hosts that are \u201cmanaged\u201d by the current MS and are \u201cDisconnected\u201d to mark them as unmanaged by any MS. That means, it updates the \u201cmanagementServerId = null\u201d of hosts marked as \u201cDisconnect\u201d.
    
    Would not it be better to have a specific method/transaction only for the aforementioned process?  If we extract that chunk of code to an isolated method, could not we have an atomic access to the DB without locking? \u201cupdate set managementServerId = null from hosts where \u2026\u2026\u201d; If the method is isolated I do not see reasons for locks here.
    
    A little further, there is another method which could be isolated, lines 527 \u2013 546. This block of code looks for clusters being managed by the current MS. Then, it searches for hosts of clusters that are managed by the current MS, which are not being managed by the current MS (or not managed at all?)? I did not understand that because I have seen in some other piece of code that we have a balancing approach; meaning that, we try to balance the number of hosts managed by an MS.  This piece of code seems to remove the balancing process.
    
    Then, at line 551 and forward (if the number of hosts is less than the limit), it tries to look for hosts of clusters not being managed by any MS. This block could also be an isolated one. And again, we might be able to do this process without using locks.
    
    My final comment, even if we choose not to refactor and improve this piece of code, there is one thing that is very strange for me. The method \u201cfindAndUpdateDirectAgentToLoad\u201d  is annotated with \u201c@DB\u201d, and also opens and tries to manage a transaction manually. Then, we have all of the pieces of code I mentioned, all of them call other methods that also are annotated with \u201c@DB\u201d. Can this cause a problem?
    
    For instance, when I use Spring, methods from a service layer (the place where I configure my pattern of transactions) call one another, they will all use/share the same transaction opened when the first method of the service layer was called, unless specified otherwise. How will it work here in ACS?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by yvsubhash <gi...@git.apache.org>.
Github user yvsubhash commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38 Is the refactoring suggested by rafael taken care by  @nvazquez, else I would take it up


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by jburwell <gi...@git.apache.org>.
Github user jburwell commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @yvsubhash according to the (MySQL deadlock documenation)[http://dev.mysql.com/doc/refman/5.7/en/innodb-deadlocks.html],  a `MYSQL_DEADLOCK_ERROR_CODE` error indicates the enclosing transaction has been rolled back.  The proper handling for this error is to re-execute all statements executed in the aborted transaction.  From a best practices perspective, all base data should be re-retrieved and changed to ensure logical consistency with changes made by the transaction that won deadlock resolution.
    
    As I understand this patch, only the most recently executed DML is retried.  Therefore, any previously executed changes will be discarded and the DML will be re-executed either in a new transaction or in auto-commit (I didn't look up how the client handles the transaction context in this scenario).  If my understanding is correct, this patch could lead to issues ranging from unexpected foreign key integrity errors to data corruption.
    
    Rather attempting to implement a generic retry, I think the best approach to addressing deadlocks is to treat them bugs.  This patch could be modified to provide detailed logging information about the conditions under which a deadlock occurs providing the information necessary to refactor the system to avoid lock contention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by jburwell <gi...@git.apache.org>.
Github user jburwell commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38 that is not a safe assumption.  Transactions often span multiple statements and methods across DAOs.  `TransactionLegacy` has a transaction stacking/nested model that further occludes when a transaction actually completely.
    
    Deadlocks are a severe problem that need to be fixed.  Unfortunately, this patch would do more harm than good as it would eventually corrupt the database.   In, and of themselves, retries are also a very expensive solution to the problem both in terms of the engineering effort required to do it properly and the extra stress placed on the database to perform additional work that will likely fail.  Furthermore, a generic **and** correct retry mechanism is a very difficult thing to write.  Given the way transaction boundaries are managed in ACS, I think such an effort would be nearly impossible.
    
    In a properly written application, deadlocks should very rarely, if ever, occur.  Their presence is a symptom of improper transaction handling and/or poor lock management problems.   Therefore, my suggestion is that we change this patch to log details about the context in which deadlocks occur.  We can then use this information to identify the areas in ACS where these contention problems are location and fix the root cause.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by jburwell <gi...@git.apache.org>.
Github user jburwell commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38 my reading of the code is that only the most recently attempted DML will be re-executed.  Furthermore, retrying without refreshing the base data can also lead to data corruption.  The best thing to do in a case of a dead lock is to fail and rollback due to the risk of data corruption.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    What about if the author can figure out a way to identify all part of transaction being cancelled and retry all parts? Or retry the whole transaction? It would  be nice  to open a path for the author to implement this 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by rafaelweingartner <gi...@git.apache.org>.
Github user rafaelweingartner commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    Thanks, @serg38.
    Looking at the SQLs you posted. We could start to discuss whether or not some SQLs statements need locking transactions.
    
    Ignoring Deadlocks 3 and 4 for now, I think we could start with the ones the look the simplest (Deadlocks 1 and 2). 
    
    These SQLS have probably being generated, so tracking them on ACS may not be that easy, but at first glance, I feel that we could execute them without needing lock in the database. 
    
    I tried to find the first SQL, without success. Would you mind helping me pin point where in the code the SQL from transaction 1 at deadlock 1 is generated? Then, we can evaluate if it is or not needed a lock there. 
    
    Are the SQLs you showed complete? I found a place that could generate SQLs similar to the one at transaction 1 and deadlock 1, but this code adds one extra where clause.
    
    The method I am talking about is:
    com.cloud.cluster.agentlb.ClusterBasedAgentLoadBalancerPlanner.getHostsToRebalance(long, int)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @rafaelweingartner Looks  like the deadlocks 2 and 3 are the same. I scanned our production  log  and since last December we had 6400 deadlocks. Out of them close to 6000 were Deadlock 1 
    20 were Deadlock 2 and 700 of a different Deadlock 5. The other deadlocks were in negligible numbers. I think if we figure out Deadlock 1 and Deadlock 5 this will be good start. I will try to find the source of transactions for them. In production we run a commercial distribution based in most part on  4.7 branch of ACS. 
    
    Deadlock 5
    
    *** (1) TRANSACTION:
    TRANSACTION D518886F8, ACTIVE 2 sec fetching rows
    mysql tables in use 4, locked 4
    LOCK WAIT 24 lock struct(s), heap size 3112, 8 row lock(s), undo log entries 17
    MySQL thread id 29781, OS thread handle 0x7f9df36db700, query id 3625404021 ussclpdcsmgt012.autodesk.com 10.41.13.14 cloud Sorting result
    SELECT user_ip_address.id, user_ip_address.account_id, user_ip_address.domain_id, user_ip_address.public_ip_address, user_ip_address.data_center_id, user_ip_address.source_n
    at, user_ip_address.allocated, user_ip_address.vlan_db_id, user_ip_address.one_to_one_nat, user_ip_address.vm_id, user_ip_address.state, user_ip_address.mac_address, user_ip
    _address.source_network_id, user_ip_address.network_id, user_ip_address.uuid, user_ip_address.physical_network_id, user_ip_address.is_system, user_ip_address.vpc_id, user_ip
    _address.dnat_vmip, user_ip_address.is_portable, user_ip_address.display, user_ip_address.removed, user_ip_address.created FROM user_ip_address  INNER JOIN vlan ON user_ip_a
    ddress.vlan_db_id=vlan.id WHERE user_ip_address.data_center_id = 6  AND user_ip_address.allocated IS NULL  AND user_ip_address.vlan_db_id IN (32,33,36,37,41,61,62,91,92,93,9
    4,106,107,108,109,11
    *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
    *** (2) TRANSACTION:
    TRANSACTION D5188582B, ACTIVE 17 sec updating or deleting, thread declared inside InnoDB 499
    mysql tables in use 1, locked 1
    25 lock struct(s), heap size 3112, 13 row lock(s), undo log entries 18
    MySQL thread id 29820, OS thread handle 0x7fa35a868700, query id 3625417999 ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Updating
    UPDATE user_ip_address SET user_ip_address.source_nat=0, user_ip_address.is_system=0, user_ip_address.account_id=3309, user_ip_address.allocated='2016-03-25 15:36:39', user_ip_address.state='Allocated', user_ip_address.domain_id=335 WHERE user_ip_address.id = 3284
    *** (2) HOLDS THE LOCK(S):



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by abhinandanprateek <gi...@git.apache.org>.
Github user abhinandanprateek commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    Even trying the full transaction again could be problematic as there might be checks done before firing the transaction that may not be valid now.
    The thing is it may mostly work, but it is not fool proof.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @yvsubhash Please, take this up. So far this PR hasn't moved forward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @rafaelweingartner You might be right that pod_vlan_map should be in the join. May be I didn't find the correct methods after all. @jburwell @rhtyd What do you think?
    
    I was able to find management serve log for Deadlock 1. Looks like one of transaction came from findAndUpdateDirectAgentToLoad  method in HostDaoImpl which creates rather complex transaction:
    
    2016-11-24 15:04:39,284 DEBUG [host.dao.HostDaoImpl] (ClusteredAgentManager Timer:ctx-a8e9449c) Resetting hosts suitable for reconnect
    2016-11-24 15:04:39,320 DEBUG [db.Transaction.Transaction] (ClusteredAgentManager Timer:ctx-a8e9449c) Rolling back the transaction: Time = 36 Name =  ClusteredAgentManager Timer; called by -TransactionLegacy.rollback:879-TransactionLegacy.removeUpTo:822-TransactionLegacy.close:646-TransactionContextInterceptor.invoke:36-ReflectiveMethodInvocation.proceed:161-ExposeInvocationInterceptor.invoke:91-ReflectiveMethodInvocation.proceed:172-JdkDynamicAopProxy.invoke:204-$Proxy48.findAndUpdateDirectAgentToLoad:-1-ClusteredAgentManagerImpl.scanDirectAgentToLoad:195-ClusteredAgentManagerImpl.runDirectAgentScanTimerTask:185-ClusteredAgentManagerImpl.access$100:99
    2016-11-24 15:04:39,322 ERROR [agent.manager.ClusteredAgentManagerImpl] (ClusteredAgentManager Timer:ctx-a8e9449c) Unexpected exception DB Exception on: com.mysql.jdbc.JDBC4PreparedStatement@1e58727c: SELECT host.id, host.disconnected, host.name, host.status, host.type, host.private_ip_address, host.private_mac_address, host.private_netmask, host.public_netmask, host.public_ip_address, host.public_mac_address, host.storage_ip_address, host.cluster_id, host.storage_netmask, host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, host.resource, host.fs_type, host.available, host.setup, host.resource_state, host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, host.mgmt_server_id, host.dom0_memory, host.version, host.created, h
 ost.removed FROM host WHERE host.resource IS NOT NULL  AND host.mgmt_server_id = 345048964870  AND host.last_ping <= 1445339907  AND host.cluster_id IS NOT NULL  AND host.status IN ('Disconnected','Down','Alert')  AND host.removed IS NULL  FOR UPDATE 
    Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
    
    Beginning of second transaction was 
    SELECT host.id, host.disconnected, host.name, host.status, host.type, host.private_ip_address, host.private_mac_address, host.private_netmask, host.public_netmask, host.public_ip_address, host.public_mac_address, host.storage_ip_address, host.cluster_id, host.storage_netmask, host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, host.resource, host.fs_type, host.available, host.setup, host.resource_state, host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, host.mgmt_server_id, host.dom0_memory, host.version, host.created, host.removed FROM host  LEFT OUTER JOIN op_host_transfer ON host.id=op_host_transfer.id  IN
    
    I will try to trace it to the ACS method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @jburwell @yvsubhash  I might be wrong but this PR will retry on deadlock for only 2  DAO methods searchIncludingRemoved and customSearchIncludingRemoved. No update methods are set with this retry mechanism. If that's the case there is no risk of corrupting DB. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by rhtyd <gi...@git.apache.org>.
Github user rhtyd commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @blueorangutan test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by rafaelweingartner <gi...@git.apache.org>.
Github user rafaelweingartner commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38 I have just now started reading this PR (excuse me if I overlooked some information).
    
    > If we are to try to implement a general way of dealing with deadlocks in ACS how could it be done to ensure DB consistency and correct transaction retry?
    
    Answering your question; in my opinion, we should not \u201ctry\u201d to implement a general way of managing transactions. We are only having this type of problem because instead of using a framework to manage access and transactions in databases, it was developed a module to do that and incorporated to ACS; this means we have to maintain and live with this code. 
    
    Now, the problem is that it would be a Dantesque task to change the way ACS manages transactions today.
    
    I am with John on this one, retrying is not a good idea; it can hide problems, cause overheads and cause even more headaches.  I think that the best approach is to deal with this type of problem on the fly; this means, as John said, addressing them as bugs when they are reported.
    
    Having said that, I have not helped a bit to solve the problem\u2026 Let\u2019s see if I can be of any help. 
    
    I was reading the ticket #CLOUDSTACK-9595. It seems that the problem (reported there) happened when a VM was being removed from a table \u201cinstance_group_vm_map\u201d. I just do not understand because the method called is \u201cUserVmManagerImpl.addInstanceToGroup\u201d. I am hoping that this makes sense. Anyways\u2026
    
    The MYSQL docs have the following on deadlocks:
    > A deadlock is a situation where different transactions are unable to proceed because each holds a lock that the other needs
    
    This means, there was something else being executed when that VM was deleted/added, and this caused the deadlock and the exception. Probably something else is using the table \u201cinstance_group_vm_map\u201d.
    
    I think we should track these two tasks/processes that can cause the problem and work them out, instead of looking for a generic way to deal with this situation. Maybe these processes that are causing deadlock are locking tables that are not needed or executing some processing that could be avoided or modified.
    
    Do we use case that can reproduce the problem? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by blueorangutan <gi...@git.apache.org>.
Github user blueorangutan commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @jburwell We've been running this fix as a part of proprietary CS for several weeks now. We are observing elimination of deadlocks and no DB corruption. Retry seems to be the only realistic way of dealing with deadlocks in complex environment like ACS. Can we come up with a limited scope/conditions of this PR to move forward ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by jburwell <gi...@git.apache.org>.
Github user jburwell commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    Due to the previous discussion, I am -1 on merging this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @rafaelweingartner Thanks a lot. I totally agree that resetting hosts doesn't really need to be a part of transaction and should be extracted to a new method. The same is for lines 527-546, and then another one after 551
    My understanding of agent LB is that is handled separately from reconnect part. I might be wrong but it is done in ClusteredAgentManagerImpl by scheduling rebalancing task every 60 sec
    getAgentRebalanceScanTask which takes care of transferring of connected agents.
    @rhtyd @jburwell @koushik-das @karuturi Do you agree that we can split a transaction in findAndUpdateDirectAgentToLoad into 3 non transactional methods and thus eliminate a one side of a repeated deadlock? This is a very core of agent management that is very hard if ever possible to write smoke test. If so @nvazquez might be able to work on refactoring this method later this month



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by jburwell <gi...@git.apache.org>.
Github user jburwell commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38 corruption could happen at any point -- it's a ticking time bomb.  From a ACID perspective, this patch fails from a consistency perspective.  All data being updated must be re-queried and validated in order to ensure the consistency guarantee is not violated.  In a high volume system, it's not a matter of if, but when a sequence of events will occur and corrupt the database.   Bear in mind, these corruptions be in the content of the data and would not yield a MySQL error.  They will be phenomenon such as phantom rows or inconsistent data updates
    
    As I said previously, the only real solution to deadlocks is to fix the way the system manages transactions and locks.  This patch is merely hiding an error while creating the potential for far larger problems.
    
    For these reasons, I remain -1 on merging this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @rafaelweingartner Tried tracing where  deadlock 5 originated. It seems both transactions are part of the same method fetchNewPublicIp in IpAddressManagerImpl  . Transactions are executed on different management servers. 
    Update is triggered through markPublicIpAsAllocated  method 
    
    Select seems to come from there as well  fetchNewPublicIp in IpAddressManagerImpl
    
            AssignIpAddressFromPodVlanSearch = _ipAddressDao.createSearchBuilder();
            AssignIpAddressFromPodVlanSearch.and("dc", AssignIpAddressFromPodVlanSearch.entity().getDataCenterId(), Op.EQ);
            AssignIpAddressFromPodVlanSearch.and("allocated", AssignIpAddressFromPodVlanSearch.entity().getAllocatedTime(), Op.NULL);
            SearchBuilder<VlanVO> podVlanSearch = _vlanDao.createSearchBuilder();
            podVlanSearch.and("type", podVlanSearch.entity().getVlanType(), Op.EQ);
            podVlanSearch.and("networkId", podVlanSearch.entity().getNetworkId(), Op.EQ);
            SearchBuilder<PodVlanMapVO> podVlanMapSB = _podVlanMapDao.createSearchBuilder();
            podVlanMapSB.and("podId", podVlanMapSB.entity().getPodId(), Op.EQ);
            AssignIpAddressFromPodVlanSearch.join("podVlanMapSB", podVlanMapSB, podVlanMapSB.entity().getVlanDbId(), AssignIpAddressFromPodVlanSearch.entity().getVlanId(),
                JoinType.INNER);
            AssignIpAddressFromPodVlanSearch.join("vlan", podVlanSearch, podVlanSearch.entity().getId(), AssignIpAddressFromPodVlanSearch.entity().getVlanId(), JoinType.INNER);
            AssignIpAddressFromPodVlanSearch.done();
    
    public IPAddressVO doInTransaction(TransactionStatus status) throws InsufficientAddressCapacityException {
                    StringBuilder errorMessage = new StringBuilder("Unable to get ip adress in ");
                    boolean fetchFromDedicatedRange = false;
                    List<Long> dedicatedVlanDbIds = new ArrayList<Long>();
                    List<Long> nonDedicatedVlanDbIds = new ArrayList<Long>();
    
                    SearchCriteria<IPAddressVO> sc = null;
                    if (podId != null) {
                        sc = **AssignIpAddressFromPodVlanSearch**.create();
                        sc.setJoinParameters("podVlanMapSB", "podId", podId);
                        errorMessage.append(" pod id=" + podId);
                    } else {
                        sc = AssignIpAddressSearch.create();
                        errorMessage.append(" zone id=" + dcId);
                    }



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by cloudmonger <gi...@git.apache.org>.
Github user cloudmonger commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    ### ACS CI BVT Run
     **Sumarry:**
     Build Number 464
     Hypervisor xenserver
     NetworkType Advanced
     Passed=104
     Failed=1
     Skipped=7
    
    _Link to logs Folder (search by build_no):_ https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0
    
    
    **Failed tests:**
    * test_routers_network_ops.py
    
     * test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failed
    
    
    **Skipped tests:**
    test_01_test_vm_volume_snapshot
    test_vm_nic_adapter_vmxnet3
    test_static_role_account_acls
    test_11_ss_nfs_version_on_ssvm
    test_nested_virtualization_vmware
    test_3d_gpu_support
    test_deploy_vgpu_enabled_vm
    
    **Passed test suits:**
    test_deploy_vm_with_userdata.py
    test_affinity_groups_projects.py
    test_portable_publicip.py
    test_over_provisioning.py
    test_global_settings.py
    test_scale_vm.py
    test_service_offerings.py
    test_routers_iptables_default_policy.py
    test_loadbalance.py
    test_routers.py
    test_reset_vm_on_reboot.py
    test_deploy_vms_with_varied_deploymentplanners.py
    test_network.py
    test_router_dns.py
    test_non_contigiousvlan.py
    test_login.py
    test_deploy_vm_iso.py
    test_list_ids_parameter.py
    test_public_ip_range.py
    test_multipleips_per_nic.py
    test_regions.py
    test_affinity_groups.py
    test_network_acl.py
    test_pvlan.py
    test_volumes.py
    test_nic.py
    test_deploy_vm_root_resize.py
    test_resource_detail.py
    test_secondary_storage.py
    test_vm_life_cycle.py
    test_disk_offerings.py


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @rafaelweingartner I might be wrong but 2d  came from findAndUpdateDirectAgentToLoad in HostDaoImpl  which also creates a large transaction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by jburwell <gi...@git.apache.org>.
Github user jburwell commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38 there remains a risk when those methods are executed in the context of an open transaction where DMLs have already been executed and subsequent DMLs will be executed.  In this scenario, the first set of the changes would be lost due to the rollback triggered by the query deadlock with the second set proceeding successfully.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by cloudmonger <gi...@git.apache.org>.
Github user cloudmonger commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    ### ACS CI BVT Run
     **Sumarry:**
     Build Number 135
     Hypervisor xenserver
     NetworkType Advanced
     Passed=102
     Failed=3
     Skipped=6
    
    _Link to logs Folder (search by build_no):_ https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0
    
    
    **Failed tests:**
    * test_non_contigiousvlan.py
    
     * test_extendPhysicalNetworkVlan Failed
    
    * test_deploy_vm_iso.py
    
     * test_deploy_vm_from_iso Failing since 19 runs
    
    * test_vm_life_cycle.py
    
     * test_10_attachAndDetach_iso Failing since 20 runs
    
    
    **Skipped tests:**
    test_01_test_vm_volume_snapshot
    test_vm_nic_adapter_vmxnet3
    test_static_role_account_acls
    test_11_ss_nfs_version_on_ssvm
    test_3d_gpu_support
    test_deploy_vgpu_enabled_vm
    
    **Passed test suits:**
    test_deploy_vm_with_userdata.py
    test_affinity_groups_projects.py
    test_portable_publicip.py
    test_over_provisioning.py
    test_global_settings.py
    test_scale_vm.py
    test_service_offerings.py
    test_routers_iptables_default_policy.py
    test_loadbalance.py
    test_routers.py
    test_reset_vm_on_reboot.py
    test_snapshots.py
    test_deploy_vms_with_varied_deploymentplanners.py
    test_network.py
    test_router_dns.py
    test_login.py
    test_list_ids_parameter.py
    test_public_ip_range.py
    test_multipleips_per_nic.py
    test_regions.py
    test_affinity_groups.py
    test_network_acl.py
    test_pvlan.py
    test_volumes.py
    test_nic.py
    test_deploy_vm_root_resize.py
    test_resource_detail.py
    test_secondary_storage.py
    test_routers_network_ops.py
    test_disk_offerings.py


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    LGTM. Finally !!! We have been seeing occasional deadlocks in environments with high level transaction rate. @rhtyd @jburwell This could be a good add to 4.8/4.9. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by rhtyd <gi...@git.apache.org>.
Github user rhtyd commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @abhinandanprateek can you help reviewing this one, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @jburwell I concur but if @yvsubhash verified that those methods don't participate in complex DML transactions this might be still a good start. If so this approach might be expanded later to multi DML transaction so that each piece can be retired individually. I myself traced few deadlocks in ACS using  native mysql deadlock logging and it doesn't seem there would be a viable alternative to retires due to well known complexity of ACS DB operations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by jburwell <gi...@git.apache.org>.
Github user jburwell commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38 with custom plugins, there is no way to reliably perform such tracing.  I can think of batch cleanup operations in the storage layer that follow the pattern I described.  Even if there were, we would have planted a landline for future changes to the system.  Deadlocks are significant technical debt that are clearly causing significant operational issues.  Unfortunately, there is no way to address them generically


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by blueorangutan <gi...@git.apache.org>.
Github user blueorangutan commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @rafaelweingartner @swill @wido @koushik-das @karuturi @rhtyd @jburwell  Let's ask a different question. If we are to try to implement a general way of dealing with deadlocks in ACS how could it be done to ensure DB consistency and correct transaction retry?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by rafaelweingartner <gi...@git.apache.org>.
Github user rafaelweingartner commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38 I have the same understanding about the agent LB. And this is one of the problems I think we have found here. It seems that this method is removing the balance created with agent LB. And, of course, this method is also causing deadlocks.
    
    Let\u2019s hear the feedback from others and discuss what we can do forward. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by blueorangutan <gi...@git.apache.org>.
Github user blueorangutan commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    Packaging result: \u2714centos6 \u2714centos7 \u2714debian. JID-164


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by rafaelweingartner <gi...@git.apache.org>.
Github user rafaelweingartner commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @serg38 if that "AssignIpAddressFromPodVlanSearch" object was being used to generate the SQL; should not we see a join with "pod_vlan_map" too? For me this, this SC is very confusing.
    
    Following the same idea of what I would do if using Spring to manage transactions, the method "fetchNewPublicIp" does not need the "@DB" annotation (assuming this is the annotation that opens a transaction and locks tables in ACS). The method \u201cfetchNewPublicIp\u201d is a simple "retrieve/get" method. Whenever we have to lock the table that is being used by this method, we could use the "fetchNewPublicIp" in a method that has the "@DB" annotation (assuming it has transaction propagation). This is something that already seems to happen. Methods "allocateIp" and "assignDedicateIpAddress" use \u201cfetchNewPublicIp\u201d and they have their own \u201c@DB\u201d annotation.
    
    Methods \u201cassignPublicIpAddressFromVlans\u201d and \u201cassignPublicIpAddress\u201d seem not to do anything that requires a transaction; despite misleading (at least for me) with names indicating that something will be assigned to someone, they just call and return the response of  \u201cfetchNewPublicIp\u201d method. Therefore, I do not think they require a locking transaction.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by serg38 <gi...@git.apache.org>.
Github user serg38 commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    Here it is few samples of deadlocks we observe in high transaction volume environment with multiple management servers. As you can see most of them are concurrent operations from different management servers and either select or select for update statements. The following 4 types account for the majority of deadlock s we saw so far ( 80-90% of all deadlocks). Deadlock 1-3 happens much more often than deadlock 4.  It is next to impossible to reproduce since they occur one in few days with 4 management servers and average VM deployment volume of 3000 a day.
    
    Deadlock type 1:
    
    InnoDB: transactions deadlock detected, dumping detailed information.
    151217  3:08:20
    *** (1) TRANSACTION:
    TRANSACTION BB4D4C91D, ACTIVE 0 sec fetching rows
    mysql tables in use 1, locked 1
    LOCK WAIT 11 lock struct(s), heap size 3112, 5 row lock(s)
    MySQL thread id 47654, OS thread handle 0x7f0475bdd700, query id 3821358107 ussclpdcsmgt012.autodesk.com 10.41.13.14 cloud Sending data
    SELECT host.id, host.disconnected, host.name, host.status, host.type, host.private_ip_address, host.private_mac_address, host.private_netmask, host.public_netmask, host.public_ip_address, host.public_mac_address, host.storage_ip_address, host.cluster_id, host.storage_netmask, host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, host.resource, host.fs_type, host.available, host.setup, host.resource_state, host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, host.mgmt_server_id, host.dom0_memory, host.version, host.created, host.removed FROM host WHERE host.resource IS NOT NULL  AND host.mgmt_server_id = 345048964870 
    *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
    *** (2) TRANSACTION:
    TRANSACTION BB4D4C915, ACTIVE 1 sec fetching rows, thread declared inside InnoDB 449
    mysql tables in use 3, locked 3
    29 lock struct(s), heap size 6960, 15 row lock(s), undo log entries 1
    MySQL thread id 47623, OS thread handle 0x7f0a47074700, query id 3821724056 ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Copying to tmp table
    SELECT host.id, host.disconnected, host.name, host.status, host.type, host.private_ip_address, host.private_mac_address, host.private_netmask, host.public_netmask, host.public_ip_address, host.public_mac_address, host.storage_ip_address, host.cluster_id, host.storage_netmask, host.storage_mac_address, host.storage_ip_address_2, host.storage_netmask_2, host.storage_mac_address_2, host.hypervisor_type, host.proxy_port, host.resource, host.fs_type, host.available, host.setup, host.resource_state, host.hypervisor_version, host.update_count, host.uuid, host.data_center_id, host.pod_id, host.cpu_sockets, host.cpus, host.url, host.speed, host.ram, host.parent, host.guid, host.capabilities, host.total_size, host.last_ping, host.mgmt_server_id, host.dom0_memory, host.version, host.created, host.removed FROM host  LEFT OUTER JOIN op_host_transfer ON host.id=op_host_transfer.id
    *** (2) HOLDS THE LOCK(S):
    RECORD LOCKS space id 0 page no 147488 n bits 840 index `i_host__removed` of table `cloud`.`host` trx id BB4D4C915 lock_mode X locks rec but not gap
    
    
    Deadlock 2:
    
    InnoDB: transactions deadlock detected, dumping detailed information.
    151218 11:03:00
    *** (1) TRANSACTION:
    TRANSACTION BBB232C81, ACTIVE 51 sec starting index read
    mysql tables in use 1, locked 1
    LOCK WAIT 3 lock struct(s), heap size 1248, 2 row lock(s)
    MySQL thread id 57308, OS thread handle 0x7f0a45c24700, query id 5217973695 ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Sending data
    SELECT resource_count.id, resource_count.type, resource_count.account_id, resource_count.domain_id, resource_count.count FROM resource_count WHERE resource_count.id IN (5083,4867,5079,33652,5077)  FOR UPDATE
    *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
    *** (2) TRANSACTION:
    TRANSACTION BBB2254AC, ACTIVE 116 sec starting index read, thread declared inside InnoDB 500
    mysql tables in use 1, locked 1
    207 lock struct(s), heap size 31160, 1650 row lock(s), undo log entries 2
    MySQL thread id 56926, OS thread handle 0x7f04756c9700, query id 5218549710 ussclpdcsmgt014.autodesk.com 10.41.13.16 cloud Sending data
    SELECT resource_count.id, resource_count.type, resource_count.account_id, resource_count.domain_id, resource_count.count FROM resource_count WHERE resource_count.id IN (5083,4867,5079,33652,5077)  FOR UPDATE
    
    Deadlock 3:
    
    ** (1) TRANSACTION:
    TRANSACTION BBB232C81, ACTIVE 51 sec starting index read
    mysql tables in use 1, locked 1
    LOCK WAIT 3 lock struct(s), heap size 1248, 2 row lock(s)
    MySQL thread id 57308, OS thread handle 0x7f0a45c24700, query id 5217973695 ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud Sending data
    SELECT resource_count.id, resource_count.type, resource_count.account_id, resource_count.domain_id, resource_count.count FROM resource_count WHERE resource_count.id IN (5083,4867,5079,33652,5077)  FOR UPDATE
    *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
    *** (2) TRANSACTION:
    TRANSACTION BBB2254AC, ACTIVE 116 sec starting index read, thread declared inside InnoDB 500
    mysql tables in use 1, locked 1
    207 lock struct(s), heap size 31160, 1650 row lock(s), undo log entries 2
    MySQL thread id 56926, OS thread handle 0x7f04756c9700, query id 5218549710 ussclpdcsmgt014.autodesk.com 10.41.13.16 cloud Sending data
    SELECT resource_count.id, resource_count.type, resource_count.account_id, resource_count.domain_id, resource_count.count FROM resource_count WHERE resource_count.id IN (5083,4867,5079,33652,5077)  FOR UPDATE
    *** (2) HOLDS THE LOCK(S):
    
    Deadlock 4:
    *** (1) TRANSACTION:
    TRANSACTION C3BDD81EF, ACTIVE 0 sec fetching rows
    mysql tables in use 1, locked 1
    LOCK WAIT 55 lock struct(s), heap size 6960, 3 row lock(s), undo log entries 1
    MySQL thread id 250487, OS thread handle 0x7f0a460b6700, query id 32833273614 ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud updating
    DELETE FROM vm_reservation WHERE vm_reservation.vm_id = 869089
    *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
    *** (2) TRANSACTION:
    TRANSACTION C3BDD81FC, ACTIVE 0 sec inserting, thread declared inside InnoDB 500
    mysql tables in use 1, locked 1
    4 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 2
    MySQL thread id 250553, OS thread handle 0x7f0a3720c700, query id 32833273762 ussclpdcsmgt013.autodesk.com 10.41.13.15 cloud update
    INSERT INTO volume_reservation (volume_reservation.vm_reservation_id, volume_reservation.vm_id, volume_reservation.volume_id, volume_reservation.pool_id) VALUES (997419, 1009484, 918449, 316)
    *** (2) HOLDS THE LOCK(S):



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by rhtyd <gi...@git.apache.org>.
Github user rhtyd commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @blueorangutan package


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] cloudstack issue #1762: CLOUDSTACK-9595 Transactions are not getting retried...

Posted by jburwell <gi...@git.apache.org>.
Github user jburwell commented on the issue:

    https://github.com/apache/cloudstack/pull/1762
  
    @rhtyd I am -1 on this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---