You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Andras Piros (JIRA)" <ji...@apache.org> on 2018/09/04 13:27:00 UTC
[jira] [Commented] (OOZIE-3336) [persistence] Refactor entity classes to feature PK, FK, and UQ constraints

    [ https://issues.apache.org/jira/browse/OOZIE-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603046#comment-16603046 ] 

Andras Piros commented on OOZIE-3336:
-------------------------------------

[~pbacsko] [~asasvari] [~gezapeti] please feel free to comment / modify.

> [persistence] Refactor entity classes to feature PK, FK, and UQ constraints
> ---------------------------------------------------------------------------
>
>                 Key: OOZIE-3336
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3336
>             Project: Oozie
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 5.0.0
>            Reporter: Andras Piros
>            Priority: Major
>             Fix For: 5.2.0
>
>
> When an Oozie database grows substantial in size, let's say, over a few hundred thousands of {{WorkflowActionBean}}, {{CoordinatorActionBean}} instances, we face a couple of performance issues. Here is an analysis why.
> Current Oozie JPA {{@Entity}} usage, and the resulting database DDL, suffers from a couple of drawback from a performance point of view:
> * {{@Id}} fields are {{String}}:
> ** leaving no space for database primary key indices to work effectively
> ** those values are calculated in case of {{WorkflowActionBean}}, {{CoordinatorActionBean}}, and {{BundleActionBean}} instances
> * no foreign constraint is set from {{WorkflowActionBean}} to {{WorkflowJobBean}}, from {{CoordinatorActionBean}} to {{CoordinatorJobBean}}, or from {{BundleActionBean}} to {{BundleJobBean}} instances:
> ** have to assess JPA queries discovering parent-child relationships by hand
> ** no database indices are created, and hence, those queries that contain any {{JOIN}} instances are slower
> * no use of unique constraints whatsoever
> * JPA queries are created by hand instead of relying on OpenJPA
> * JPA entities are filled by hand instead of relying on OpenJPA
> Following enhancements are necessary:
> # keeping the existing {{String compositeId}} fields, let's break down the contents to following new fields:
> ## {{@Id long id}} - an auto-increment value that is unique across Oozie database
> ## {{long currentSequence}} - the sequence number of the current run since last Oozie server restart. The first part of the {{compositeId}}
> ## {{Timestamp serverStartupTimestamp}} - the timestamp when the Oozie server was last started. The second part of the {{compositeId}}
> ## {{String serverName}} - the third part of the {{compositeId}}
> ## {{String name}} - the fourth and last part of the {{compositeId}}
> ## {{compositeId}} might be calculated when an entity is loaded / persisted, and then stored
> # FK constraints:
> ## {{@OneToMany}} fields where we have a list of child references inside parent
> ## {{@ManyToOne}} fields where we have a parent reference inside child
> ## pay attention to {{FetchType}}, most of the times {{LAZY}} will be needed
> ## the containment fields should not be {{@Transient}} anymore
> # UQ constraints:
> ## on {{currentSequence}} and {{serverStartupTimestamp}}
> ## on {{currentSequence}} and {{name}}
> # new JPQL queries:
> ## to cover changed parent-child relationships
> ## to get use of each disassembled part of {{originalId}} when doing e.g. filtering
> # let JPA fill entities instead performing this by hand
> Following enhancements can be considered as nice-to-have:
> * upgrade to an OpenJPA version that features JPA 2.1's composite indexing capability
> * see whether to have an optimistic locking field using {{@Version}} instead of ZooKeeper based pessimistic locking would increase High Availability characteristics
> * refactor also SLA related entity classes
> It's necessary to have performance benchmarks with some database types like MySQL/MariaDB, and PostgreSQL before and after the changes for following use cases:
> * {{CoordinatorJobBean}} and {{WorkflowJobBean}} instances up to millions
> * {{CoordinatorActionBean}} and {{WorkflowActionBean}} instances up to tens of millions
> * performance for JPQLs that get a list of entities
> * performance of persisting a new entity
> * performance of querying lists of entities based on popular / possible filters like the ones used by {{VxJobsServlet}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)