You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/09 10:39:45 UTC

[GitHub] [airflow] potiuk commented on a change in pull request #10227: Use Hash of Serialized DAG to determine DAG is changed or not

potiuk commented on a change in pull request #10227:
URL: https://github.com/apache/airflow/pull/10227#discussion_r467566683



##########
File path: airflow/models/serialized_dag.py
##########
@@ -76,6 +82,7 @@ def __init__(self, dag: DAG):
         self.fileloc_hash = DagCode.dag_fileloc_hash(self.fileloc)
         self.data = SerializedDAG.to_dict(dag)
         self.last_updated = timezone.utcnow()
+        self.dag_hash = hashlib.md5(json.dumps(self.data, sort_keys=True).encode("utf-8")).hexdigest()

Review comment:
       I believe it will be better to use  hash of the file itself rather than Json representation. Not only it will be faster (you d note have to parse the file and create a SerializedDAG, but also it will be more accurate (there are some changes like comments that might cause the Serialized DAG to be the same (comments, jinja templates etc.). Of course it will be still ok (we care about Serialized representation in this case) but I think it would be more accurate to refresh Serialized DAG and hash every time when the file changes. Also it wil be much easier to reason about that hash and verify if it is correct or whether the file changed since - we just need to check the hash of the file on disk. 

##########
File path: airflow/models/serialized_dag.py
##########
@@ -65,6 +66,11 @@ class SerializedDagModel(Base):
     fileloc_hash = Column(BigInteger, nullable=False)
     data = Column(sqlalchemy_jsonfield.JSONField(json=json), nullable=False)
     last_updated = Column(UtcDateTime, nullable=False)
+    # TODO: Make dag_hash not nullable in Airflow 1.10.13 or Airflow 2.0??

Review comment:
       We can make it not nullable and add default value in the migration ("Hash not calculated yet"). This way it will be refreshed with the next run automagically and we will not worry about nullability.

##########
File path: airflow/models/serialized_dag.py
##########
@@ -102,9 +109,11 @@ def write_dag(cls, dag: DAG, min_update_interval: Optional[int] = None, session:
                 return
 
         log.debug("Checking if DAG (%s) changed", dag.dag_id)
-        serialized_dag_from_db: SerializedDagModel = session.query(cls).get(dag.dag_id)
         new_serialized_dag = cls(dag)
-        if serialized_dag_from_db and (serialized_dag_from_db.data == new_serialized_dag.data):
+        serialized_dag_hash_from_db = session.query(
+            cls.dag_hash).filter(cls.dag_id == dag.dag_id).scalar()
+
+        if serialized_dag_hash_from_db and (serialized_dag_hash_from_db == new_serialized_dag.dag_hash):

Review comment:
       If make the column nullable with default this might be just ==




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org