You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by GitBox <gi...@apache.org> on 2021/11/04 12:39:38 UTC

[GitHub] [daffodil] jadams-tresys opened a new pull request #674: Work around issue resetting predefined vars

jadams-tresys opened a new pull request #674:
URL: https://github.com/apache/daffodil/pull/674


   This is a quick fix for this bug where a predefined variable is being reset beyond its original definition.  This was occurring in the EDIFACT and NACHA schemas.
   
   Definitely feels a little hacky and I was trying to add a test case to reproduce the issue by having a choice branch backtrack after reading a predefined variable but was unsuccessful.  I'm not sure what is triggering this double reset in the EDIFACT/NACHA schemas, so this "fix" may be more of a bandaid instead of fixing a deeper issue.  Since this bug is holding up the release of 3.2.0, I figured it would be worth creating this pull request and open up the forum for discussion on how to fix this issue.
   
   DAFFODIL-2580


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence closed pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
stevedlawrence closed pull request #674:
URL: https://github.com/apache/daffodil/pull/674


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on pull request #674:
URL: https://github.com/apache/daffodil/pull/674#issuecomment-967322823


   I dug into this issue a bit and think I've track down the cause, still not sure of the best fix. Here's a pretty minimal schema that reproduces the issue:
   
   ```xml
   <xs:schema
     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
     xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
     <xs:include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />
   
     <xs:annotation>
       <xs:appinfo source="http://www.ogf.org/dfdl/">
         <dfdl:defineVariable defaultValue="x" name="var1" type="xs:string"/> 
         <dfdl:format ref="GeneralFormat" />
       </xs:appinfo>
     </xs:annotation>
     
     <xs:element name="root">
       <xs:complexType>
         <xs:sequence>
           <xs:element name="opt" minOccurs="0">
             <xs:complexType>
               <xs:sequence>
                 <xs:sequence>
                   <xs:annotation>
                     <xs:appinfo source="http://www.ogf.org/dfdl/">
                       <dfdl:assert test="{ $var1 eq 'x' }" />
                     </xs:appinfo>
                   </xs:annotation>
                 </xs:sequence>
                 <xs:sequence>
                   <xs:annotation>
                     <xs:appinfo source="http://www.ogf.org/dfdl/">
                       <dfdl:assert test="{ $var1 eq 'y' }" />
                     </xs:appinfo>
                   </xs:annotation>
                 </xs:sequence>
               </xs:sequence>
             </xs:complexType>
           </xs:element>
         </xs:sequence>
       </xs:complexType>
     </xs:element>
     
   </xs:schema>
   ```
   So we define a variable with a default value, in this case `x`. We have an optional element which creates a point of uncertainty. Within this PoU we read the variable twice, once in the first assert (which passes) and again in the second assert (which fails). The second assert fails and so we need to back track to the PoU and undo the changes to the variable state.
   
   How we undo this state is broken. For our variables, the variables themselves keep track of the current state and the prior state. When we need to reset back to the PoU, sometimes we use this prior state, but sometimes we just make assumptions about what the prior state should have been. These assumptions aren't always correct. This test finds a case where this isn't correct.
   
   What happens here is we read a variable twice, so we mark it as needing to be reset to the prior states twice to get back to the correct state. But the second read didn't actually change the state. So when we reset once we reset to VariableDefined. And when we reset again we hit this case. Note that in some cases we do actually need to reset twice, so it's not just a matter of ignoring duplicate resets.
   
   I think the core issue here is that variables are trying to keep track of their own state, and only really keep track of their prior state, and then we make assumptions from there. I think the *right* fix is probably to handle variables like we do other state and PoU's and just take a snapshot. When we reset back to the PoU we just restore that snapshot. Unfortunately, variables are a bit more complicated and are mutable, so taking a snapshot may add some complication and overhead.
   
   Maybe another option is instead of PState snapshotting the entire variable map, we instead just snapshot the state of each individual value prior to the change. So it's a sort of copy-on-write behavior. Then when we reset back to the PoU, we just need to reset back to the state of things that were changed. That might be a better approach with less overhead.
   
   Note that both EDIFACT and NACHA hit this double-read issue but in different ways. NACHA reads an `encoding` variable multiple times in a PoU, and EDIFACT reads variables related to escape schemas multiple times in a PoU.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] jadams-tresys commented on a change in pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
jadams-tresys commented on a change in pull request #674:
URL: https://github.com/apache/daffodil/pull/674#discussion_r743018220



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/VariableMap1.scala
##########
@@ -155,6 +155,7 @@ class VariableInstance private (val rd: VariableRuntimeData)
         this.setState(this.priorState)
         this.setValue(this.priorValue)
       }
+      case (VariableDefined, _, _) => // This should only occur for pre-defined variables

Review comment:
       It's possible that the variable is being added twice, not sure.  As I mentioned, I was having trouble recreating the same failure with a test case.
   
   In both of the affected schema projects, they simply reference the dfdl:encoding variable when setting up the dfdl:format that they use for all the files.  It's not being read or changed inside a choice branch, so I'm not sure how this failure is ocurring.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on a change in pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on a change in pull request #674:
URL: https://github.com/apache/daffodil/pull/674#discussion_r743005566



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/VariableMap1.scala
##########
@@ -155,6 +155,7 @@ class VariableInstance private (val rd: VariableRuntimeData)
         this.setState(this.priorState)
         this.setValue(this.priorValue)
       }
+      case (VariableDefined, _, _) => // This should only occur for pre-defined variables

Review comment:
       Agreed, some more information about what's going on would be helpful.
   
   I believe this `reset` function is only called in `PState.reset`, and is is only called for variables that have "changed" and are added to the list at the top of the `changedVariableStack`. But if a variable has changed, then it seems like it should be impossible for the current state to be Defined? By being changed, wouldn't it have needed to change to either Read or Set?
   
   I wonder if maybe the same variable is being added to the top of the `changedVariableStack` list twice? That would cause us to try to reset it twice. When we reset it the first time that could change it to the Defined state, and then trying to reset it again hits this case? If that's the issue, I'm not sure if adding it twice is a bug, or if handling Defined is the right thing? More information is needed about what's going on...
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on a change in pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on a change in pull request #674:
URL: https://github.com/apache/daffodil/pull/674#discussion_r743005566



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/VariableMap1.scala
##########
@@ -155,6 +155,7 @@ class VariableInstance private (val rd: VariableRuntimeData)
         this.setState(this.priorState)
         this.setValue(this.priorValue)
       }
+      case (VariableDefined, _, _) => // This should only occur for pre-defined variables

Review comment:
       Agreed, some more information about what's going on would be helpful.
   
   I believe this `reset` function is only called in `PState.reset`, and is is only called for variables that have "changed" and are added to the list at the top of the `changedVariableStack`. But if a variable has changed, then it seems like it should be impossible for the current state to be Defined? By being changed, wouldn't it have needed to change to either Read or Set?
   
   I wonder if maybe the same variable is being added to the top of the `changedVariableStack` list twice? That would cause us to try to reset it twice. When we reset it the first time that could change it to the Defined state, and then trying to reset it again hits this case? If that's the issue, I'm not sure if adding it twice is a bug, or if handling Defined is the right thing? More information is needed about what's going on...
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] tuxji commented on pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
tuxji commented on pull request #674:
URL: https://github.com/apache/daffodil/pull/674#issuecomment-967395784


   Some questions for clarification:
   
   1.	What is the pitfall of just taking a snapshot of variables’ values like other state and restoring the snapshot when resetting back to the PoU?  A snapshot copies variables’ values at a certain point in time and restoring the snapshot resets variables back to their snapshotted values.  What does a variable’s mutability have to do with that, that is, why would snapshotting variables be more complicated than snapshotting other state?  I do understand the overhead of copying all variables’ values in every snapshot and resetting them (likely back to the same values) when restoring snapshots. 
   2.	Adding copy-on-write behavior to variables sounds like an attempt to optimize away some of the overhead on the assumption that variables change values infrequently.  Do you have any range of numbers (even ballpark guesses) for how many variables might exist during a parse (from zero to 20?) and how many times variables really change values during a parse (from zero to 1000 times)?
   
   My suggestion is to replace the broken code (which makes assumption) with simple code that is obviously correct in behavior while measuring performance before and after.  When you are satisfied the code works correctly, then you can look for ways to use copy-on-write, caches, etc., to speed up performance.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on pull request #674:
URL: https://github.com/apache/daffodil/pull/674#issuecomment-969063293


   I did some testing on three different approaches with the PCAP schema, which I think is good representation of fairly complex schema with a decent amount of PoUs and some variables stuff. Note that this version does not use any of the layering stuff, which does rely on variables quite a bit more.
   
   The three approaches were:
   1. Deep copy the variable map every time we enter a PoU
   2. Deep copy the variable map the first time a variable is modified inside a PoU
   3. Copy instance information the first time a variable instance is modified inside  PoU
   
   Each one is a bit more complex than the previous. The first one is simple to implement, but potentially adds a lot of overhead. The second one is a bit more complex, but only has overhead if a variable is modified. But if the variable map grows large, the deep copy might be noticable. The last one is yet again more complex, but minimizes copying if there are lots of variables or new variable instances.
   
   The numbers below are for parsing 20,000 PCAP files. The null infoset outputter was used to avoid the overhead of writing the infoset.
   
   The three stats gathered are total time (smaller is better) and maximum and average rate in files/second (larger is better). The numbers are also compared against the current 3.1.0 release, with percent change in parenthesis.
   
   |                        |3.1.0    |Deep Copy Every PoU|Deep Copy On Write|Copy Instance On Write|
   |:---                    | ---:    | ---:              | ---:             | ---:                 |
   |Total time (seconds)     |39.97    |41.42 (3.62%)      |36.59 (-8.45%)    |36.55 (-8.55%)        |
   |Max rate (files/second) |637.72   |570.15 -10.60%)    |628.87 (-1.39%)   |639.01 (0.20%)        |
   |Avg rage (files/second) |500.40   |483.79 (-3.32%)    |546.65 (9.24%)    |547.21 (9.35%)        |
   
   So deep copy on write and copy instance on write are virtually the same, with maybe a slight advantage to the copy instance on write, but the differences could easily be in the noise of JVM. Deep copying every PoU is definitely a bad idea.
   
   Also, note that I think copy on write's are faster than 3.1.0 because we currently allocate an empty `Map` every time we enter a PoU just in case we have to do variable tracking stuff. In the copy on write changes, it only creates that `Map` if variable change. Creating an Empty map can be expensive, if if it's never used.
   
   So I think this is a clear win for the copy-on-write stuff. Whether or not we add extra complexity to handle per-instance copy on write or entire map copy-on-write doesn't seem to matter, at least not with this format. I think if a schema had lots of variables or lots of nested new variable instances, then maybe the variable map could grow pretty big and make the deep copy expensive, but I'm not sure how likely that is.
   
   I think I'm leaning towards the deep copy on write. It's definitely simpler than the copy instance on write, and seems to have comparable performance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] tuxji commented on a change in pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
tuxji commented on a change in pull request #674:
URL: https://github.com/apache/daffodil/pull/674#discussion_r742845282



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/VariableMap1.scala
##########
@@ -155,6 +155,7 @@ class VariableInstance private (val rd: VariableRuntimeData)
         this.setState(this.priorState)
         this.setValue(this.priorValue)
       }
+      case (VariableDefined, _, _) => // This should only occur for pre-defined variables

Review comment:
       This change definitely looks too simple to me.  At the very least it should verify some invariant or assumption so that we will be alerted if a deeper problem exists.  I'm not sure exactly what the original problem is or what invariant needs to be checked, but surely there must be something you can put into an `Assert.invariant(...)` call at this line.
   
   Perhaps the invariant should be whether the variable has ever been read?  Or whether it ever has been shadowed by a new variable instance?  Again, I'm not sure what problem is occurring or which part of the DFDL specification applies to this situation (the verb "reset" does not occur anywhere in the specification!), although I looked at [7.7 DFDL Variable Annotations](https://daffodil.apache.org/docs/dfdl/#_Toc62570090) and noted the following statements:
   
   > Variables are defined at the top-level of a schema and have a specific simple type. A distinction is made between the variable as defined, and an instance of the variable where a value can be stored. The dfdl:defineVariable also introduces a single unique global instance of the variable. Additional instances may be allocated in a stack-like fashion using dfdl:newVariableInstance which causes new instances to come into existence upon entry to the scope of a model group, and these instances go away on exit from the same. **[Is this where the function "reset" comes in?]**
   
   > DFDL variables only vary in the sense that different instances of the same variable can have different values. A single instance of a variable only ever takes on a single value. Each variable instance is a single-assignment location for a value[9]. Once a variable instance's value has been read, it can never be assigned again. If it has not yet been assigned, and its default value has not been read, then a variable instance can be assigned once using dfdl:setVariable. **[Can you find some invariant to check from this part of the specification?]**
   
   > The following variables are predefined, and their names are in the DFDL namespace (http://www.ogf.org/dfdl/dfdl-1.0/). These variables are expected to be commonly set externally so are predefined for convenience. **[Can a predefined variable ever take on a different value (I know it has a default value) if it was set externally but did not have its value read before a dfdl:setVariable?]**




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] jadams-tresys commented on a change in pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
jadams-tresys commented on a change in pull request #674:
URL: https://github.com/apache/daffodil/pull/674#discussion_r743018220



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/VariableMap1.scala
##########
@@ -155,6 +155,7 @@ class VariableInstance private (val rd: VariableRuntimeData)
         this.setState(this.priorState)
         this.setValue(this.priorValue)
       }
+      case (VariableDefined, _, _) => // This should only occur for pre-defined variables

Review comment:
       It's possible that the variable is being added twice, not sure.  As I mentioned, I was having trouble recreating the same failure with a test case.
   
   In both of the affected schema projects, they simply reference the dfdl:encoding variable when setting up the dfdl:format that they use for all the files.  It's not being read or changed inside a choice branch, so I'm not sure how this failure is ocurring.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on pull request #674:
URL: https://github.com/apache/daffodil/pull/674#issuecomment-967454410


   1. I think its really just about potential overhead. Variables really don't change all that often, and so copying the entire variable state every time we enter a point of uncertainty might have some noticeable overhead. I'm not sure if that's why we have this prior state stuff or not, but I would guess that the reason for being different than the rest of the PoU stuff.
   
   2. Most schemas don't define variables at all, but there are four predefined variables that will always exist: [7.7.1.2 Predefined Variabls](https://daffodil.apache.org/docs/dfdl/#_Toc349042676). So there is a minimum of four variables. Though I imagine max used is still fairly small, probably in the range of a dozen or so. So a fully copy every PoU might not be terrible. That said, we recently added `newVarialbeInstance`, which makes variables much more useful--I think I've already seen in increase in variables since this was added. That said, I'd gust most variables change pretty rarely. They're set once, read a few times, and then that's it. But that's just a guess.
   
   I've actually already implemented the copy-on-write approach, we were kindof already doing something similar, it was just a matter change storeing copies in separately from the variable instances. Though I am interested if there is a performance benefit. It would be much simpler to just copy the whole thing, and it shouldn't be too much effort to change it to the simpler copy-everything-on-PoU and see the difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on pull request #674:
URL: https://github.com/apache/daffodil/pull/674#issuecomment-974080646


   Fixed by PR #682, closing this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] tuxji commented on a change in pull request #674: Work around issue resetting predefined vars

Posted by GitBox <gi...@apache.org>.
tuxji commented on a change in pull request #674:
URL: https://github.com/apache/daffodil/pull/674#discussion_r742845282



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/VariableMap1.scala
##########
@@ -155,6 +155,7 @@ class VariableInstance private (val rd: VariableRuntimeData)
         this.setState(this.priorState)
         this.setValue(this.priorValue)
       }
+      case (VariableDefined, _, _) => // This should only occur for pre-defined variables

Review comment:
       This change definitely looks too simple to me.  At the very least it should verify some invariant or assumption so that we will be alerted if a deeper problem exists.  I'm not sure exactly what the original problem is or what invariant needs to be checked, but surely there must be something you can put into an `Assert.invariant(...)` call at this line.
   
   Perhaps the invariant should be whether the variable has ever been read?  Or whether it ever has been shadowed by a new variable instance?  Again, I'm not sure what problem is occurring or which part of the DFDL specification applies to this situation (the verb "reset" does not occur anywhere in the specification!), although I looked at [7.7 DFDL Variable Annotations](https://daffodil.apache.org/docs/dfdl/#_Toc62570090) and noted the following statements:
   
   > Variables are defined at the top-level of a schema and have a specific simple type. A distinction is made between the variable as defined, and an instance of the variable where a value can be stored. The dfdl:defineVariable also introduces a single unique global instance of the variable. Additional instances may be allocated in a stack-like fashion using dfdl:newVariableInstance which causes new instances to come into existence upon entry to the scope of a model group, and these instances go away on exit from the same. **[Is there where the function "reset" comes in?]**
   
   > DFDL variables only vary in the sense that different instances of the same variable can have different values. A single instance of a variable only ever takes on a single value. Each variable instance is a single-assignment location for a value[9]. Once a variable instance's value has been read, it can never be assigned again. If it has not yet been assigned, and its default value has not been read, then a variable instance can be assigned once using dfdl:setVariable. **[Can you find some invariant to check from this part of the specification?]**
   
   > The following variables are predefined, and their names are in the DFDL namespace (http://www.ogf.org/dfdl/dfdl-1.0/). These variables are expected to be commonly set externally so are predefined for convenience. **[Can a predefined variable ever take on a different value (I know it has a default value) if it was set externally but did not have its value read before a dfdl:setVariable?]**




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org