You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@daffodil.apache.org by GitBox <gi...@apache.org> on 2021/01/14 15:08:01 UTC

[GitHub] [incubator-daffodil] stevedlawrence opened a new pull request #472: Fix how we throw away buckets during parsing

stevedlawrence opened a new pull request #472:
URL: https://github.com/apache/incubator-daffodil/pull/472


   The BucketingInputSource caches small-ish buckets of data in a "buckets"
   array while parsing. As we determine that Daffodil will no longer need
   data from older buckets (e.g. when PoU's are resolved), we set the array
   indices to null so that Java can garbage collect those buckets. This
   means that although the size of the buckets array may grow large for
   very large data, the majority of the elements in that array are null, so
   the actual cached data in memory is quite small.
   
   Sometimes we cannot discard buckets due to unresolved PoU's, so we
   impose a maximum number of buckets that can be cached. Once we reach
   this number of buckets in the buckets array, we simply discard the
   oldest bucket by setting it to null like above. If anything ever tries
   to use data from a discarded bucket we throw a backtracking error. In
   practice, the maximum bucket limit is equivalent to about 256MB of
   cached data, so is high enough that nothing reasonably needs to
   backtrack that far.
   
   However, we incorrectly calculate when to throw away the oldest bucket.
   We currently do so when the buckets array grows beyond some maximum
   number of elements. But this just means we've parsed some large amount
   of data, not that we have actually cached a large amount of data--this
   doesn't take into account the fact that many of the buckets in the array
   may have been discarded. This can lead to a situation where we think we
   have reached some a maximum cache size, but we really haven't, and so we
   start throwing away buckets that we actually need and could reasonably
   backtrack to. And if we do backtrack that small amount, we get an error.
   
   So instead of just throwing away the oldest bucket once the buckets
   array grows to some size, we really only want do so when the number of
   *non-null buckets* grows to some size, which is what actually implies
   the cached data has grown too large. This patch fixes the calculation to
   do that.
   
   DAFFODIL-2455


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-daffodil] stevedlawrence commented on pull request #472: Fix how we throw away buckets during parsing

Posted by GitBox <gi...@apache.org>.

stevedlawrence commented on pull request #472:
URL: https://github.com/apache/incubator-daffodil/pull/472#issuecomment-760313432

> Is this fix relevant only to Daffodil 3.0.0, or is this something affecting code from earlier revisions such as 2.4.0 or 2.6.0 ?

This bug was introduced in 2.5.0, so it's been around for about a year. But without the new streaming capabilities introduced in 3.0.0, I'm not sure how likely it is to actually hit this issue. I suspect of the time you'll just run out of memory before hit this bug. I just now tested this same schema + data with 2.7.0 and if looks like it's just stuck in the garbage collector trying to find memory to free. I suspect if I let it run long enough I'd get an OutOfMemoryException.

It might be worth considering a 3.0.1 patch release for this issue, since there isn't really a good workaround. The only workaround when dealing with files larger than 256MB I can think of is to avoid the InputStream constructor when creating an InputSourceDataInputStream so it just doesn't use the bucketing stuff at all. Instead a user can use the ByteBuffer or Array[Byte] cosntructors, but that means that all the data must be read into memory, and there's a 2GB limit, so it's not that great of a workaround. And that only works for people using the API. There's no workaround for people using the CLI.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-daffodil] mbeckerle commented on pull request #472: Fix how we throw away buckets during parsing

Posted by GitBox <gi...@apache.org>.

mbeckerle commented on pull request #472:
URL: https://github.com/apache/incubator-daffodil/pull/472#issuecomment-760355831


   This also suggests we need a test that stops the JVM memory down to way less than the data size, and then parses a file with streaming behavior successfully, to prove that the streaming doesn't get stuck on things like this. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-daffodil] stevedlawrence commented on pull request #472: Fix how we throw away buckets during parsing

Posted by GitBox <gi...@apache.org>.

stevedlawrence commented on pull request #472:
URL: https://github.com/apache/incubator-daffodil/pull/472#issuecomment-760374203


   I have done that, and done so with a multi gig file. So that stuff does work. The problem was the schema I used for that just too simple and required zero backtracking. So we were throwing away buckets earlier than we should, but it didnt matter since we never used those buckets. This particluar CSV schema, while it doesn't require much backtracking, does require a little bit of lookahead when scanning for delimiters. I think what happend in this case is we scanned for a delimiter, were overzealous in getting rid of the previous bucket in doing so, and then when the scanner came back to read the data, the bucket containing that data was gone.
   
   What we probably really need is a test that consumes a bunch of data, but has some a speculative parse that backtracks just a little less than 256MB. That huge backtrack should be allowed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-daffodil] stevedlawrence commented on pull request #472: Fix how we throw away buckets during parsing

Posted by GitBox <gi...@apache.org>.

stevedlawrence commented on pull request #472:
URL: https://github.com/apache/incubator-daffodil/pull/472#issuecomment-761033252


   FYI, I was able to create schema that parses 255MB of data, then backtracks and parses it again. This works as expected. And if I bump it up to 256MB then it fails due to the backtracking error, as expected. So the ability to backtrack up to 256MB does seem to work. Unfortunately, these files are way too large to store in our repo. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-daffodil] stevedlawrence commented on pull request #472: Fix how we throw away buckets during parsing

Posted by GitBox <gi...@apache.org>.

stevedlawrence commented on pull request #472:
URL: https://github.com/apache/incubator-daffodil/pull/472#issuecomment-760316642


   FYI, the user than ran into this issues says that they are able to split files so they are less than 256GB, which should workaround this issue. So they don't consider it a blocker. Still might be worth considering a 3.0.1 release, though maybe less urgent.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-daffodil] stevedlawrence merged pull request #472: Fix how we throw away buckets during parsing

Posted by GitBox <gi...@apache.org>.

stevedlawrence merged pull request #472:
URL: https://github.com/apache/incubator-daffodil/pull/472


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org