You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2018/06/12 11:08:31 UTC

Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?

Hi Folks,

I am creating a DFDL schema for Shape files. I ran my DFDL schema on a Shape file and the parse crashed. I discovered that the Shape file has a polygon with 1,371 points (so I need the <Point> element repeated 1,371 times) but Daffodil imposes a limit of 1,024 repetitions. I learned how to increase that limit:

daffodil parse -TmaxOccursBounds=2048 ...

I did that and it took care of the error I was getting.

I ran my DFDL schema on another Shape file and the parse crashed. Upon investigation I found the Shape file has a polygon with 3,087 points. So I increased the limit again:

daffodil parse -TmaxOccursBounds=4096 ...

I did that and it took care of the error I was getting.

Now I begin to wonder - why? Why does Daffodil impose a limit? I think there should be no limit. Is there a reason that it can't be unlimited?

/Roger

RE: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?

Posted by "Costello, Roger L." <co...@mitre.org>.

Hi Mike,

Thus far I have only been able to parse about ¼ of the 744 MB Shape file. The portion that I successfully parsed resulted in producing an XML file that is 1.2 GB in size. Presumably, if I could parse the entire Shape file, the resulting XML would be somewhere in the 5 GB size range. Interestingly, I wanted to count the number of <variable-length-record> elements (there is one such element per shape) in the resulting XML file. I wrote an XSLT program to do the counting. SAXON generated an out-of-memory error. So, I used streaming XML to do the count - there are 149,108 variable length records (shapes). Extrapolating, the entire Shape file contains roughly 600,000 shapes.

I don't know if this Shape file is typical. It is a Shape file for a coastal region. I can imagine such files would often be huge.

/Roger

From: Mike Beckerle [mailto:mbeckerle@tresys.com]
Sent: Tuesday, June 12, 2018 10:45 AM
To: users@daffodil.apache.org
Subject: Re: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?




So, a shape file that big may not be possible to parse right now.



If you just think about the enlargement due to expanding the data from the shapefile representation, which is fairly dense, to something more like an XML DOM tree, every field in the data becomes a java object, or several, each of which has many bytes of overhead. A file of 744M might turn into 7 gig of storage. That's assuming only a 10-to-1 expansion, which honestly, might not be a big enough factor. It could be 20 to 1 though I doubt it is 100 to 1.



Question: are the files typically like this? Or are these somewhat extreme examples?



This may be a case where true streaming (e.g., XML SAX style) parsing is needed. This is something on our roadmap (https://issues.apache.org/jira/browse/DAFFODIL-934)

but it's not yet been made high priority as yet.



In the interim.... You *could* just get a *lot* more RAM.... e.g., my laptop has 64Gbytes.....



Alternatively,... know any Scala/Java developers who might want to add features to Daffodil?? We can certainly help someone get up the learning curve, learn enough Scala, provide and/or review a design, etc.



...mike beckerle

Tresys









________________________________
From: Costello, Roger L. <co...@mitre.org>>
Sent: Tuesday, June 12, 2018 10:18:07 AM
To: users@daffodil.apache.org<ma...@daffodil.apache.org>
Subject: RE: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?


Thanks for the explanation Mike!



  *   what if we just increased all these initial limits to 1M instead of 1K ?



1M is not sufficient. I just encountered a Shape file containing a polygon with 4,900,469 points.



I used the maxOccursBounds flag to increase the repetition limit to 5 million:



daffodil parse -TmaxOccursBounds=5000000 ...



But that resulted in a "GC overhead limit exceeded".



So I added a -UseGCOverheadLimit flag to avoid that:



set JOPTS=-Xms4096M -Xmx4096M -XX:ReservedCodeCacheSize=512M -XX:-UseGCOverheadLimit



But now I'm getting a "java.lang.OutOfMemoryError: Java heap space" error. My next attempt will be to further increase -Xms



Any suggestions you might have would be appreciated. My Shape file is large ... 744 MB.



/Roger



From: Mike Beckerle [mailto:mbeckerle@tresys.com]
Sent: Tuesday, June 12, 2018 10:00 AM
To: users@daffodil.apache.org<ma...@daffodil.apache.org>
Subject: Re: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?



We certainly can enlarge these initial settings, as they do seem awfully small.



And we can probably have a "unlimited" setting, but the point of this "limited" behavior was in general to avoid the kinds of problems that come up with unlimited - as in "did you really mean 8 trillion is ok?"



E.g., a regex with ".*", did you really mean "*" as in any number as in trillions? Or did you mean "pretty big by human standards like maybe a million." ?



In DFDL, due to backtracking, if there is an error in the data, it is possible for the parser to waste a lot of time thrashing around trying to parse data hopelessly. Some reasonable limits to make it fail faster are helpful in these cases. Commercial data integration products have various tunable limits of this sort as well.



There's various guidance associated with using regex in XSD, for example, that frowns upon use of the wildcard * and + quantifiers for these same reasons.



So all that said... what if we just increased all these initial limits to 1M instead of 1K ?



I'm open to all suggestions for how to improve here. Just wanted to explain current rationales.



...mike beckerle

Tresys

________________________________

From: Costello, Roger L. <co...@mitre.org>>
Sent: Tuesday, June 12, 2018 7:08:31 AM
To: users@daffodil.apache.org<ma...@daffodil.apache.org>
Subject: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?



Hi Folks,

I am creating a DFDL schema for Shape files. I ran my DFDL schema on a Shape file and the parse crashed. I discovered that the Shape file has a polygon with 1,371 points (so I need the <Point> element repeated 1,371 times) but Daffodil imposes a limit of 1,024 repetitions. I learned how to increase that limit:

daffodil parse -TmaxOccursBounds=2048 ...

I did that and it took care of the error I was getting.

I ran my DFDL schema on another Shape file and the parse crashed. Upon investigation I found the Shape file has a polygon with 3,087 points. So I increased the limit again:

daffodil parse -TmaxOccursBounds=4096 ...

I did that and it took care of the error I was getting.

Now I begin to wonder - why? Why does Daffodil impose a limit? I think there should be no limit. Is there a reason that it can't be unlimited?

/Roger

Re: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?

Posted by Mike Beckerle <mb...@tresys.com>.

So, a shape file that big may not be possible to parse right now.


If you just think about the enlargement due to expanding the data from the shapefile representation, which is fairly dense, to something more like an XML DOM tree, every field in the data becomes a java object, or several, each of which has many bytes of overhead. A file of 744M might turn into 7 gig of storage. That's assuming only a 10-to-1 expansion, which honestly, might not be a big enough factor. It could be 20 to 1 though I doubt it is 100 to 1.


Question: are the files typically like this? Or are these somewhat extreme examples?


This may be a case where true streaming (e.g., XML SAX style) parsing is needed. This is something on our roadmap (https://issues.apache.org/jira/browse/DAFFODIL-934)

but it's not yet been made high priority as yet.


In the interim.... You *could* just get a *lot* more RAM.... e.g., my laptop has 64Gbytes.....


Alternatively,... know any Scala/Java developers who might want to add features to Daffodil?? We can certainly help someone get up the learning curve, learn enough Scala, provide and/or review a design, etc.


...mike beckerle

Tresys





________________________________
From: Costello, Roger L. <co...@mitre.org>
Sent: Tuesday, June 12, 2018 10:18:07 AM
To: users@daffodil.apache.org
Subject: RE: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?


Thanks for the explanation Mike!



  *   what if we just increased all these initial limits to 1M instead of 1K ?



1M is not sufficient. I just encountered a Shape file containing a polygon with 4,900,469 points.



I used the maxOccursBounds flag to increase the repetition limit to 5 million:



daffodil parse -TmaxOccursBounds=5000000 …



But that resulted in a “GC overhead limit exceeded”.



So I added a -UseGCOverheadLimit flag to avoid that:



set JOPTS=-Xms4096M -Xmx4096M -XX:ReservedCodeCacheSize=512M -XX:-UseGCOverheadLimit



But now I’m getting a “java.lang.OutOfMemoryError: Java heap space” error. My next attempt will be to further increase -Xms



Any suggestions you might have would be appreciated. My Shape file is large … 744 MB.



/Roger



From: Mike Beckerle [mailto:mbeckerle@tresys.com]
Sent: Tuesday, June 12, 2018 10:00 AM
To: users@daffodil.apache.org
Subject: Re: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?



We certainly can enlarge these initial settings, as they do seem awfully small.



And we can probably have a "unlimited" setting, but the point of this "limited" behavior was in general to avoid the kinds of problems that come up with unlimited - as in "did you really mean 8 trillion is ok?"



E.g., a regex with ".*", did you really mean "*" as in any number as in trillions? Or did you mean "pretty big by human standards like maybe a million." ?



In DFDL, due to backtracking, if there is an error in the data, it is possible for the parser to waste a lot of time thrashing around trying to parse data hopelessly. Some reasonable limits to make it fail faster are helpful in these cases. Commercial data integration products have various tunable limits of this sort as well.



There's various guidance associated with using regex in XSD, for example, that frowns upon use of the wildcard * and + quantifiers for these same reasons.



So all that said... what if we just increased all these initial limits to 1M instead of 1K ?



I'm open to all suggestions for how to improve here. Just wanted to explain current rationales.



...mike beckerle

Tresys

________________________________

From: Costello, Roger L. <co...@mitre.org>>
Sent: Tuesday, June 12, 2018 7:08:31 AM
To: users@daffodil.apache.org<ma...@daffodil.apache.org>
Subject: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?



Hi Folks,

I am creating a DFDL schema for Shape files. I ran my DFDL schema on a Shape file and the parse crashed. I discovered that the Shape file has a polygon with 1,371 points (so I need the <Point> element repeated 1,371 times) but Daffodil imposes a limit of 1,024 repetitions. I learned how to increase that limit:

daffodil parse -TmaxOccursBounds=2048 ...

I did that and it took care of the error I was getting.

I ran my DFDL schema on another Shape file and the parse crashed. Upon investigation I found the Shape file has a polygon with 3,087 points. So I increased the limit again:

daffodil parse -TmaxOccursBounds=4096 ...

I did that and it took care of the error I was getting.

Now I begin to wonder - why? Why does Daffodil impose a limit? I think there should be no limit. Is there a reason that it can't be unlimited?

/Roger

RE: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?

Posted by "Costello, Roger L." <co...@mitre.org>.

Thanks for the explanation Mike!


  *   what if we just increased all these initial limits to 1M instead of 1K ?

1M is not sufficient. I just encountered a Shape file containing a polygon with 4,900,469 points.

I used the maxOccursBounds flag to increase the repetition limit to 5 million:

daffodil parse -TmaxOccursBounds=5000000 ...

But that resulted in a "GC overhead limit exceeded".

So I added a -UseGCOverheadLimit flag to avoid that:

set JOPTS=-Xms4096M -Xmx4096M -XX:ReservedCodeCacheSize=512M -XX:-UseGCOverheadLimit

But now I'm getting a "java.lang.OutOfMemoryError: Java heap space" error. My next attempt will be to further increase -Xms

Any suggestions you might have would be appreciated. My Shape file is large ... 744 MB.

/Roger

From: Mike Beckerle [mailto:mbeckerle@tresys.com]
Sent: Tuesday, June 12, 2018 10:00 AM
To: users@daffodil.apache.org
Subject: Re: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?


We certainly can enlarge these initial settings, as they do seem awfully small.



And we can probably have a "unlimited" setting, but the point of this "limited" behavior was in general to avoid the kinds of problems that come up with unlimited - as in "did you really mean 8 trillion is ok?"



E.g., a regex with ".*", did you really mean "*" as in any number as in trillions? Or did you mean "pretty big by human standards like maybe a million." ?



In DFDL, due to backtracking, if there is an error in the data, it is possible for the parser to waste a lot of time thrashing around trying to parse data hopelessly. Some reasonable limits to make it fail faster are helpful in these cases. Commercial data integration products have various tunable limits of this sort as well.



There's various guidance associated with using regex in XSD, for example, that frowns upon use of the wildcard * and + quantifiers for these same reasons.



So all that said... what if we just increased all these initial limits to 1M instead of 1K ?



I'm open to all suggestions for how to improve here. Just wanted to explain current rationales.



...mike beckerle

Tresys

________________________________
From: Costello, Roger L. <co...@mitre.org>>
Sent: Tuesday, June 12, 2018 7:08:31 AM
To: users@daffodil.apache.org<ma...@daffodil.apache.org>
Subject: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?

Hi Folks,

I am creating a DFDL schema for Shape files. I ran my DFDL schema on a Shape file and the parse crashed. I discovered that the Shape file has a polygon with 1,371 points (so I need the <Point> element repeated 1,371 times) but Daffodil imposes a limit of 1,024 repetitions. I learned how to increase that limit:

daffodil parse -TmaxOccursBounds=2048 ...

I did that and it took care of the error I was getting.

I ran my DFDL schema on another Shape file and the parse crashed. Upon investigation I found the Shape file has a polygon with 3,087 points. So I increased the limit again:

daffodil parse -TmaxOccursBounds=4096 ...

I did that and it took care of the error I was getting.

Now I begin to wonder - why? Why does Daffodil impose a limit? I think there should be no limit. Is there a reason that it can't be unlimited?

/Roger

Re: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?

Posted by Mike Beckerle <mb...@tresys.com>.

We certainly can enlarge these initial settings, as they do seem awfully small.


And we can probably have a "unlimited" setting, but the point of this "limited" behavior was in general to avoid the kinds of problems that come up with unlimited - as in "did you really mean 8 trillion is ok?"


E.g., a regex with ".*", did you really mean "*" as in any number as in trillions? Or did you mean "pretty big by human standards like maybe a million." ?


In DFDL, due to backtracking, if there is an error in the data, it is possible for the parser to waste a lot of time thrashing around trying to parse data hopelessly. Some reasonable limits to make it fail faster are helpful in these cases. Commercial data integration products have various tunable limits of this sort as well.


There's various guidance associated with using regex in XSD, for example, that frowns upon use of the wildcard * and + quantifiers for these same reasons.


So all that said... what if we just increased all these initial limits to 1M instead of 1K ?


I'm open to all suggestions for how to improve here. Just wanted to explain current rationales.


...mike beckerle

Tresys

________________________________
From: Costello, Roger L. <co...@mitre.org>
Sent: Tuesday, June 12, 2018 7:08:31 AM
To: users@daffodil.apache.org
Subject: Why is there an arbitrary limit that Daffodil imposes so that arrays can't be bigger than 1024 elements?

Hi Folks,

I am creating a DFDL schema for Shape files. I ran my DFDL schema on a Shape file and the parse crashed. I discovered that the Shape file has a polygon with 1,371 points (so I need the <Point> element repeated 1,371 times) but Daffodil imposes a limit of 1,024 repetitions. I learned how to increase that limit:

daffodil parse -TmaxOccursBounds=2048 ...

I did that and it took care of the error I was getting.

I ran my DFDL schema on another Shape file and the parse crashed. Upon investigation I found the Shape file has a polygon with 3,087 points. So I increased the limit again:

daffodil parse -TmaxOccursBounds=4096 ...

I did that and it took care of the error I was getting.

Now I begin to wonder - why? Why does Daffodil impose a limit? I think there should be no limit. Is there a reason that it can't be unlimited?

/Roger