You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Lakatos Gyula (Jira)" <ji...@apache.org> on 2022/08/18 10:16:00 UTC

[jira] [Updated] (TIKA-3839) Property com.ctc.wstx.maxEntityCount is not supported

     [ https://issues.apache.org/jira/browse/TIKA-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lakatos Gyula updated TIKA-3839:
--------------------------------
    Description: 
First of all, this might not even be a bug, just a slight annoyance.

Whenever I try to parse the attached PDF, I get the following error:

 
{code:java}
[main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could not be setup [log suppressed for 5 minutes]
java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not supported
    at java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246)
    at org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732)
    at org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303)
    at org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229)
    at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90)
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863)
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code}
 

After a couple of hours of Googling, I realized that there is an XML parser implementation called woodstox. If I include that dependency on the classpath, this exception is no longer present, because it understands the _com.ctc.wstx.maxEntityCount_ property.

As far as I see, the 1.28.4 version of tika-parsers included this library as a compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the case, but there must be a good reason for it.

However, I think it would be a good idea to change the exception's message from:
{code:java}
SAX Security Manager could not be setup [log suppressed for 5 minutes] {code}
to something more meaningful.

Something that mentions woodstox would be good (especially if the only property that Tika tries to set is woodstox specific). Also, spamming/printing the message every 5 minutes is pointless in my opinion. If woodstox is not on the classpath, it will fail anyways (and also it is not synchronized so if you parsing a lot of documents at the same time in parallel, it still can print it more than once).

  was:
First of all, this might not even be a bug, just a slight annoyance.

Whenever I try to parse the attached PDF, I get the following error:

 
{code:java}
[main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could not be setup [log suppressed for 5 minutes]
java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not supported
    at java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246)
    at org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732)
    at org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303)
    at org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229)
    at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90)
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863)
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code}
 

After a couple of hours of Googling, I realized that there is an XML parser implementation called woodstox. If I include that dependency on the classpath, this exception is no longer present, because it understands the _com.ctc.wstx.maxEntityCount_ property.

As far as I see, the 1.28.4 version of tika-parsers included this library as a compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the case, but there must be a good reason for it.

However, I think it would be a good idea to change the exception's message from:
{code:java}
SAX Security Manager could not be setup [log suppressed for 5 minutes] {code}
to something more meaningful.

Something that mentions woodstox would be good (especially if the only property that Tika tries to set is woodstox specific). Also, spamming/printing the message every 5 minutes is pointless in my opinion. If woodstox is not on the classpath, it will fail anyways.


> Property com.ctc.wstx.maxEntityCount is not supported
> -----------------------------------------------------
>
>                 Key: TIKA-3839
>                 URL: https://issues.apache.org/jira/browse/TIKA-3839
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.4.1
>            Reporter: Lakatos Gyula
>            Priority: Minor
>         Attachments: 8a4b2154-b6c1-4e0e-b8be-8ce4e68c454f.pdf
>
>
> First of all, this might not even be a bug, just a slight annoyance.
> Whenever I try to parse the attached PDF, I get the following error:
>  
> {code:java}
> [main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could not be setup [log suppressed for 5 minutes]
> java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not supported
>     at java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246)
>     at org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732)
>     at org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303)
>     at org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229)
>     at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90)
>     at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863)
>     at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772)
>     at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270)
>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code}
>  
> After a couple of hours of Googling, I realized that there is an XML parser implementation called woodstox. If I include that dependency on the classpath, this exception is no longer present, because it understands the _com.ctc.wstx.maxEntityCount_ property.
> As far as I see, the 1.28.4 version of tika-parsers included this library as a compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the case, but there must be a good reason for it.
> However, I think it would be a good idea to change the exception's message from:
> {code:java}
> SAX Security Manager could not be setup [log suppressed for 5 minutes] {code}
> to something more meaningful.
> Something that mentions woodstox would be good (especially if the only property that Tika tries to set is woodstox specific). Also, spamming/printing the message every 5 minutes is pointless in my opinion. If woodstox is not on the classpath, it will fail anyways (and also it is not synchronized so if you parsing a lot of documents at the same time in parallel, it still can print it more than once).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)