You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by GitBox <gi...@apache.org> on 2021/10/04 20:53:59 UTC

[GitHub] [any23] lewismc opened a new pull request #205: ANY23-504 Optionally disable remote HTTP connections when resolving XML entities

lewismc opened a new pull request #205:
URL: https://github.com/apache/any23/pull/205


   *Context*
   This PR is a WIP.
   The unit test attempt sot perform a simple document extraction using the BBC Scotland HTML as input.
   
   *How to debug*
   One can inspect the `TriXExtractor` issues by setting a breakpoint at [org/apache/any23/extractor/SingleDocumentExtraction.java#L543](https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L543). You can then evaluate the following expression
   
   ```
   extractionResult.getIssues().toArray()[1]
   ```
   
   This indicates the following
   
   ```
   FATAL: 	'org.eclipse.rdf4j.rio.RDFParseException: The attribute name must be specified in the attribute-list declaration for element "charset". [line 181, column 45]
   	at org.eclipse.rdf4j.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:333)
   	at org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.reportFatalError(AbstractRDFParser.java:724)
   	at org.eclipse.rdf4j.rio.trix.TriXParser.reportFatalError(TriXParser.java:253)
   	at org.eclipse.rdf4j.rio.trix.TriXParser.fatalError(TriXParser.java:419)
   	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
   	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
   	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
   	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
   	at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
   	at org.apache.xerces.impl.XMLDTDScannerImpl.scanAttlistDecl(Unknown Source)
   	at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
   	at org.apa...' 	(-1,-1)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc commented on pull request #205: ANY23-504 Optionally disable remote HTTP connections when resolving XML entities

Posted by GitBox <gi...@apache.org>.
lewismc commented on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-941314571


   We've concluded that the TriXParser itself is strictly limited to the TriX format, which is a structured XML format for RDF. It certainly won't be able to deal with HTML documents, and should not be used to process those directly. So I am going back to determine exactly how and why the TriXParser was activated when processing the HTML file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] sebastian-nagel commented on pull request #205: ANY23-504 XML-based parsers should not load external DTDs by default

Posted by GitBox <gi...@apache.org>.
sebastian-nagel commented on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-947865771


   Thanks, @lewismc and @jeenbroekstra! Nice to hear that it was worth the effort to dig into it, even starting from my relatively vague problem description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc commented on pull request #205: ANY23-504 Optionally disable remote HTTP connections when resolving XML entities

Posted by GitBox <gi...@apache.org>.
lewismc commented on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-939082023


   I also registered https://github.com/eclipse/rdf4j/issues/3347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc edited a comment on pull request #205: ANY23-504 Optionally disable remote HTTP connections when resolving XML entities

Posted by GitBox <gi...@apache.org>.
lewismc edited a comment on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-946876687


   We currently provide unit tests for only a few of the overridden [Any23 constructors](https://github.com/apache/any23/blob/master/core/src/test/java/org/apache/any23/Any23Test.java#L473-L519]. The constructor used by the Nutch client is 
   ```
   Any23 any23 = new Any23(String... extractorNames);
   ```
   I will therefore augment this PR with adequate test coverage for that Constructor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc edited a comment on pull request #205: ANY23-504 Optionally disable remote HTTP connections when resolving XML entities

Posted by GitBox <gi...@apache.org>.
lewismc edited a comment on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-946876687


   We currently provide unit tests for only a few of the overridden [Any23 constructors](https://github.com/apache/any23/blob/master/core/src/test/java/org/apache/any23/Any23Test.java#L473-L519). The constructor used by the [Nutch client](https://github.com/apache/nutch/blob/master/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java#L99) is 
   ```
   Any23 any23 = new Any23(String... extractorNames);
   ```
   I will therefore augment this PR with adequate test coverage for that Constructor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc commented on pull request #205: ANY23-504 Optionally disable remote HTTP connections when resolving XML entities

Posted by GitBox <gi...@apache.org>.
lewismc commented on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-946876687


   We do not provide unit tests]() for the majority of overridden constructors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc merged pull request #205: ANY23-504 XML-based parsers should not load external DTDs by default

Posted by GitBox <gi...@apache.org>.
lewismc merged pull request #205:
URL: https://github.com/apache/any23/pull/205


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc commented on pull request #205: ANY23-504 Optionally disable remote HTTP connections when resolving XML entities

Posted by GitBox <gi...@apache.org>.
lewismc commented on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-941314571






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc edited a comment on pull request #205: ANY23-504 Optionally disable remote HTTP connections when resolving XML entities

Posted by GitBox <gi...@apache.org>.
lewismc edited a comment on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-946876687


   We currently provide unit tests for only a few of the overridden [Any23 constructors](https://github.com/apache/any23/blob/master/core/src/test/java/org/apache/any23/Any23Test.java#L473-L519). The constructor used by the [Nutch client](https://github.com/apache/nutch/blob/master/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java#L99) is 
   ```
   Any23 any23 = new Any23(String... extractorNames);
   ```
   I will therefore augment this PR with adequate test coverage for that Constructor.
   
   @sebastian-nagel FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc commented on pull request #205: ANY23-504 XML-based parsers should not load external DTDs by default

Posted by GitBox <gi...@apache.org>.
lewismc commented on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-947831492


   @sebastian-nagel so it looks like we [got to the bottom of it](https://github.com/eclipse/rdf4j/issues/3347#issuecomment-947414103). For clarity, 
   
   > The TriXParser's underlying SAX2 parser (usually Xerces) should be configured, by default, to not read remote DTDs. This behavior can be overridden from the RDF4J side by tweaking the XMLParserSettings.LOAD_EXTERNAL_DTD option, or by setting the system property http://apache.org/xml/features/nonvalidating/load-external-dtd to true.
   > However, I've just done a quick unit test at my end and it appears there is a regression in the default settings.
   > Long story short: you've discovered a bug in the TriXParser, thanks! And sorry it took so long for me to cotton on.
   > The short-term workaround in the Any23 code is to explicitly disable loading of external DTDs on the TriXParser:
   ```parser.getParserConfig().set(XMLParserSettings.LOAD_EXTERNAL_DTD, false);```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [any23] lewismc commented on pull request #205: ANY23-504 Optionally disable remote HTTP connections when resolving XML entities

Posted by GitBox <gi...@apache.org>.
lewismc commented on pull request #205:
URL: https://github.com/apache/any23/pull/205#issuecomment-942567220


   OK I updated this PR to specifically test if the [SingleDocumentExtraction#filterExtractorsByMIMEType](https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L467-L486) method works as expected given the test input file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@any23.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org