You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Hudson (Jira)" <ji...@apache.org> on 2019/09/22 23:34:00 UTC

[jira] [Commented] (ANY23-443) Improve efficiency of RDFa Extractor

    [ https://issues.apache.org/jira/browse/ANY23-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935437#comment-16935437 ] 

Hudson commented on ANY23-443:
------------------------------

SUCCESS: Integrated in Jenkins build Any23-trunk #1667 (See [https://builds.apache.org/job/Any23-trunk/1667/])
ANY23-443 improve speed & stability of RDFa extractors (hans: rev 50cfb2fd7f3112e27c44ab5850117bacda22a679)
* (edit) core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/BaseRDFaExtractor.java
* (add) core/src/main/java/org/apache/any23/extractor/rdfa/JsoupScanner.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/RDFaExtractor.java
* (add) core/src/main/java/org/apache/any23/extractor/rdfa/SemarglSink.java
ANY23-443 cleanup (hans: rev d9f1fa4036133158b1a91976d9d05d152c02feaa)
* (edit) core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/BaseRDFaExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/SemarglSink.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/RDFaExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/JsoupScanner.java


> Improve efficiency of RDFa Extractor
> ------------------------------------
>
>                 Key: ANY23-443
>                 URL: https://issues.apache.org/jira/browse/ANY23-443
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Hans Brende
>            Priority: Major
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Our RDFa Extractor is terribly inefficient. 
> 1st, we parse the html "tag soup" input stream into a DOM using Jsoup
> 2nd, we transform the DOM back into an input stream, containing strictly valid XML to avoid errors in the underlying semargl parser
> 3rd, the underlying semargl parser resurrects this input stream as XML and hands off XML streaming events to its underlying XmlSink. 
> 4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn hands them back to Any23. 
> I propose cutting out all these intermediate steps by simply walking the original jsoup DOM and handing our own XML events directly to semargl's XmlSink, which we will configure to give RDF events directly back to Any23. 
> This will also allow us to get rid of most (or possibly all) of the various HTML-to-XML "fixups" we had to implement to prevent extraction failures.
> ----
> *TL;DR:*
>  
> {{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}} 
> *becomes*
> {{Jsoup → RdfaParser → Any23}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)