You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Lev Khomich (JIRA)" <ji...@apache.org> on 2014/03/03 13:31:21 UTC

[jira] [Comment Edited] (ANY23-137) RDFa parser implementation proposal

    [ https://issues.apache.org/jira/browse/ANY23-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918011#comment-13918011 ] 

Lev Khomich edited comment on ANY23-137 at 3/3/14 12:29 PM:
------------------------------------------------------------

Thanks, Stephane!

Completely missed that RDFa was used as a part of extraction process in other tests.
I've added related fixes. 

Brief description.

*ServletTest*
Old RDFa implementation produces
{{<issue level="Warning" row="14" col="5">Error while processing node /HTML(1)/HEAD(1)/META(9) : 'Cannot map prefix 'fb''</issue>}}
while {{<fb:app_id>}} is completely valid predicate which shouldn't be resolved against fb: prefix.

*Any23Test*
*RoverTest*
Changed RDFXMLWriter to NTriplesWriter in some tests to improve precision (they basically check line count).
Changed expected triples count. It was reduced in most cases, because old RDFa parsed produced a lot of invalid triples like:

{quote}
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/ambiente/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/salute/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/legalita/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://www.ansamed.info/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/web/notizie/regioni/lazio/provinciadiroma/> .
{quote}

Fixed markup in {{test-resources/src/test/resources/html/rdfa/ansa_2010-02-26_12645863.html}} to conform declared XHTML 1.0 Strict.
Fixed RDFa markup in {{test-resources/src/test/resources/html/encoding-test.html}} otherwise it shouldn't produce any triples.
Disabled second part of {{Any23Test.testExtractionParameters}}. Should it do anything after RDFa parser replacement?

Also, ExtractionException thrown from BaseRDFExtractor is escalated in test suite. It leads to some failed tests in Any23Test. What's the correct behaviour for ANY23 parser in case it gets SAXException?



was (Author: levkhomich):
Completely missed that RDFa was used as a part of extraction process in other tests.
I've added related fixes. 

Brief description.

*ServletTest*
Old RDFa implementation produces
{{<issue level="Warning" row="14" col="5">Error while processing node /HTML(1)/HEAD(1)/META(9) : 'Cannot map prefix 'fb''</issue>}}
while {{<fb:app_id>}} is completely valid predicate which shouldn't be resolved against fb: prefix.

*Any23Test*
*RoverTest*
Changed RDFXMLWriter to NTriplesWriter in some tests to improve precision (they basically check line count).
Changed expected triples count. It was reduced in most cases, because old RDFa parsed produced a lot of invalid triples like:

{quote}
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/ambiente/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/salute/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/legalita/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://www.ansamed.info/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/web/notizie/regioni/lazio/provinciadiroma/> .
{quote}

Fixed markup in {{test-resources/src/test/resources/html/rdfa/ansa_2010-02-26_12645863.html}} to conform declared XHTML 1.0 Strict.
Fixed RDFa markup in {{test-resources/src/test/resources/html/encoding-test.html}} otherwise it shouldn't produce any triples.
Disabled second part of {{Any23Test.testExtractionParameters}}. Should it do anything after RDFa parser replacement?

Also, ExtractionException thrown from BaseRDFExtractor is escalated in test suite. It leads to some failed tests in Any23Test. What's the correct behaviour for ANY23 parser in case it gets SAXException?


> RDFa parser implementation proposal
> -----------------------------------
>
>                 Key: ANY23-137
>                 URL: https://issues.apache.org/jira/browse/ANY23-137
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Lev Khomich
>            Assignee: Peter Ansell
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: oQYfomKX.part, rdfa-extractor-proposal.patch
>
>
> As a follow up to discussion [1].
> I've implemented another RDFa extractor for Any23 (0.7.1).
> Proposed code depends on semargl project [2]. It isn't published in maven
> central, therefore I didn't change any poms.
> Still not quite sure about class name (because related ones are already taken),
> feel free to rename it. See attachments for patch with extractor and tests.
> [1] http://mail-archives.apache.org/mod_mbox/any23-dev/201212.mbox/browser
> [2] http://semarglproject.org



--
This message was sent by Atlassian JIRA
(v6.2#6252)