You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2013/08/25 22:48:51 UTC

[jira] [Resolved] (ANY23-115) Empty spans seem to break ANY23

     [ https://issues.apache.org/jira/browse/ANY23-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney resolved ANY23-115.
----------------------------------------

    Resolution: Fixed

I made another commit for this which relaxes checking of ItemProp values (Objects). Instead of throwing a new IllegalArgumentException which effectively fails the entire parsing task, we now just set the object value (content) to null. This ensures that we successfully complete the parsing task, extracting all relevant data.

commit 5195ebaa806d108791bb7ce449644ed93b62e882
Author: lewismc <le...@gmail.com>
Date:   Sun Aug 25 13:45:07 2013 -0700

    ANY23-115 Empty spans seem to break ANY23

                
> Empty spans seem to break ANY23
> -------------------------------
>
>                 Key: ANY23-115
>                 URL: https://issues.apache.org/jira/browse/ANY23-115
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: html-scraper, microdata
>    Affects Versions: 0.7.0
>         Environment: Any23.org public scraper
>            Reporter: Christophe Dupriez
>             Fix For: 0.9.0
>
>         Attachments: 0001-ANY23-115-Empty-spans-seem-to-break-ANY23.patch, json-pretty-printer.html
>
>
> One of the 2 thousand URLs with the problem:
> http://www.oceanexpert.net/viewMemberRecord.php?&memberID=20045
> The piece of HTML creating the problem seems to be:
> <h1>
> 				Details of<span itemprop="name"> <span itemprop="honorificPrefix"></span>&nbsp;<span itemprop="givenName">Laury</span>&nbsp; <span itemprop="familyName">Miller</span></span>
> 							</h1>
> (this may disappear as we may workaround the problem)
> Error message:
> Internal error.
> ================================================================
> java.lang.IllegalArgumentException: Invalid content ''
> 	at org.apache.any23.extractor.microdata.ItemPropValue.<init>(ItemPropValue.java:89)
> 	at org.apache.any23.extractor.microdata.MicrodataParser.getPropertyValue(MicrodataParser.java:341)
> 	at org.apache.any23.extractor.microdata.MicrodataParser.getItemProps(MicrodataParser.java:394)
> 	at org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:471)
> 	at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:186)
> 	at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:203)
> 	at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:100)
> 	at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:62)
> 	at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:477)
> 	at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:260)
> 	at org.apache.any23.Any23.extract(Any23.java:294)
> 	at org.apache.any23.Any23.extract(Any23.java:446)
> 	at org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:113)
> 	at org.apache.any23.servlet.Servlet.doGet(Servlet.java:74)
> 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
> 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> 	at com.googlecode.psiprobe.Tomcat60AgentValve.invoke(Tomcat60AgentValve.java:30)
> 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
> 	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
> 	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
> 	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> 	at java.lang.Thread.run(Thread.java:662)
> ================================================================

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira