You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Kunal P (JIRA)" <ji...@apache.org> on 2013/03/15 16:10:13 UTC

[jira] [Created] (ANY23-154) Not able to extract microdata in few test cases

Kunal P created ANY23-154:
-----------------------------

             Summary: Not able to extract microdata in few test cases
                 Key: ANY23-154
                 URL: https://issues.apache.org/jira/browse/ANY23-154
             Project: Apache Any23
          Issue Type: Bug
          Components: core
    Affects Versions: 0.7.0
         Environment: Windows 7 32bit
JDK 1.6.0_38
Intel Core 2 duo and 4GB RAM
            Reporter: Kunal P


we are using ApacheAny23 API for extracting microdata from the given web-page as part of internal project.

we have some test cases where api is not able to parse the microdata. 

www.neeraj.nowfloats.com (The web page is not following schema.org standards strictly)

I am giving the snippit of the HTML code here.
<div id="someid" itemprop="offer" itemscope itemtype="http://schema.org/Offer">
  <div ... ></div>
</div>

It clearly shows that given microdata is a child of some parent microdata specification as it contains itemscope as well as itemprop in the same tag. And the given <div id="someid"> tag has no parent microdata specification.

The method used for extracting ItemScopes is as follows,


import org.apache.any23.extractor.microdata.ItemScope;
import org.apache.any23.extractor.microdata.MicrodataParser;
import org.apache.any23.extractor.microdata.MicrodataParserReport;

Document dom = getDomDocument(String html)
MicrodataParserReport report = MicrodataParser.getMicrodata(dom);
ItemScope[] items = report.getDetectedItemScopes();


here, items doesnt contain any ItemScope which has above test case. 

In such scenario, how can we extract microdata from the page using any23 api.
Is there any way to relax the criterion of itemprop and itemscope not appearing in the same tag so that we get the data from the webpage.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira