You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Ondrej Klimpera (Commented) (JIRA)" <ji...@apache.org> on 2012/04/20 17:48:40 UTC

[jira] [Commented] (ANY23-77) Facing a infinite loop problem in version 0.6.1 - Verify

    [ https://issues.apache.org/jira/browse/ANY23-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258328#comment-13258328 ] 

Ondrej Klimpera commented on ANY23-77:
--------------------------------------

Hello, 

digged a little bit deeper in this problem. This loop is not infinite it will end, but it takes too long to process. On my notebook it took about 2 hours to process this resource (about 2 MB). The problem lies in very ineffective implementation of microdata parsing. What I found so far, the class org.deri.any23.extractor.html.DomUtils with method getXPathForNode() is using recursion that is called really often and this causes so long processing time.
As a raw fix I created a local cahe for this method to store mostly used xPath statements and cut the processing time to 13 minutes. It's not ideal, but at least it shows the way of making microdata parsing much more effective - by adding smth like local caching to extractor (not only to DomUtils).

Please let me know if there is any desire to fix this.

Thanks for dealing with it :)
Regards
Ondrej Klimpera
                
> Facing a infinite loop problem in version 0.6.1 - Verify
> --------------------------------------------------------
>
>                 Key: ANY23-77
>                 URL: https://issues.apache.org/jira/browse/ANY23-77
>             Project: Apache Any23
>          Issue Type: Bug
>            Reporter: Michele Mostarda
>            Assignee: Michele Mostarda
>
> The code to reproduce the bug is here (Client is Jersey http client, but thats  just a detail, the problem lies in URL resource: http://lod.openlinksw.com/sparql?query=define%20sql%3Adescribe-mode%20%22LOD%22%20%20DESCRIBE%20%3Chttp%3A%2F%2Fyago-knowledge.org%2Fresource%2FBerlin%3E&output=text%2Fhtml 
> Java Code:
> Client c = Client.create();
>        System.out.println("Downloading file.");
>        InputStream in = c.resource("http://lod.openlinksw.com/sparql?query=define%20sql%3Adescribe-mode%20%22LOD%22%20%20DESCRIBE%20%3Chttp%3A%2F%2Fyago-knowledge.org%2Fresource%2FBerlin%3E&output=text%2Fhtml").get(InputStream.class); 
>        FileOutputStream out = null;
>        File f = new File("urlResource");
>        try {
>            out = new FileOutputStream(f);
>            IOUtils.copy(in,  out);
>        } catch (Exception e) {
>            e.printStackTrace();
>        } finally {
>            IOUtils.closeQuietly(in);
>            IOUtils.closeQuietly(out);
>        }
>        System.out.println("File downloaded.");
>        System.out.println("Starting extraction.");
>        FileDocumentSource doc = new FileDocumentSource(f);
>        TurtleWriter tw = new TurtleWriter(System.out);
>        Any23 ext = new Any23();
>        try {
>            ext.extract(doc, tw);
>        } catch (Exception e) {
>            e.printStackTrace();
>        }
>        System.out.println("Extraction done.");

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira