You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Timothy Potter (Created) (JIRA)" <ji...@apache.org> on 2012/04/16 12:38:16 UTC

[jira] [Created] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Improve runtime of the Microformat extractor on documents with many relations.
------------------------------------------------------------------------------

                 Key: ANY23-76
                 URL: https://issues.apache.org/jira/browse/ANY23-76
             Project: Apache Any23
          Issue Type: Improvement
            Reporter: Timothy Potter
            Priority: Trivial


For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 

To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
http://en.wikipedia.org/wiki/List_of_Nike_missile_locations

I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Posted by "Simone Tripodi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simone Tripodi updated ANY23-76:
--------------------------------

          Component/s: core
    Affects Version/s: 0.7.0
        Fix Version/s: 0.7.0
    
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.7.0
>            Reporter: Timothy Potter
>            Assignee: Michele Mostarda
>            Priority: Trivial
>             Fix For: 0.7.0
>
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Posted by "Timothy Potter (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Potter updated ANY23-76:
--------------------------------

    Attachment:     (was: MicroformatSpeed.patch)
    
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Posted by "Michele Mostarda (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michele Mostarda resolved ANY23-76.
-----------------------------------

    Resolution: Fixed

Fixed @ r1328663.
                
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Assignee: Michele Mostarda
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Posted by "Michele Mostarda (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michele Mostarda reassigned ANY23-76:
-------------------------------------

    Assignee: Michele Mostarda
    
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Assignee: Michele Mostarda
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259032#comment-13259032 ] 

Hudson commented on ANY23-76:
-----------------------------

Integrated in Any23-trunk #178 (See [https://builds.apache.org/job/Any23-trunk/178/])
    Improved HCardExtractor performances. Related to issue #ANY23-76 . (Revision 1328663)

     Result = UNSTABLE
mostarda : 
Files : 
* /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/html/DomUtils.java
* /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/html/HCardExtractor.java
* /incubator/any23/trunk/core/src/test/java/org/apache/any23/extractor/html/HCardExtractorTest.java
* /incubator/any23/trunk/core/src/test/resources/microformats/hcard/performance.html

                
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Assignee: Michele Mostarda
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney closed ANY23-76.
-------------------------------------


Bulk close for 0.7.0-incubating release
                
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.7.0
>            Reporter: Timothy Potter
>            Assignee: Michele Mostarda
>            Priority: Trivial
>             Fix For: 0.7.0
>
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Posted by "Timothy Potter (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Potter updated ANY23-76:
--------------------------------

    Attachment: MicroformatSpeed.patch
    
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Posted by "Michele Mostarda (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258856#comment-13258856 ] 

Michele Mostarda commented on ANY23-76:
---------------------------------------

Hi Tim, 
  I applied your patch and verified performances, on my Mac (   2,8 GHz Intel Core 2 Duo,  Memory  8 GB 1067 MHz DDR3 ) with default JVM configuration ( no heap size specified) I just observed a 2x performance improvement (from 21sec to 9sec) on the same input page you reported, enough in my opinion to integrate the patch. 
Thanks a lot.
The best.
                
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Assignee: Michele Mostarda
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ANY23-76) Improve runtime of the Microformat extractor on documents with many relations.

Posted by "Timothy Potter (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Potter updated ANY23-76:
--------------------------------

    Attachment: MicroformatSpeed.patch
    
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow.   I've market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way..  The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. getting all nodes with 'class' attributes.  On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira