You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@maven.apache.org by "Joakim Erdfelt (JIRA)" <ji...@codehaus.org> on 2008/06/01 06:13:53 UTC

[jira] Created: (WAGON-218) Link Parsing in http is flawed

Link Parsing in http is flawed
------------------------------

                 Key: WAGON-218
                 URL: http://jira.codehaus.org/browse/WAGON-218
             Project: Maven Wagon
          Issue Type: Improvement
          Components: wagon-http, wagon-http-lightweight
    Affects Versions: 1.0-beta-2
            Reporter: Joakim Erdfelt


The link parsing in wagon http has a few issues.

a) not all links detected.
b) the various ways that page content is identified via url string manipulation isn't working in many example cases.
c) the use of jtidy introduces a large dependency and high memory usage.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (WAGON-218) Link Parsing in http is flawed

Posted by "Jesse Glick (JIRA)" <ji...@codehaus.org>.
    [ http://jira.codehaus.org/browse/WAGON-218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=232855#action_232855 ] 

Jesse Glick commented on WAGON-218:
-----------------------------------

Why not get rid of nekohtml (+ XNI), saving 152 Kb as well as complexity in the Maven third-party dependency list, and directly search for {{(?i)<a href="(.+?)">}} or similar? After all, the intended use case is to find links in index listings generated by a small number of distinct pieces of software. These generators are surely not going to use exotic formatting or attributes of the kind created by humans editing HTML by hand or with WYSIWYG designers.

I would be happy to supply a patch if there is interest.

> Link Parsing in http is flawed
> ------------------------------
>
>                 Key: WAGON-218
>                 URL: http://jira.codehaus.org/browse/WAGON-218
>             Project: Maven Wagon
>          Issue Type: Improvement
>          Components: wagon-http, wagon-http-lightweight
>    Affects Versions: 1.0-beta-2
>            Reporter: Joakim Erdfelt
>            Assignee: Joakim Erdfelt
>
> The link parsing in wagon http has a few issues.
> a) not all links detected.
> b) the various ways that page content is identified via url string manipulation isn't working in many example cases.
> c) the use of jtidy introduces a large dependency and high memory usage.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Closed: (WAGON-218) Link Parsing in http is flawed

Posted by "Joakim Erdfelt (JIRA)" <ji...@codehaus.org>.
     [ http://jira.codehaus.org/browse/WAGON-218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joakim Erdfelt closed WAGON-218.
--------------------------------

    Resolution: Fixed

Work completed in revision 662070

After a sample LinkParser replacement is P.o.C. in a wagon-http-with-webdav branch, and a discussion in the dev@wagon mailing list.  The following changes have been made.

1) Replaced jtidy with nekohtml
    This resulted in a smaller dependency list and improved memory utilization.
2) Replaces reliance of String URL manipulation with use of java.net.URI
    This change makes the detection of content that belongs to the page more accurate, as well as enables some complex relative uri resolution almost trivial.
3) Added more unit tests for real world scenarios encountered since the original implementation was loose on the world.


> Link Parsing in http is flawed
> ------------------------------
>
>                 Key: WAGON-218
>                 URL: http://jira.codehaus.org/browse/WAGON-218
>             Project: Maven Wagon
>          Issue Type: Improvement
>          Components: wagon-http, wagon-http-lightweight
>    Affects Versions: 1.0-beta-2
>            Reporter: Joakim Erdfelt
>            Assignee: Joakim Erdfelt
>
> The link parsing in wagon http has a few issues.
> a) not all links detected.
> b) the various ways that page content is identified via url string manipulation isn't working in many example cases.
> c) the use of jtidy introduces a large dependency and high memory usage.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira