You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2014/06/21 01:42:27 UTC

[jira] [Commented] (NUTCH-1767) remove special treatment of "params" in relative links

    [ https://issues.apache.org/jira/browse/NUTCH-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039547#comment-14039547 ] 

Hudson commented on NUTCH-1767:
-------------------------------

SUCCESS: Integrated in Nutch-nutchgora #1052 (See [https://builds.apache.org/job/Nutch-nutchgora/1052/])
NUTCH-1767 remove special treatment of "params" in relative links (snagel: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1604298)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/java/org/apache/nutch/util/URLUtil.java
* /nutch/branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* /nutch/branches/2.x/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
* /nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* /nutch/branches/2.x/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/DOMContentUtilsTest.java
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java


> remove special treatment of "params" in relative links
> ------------------------------------------------------
>
>                 Key: NUTCH-1767
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1767
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.8, 2.2.1
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 2.3, 1.9
>
>         Attachments: NUTCH-1767-1x.patch, NUTCH-1767-2x.patch, test_nutch_1767-1.html, test_nutch_1767-2.html
>
>
> [RFC 1808|http://www.ietf.org/rfc/rfc1808.txt] specified that path elements of URLs may contains so-called params startet by ";", e.g. ";type=a". If the base URL contains a path param while the link target does not, params are transferred to the target:
> {quote}
> Step 5: 
>  a) if the embedded URL's <params> is non-empty, we skip to
>      step 7; otherwise, it inherits the <params> of the base URL (if any)
> {quote}
> This behaviour has been implemented with NUTCH-436. Later (NUTCH-1115) it had been made optional and configurable by property {{parser.fix.embeddedparams}}. NUTCH-797 made the changes of both issues inactive for 1.x (not applied to 2.x) with reference to RFC 3986.
> [RFC 3986|http://tools.ietf.org/html/rfc3986] which obsoletes RFC 1808 does not mention params and examples given in sect. 5.4. "Reference Resolution Examples" contradict RFC 1808. Also [Wikipedia|http://en.wikipedia.org/w/index.php?title=URI_scheme&oldid=604656593] states:
> {quote}
> Historically, each segment was specified to contain parameters separated from it using a semicolon (";"), though this was rarely used in practice and current specifications allow but no longer specify such semantics.
> {quote}
> Accordingly, any special treatment of "params" in relative links should be removed from Nutch. At a first glance, this would include:
> * 2.x parse-html and parse-tika
> ** remove fixEmbeddedParams(...)
> ** change unit tests to follow examples from RFC 3986
> * 1.x
> ** remove unused fixEmbeddedParams(...) from parse-html
> ** remove property {{parser.fix.embeddedparams}} from nutch-default.xml



--
This message was sent by Atlassian JIRA
(v6.2#6252)