You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrew Groh (JIRA)" <ji...@apache.org> on 2007/01/26 15:09:49 UTC

[jira] Created: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

Incorrect handling of relative paths when the embedded URL path is empty
------------------------------------------------------------------------

                 Key: NUTCH-436
                 URL: https://issues.apache.org/jira/browse/NUTCH-436
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
            Reporter: Andrew Groh
            Priority: Critical


If you have a base URL of the form:
http://a/b/c/d;p?q#f

Embedded URL      Correct Absolute URL     Nutch Generated URL
?y                                http://a/b/c/d;p?y               http://a/b/c/?y
;x                                 http://a/b/c/d;x                    http://a/b/c/;x


See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example

http://www.ietf.org/rfc/rfc1808.txt




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

Posted by "Doug Cook (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535272 ] 

Doug Cook commented on NUTCH-436:
---------------------------------

It looks like Nutch-566, and associated patch, which I recently filed, is a duplicate of this.

The patch I proposed may or may not handle the ';' correctly, I need to check that.

But the patch for this issue (Nutch-436) is limited to DOMContentUtils, and this problem will exist wherever Sun's URL class is used in URL extraction -- thus it affects any parser, not just the HTML one. The same issue occurs in Javascript link extraction, Flash link extraction, etc. -- thus the patch should be in a centralized location (like util).


> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-436
>                 URL: https://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>            Assignee: Dennis Kubes
>            Priority: Critical
>         Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-436:
-------------------------------

    Attachment: NUTCH-436-20070304.patch

NUTCH-436-20070304.patch handles correct encoding of the params information in the base url.  When creating a new URL,with a base URL and target String path, if the target contains params information but the base does not then the java.net.URL class  has the correct behavior.  If the base has params information then the URL class strips this information from the URL.  This patch is a workaround that moves base params information to the target so that it can be correctly handled by the URL class.

> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-436
>                 URL: https://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>         Assigned To: Dennis Kubes
>            Priority: Critical
>         Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

Posted by "Andrew Groh (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467883 ] 

Andrew Groh commented on NUTCH-436:
-----------------------------------

This is a bug in java.net.URL, specifically the URLStreamClass that it uses.  

new URL("http://a/b/c/d;p?q#f ","?y")

creates a URL object with a bad URL.

> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-436
>                 URL: https://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>            Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: java.io.FileNotFoundException: / (Is a directory)

Posted by Dennis Kubes <nu...@dragonflymc.com>.
That is a hadoop.log.dir problem value not being set.  It is trying to 
use the DRFA appender to a file and can't find the log directory.

Dennis

Gal Nitzan wrote:
> 
> Just installed latest from trunk.
> 
> I run mergesegs and I get the following error in all tasks log files (I use
> default log4j.properties):
> 
> log4j:ERROR setFile(null,true) call failed.
> java.io.FileNotFoundException: / (Is a directory)
>         at java.io.FileOutputStream.openAppend(Native Method)
>         at java.io.FileOutputStream.(FileOutputStream.java:177)
>         at java.io.FileOutputStream.(FileOutputStream.java:102)
>         at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
>         at
> org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
>         at
> org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp
> pender.java:215)
>         at
> org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)
>         at
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132
> )
>         at
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)
>         at
> org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav
> a:654)
>         at
> org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav
> a:612)
>         at
> org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur
> ator.java:509)
>         at
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
> 415)
>         at
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
> 441)
>         at
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.
> java:468)
>         at org.apache.log4j.LogManager.(LogManager.java:122)
>         at org.apache.log4j.Logger.getLogger(Logger.java:104)
>         at
> org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)
>         at org.apache.commons.logging.impl.Log4JLogger.(Log4JLogger.java:65)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces
> sorImpl.java:39)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
> torAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at
> org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.ja
> va:529)
>         at
> org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.ja
> va:235)
>         at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:370)
>         at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:59)
>         at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1346)
> log4j:ERROR Either File or DatePattern options are not set for appender
> [DRFA].
> 
> 

java.io.FileNotFoundException: / (Is a directory)

Posted by Gal Nitzan <gn...@usa.net>.

Just installed latest from trunk.

I run mergesegs and I get the following error in all tasks log files (I use
default log4j.properties):

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: / (Is a directory)
        at java.io.FileOutputStream.openAppend(Native Method)
        at java.io.FileOutputStream.(FileOutputStream.java:177)
        at java.io.FileOutputStream.(FileOutputStream.java:102)
        at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
        at
org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
        at
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp
pender.java:215)
        at
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)
        at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132
)
        at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)
        at
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav
a:654)
        at
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav
a:612)
        at
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur
ator.java:509)
        at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
415)
        at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
441)
        at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.
java:468)
        at org.apache.log4j.LogManager.(LogManager.java:122)
        at org.apache.log4j.Logger.getLogger(Logger.java:104)
        at
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)
        at org.apache.commons.logging.impl.Log4JLogger.(Log4JLogger.java:65)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces
sorImpl.java:39)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
torAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.ja
va:529)
        at
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.ja
va:235)
        at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:370)
        at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:59)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1346)
log4j:ERROR Either File or DatePattern options are not set for appender
[DRFA].



Re: record version mismatch occured

Posted by Sami Siren <ss...@gmail.com>.
Gal Nitzan wrote:
> Thanks Sami,
> 
> By redo do you mean re-parse or re-fetch + re-parse

generate -> fetch -> parse

--
 Sami Siren


RE: record version mismatch occured

Posted by Gal Nitzan <gn...@usa.net>.
Thanks Sami,

By redo do you mean re-parse or re-fetch + re-parse

-----Original Message-----
From: Sami Siren [mailto:ssiren@gmail.com] 
Sent: Friday, January 26, 2007 10:49 PM
To: nutch-dev@lucene.apache.org
Subject: Re: record version mismatch occured

Gal Nitzan wrote:
> Got it. I used latest trunk for a few hours and it seems that it changed
the
> version of Crawldatum to ver. 5 :(

Earlier one left too early, one(ore more) of your segments has data
written with newer version. If you haven't updated crawldb then you just
need to redo that(those) segment(s).

--
 Sami Siren




Re: record version mismatch occured

Posted by Sami Siren <ss...@gmail.com>.
Gal Nitzan wrote:
> Got it. I used latest trunk for a few hours and it seems that it changed the
> version of Crawldatum to ver. 5 :(

Earlier one left too early, one(ore more) of your segments has data
written with newer version. If you haven't updated crawldb then you just
need to redo that(those) segment(s).

--
 Sami Siren


Re: record version mismatch occured

Posted by Sami Siren <ss...@gmail.com>.
Gal Nitzan wrote:
> Got it. I used latest trunk for a few hours and it seems that it changed the
> version of Crawldatum to ver. 5 :(

yes, version is updated on write

RE: record version mismatch occured

Posted by Gal Nitzan <gn...@usa.net>.
Got it. I used latest trunk for a few hours and it seems that it changed the
version of Crawldatum to ver. 5 :(



-----Original Message-----
From: Gal Nitzan [mailto:gnitzan@usa.net] 
Sent: Friday, January 26, 2007 4:57 PM
To: nutch-dev@lucene.apache.org
Subject: record version mismatch occured

Trying to mergesegs I get the following, any idea?


A record version mismatch occured. Expecting v4, found v5
	at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:147)
	at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1
175)
	at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1258)
	at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRea
der.java:69)
	at
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerge
r.java:139)
	at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:201)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
	at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211)





record version mismatch occured

Posted by Gal Nitzan <gn...@usa.net>.
Trying to mergesegs I get the following, any idea?


A record version mismatch occured. Expecting v4, found v5
	at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:147)
	at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1
175)
	at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1258)
	at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRea
der.java:69)
	at
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerge
r.java:139)
	at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:201)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
	at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211)



[jira] Updated: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

Posted by "Andrew Groh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Groh updated NUTCH-436:
------------------------------

    Description: 
If you have a base URL of the form:
http://a/b/c/d;p?q#f

Embedded URL: ?y
Correct Absolute URL: http://a/b/c/d;p?y 
Nutch Generated URL: http://a/b/c/?y

Embedded URL: ;x
Correct Absolute URL: http://a/b/c/d;x 
Nutch Generated URL: http://a/b/c/;x


See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example

http://www.ietf.org/rfc/rfc1808.txt




  was:
If you have a base URL of the form:
http://a/b/c/d;p?q#f

Embedded URL      Correct Absolute URL     Nutch Generated URL
?y                                http://a/b/c/d;p?y               http://a/b/c/?y
;x                                 http://a/b/c/d;x                    http://a/b/c/;x


See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example

http://www.ietf.org/rfc/rfc1808.txt





> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-436
>                 URL: https://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>            Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes closed NUTCH-436.
------------------------------


Issue closed.

> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-436
>                 URL: https://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>         Assigned To: Dennis Kubes
>            Priority: Critical
>         Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-436.
--------------------------------

    Resolution: Fixed

Patch tested on 10,000 URL run with no apparent issues.  Reviewed and committed.

> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-436
>                 URL: https://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>         Assigned To: Dennis Kubes
>            Priority: Critical
>         Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes reassigned NUTCH-436:
----------------------------------

    Assignee: Dennis Kubes

> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-436
>                 URL: https://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>         Assigned To: Dennis Kubes
>            Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.