You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrew Groh (JIRA)" <ji...@apache.org> on 2007/01/26 15:09:49 UTC
[jira] Created: (NUTCH-436) Incorrect handling of relative paths
when the embedded URL path is empty
Incorrect handling of relative paths when the embedded URL path is empty
------------------------------------------------------------------------
Key: NUTCH-436
URL: https://issues.apache.org/jira/browse/NUTCH-436
Project: Nutch
Issue Type: Bug
Components: fetcher
Reporter: Andrew Groh
Priority: Critical
If you have a base URL of the form:
http://a/b/c/d;p?q#f
Embedded URL Correct Absolute URL Nutch Generated URL
?y http://a/b/c/d;p?y http://a/b/c/?y
;x http://a/b/c/d;x http://a/b/c/;x
See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-436) Incorrect handling of relative paths
when the embedded URL path is empty
Posted by "Doug Cook (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535272 ]
Doug Cook commented on NUTCH-436:
---------------------------------
It looks like Nutch-566, and associated patch, which I recently filed, is a duplicate of this.
The patch I proposed may or may not handle the ';' correctly, I need to check that.
But the patch for this issue (Nutch-436) is limited to DOMContentUtils, and this problem will exist wherever Sun's URL class is used in URL extraction -- thus it affects any parser, not just the HTML one. The same issue occurs in Javascript link extraction, Flash link extraction, etc. -- thus the patch should be in a centralized location (like util).
> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
> Key: NUTCH-436
> URL: https://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assignee: Dennis Kubes
> Priority: Critical
> Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-436) Incorrect handling of relative paths
when the embedded URL path is empty
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes updated NUTCH-436:
-------------------------------
Attachment: NUTCH-436-20070304.patch
NUTCH-436-20070304.patch handles correct encoding of the params information in the base url. When creating a new URL,with a base URL and target String path, if the target contains params information but the base does not then the java.net.URL class has the correct behavior. If the base has params information then the URL class strips this information from the URL. This patch is a workaround that moves base params information to the target so that it can be correctly handled by the URL class.
> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
> Key: NUTCH-436
> URL: https://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assigned To: Dennis Kubes
> Priority: Critical
> Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-436) Incorrect handling of relative paths
when the embedded URL path is empty
Posted by "Andrew Groh (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467883 ]
Andrew Groh commented on NUTCH-436:
-----------------------------------
This is a bug in java.net.URL, specifically the URLStreamClass that it uses.
new URL("http://a/b/c/d;p?q#f ","?y")
creates a URL object with a bad URL.
> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
> Key: NUTCH-436
> URL: https://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: java.io.FileNotFoundException: / (Is a directory)
Posted by Dennis Kubes <nu...@dragonflymc.com>.
That is a hadoop.log.dir problem value not being set. It is trying to
use the DRFA appender to a file and can't find the log directory.
Dennis
Gal Nitzan wrote:
>
> Just installed latest from trunk.
>
> I run mergesegs and I get the following error in all tasks log files (I use
> default log4j.properties):
>
> log4j:ERROR setFile(null,true) call failed.
> java.io.FileNotFoundException: / (Is a directory)
> at java.io.FileOutputStream.openAppend(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:177)
> at java.io.FileOutputStream.(FileOutputStream.java:102)
> at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
> at
> org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
> at
> org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp
> pender.java:215)
> at
> org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)
> at
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132
> )
> at
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)
> at
> org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav
> a:654)
> at
> org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav
> a:612)
> at
> org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur
> ator.java:509)
> at
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
> 415)
> at
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
> 441)
> at
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.
> java:468)
> at org.apache.log4j.LogManager.(LogManager.java:122)
> at org.apache.log4j.Logger.getLogger(Logger.java:104)
> at
> org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)
> at org.apache.commons.logging.impl.Log4JLogger.(Log4JLogger.java:65)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces
> sorImpl.java:39)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
> torAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at
> org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.ja
> va:529)
> at
> org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.ja
> va:235)
> at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:370)
> at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:59)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1346)
> log4j:ERROR Either File or DatePattern options are not set for appender
> [DRFA].
>
>
java.io.FileNotFoundException: / (Is a directory)
Posted by Gal Nitzan <gn...@usa.net>.
Just installed latest from trunk.
I run mergesegs and I get the following error in all tasks log files (I use
default log4j.properties):
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: / (Is a directory)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:177)
at java.io.FileOutputStream.(FileOutputStream.java:102)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
at
org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
at
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp
pender.java:215)
at
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)
at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132
)
at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)
at
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav
a:654)
at
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav
a:612)
at
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur
ator.java:509)
at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
415)
at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
441)
at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.
java:468)
at org.apache.log4j.LogManager.(LogManager.java:122)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)
at org.apache.commons.logging.impl.Log4JLogger.(Log4JLogger.java:65)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces
sorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
torAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.ja
va:529)
at
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.ja
va:235)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:370)
at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:59)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1346)
log4j:ERROR Either File or DatePattern options are not set for appender
[DRFA].
Re: record version mismatch occured
Posted by Sami Siren <ss...@gmail.com>.
Gal Nitzan wrote:
> Thanks Sami,
>
> By redo do you mean re-parse or re-fetch + re-parse
generate -> fetch -> parse
--
Sami Siren
RE: record version mismatch occured
Posted by Gal Nitzan <gn...@usa.net>.
Thanks Sami,
By redo do you mean re-parse or re-fetch + re-parse
-----Original Message-----
From: Sami Siren [mailto:ssiren@gmail.com]
Sent: Friday, January 26, 2007 10:49 PM
To: nutch-dev@lucene.apache.org
Subject: Re: record version mismatch occured
Gal Nitzan wrote:
> Got it. I used latest trunk for a few hours and it seems that it changed
the
> version of Crawldatum to ver. 5 :(
Earlier one left too early, one(ore more) of your segments has data
written with newer version. If you haven't updated crawldb then you just
need to redo that(those) segment(s).
--
Sami Siren
Re: record version mismatch occured
Posted by Sami Siren <ss...@gmail.com>.
Gal Nitzan wrote:
> Got it. I used latest trunk for a few hours and it seems that it changed the
> version of Crawldatum to ver. 5 :(
Earlier one left too early, one(ore more) of your segments has data
written with newer version. If you haven't updated crawldb then you just
need to redo that(those) segment(s).
--
Sami Siren
Re: record version mismatch occured
Posted by Sami Siren <ss...@gmail.com>.
Gal Nitzan wrote:
> Got it. I used latest trunk for a few hours and it seems that it changed the
> version of Crawldatum to ver. 5 :(
yes, version is updated on write
RE: record version mismatch occured
Posted by Gal Nitzan <gn...@usa.net>.
Got it. I used latest trunk for a few hours and it seems that it changed the
version of Crawldatum to ver. 5 :(
-----Original Message-----
From: Gal Nitzan [mailto:gnitzan@usa.net]
Sent: Friday, January 26, 2007 4:57 PM
To: nutch-dev@lucene.apache.org
Subject: record version mismatch occured
Trying to mergesegs I get the following, any idea?
A record version mismatch occured. Expecting v4, found v5
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:147)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1
175)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1258)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRea
der.java:69)
at
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerge
r.java:139)
at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:201)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211)
record version mismatch occured
Posted by Gal Nitzan <gn...@usa.net>.
Trying to mergesegs I get the following, any idea?
A record version mismatch occured. Expecting v4, found v5
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:147)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1
175)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1258)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRea
der.java:69)
at
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerge
r.java:139)
at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:201)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211)
[jira] Updated: (NUTCH-436) Incorrect handling of relative paths
when the embedded URL path is empty
Posted by "Andrew Groh (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Groh updated NUTCH-436:
------------------------------
Description:
If you have a base URL of the form:
http://a/b/c/d;p?q#f
Embedded URL: ?y
Correct Absolute URL: http://a/b/c/d;p?y
Nutch Generated URL: http://a/b/c/?y
Embedded URL: ;x
Correct Absolute URL: http://a/b/c/d;x
Nutch Generated URL: http://a/b/c/;x
See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
http://www.ietf.org/rfc/rfc1808.txt
was:
If you have a base URL of the form:
http://a/b/c/d;p?q#f
Embedded URL Correct Absolute URL Nutch Generated URL
?y http://a/b/c/d;p?y http://a/b/c/?y
;x http://a/b/c/d;x http://a/b/c/;x
See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
http://www.ietf.org/rfc/rfc1808.txt
> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
> Key: NUTCH-436
> URL: https://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-436) Incorrect handling of relative paths
when the embedded URL path is empty
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes closed NUTCH-436.
------------------------------
Issue closed.
> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
> Key: NUTCH-436
> URL: https://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assigned To: Dennis Kubes
> Priority: Critical
> Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-436) Incorrect handling of relative paths
when the embedded URL path is empty
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes resolved NUTCH-436.
--------------------------------
Resolution: Fixed
Patch tested on 10,000 URL run with no apparent issues. Reviewed and committed.
> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
> Key: NUTCH-436
> URL: https://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assigned To: Dennis Kubes
> Priority: Critical
> Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-436) Incorrect handling of relative paths
when the embedded URL path is empty
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes reassigned NUTCH-436:
----------------------------------
Assignee: Dennis Kubes
> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
> Key: NUTCH-436
> URL: https://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assigned To: Dennis Kubes
> Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.