You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "David Broadfoot (JIRA)" <ji...@apache.org> on 2011/08/15 18:51:27 UTC
[jira] [Commented] (CONNECTORS-157) Root-relative paths without
leading / do not resolve properly
[ https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085161#comment-13085161 ]
David Broadfoot commented on CONNECTORS-157:
--------------------------------------------
Hi Karl - I've recently run into a very similar problem
I have a job with this page as a seed:
http://mysare.sare.org/MySareF?do=searchProj&q=*&searchmethod=and®ion=&state=&projType=0&sortby=1&page=1
and i want to crawl / ingest pages with
do=viewRept
in the urls. The links in the html are like this:
<a href="?do=viewRept&pn=LNE03-182&y=2004&t=0">2004 Annual Report
When I check the simple history, I see lines like the following in the identifier column :
http://mysare.sare.org/MySare/?do=viewRept&pn=LNC04-240&t=0&y=2006
ie the /ProjectReport.aspx part is being omited from the crawled url.
Any idea what is going on here?
> Root-relative paths without leading / do not resolve properly
> -------------------------------------------------------------
>
> Key: CONNECTORS-157
> URL: https://issues.apache.org/jira/browse/CONNECTORS-157
> Project: ManifoldCF
> Issue Type: Bug
> Components: Web connector
> Affects Versions: ManifoldCF 0.1
> Reporter: Karl Wright
> Assignee: Karl Wright
> Fix For: ManifoldCF 0.2
>
>
> If a document has a URL which is just the domain, e.g. "http://foo.com", the java.net.URI class fails to resolve URLs in that document which have no starting "/", e.g. "document.pdf". The resolved URI has no path part, e.g. "http://foo.comdocument.pdf". This is apparently a bug, but we need to find a way to work around it properly.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira