You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by "David Broadfoot (JIRA)" <ji...@apache.org> on 2011/08/15 18:51:27 UTC

[jira] [Commented] (CONNECTORS-157) Root-relative paths without leading / do not resolve properly

    [ https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085161#comment-13085161 ] 

David Broadfoot commented on CONNECTORS-157:
--------------------------------------------

Hi Karl - I've recently run into a very similar problem

I have a job with this page as a seed:

http://mysare.sare.org/MySareF?do=searchProj&q=*&amp;searchmethod=and&region=&state=&projType=0&sortby=1&page=1

and i want to crawl / ingest pages with 

do=viewRept

in the urls. The links in the html are like this:

<a href="?do=viewRept&amp;pn=LNE03-182&amp;y=2004&amp;t=0">2004 Annual Report

When I check the simple history, I see lines like the following in the identifier column :

http://mysare.sare.org/MySare/?do=viewRept&pn=LNC04-240&t=0&y=2006

ie the /ProjectReport.aspx part is being omited from the crawled url.

Any idea what is going on here?



> Root-relative paths without leading / do not resolve properly
> -------------------------------------------------------------
>
>                 Key: CONNECTORS-157
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-157
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.1
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.2
>
>
> If a document has a URL which is just the domain, e.g. "http://foo.com", the java.net.URI class fails to resolve URLs in that document which have no starting "/", e.g. "document.pdf".  The resolved URI has no path part, e.g. "http://foo.comdocument.pdf".  This is apparently a bug, but we need to find a way to work around it properly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira