You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by bu...@gmx.de on 2011/02/17 13:27:20 UTC

URISnytaxException

Hi all,

I just checked out the newest version of MCF and now I am getting this 
error while crawling certain pages. What can I do against that?

Error Message:

java.net.URISyntaxException: Illegal character in path at index 73: 
/link/to/the/page/alan smithee.xls
        at java.net.URI$Parser.fail(URI.java:2809)
        at java.net.URI$Parser.checkChars(URI.java:2982)
        at java.net.URI$Parser.parseHierarchical(URI.java:3066)
        at java.net.URI$Parser.parse(URI.java:3024)
        at java.net.URI.<init>(URI.java:578)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132)
        at 
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
        at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)


How i set it up (hope that it helps):

        
            - installed postgreSQL 8.3.11-1
            - checked out the project into the MCF folder
            - added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed 
it to jcifs.jar

            - built the project with ant at /mcf
            - copied the content of "dist" to c:/documents and 
settings/myUserAccount/lcf
            - added the properties.xml and the logging.ini there
            - created a synchronization folder
            - set MCF_HOME to the folder above
                
            - executed in /processes/scripts these commands: 
                org.apache.manifoldcf.core.DBCreate postgres p0sTgres
org.apache.manifoldcf.agents.Install 
org.apache.manifoldcf.agents.Register 
org.apache.manifoldcf.crawler.system.CrawlerAgent 
org.apache.manifoldcf.agents.RegisterOutput 
org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector" 
org.apache.manifoldcf.authorities.RegisterAuthority 
org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority 
"Active Directory Authority" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector 
"Filesystem Connector" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database 
Connector" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector 
"Windows Share Connector" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS Connector" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector 
"Web Connector" 

        
            - and copied the content of /lcf/web/war to my /tomcat/webapps
                

Thanks for your help and Best regards,
Julian    
-- 
Schon gehört? GMX hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://www.gmx.net/de/go/toolbar

Re: URISnytaxException

Posted by fred fredson <bu...@gmx.de>.
Hi,
thanks for your quick reply and your explanation.

~Julian

>             
> -------- Original-Nachricht --------
> Datum: Thu, 17 Feb 2011 08:03:04 -0500
> Von: Karl Wright <da...@gmail.com>
> An: connectors-user@incubator.apache.org
> Betreff: Re: URISnytaxException
> 
>             Hi,
> You've done nothing wrong; the stack trace is being dumped because of
> a debugging line that was inadvertantly left in the code recently.  It
> should not change the way the crawl occurs.  Regardless, I've removed
> the offending line from trunk now.
> 
> In case you are curious, what is happening is that the page link the
> crawler has located is not properly URI encoded.  Space characters are
> illegal in URI's.  Normally, the web connector would skip this link
> and note that to the log.
> 
> Thanks,
> Karl
> 
> 
> On Thu, Feb 17, 2011 at 7:27 AM,  <bu...@gmx.de> wrote:
> > Hi all,
> >
> > I just checked out the newest version of MCF and now I am getting this 
> error
> > while crawling certain pages. What can I do against that?
> >
> > Error Message:
> >
> > java.net.URISyntaxException: Illegal character in path at index 73:
> > /link/to/the/page/alan smithee.xls
> >         at java.net.URI$Parser.fail(URI.java:2809)
> >         at java.net.URI$Parser.checkChars(URI.java:2982)
> >         at java.net.URI$Parser.parseHierarchical(URI.java:3066)
> >         at java.net.URI$Parser.parse(URI.java:3024)
> >         at java.net.URI.<init>(URI.java:578)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
> >         at
> > 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
> >
> >
> > How i set it up (hope that it helps):
> >
> > installed postgreSQL 8.3.11-1
> > checked out the project into the MCF folder
> > added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed it to 
> jcifs.jar
> > built the project with ant at /mcf
> > copied the content of "dist" to c:/documents and 
> settings/myUserAccount/lcf
> > added the properties.xml and the logging.ini there
> > created a synchronization folder
> > set MCF_HOME to the folder above
> >
> > executed in /processes/scripts these commands:
> >
> > org.apache.manifoldcf.core.DBCreate postgres p0sTgres
> > org.apache.manifoldcf.agents.Install
> > org.apache.manifoldcf.agents.Register
> > org.apache.manifoldcf.crawler.system.CrawlerAgent
> > org.apache.manifoldcf.agents.RegisterOutput
> > org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector"
> > org.apache.manifoldcf.authorities.RegisterAuthority
> > 
> org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority
> > "Active Directory Authority"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector
> > "Filesystem Connector"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database
> > Connector"
> > org.apache.manifoldcf.crawler.Register
> > 
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector
> > "Windows Share Connector"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS 
> Connector"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector 
> "Web
> > Connector"
> >
> > and copied the content of /lcf/web/war to my /tomcat/webapps
> >
> > Thanks for your help and Best regards,
> > Julian
> >
> >
> > --
> > Schon gehört? GMX hat einen genialen Phishing-Filter in die
> > Toolbar eingebaut! http://www.gmx.net/de/go/toolbar
> 
        
-- 
GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit 
gratis Handy-Flat! http://portal.gmx.net/de/go/dsl

Re: URISnytaxException

Posted by Karl Wright <da...@gmail.com>.
Hi,
You've done nothing wrong; the stack trace is being dumped because of
a debugging line that was inadvertantly left in the code recently.  It
should not change the way the crawl occurs.  Regardless, I've removed
the offending line from trunk now.

In case you are curious, what is happening is that the page link the
crawler has located is not properly URI encoded.  Space characters are
illegal in URI's.  Normally, the web connector would skip this link
and note that to the log.

Thanks,
Karl


On Thu, Feb 17, 2011 at 7:27 AM,  <bu...@gmx.de> wrote:
> Hi all,
>
> I just checked out the newest version of MCF and now I am getting this error
> while crawling certain pages. What can I do against that?
>
> Error Message:
>
> java.net.URISyntaxException: Illegal character in path at index 73:
> /link/to/the/page/alan smithee.xls
>         at java.net.URI$Parser.fail(URI.java:2809)
>         at java.net.URI$Parser.checkChars(URI.java:2982)
>         at java.net.URI$Parser.parseHierarchical(URI.java:3066)
>         at java.net.URI$Parser.parse(URI.java:3024)
>         at java.net.URI.<init>(URI.java:578)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132)
>         at
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
>
>
> How i set it up (hope that it helps):
>
> installed postgreSQL 8.3.11-1
> checked out the project into the MCF folder
> added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed it to jcifs.jar
> built the project with ant at /mcf
> copied the content of "dist" to c:/documents and settings/myUserAccount/lcf
> added the properties.xml and the logging.ini there
> created a synchronization folder
> set MCF_HOME to the folder above
>
> executed in /processes/scripts these commands:
>
> org.apache.manifoldcf.core.DBCreate postgres p0sTgres
> org.apache.manifoldcf.agents.Install
> org.apache.manifoldcf.agents.Register
> org.apache.manifoldcf.crawler.system.CrawlerAgent
> org.apache.manifoldcf.agents.RegisterOutput
> org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector"
> org.apache.manifoldcf.authorities.RegisterAuthority
> org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority
> "Active Directory Authority"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector
> "Filesystem Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database
> Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector
> "Windows Share Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector "Web
> Connector"
>
> and copied the content of /lcf/web/war to my /tomcat/webapps
>
> Thanks for your help and Best regards,
> Julian
>
>
> --
> Schon gehört? GMX hat einen genialen Phishing-Filter in die
> Toolbar eingebaut! http://www.gmx.net/de/go/toolbar