You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by bu...@gmx.de on 2011/02/17 13:27:20 UTC
URISnytaxException
Hi all,
I just checked out the newest version of MCF and now I am getting this
error while crawling certain pages. What can I do against that?
Error Message:
java.net.URISyntaxException: Illegal character in path at index 73:
/link/to/the/page/alan smithee.xls
at java.net.URI$Parser.fail(URI.java:2809)
at java.net.URI$Parser.checkChars(URI.java:2982)
at java.net.URI$Parser.parseHierarchical(URI.java:3066)
at java.net.URI$Parser.parse(URI.java:3024)
at java.net.URI.<init>(URI.java:578)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553)
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132)
at
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
How i set it up (hope that it helps):
- installed postgreSQL 8.3.11-1
- checked out the project into the MCF folder
- added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed
it to jcifs.jar
- built the project with ant at /mcf
- copied the content of "dist" to c:/documents and
settings/myUserAccount/lcf
- added the properties.xml and the logging.ini there
- created a synchronization folder
- set MCF_HOME to the folder above
- executed in /processes/scripts these commands:
org.apache.manifoldcf.core.DBCreate postgres p0sTgres
org.apache.manifoldcf.agents.Install
org.apache.manifoldcf.agents.Register
org.apache.manifoldcf.crawler.system.CrawlerAgent
org.apache.manifoldcf.agents.RegisterOutput
org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector"
org.apache.manifoldcf.authorities.RegisterAuthority
org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority
"Active Directory Authority"
org.apache.manifoldcf.crawler.Register
org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector
"Filesystem Connector"
org.apache.manifoldcf.crawler.Register
org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database
Connector"
org.apache.manifoldcf.crawler.Register
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector
"Windows Share Connector"
org.apache.manifoldcf.crawler.Register
org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS Connector"
org.apache.manifoldcf.crawler.Register
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector
"Web Connector"
- and copied the content of /lcf/web/war to my /tomcat/webapps
Thanks for your help and Best regards,
Julian
--
Schon gehört? GMX hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://www.gmx.net/de/go/toolbar
Re: URISnytaxException
Posted by fred fredson <bu...@gmx.de>.
Hi,
thanks for your quick reply and your explanation.
~Julian
>
> -------- Original-Nachricht --------
> Datum: Thu, 17 Feb 2011 08:03:04 -0500
> Von: Karl Wright <da...@gmail.com>
> An: connectors-user@incubator.apache.org
> Betreff: Re: URISnytaxException
>
> Hi,
> You've done nothing wrong; the stack trace is being dumped because of
> a debugging line that was inadvertantly left in the code recently. It
> should not change the way the crawl occurs. Regardless, I've removed
> the offending line from trunk now.
>
> In case you are curious, what is happening is that the page link the
> crawler has located is not properly URI encoded. Space characters are
> illegal in URI's. Normally, the web connector would skip this link
> and note that to the log.
>
> Thanks,
> Karl
>
>
> On Thu, Feb 17, 2011 at 7:27 AM, <bu...@gmx.de> wrote:
> > Hi all,
> >
> > I just checked out the newest version of MCF and now I am getting this
> error
> > while crawling certain pages. What can I do against that?
> >
> > Error Message:
> >
> > java.net.URISyntaxException: Illegal character in path at index 73:
> > /link/to/the/page/alan smithee.xls
> > at java.net.URI$Parser.fail(URI.java:2809)
> > at java.net.URI$Parser.checkChars(URI.java:2982)
> > at java.net.URI$Parser.parseHierarchical(URI.java:3066)
> > at java.net.URI$Parser.parse(URI.java:3024)
> > at java.net.URI.<init>(URI.java:578)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132)
> > at
> >
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
> > at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
> >
> >
> > How i set it up (hope that it helps):
> >
> > installed postgreSQL 8.3.11-1
> > checked out the project into the MCF folder
> > added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed it to
> jcifs.jar
> > built the project with ant at /mcf
> > copied the content of "dist" to c:/documents and
> settings/myUserAccount/lcf
> > added the properties.xml and the logging.ini there
> > created a synchronization folder
> > set MCF_HOME to the folder above
> >
> > executed in /processes/scripts these commands:
> >
> > org.apache.manifoldcf.core.DBCreate postgres p0sTgres
> > org.apache.manifoldcf.agents.Install
> > org.apache.manifoldcf.agents.Register
> > org.apache.manifoldcf.crawler.system.CrawlerAgent
> > org.apache.manifoldcf.agents.RegisterOutput
> > org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector"
> > org.apache.manifoldcf.authorities.RegisterAuthority
> >
> org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority
> > "Active Directory Authority"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector
> > "Filesystem Connector"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database
> > Connector"
> > org.apache.manifoldcf.crawler.Register
> >
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector
> > "Windows Share Connector"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS
> Connector"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector
> "Web
> > Connector"
> >
> > and copied the content of /lcf/web/war to my /tomcat/webapps
> >
> > Thanks for your help and Best regards,
> > Julian
> >
> >
> > --
> > Schon gehört? GMX hat einen genialen Phishing-Filter in die
> > Toolbar eingebaut! http://www.gmx.net/de/go/toolbar
>
--
GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
Re: URISnytaxException
Posted by Karl Wright <da...@gmail.com>.
Hi,
You've done nothing wrong; the stack trace is being dumped because of
a debugging line that was inadvertantly left in the code recently. It
should not change the way the crawl occurs. Regardless, I've removed
the offending line from trunk now.
In case you are curious, what is happening is that the page link the
crawler has located is not properly URI encoded. Space characters are
illegal in URI's. Normally, the web connector would skip this link
and note that to the log.
Thanks,
Karl
On Thu, Feb 17, 2011 at 7:27 AM, <bu...@gmx.de> wrote:
> Hi all,
>
> I just checked out the newest version of MCF and now I am getting this error
> while crawling certain pages. What can I do against that?
>
> Error Message:
>
> java.net.URISyntaxException: Illegal character in path at index 73:
> /link/to/the/page/alan smithee.xls
> at java.net.URI$Parser.fail(URI.java:2809)
> at java.net.URI$Parser.checkChars(URI.java:2982)
> at java.net.URI$Parser.parseHierarchical(URI.java:3066)
> at java.net.URI$Parser.parse(URI.java:3024)
> at java.net.URI.<init>(URI.java:578)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553)
> at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132)
> at
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
>
>
> How i set it up (hope that it helps):
>
> installed postgreSQL 8.3.11-1
> checked out the project into the MCF folder
> added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed it to jcifs.jar
> built the project with ant at /mcf
> copied the content of "dist" to c:/documents and settings/myUserAccount/lcf
> added the properties.xml and the logging.ini there
> created a synchronization folder
> set MCF_HOME to the folder above
>
> executed in /processes/scripts these commands:
>
> org.apache.manifoldcf.core.DBCreate postgres p0sTgres
> org.apache.manifoldcf.agents.Install
> org.apache.manifoldcf.agents.Register
> org.apache.manifoldcf.crawler.system.CrawlerAgent
> org.apache.manifoldcf.agents.RegisterOutput
> org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector"
> org.apache.manifoldcf.authorities.RegisterAuthority
> org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority
> "Active Directory Authority"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector
> "Filesystem Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database
> Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector
> "Windows Share Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector "Web
> Connector"
>
> and copied the content of /lcf/web/war to my /tomcat/webapps
>
> Thanks for your help and Best regards,
> Julian
>
>
> --
> Schon gehört? GMX hat einen genialen Phishing-Filter in die
> Toolbar eingebaut! http://www.gmx.net/de/go/toolbar