You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/05 13:38:49 UTC
Need help handeling corrupted files
Hey ho,
i have a problem with a url that seems to be an vcf document.
Let me explain:
When I try to build an solr index, this url is responsible for this
error message:
SEVERE: org.apache.solr.common.SolrException: ERROR:
[http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10]
multiple values encountered for non multiValued field title:
[Universität Kassel, Fachbereich 6 ASL: Faculty Members,
Lolita_Hörnlein.vcf]
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:242)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
The url is:
http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10
When I download it separately it delivers following response:
Status=OK - 200
Date=Fri, 05 Aug 2011 11:09:12 GMT
Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
X-Powered-By=PHP/5.2.0-8+etch16
Content-Disposition=attachment; filename=Lolita_Hörnlein.vcf
Pragma=public
Content-Type=text/directory
Set-Cookie=fe_typo_user=316c4c91100f95fb57c5e8d39d32f99d; path=/asl/
Via=1.1 cms.uni-kassel.de
Vary=Accept-Encoding
Content-Encoding=gzip
Content-Length=5043
Keep-Alive=timeout=15, max=99
Connection=Keep-Alive
I have inspected this file and find out that it is corrupted, it seems
that besides the prober vcf data, there is generated html code in this
file. This seems to be a misbehaviour from some plugin in the cms.
My Question is how to handle such files. It looks like the parser sets
to much values in the title field, so solr can't handle it.
For a quick solution it would be best if I could configure tika in that
way, that it won't parse the vcf. But I don't know how to do that.
Any suggestions for this problem?
Thank you very much.
Re: Need help handeling corrupted files
Posted by Marek Bachmann <m....@uni-kassel.de>.
On 05.08.2011 13:50, Julien Nioche wrote:
> Simply change your solr schema and make the field title multivalued
Thank you Julien. Perfect first aid! :-)
>
> On 5 August 2011 12:38, Marek Bachmann<m....@uni-kassel.de> wrote:
>
>> Hey ho,
>>
>> i have a problem with a url that seems to be an vcf document.
>> Let me explain:
>>
>> When I try to build an solr index, this url is responsible for this error
>> message:
>>
>> SEVERE: org.apache.solr.common.**SolrException: ERROR: [
>> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
>> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>]
>> multiple values encountered for non multiValued field title: [Universität
>> Kassel, Fachbereich 6 ASL: Faculty Members, Lolita_Hörnlein.vcf]
>> at org.apache.solr.update.**DocumentBuilder.toDocument(**
>> DocumentBuilder.java:242)
>> at org.apache.solr.update.**processor.RunUpdateProcessor.**
>> processAdd(**RunUpdateProcessorFactory.**java:60)
>> at org.apache.solr.handler.**XMLLoader.processUpdate(**
>> XMLLoader.java:147)
>> at org.apache.solr.handler.**XMLLoader.load(XMLLoader.java:**77)
>> at org.apache.solr.handler.**ContentStreamHandlerBase.**
>> handleRequestBody(**ContentStreamHandlerBase.java:**67)
>> at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
>> RequestHandlerBase.java:129)
>> at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1360)
>> at org.apache.solr.servlet.**SolrDispatchFilter.execute(**
>> SolrDispatchFilter.java:356)
>> at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>> SolrDispatchFilter.java:252)
>> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
>> doFilter(ServletHandler.java:**1212)
>> at org.mortbay.jetty.servlet.**ServletHandler.handle(**
>> ServletHandler.java:399)
>> at org.mortbay.jetty.security.**SecurityHandler.handle(**
>> SecurityHandler.java:216)
>> at org.mortbay.jetty.servlet.**SessionHandler.handle(**
>> SessionHandler.java:182)
>> at org.mortbay.jetty.handler.**ContextHandler.handle(**
>> ContextHandler.java:766)
>> at org.mortbay.jetty.webapp.**WebAppContext.handle(**
>> WebAppContext.java:450)
>> at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
>> ContextHandlerCollection.java:**230)
>> at org.mortbay.jetty.handler.**HandlerCollection.handle(**
>> HandlerCollection.java:114)
>> at org.mortbay.jetty.handler.**HandlerWrapper.handle(**
>> HandlerWrapper.java:152)
>> at org.mortbay.jetty.Server.**handle(Server.java:326)
>> at org.mortbay.jetty.**HttpConnection.handleRequest(**
>> HttpConnection.java:542)
>> at org.mortbay.jetty.**HttpConnection$RequestHandler.**
>> content(HttpConnection.java:**945)
>> at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:843)
>> at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.**
>> java:212)
>> at org.mortbay.jetty.**HttpConnection.handle(**
>> HttpConnection.java:404)
>> at org.mortbay.jetty.bio.**SocketConnector$Connection.**
>> run(SocketConnector.java:228)
>> at org.mortbay.thread.**QueuedThreadPool$PoolThread.**
>> run(QueuedThreadPool.java:582)
>>
>>
>> The url is:
>>
>> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
>> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>
>>
>> When I download it separately it delivers following response:
>>
>> Status=OK - 200
>> Date=Fri, 05 Aug 2011 11:09:12 GMT
>> Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
>> X-Powered-By=PHP/5.2.0-8+**etch16
>> Content-Disposition=**attachment; filename=Lolita_Hörnlein.vcf
>> Pragma=public
>> Content-Type=text/directory
>> Set-Cookie=fe_typo_user=**316c4c91100f95fb57c5e8d39d32f9**9d; path=/asl/
>> Via=1.1 cms.uni-kassel.de
>> Vary=Accept-Encoding
>> Content-Encoding=gzip
>> Content-Length=5043
>> Keep-Alive=timeout=15, max=99
>> Connection=Keep-Alive
>>
>> I have inspected this file and find out that it is corrupted, it seems that
>> besides the prober vcf data, there is generated html code in this file. This
>> seems to be a misbehaviour from some plugin in the cms.
>>
>> My Question is how to handle such files. It looks like the parser sets to
>> much values in the title field, so solr can't handle it.
>>
>> For a quick solution it would be best if I could configure tika in that
>> way, that it won't parse the vcf. But I don't know how to do that.
>>
>> Any suggestions for this problem?
>>
>> Thank you very much.
>>
>>
>>
>
>
Re: Need help handeling corrupted files
Posted by Julien Nioche <li...@gmail.com>.
Simply change your solr schema and make the field title multivalued
On 5 August 2011 12:38, Marek Bachmann <m....@uni-kassel.de> wrote:
> Hey ho,
>
> i have a problem with a url that seems to be an vcf document.
> Let me explain:
>
> When I try to build an solr index, this url is responsible for this error
> message:
>
> SEVERE: org.apache.solr.common.**SolrException: ERROR: [
> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>]
> multiple values encountered for non multiValued field title: [Universität
> Kassel, Fachbereich 6 ASL: Faculty Members, Lolita_Hörnlein.vcf]
> at org.apache.solr.update.**DocumentBuilder.toDocument(**
> DocumentBuilder.java:242)
> at org.apache.solr.update.**processor.RunUpdateProcessor.**
> processAdd(**RunUpdateProcessorFactory.**java:60)
> at org.apache.solr.handler.**XMLLoader.processUpdate(**
> XMLLoader.java:147)
> at org.apache.solr.handler.**XMLLoader.load(XMLLoader.java:**77)
> at org.apache.solr.handler.**ContentStreamHandlerBase.**
> handleRequestBody(**ContentStreamHandlerBase.java:**67)
> at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
> RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1360)
> at org.apache.solr.servlet.**SolrDispatchFilter.execute(**
> SolrDispatchFilter.java:356)
> at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> SolrDispatchFilter.java:252)
> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1212)
> at org.mortbay.jetty.servlet.**ServletHandler.handle(**
> ServletHandler.java:399)
> at org.mortbay.jetty.security.**SecurityHandler.handle(**
> SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.**SessionHandler.handle(**
> SessionHandler.java:182)
> at org.mortbay.jetty.handler.**ContextHandler.handle(**
> ContextHandler.java:766)
> at org.mortbay.jetty.webapp.**WebAppContext.handle(**
> WebAppContext.java:450)
> at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
> ContextHandlerCollection.java:**230)
> at org.mortbay.jetty.handler.**HandlerCollection.handle(**
> HandlerCollection.java:114)
> at org.mortbay.jetty.handler.**HandlerWrapper.handle(**
> HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.**handle(Server.java:326)
> at org.mortbay.jetty.**HttpConnection.handleRequest(**
> HttpConnection.java:542)
> at org.mortbay.jetty.**HttpConnection$RequestHandler.**
> content(HttpConnection.java:**945)
> at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:843)
> at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.**
> java:212)
> at org.mortbay.jetty.**HttpConnection.handle(**
> HttpConnection.java:404)
> at org.mortbay.jetty.bio.**SocketConnector$Connection.**
> run(SocketConnector.java:228)
> at org.mortbay.thread.**QueuedThreadPool$PoolThread.**
> run(QueuedThreadPool.java:582)
>
>
> The url is:
>
> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>
>
> When I download it separately it delivers following response:
>
> Status=OK - 200
> Date=Fri, 05 Aug 2011 11:09:12 GMT
> Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
> X-Powered-By=PHP/5.2.0-8+**etch16
> Content-Disposition=**attachment; filename=Lolita_Hörnlein.vcf
> Pragma=public
> Content-Type=text/directory
> Set-Cookie=fe_typo_user=**316c4c91100f95fb57c5e8d39d32f9**9d; path=/asl/
> Via=1.1 cms.uni-kassel.de
> Vary=Accept-Encoding
> Content-Encoding=gzip
> Content-Length=5043
> Keep-Alive=timeout=15, max=99
> Connection=Keep-Alive
>
> I have inspected this file and find out that it is corrupted, it seems that
> besides the prober vcf data, there is generated html code in this file. This
> seems to be a misbehaviour from some plugin in the cms.
>
> My Question is how to handle such files. It looks like the parser sets to
> much values in the title field, so solr can't handle it.
>
> For a quick solution it would be best if I could configure tika in that
> way, that it won't parse the vcf. But I don't know how to do that.
>
> Any suggestions for this problem?
>
> Thank you very much.
>
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com