You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/05 13:38:49 UTC

Need help handeling corrupted files

Hey ho,

i have a problem with a url that seems to be an vcf document.
Let me explain:

  When I try to build an solr index, this url is responsible for this 
error message:

SEVERE: org.apache.solr.common.SolrException: ERROR: 
[http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10] 
multiple values encountered for non multiValued field title: 
[Universität Kassel, Fachbereich 6 ASL: Faculty Members, 
Lolita_Hörnlein.vcf]
         at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:242)
         at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
         at 
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147)
         at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
         at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
         at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
         at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
         at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
         at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
         at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
         at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
         at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
         at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
         at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
         at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
         at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
         at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
         at org.mortbay.jetty.Server.handle(Server.java:326)
         at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
         at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
         at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
         at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


The url is:

http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10 


When I download it separately it delivers following response:

Status=OK - 200
Date=Fri, 05 Aug 2011 11:09:12 GMT
Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
X-Powered-By=PHP/5.2.0-8+etch16
Content-Disposition=attachment; filename=Lolita_Hörnlein.vcf
Pragma=public
Content-Type=text/directory
Set-Cookie=fe_typo_user=316c4c91100f95fb57c5e8d39d32f99d; path=/asl/
Via=1.1 cms.uni-kassel.de
Vary=Accept-Encoding
Content-Encoding=gzip
Content-Length=5043
Keep-Alive=timeout=15, max=99
Connection=Keep-Alive

I have inspected this file and find out that it is corrupted, it seems 
that besides the prober vcf data, there is generated html code in this 
file. This seems to be a misbehaviour from some plugin in the cms.

My Question is how to handle such files. It looks like the parser sets 
to much values in the title field, so solr can't handle it.

For a quick solution it would be best if I could configure tika in that 
way, that it won't parse the vcf. But I don't know how to do that.

Any suggestions for this problem?

Thank you very much.



Re: Need help handeling corrupted files

Posted by Marek Bachmann <m....@uni-kassel.de>.
On 05.08.2011 13:50, Julien Nioche wrote:
> Simply change your solr schema and make the field title multivalued

Thank you Julien. Perfect first aid! :-)
>
> On 5 August 2011 12:38, Marek Bachmann<m....@uni-kassel.de>  wrote:
>
>> Hey ho,
>>
>> i have a problem with a url that seems to be an vcf document.
>> Let me explain:
>>
>>   When I try to build an solr index, this url is responsible for this error
>> message:
>>
>> SEVERE: org.apache.solr.common.**SolrException: ERROR: [
>> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
>> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>]
>> multiple values encountered for non multiValued field title: [Universität
>> Kassel, Fachbereich 6 ASL: Faculty Members, Lolita_Hörnlein.vcf]
>>         at org.apache.solr.update.**DocumentBuilder.toDocument(**
>> DocumentBuilder.java:242)
>>         at org.apache.solr.update.**processor.RunUpdateProcessor.**
>> processAdd(**RunUpdateProcessorFactory.**java:60)
>>         at org.apache.solr.handler.**XMLLoader.processUpdate(**
>> XMLLoader.java:147)
>>         at org.apache.solr.handler.**XMLLoader.load(XMLLoader.java:**77)
>>         at org.apache.solr.handler.**ContentStreamHandlerBase.**
>> handleRequestBody(**ContentStreamHandlerBase.java:**67)
>>         at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
>> RequestHandlerBase.java:129)
>>         at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1360)
>>         at org.apache.solr.servlet.**SolrDispatchFilter.execute(**
>> SolrDispatchFilter.java:356)
>>         at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>> SolrDispatchFilter.java:252)
>>         at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
>> doFilter(ServletHandler.java:**1212)
>>         at org.mortbay.jetty.servlet.**ServletHandler.handle(**
>> ServletHandler.java:399)
>>         at org.mortbay.jetty.security.**SecurityHandler.handle(**
>> SecurityHandler.java:216)
>>         at org.mortbay.jetty.servlet.**SessionHandler.handle(**
>> SessionHandler.java:182)
>>         at org.mortbay.jetty.handler.**ContextHandler.handle(**
>> ContextHandler.java:766)
>>         at org.mortbay.jetty.webapp.**WebAppContext.handle(**
>> WebAppContext.java:450)
>>         at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
>> ContextHandlerCollection.java:**230)
>>         at org.mortbay.jetty.handler.**HandlerCollection.handle(**
>> HandlerCollection.java:114)
>>         at org.mortbay.jetty.handler.**HandlerWrapper.handle(**
>> HandlerWrapper.java:152)
>>         at org.mortbay.jetty.Server.**handle(Server.java:326)
>>         at org.mortbay.jetty.**HttpConnection.handleRequest(**
>> HttpConnection.java:542)
>>         at org.mortbay.jetty.**HttpConnection$RequestHandler.**
>> content(HttpConnection.java:**945)
>>         at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:843)
>>         at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.**
>> java:212)
>>         at org.mortbay.jetty.**HttpConnection.handle(**
>> HttpConnection.java:404)
>>         at org.mortbay.jetty.bio.**SocketConnector$Connection.**
>> run(SocketConnector.java:228)
>>         at org.mortbay.thread.**QueuedThreadPool$PoolThread.**
>> run(QueuedThreadPool.java:582)
>>
>>
>> The url is:
>>
>> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
>> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>
>>
>> When I download it separately it delivers following response:
>>
>> Status=OK - 200
>> Date=Fri, 05 Aug 2011 11:09:12 GMT
>> Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
>> X-Powered-By=PHP/5.2.0-8+**etch16
>> Content-Disposition=**attachment; filename=Lolita_Hörnlein.vcf
>> Pragma=public
>> Content-Type=text/directory
>> Set-Cookie=fe_typo_user=**316c4c91100f95fb57c5e8d39d32f9**9d; path=/asl/
>> Via=1.1 cms.uni-kassel.de
>> Vary=Accept-Encoding
>> Content-Encoding=gzip
>> Content-Length=5043
>> Keep-Alive=timeout=15, max=99
>> Connection=Keep-Alive
>>
>> I have inspected this file and find out that it is corrupted, it seems that
>> besides the prober vcf data, there is generated html code in this file. This
>> seems to be a misbehaviour from some plugin in the cms.
>>
>> My Question is how to handle such files. It looks like the parser sets to
>> much values in the title field, so solr can't handle it.
>>
>> For a quick solution it would be best if I could configure tika in that
>> way, that it won't parse the vcf. But I don't know how to do that.
>>
>> Any suggestions for this problem?
>>
>> Thank you very much.
>>
>>
>>
>
>


Re: Need help handeling corrupted files

Posted by Julien Nioche <li...@gmail.com>.
Simply change your solr schema and make the field title multivalued

On 5 August 2011 12:38, Marek Bachmann <m....@uni-kassel.de> wrote:

> Hey ho,
>
> i have a problem with a url that seems to be an vcf document.
> Let me explain:
>
>  When I try to build an solr index, this url is responsible for this error
> message:
>
> SEVERE: org.apache.solr.common.**SolrException: ERROR: [
> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>]
> multiple values encountered for non multiValued field title: [Universität
> Kassel, Fachbereich 6 ASL: Faculty Members, Lolita_Hörnlein.vcf]
>        at org.apache.solr.update.**DocumentBuilder.toDocument(**
> DocumentBuilder.java:242)
>        at org.apache.solr.update.**processor.RunUpdateProcessor.**
> processAdd(**RunUpdateProcessorFactory.**java:60)
>        at org.apache.solr.handler.**XMLLoader.processUpdate(**
> XMLLoader.java:147)
>        at org.apache.solr.handler.**XMLLoader.load(XMLLoader.java:**77)
>        at org.apache.solr.handler.**ContentStreamHandlerBase.**
> handleRequestBody(**ContentStreamHandlerBase.java:**67)
>        at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
> RequestHandlerBase.java:129)
>        at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1360)
>        at org.apache.solr.servlet.**SolrDispatchFilter.execute(**
> SolrDispatchFilter.java:356)
>        at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> SolrDispatchFilter.java:252)
>        at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1212)
>        at org.mortbay.jetty.servlet.**ServletHandler.handle(**
> ServletHandler.java:399)
>        at org.mortbay.jetty.security.**SecurityHandler.handle(**
> SecurityHandler.java:216)
>        at org.mortbay.jetty.servlet.**SessionHandler.handle(**
> SessionHandler.java:182)
>        at org.mortbay.jetty.handler.**ContextHandler.handle(**
> ContextHandler.java:766)
>        at org.mortbay.jetty.webapp.**WebAppContext.handle(**
> WebAppContext.java:450)
>        at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
> ContextHandlerCollection.java:**230)
>        at org.mortbay.jetty.handler.**HandlerCollection.handle(**
> HandlerCollection.java:114)
>        at org.mortbay.jetty.handler.**HandlerWrapper.handle(**
> HandlerWrapper.java:152)
>        at org.mortbay.jetty.Server.**handle(Server.java:326)
>        at org.mortbay.jetty.**HttpConnection.handleRequest(**
> HttpConnection.java:542)
>        at org.mortbay.jetty.**HttpConnection$RequestHandler.**
> content(HttpConnection.java:**945)
>        at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:843)
>        at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.**
> java:212)
>        at org.mortbay.jetty.**HttpConnection.handle(**
> HttpConnection.java:404)
>        at org.mortbay.jetty.bio.**SocketConnector$Connection.**
> run(SocketConnector.java:228)
>        at org.mortbay.thread.**QueuedThreadPool$PoolThread.**
> run(QueuedThreadPool.java:582)
>
>
> The url is:
>
> http://cms.uni-kassel.de/asl/**en/fb/staff.html?tx_**
> wtdirectory_pi1%5BvCard%5D=10<http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10>
>
> When I download it separately it delivers following response:
>
> Status=OK - 200
> Date=Fri, 05 Aug 2011 11:09:12 GMT
> Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
> X-Powered-By=PHP/5.2.0-8+**etch16
> Content-Disposition=**attachment; filename=Lolita_Hörnlein.vcf
> Pragma=public
> Content-Type=text/directory
> Set-Cookie=fe_typo_user=**316c4c91100f95fb57c5e8d39d32f9**9d; path=/asl/
> Via=1.1 cms.uni-kassel.de
> Vary=Accept-Encoding
> Content-Encoding=gzip
> Content-Length=5043
> Keep-Alive=timeout=15, max=99
> Connection=Keep-Alive
>
> I have inspected this file and find out that it is corrupted, it seems that
> besides the prober vcf data, there is generated html code in this file. This
> seems to be a misbehaviour from some plugin in the cms.
>
> My Question is how to handle such files. It looks like the parser sets to
> much values in the title field, so solr can't handle it.
>
> For a quick solution it would be best if I could configure tika in that
> way, that it won't parse the vcf. But I don't know how to do that.
>
> Any suggestions for this problem?
>
> Thank you very much.
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com