You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2013/06/24 14:30:20 UTC

[jira] [Commented] (TIKA-1138) I got empty body and empty title with some documents

    [ https://issues.apache.org/jira/browse/TIKA-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691929#comment-13691929 ] 

Nick Burch commented on TIKA-1138:
----------------------------------

That's often a sign that the parser can't handle them. There's some discussion on the dev list at the moment about how best to report that, but it hasn't concluded

As an example, solupro.xls is an Excel-95 file, which Apache POI (the library Tika uses for .xls) doesn't handle, hence why you're able to get metadata but not text
                
> I got empty body and empty title with some documents
> ----------------------------------------------------
>
>                 Key: TIKA-1138
>                 URL: https://issues.apache.org/jira/browse/TIKA-1138
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.3
>         Environment: Windows 7 (my desktop)
>            Reporter: Koutsoulis Philippe
>              Labels: test
>
> *+Tested version:+* Apache Tika 1.3 (with the Apache Tika GUI)
> Hi all,
> I have empty body and empty title with some documents.
> Do you have an idea?
> *+Extract from my "Structured Text"+*
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> ...
> <title/>
> </head>
> <body/></html>
> {noformat}
> *+Files to reproduce+*
> [http://www.justice.gouv.fr/art_pix/declaration_sexe_20091016.xls]
> [http://ge.ch/ssco_gestats/excel/deinfo_par_ht2004.xls]
> [http://homepage.swissonline.ch/ccvaf1/stock_divers/palmares_ccvaf.xls]
> [http://top1000.anthologeek.net/participants.current.txt]
> [http://ge.ch/ssco_gestats/excel/refona_par_ht2006.xls]
> [http://www.rad.fr/solupro.xls]
> [http://www.pfynschiessen.ch/TClassementgroupeinvite.xls]
> [http://www.gregdonner.org/workbench/wb_31rev.txt]
> (i) No error in logs :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira