You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Tomas Fernandez Lobbe <to...@yahoo.com.ar> on 2009/12/16 20:26:26 UTC

parsing old Excel files

Hi, I'm trying to parse a big set of  Miscrosoft Word and Microsoft Excel files. I'm having a problem with some old excel files, they are not being parsed (both, metadata and content info is empty after parseing them).


For example, if I run a test similar to ExcelParserTest with my old excel file, the parsing doesn't return any data. 
Debugging the parser code (OfficeParser) a little bit I found that there is not an entry with the the name "Workbook" in this excel file, there is an entry with the name "Book" instead, but anyway, the ExcelExtractor wont work with this file (tried it).

Did someone faced this problem before? Does somebody knows the first excel version that can be parsed with tika?

Thanks


Tomás


      Yahoo! Cocina

Encontra las mejores recetas con Yahoo! Cocina.


http://ar.mujer.yahoo.com/cocina/

Re: parsing old Excel files

Posted by Alex Ott <al...@gmail.com>.
Re

Tomas Fernandez Lobbe  at "Wed, 16 Dec 2009 11:26:26 -0800 (PST)" wrote:
 TFL> Hi, I'm trying to parse a big set of  Miscrosoft Word and Microsoft Excel files. I'm having a problem with some old excel files,
 TFL> they are not being parsed (both, metadata and content info is empty after parseing them).

 TFL> For example, if I run a test similar to ExcelParserTest with my old excel file, the parsing doesn't return any data.
 TFL> Debugging the parser code (OfficeParser) a little bit I found that there is not an entry with the the name "Workbook" in this excel
 TFL> file, there is an entry with the name "Book" instead, but anyway, the ExcelExtractor wont work with this file (tried it).

 TFL> Did someone faced this problem before? Does somebody knows the first excel version that can be parsed with tika?

I think, that first supported version is Office 97. Previous formats aren't
documented, although there is some documentation about them (in xls2csv
from catdoc package, for example).  But these format are very different
from MS Office 97-2003 formats


-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://xtalk.msk.su/~ott/
http://alexott-ru.blogspot.com/