You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Matt Bachmann <ba...@gmail.com> on 2015/02/05 00:15:23 UTC

Advice on parsing Spreadsheets and preserving cell positions

Howdy!

So I am working with TIKA to help me parse Office types to pull out text. I
would like to preserve the structure of the text as much as possible.

When I play with the TIKA jar file with a simple excel file I get something
like what I have below. Code I write to do the parsing pulls out something
similar.  The data is generally correct. But, in the parsing the position
of cells is completely lost. For example, the xls that I used here
contained the cells in the middle of the spreadsheet. But that positioning
is lost in the output here. I would like to say I found a bit of text at a
specific row/column/sheet.

Is this possible with TIKA? I have google around and have not found much.
Do I have to drop down to POI to do this?

Thanks! Sorry if this is a super obvious question.

-Matt

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta name="cp:revision" content="1" />
  <meta name="date" content="2013-08-10T00:26:18Z" />
  <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
  <meta name="X-Parsed-By"
content="org.apache.tika.parser.microsoft.OfficeParser" />
  <meta name="meta:creation-date" content="2013-08-10T00:25:37Z" />
  <meta name="Last-Printed" content="1601-01-01T00:00:00Z" />
  <meta name="Creation-Date" content="2013-08-10T00:25:37Z" />
  <meta name="meta:print-date" content="1601-01-01T00:00:00Z" />
  <meta name="resourceName" content="spreadsheet.xls" />
  <meta name="dcterms:created" content="2013-08-10T00:25:37Z" />
  <meta name="dcterms:modified" content="2013-08-10T00:26:18Z" />
  <meta name="Last-Modified" content="2013-08-10T00:26:18Z" />
  <meta name="Last-Save-Date" content="2013-08-10T00:26:18Z" />
  <meta name="Revision-Number" content="1" />
  <meta name="meta:save-date" content="2013-08-10T00:26:18Z" />
  <meta name="modified" content="2013-08-10T00:26:18Z" />
  <meta name="Content-Length" content="5632" />
  <meta name="Content-Type" content="application/vnd.ms-excel" />

  <title></title>
</head>

<body>
  <div class="page">
    <h1>Sheet1</h1>

    <table>
      <tbody>
        <tr>
          <td>A</td>

          <td>B</td>

          <td>D</td>
        </tr>

        <tr>
          <td>1</td>

          <td>2</td>

          <td>3</td>
        </tr>
      </tbody>
    </table>
  </div>
</body>
</html>

Re: Advice on parsing Spreadsheets and preserving cell positions

Posted by Nick Burch <ap...@gagravarr.org>.

On Wed, 4 Feb 2015, Matt Bachmann wrote:
> When I play with the TIKA jar file with a simple excel file I get 
> something like what I have below. Code I write to do the parsing pulls 
> out something similar.  The data is generally correct. But, in the 
> parsing the position of cells is completely lost.

That's to be expected - Tika will only return you text for cells which are 
really defined in the file. It won't generate dummy entries for "missing" 
cells which Excel optimised out of the file for being blank. This avoids 
bloating the Tika output, and keeps the Tika code much simpler

> Is this possible with TIKA? I have google around and have not found 
> much. Do I have to drop down to POI to do this?

You'll need to use POI if you want full control over missing rows or 
missing cells.

For working with .xls files, you'd probably want something like the 
example "missing records aware" streaming xls to csv converter:
https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/hssf/eventusermodel/examples/XLS2CSVmra.java
For .xlsx you'll need some similar logic in a sax-based parser

Or, if you have the memory, it's all very easy, as detailed on the site:
http://poi.apache.org/spreadsheet/quick-guide.html#Iterator

Nick