You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Julien Nioche <li...@gmail.com> on 2008/02/12 11:22:48 UTC

get markup information via ContentHandler for OfficeParser

Hi,

Congratulations first: I have been following Tika for a little bit now and
am very happy to see a first release of it. Well done everybody!

I am particularly interested in the project as we work on text analysis with
GATE and UIMA. Obviously being able to extract text from different formats
is crucial for what we do and so is the extraction of the markup
information. That leads me to the following question: how difficult would it
be to get the OfficeParser to generate information about the markup (pages,
headers, tables, etc...)? I am not a POI expert at all, is this is supported
by it?

Thanks,

Julien

PS: I will probably go to the Apache EU conference. Anyone from the Tika
community going there?
<http://www.digitalpebble.com>

Re: get markup information via ContentHandler for OfficeParser

Posted by Bertrand Delacretaz <bd...@apache.org>.
On Feb 17, 2008 10:33 AM, Jukka Zitting <ju...@gmail.com> wrote:

> >... PS: I will probably go to the Apache EU conference. Anyone from the Tika
> > community going there?
>
> I'll be there and I think a few other people as well....

I'll be there, at the Hackathon and conference.
-Bertrand

Re: get markup information via ContentHandler for OfficeParser

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Feb 12, 2008 12:22 PM, Julien Nioche <li...@gmail.com> wrote:
> Congratulations first: I have been following Tika for a little bit now and
> am very happy to see a first release of it. Well done everybody!

Great to hear that, thanks!

> I am particularly interested in the project as we work on text analysis with
> GATE and UIMA. Obviously being able to extract text from different formats
> is crucial for what we do and so is the extraction of the markup
> information. That leads me to the following question: how difficult would it
> be to get the OfficeParser to generate information about the markup (pages,
> headers, tables, etc...)? I am not a POI expert at all, is this is supported
> by it?

I think we should be able to do that, and since one of Tika's goals is
to support extraction of "structured text", doing that is right there
on our charter. However, since Tika is supposed to be a generic tool,
we probably don't want to replicate the structure of any specific
format in too much details. You can always use the specific parser
libraries for details.

My proposal would be to try to support at least the following basic
structural constructs in all parsers that have the required
information:

    <div class="page"/>
    <h1/>
    <p/>
    <table/>
    <a/>

We could add more constructs based on existing demand.

> PS: I will probably go to the Apache EU conference. Anyone from the Tika
> community going there?

I'll be there and I think a few other people as well.

BR,

Jukka Zitting