You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "lgilardoni61@gmail.com" <lg...@gmail.com> on 2015/01/29 23:05:47 UTC

Question .. how NOT to skip empty paragraphs

We use tika for processing word Documents (among others).

In a specific application we need to rely on empty paragraphs to 
recognize specific part of text, which in the source document
appear as empty paragraphs separating blocks (ok, i know, not the best 
way to use word even but this what we have - part of an old legacy system).

Extracting plain text from word this empty paragraphs are completely 
removed (albeit they stay in the xhtml representation).

Any suggestion for preserving this empty paragraphs - in the extracted 
string they would appear as double \n\n - without getting and parsing 
the xhtml?

Any help wellcome.

LG

Re: Question .. how NOT to skip empty paragraphs

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 29 Jan 2015, lgilardoni61@gmail.com wrote:
> Extracting plain text from word this empty paragraphs are completely 
> removed (albeit they stay in the xhtml representation).
>
> Any suggestion for preserving this empty paragraphs - in the extracted 
> string they would appear as double \n\n - without getting and parsing 
> the xhtml?

What's wrong with parsing at the xhtml level?

I'd suggest you do something like a custom handler, which normally just 
looks at the characters and whitespace (much as the to-text handlers do), 
but also adds a tiny bit of logic to detect empty paragraphs which then 
triggers the "this is a new block" behaviour in your code

Custom handlers are surprisingly easy to write, take a look at the Tika 
Examples package for a few

Nick