You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/07/19 09:19:15 UTC

How does Tika put whitespace between tags

Hi,

We're having an issue with Boilerpipe and the lack of whitespace between tags and terms. The ordinary Tika HTML parser does the job right. Take the following HTML for example:

abc<br>def<br>xyz

becomes without BP: abc def xyz
becomes with BP: abcdefxyz

How does the Tika parser determine when to put whitespace between tags? What about languages without whitespace? When testing with ordinary chinese pages i see whitespace being added here too.
Also, any hints as where to look for the problem in the Boilerpipe code is appreciated.

Thanks,
Markus