You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/07/19 09:19:15 UTC
How does Tika put whitespace between tags
Hi,
We're having an issue with Boilerpipe and the lack of whitespace between tags and terms. The ordinary Tika HTML parser does the job right. Take the following HTML for example:
abc<br>def<br>xyz
becomes without BP: abc def xyz
becomes with BP: abcdefxyz
How does the Tika parser determine when to put whitespace between tags? What about languages without whitespace? When testing with ordinary chinese pages i see whitespace being added here too.
Also, any hints as where to look for the problem in the Boilerpipe code is appreciated.
Thanks,
Markus