You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by qubit <la...@yahoo.com> on 2010/11/10 21:34:57 UTC

tika and plain text -- bug or feature?

Greetings.
I have been sifting through the code in TextParser.java and the various 
content handlers it invokes, and I have some questions.

First, it appears that the code in TextParser.java thinks it is dealing with 
a file in plain text (isn't that the same as text/plain ?)
However it is output as xhtml with very little processing.  I think I 
mentioned before that things like '<' should be translated to '&lt;' and '&' 
should become '&amp;'.
I noticed the header and footer elements you output for the file.  But this 
translation, and probably other insertions, need to be made to the text 
within the file itself, not the header/footer.  Otherwise the rendered xhtml 
will be wrong.

I have been trying to make this patch myself by looking at your code, which 
has taken me into the SAX content handlers, and I have one question:
Is this code considered complete? I find XHTMLContentHandler code, which 
calls SAFEContentHandler code.
But I gather these methods have a different purpose than what I am looking 
at.
I thought to create a subclass SafeTextContentHandler of SafeContentHandler 
to override the write function and provide the necessary replacement 
strings.
Or I could just handcode an extra check inside existing methods, but I think 
this would endanger other code that depends on these classes.

Anyway, please comment if you think I'm not approaching it the right way.
I wanted to do this myself rather than just report a bug as I want some 
experience with this source code, partly because I am new to java and partly 
because I may be using tika in a project I'm working on.

So any comments are welcome.
Thank you.
--le

RE: tika and plain text -- bug or feature?

Posted by Jukka Zitting <jz...@adobe.com>.

Hi,

From: qubit [mailto:lauraeaves@yahoo.com]
> Then perhaps I am in the wrong place in the code... or I am still not
> understanding all that sax is doing.  The translation needs to be done
> however because you are essentially outputting a plain text file as if
> it were xhtml with only header and footer elements slapped on the ends.

Actually we aren't. The relevant part of the code in TXTParser is:

    XHTMLContentHandler xhtml = ...;
    xhtml.startDocument();

    xhtml.startElement("p");
    xhtml.characters(...);
    xhtml.endElement("p");

    xhtml.endDocument();

(The XHTMLContentHandler class hides some of the complexities, but the basic idea is the same as with a plain ContentHandler instance.)

The startElement() and endElement() calls above are handled differently from the characters() call. If we were simply outputting text like you assume, we could rewrite part of the above to:

    xhtml.characters("<p>");
    xhtml.characters(...);
    xhtml.characters("</p>");

That wouldn't work, as it's the task of the ContentHandler instance that serializes these SAX events to properly output any start and end elements triggered by start/endElement() calls, and to correctly escape character data given in characters() events.

See the JAXP documentation for more background on how SAX parsing and serialization is designed to work.

> Ok, this bespeaks my newness looking at this code. In your view, is
> this the right place to make a change? or am I misunderstanding the
> purpose of the content handler code?

I believe you've slightly misunderstood the code. I'm sorry about not making the intended design more apparent; we probably should document that part a bit better.

To better understand how and where escaping actually happens, take a look at the getTransformerHandler() method in the TikaCLI class (part of the tika-app component). There we use the Transformer functionality in JAXP to automatically handle the conversion from abstract SAX events (start/end elements, character data, etc.) to the corresponding character and byte sequences.

BR,

Jukka Zitting

Re: tika and plain text -- bug or feature?

Posted by qubit <la...@yahoo.com>.

I don't know if my mail is getting filtered and translated by mail. It is 
not coming out right and I am afraid you may not understand what I am 
saying.  Please review the source of my message rather than the rendering, 
which is wrong.  My mailer went and translated all the html so what I'm 
seeing is not what I typed.

By translation I mean turning the less than sign into ampersand l t ; 
without the spaces.  Similarly the ampersand & should convert to & a m p ; 
without the spaces.

Please tell me if I made sense.
Thank you.
--le

----- Original Message ----- 
From: "qubit" <la...@yahoo.com>
To: <de...@tika.apache.org>
Sent: Wednesday, November 10, 2010 5:19 PM
Subject: Re: tika and plain text -- bug or feature?


Greetings and thanks for your reply.
I'll reply to excerpted fragments.

<<- Please avoid cross-posting between dev@ and user@. Responding only on
dev@, as this is mostly related to Tika internals. ->>

Sorry about that.  I will send the rest of the mail on this thread only to
dev.

<<-
> However it is output as xhtml with very little processing. I think I
> mentioned before that things like '<' should be translated to '&lt;'
> and '&' should become '&amp;'.

Escaping happens only when a SAX event stream is serialized to a character
or a byte stream. The character SAX events produced by a parser aren't
supposed to be escaped.
->>

Then perhaps I am in the wrong place in the code... or I am still not
understanding all that sax is doing.  The translation needs to be done
however because you are essentially outputting a plain text file as if it
were xhtml with only header and footer elements slapped on the ends.  This
doesn't work because suppose your plaintext file is a tutorial on html and
contains sample fragments of html code.  If you output the text without
translation of certain characters, the fragments will render as text and
will not appear in the document that the end user sees.  A short example:
---- example ----
This is how you write a link in html: <a href="#here">hi there</a>
---- end example ----
If you slap xhtml header and footer onto this plain text and output it as
xhtml, then the end user will see only the link "hi there" and not the
expansion of the link source.
To prevent this, all < symbols should be translated to &lt; Furthermore,
since the ampersand & also prefixes special character codes, tika should
also translate & to &amp;
I do not know if it is necessary to convert the > symbols or the #, " or =.
I believe only the less than and ampersand are essential to translate.

Does this answer your question?

<<- > I noticed the header and footer elements you output for the file.
> But this translation, and probably other insertions, need to be
> made to the text within the file itself, not the header/footer.
> Otherwise the rendered xhtml will be wrong.

I'm not sure what you're referring to here. Can you elaborate?
->>

See above.

<<- > I find XHTMLContentHandler code, which calls SAFEContentHandler code.
> But I gather these methods have a different purpose than what I am
> looking at. I thought to create a subclass SafeTextContentHandler of
> SafeContentHandler to override the write function and provide the
> necessary replacement strings.

You're talking about entity escaping? There's no need to do this, as the
functionality is already there in the Transformer part of JAXP.

More generally it's usually better to use a decorator than a subclass when
you want to customize the behavior of a SAX ContentHandler.
->>
->>

Ok, this bespeaks my newness looking at this code.  In your view, is this
the right place to make a change? or am I misunderstanding the purpose of
the content handler code?

I apreciate any comments.
--le

Re: tika and plain text -- bug or feature?

Posted by qubit <la...@yahoo.com>.

Greetings and thanks for your reply.
I'll reply to excerpted fragments.

<<- Please avoid cross-posting between dev@ and user@. Responding only on 
dev@, as this is mostly related to Tika internals. ->>

Sorry about that.  I will send the rest of the mail on this thread only to 
dev.

<<-
> However it is output as xhtml with very little processing. I think I
> mentioned before that things like '<' should be translated to '&lt;'
> and '&' should become '&amp;'.

Escaping happens only when a SAX event stream is serialized to a character 
or a byte stream. The character SAX events produced by a parser aren't 
supposed to be escaped.
->>

Then perhaps I am in the wrong place in the code... or I am still not 
understanding all that sax is doing.  The translation needs to be done 
however because you are essentially outputting a plain text file as if it 
were xhtml with only header and footer elements slapped on the ends.  This 
doesn't work because suppose your plaintext file is a tutorial on html and 
contains sample fragments of html code.  If you output the text without 
translation of certain characters, the fragments will render as text and 
will not appear in the document that the end user sees.  A short example:
---- example ----
This is how you write a link in html: <a href="#here">hi there</a>
---- end example ----
If you slap xhtml header and footer onto this plain text and output it as 
xhtml, then the end user will see only the link "hi there" and not the 
expansion of the link source.
To prevent this, all < symbols should be translated to &lt; Furthermore, 
since the ampersand & also prefixes special character codes, tika should 
also translate & to &amp;
I do not know if it is necessary to convert the > symbols or the #, " or =. 
I believe only the less than and ampersand are essential to translate.

Does this answer your question?

<<- > I noticed the header and footer elements you output for the file.
> But this translation, and probably other insertions, need to be
> made to the text within the file itself, not the header/footer.
> Otherwise the rendered xhtml will be wrong.

I'm not sure what you're referring to here. Can you elaborate?
->>

See above.

<<- > I find XHTMLContentHandler code, which calls SAFEContentHandler code.
> But I gather these methods have a different purpose than what I am
> looking at. I thought to create a subclass SafeTextContentHandler of
> SafeContentHandler to override the write function and provide the
> necessary replacement strings.

You're talking about entity escaping? There's no need to do this, as the 
functionality is already there in the Transformer part of JAXP.

More generally it's usually better to use a decorator than a subclass when 
you want to customize the behavior of a SAX ContentHandler.
->>
->>

Ok, this bespeaks my newness looking at this code.  In your view, is this 
the right place to make a change? or am I misunderstanding the purpose of 
the content handler code?

I apreciate any comments.
--le

RE: tika and plain text -- bug or feature?

Posted by Jukka Zitting <jz...@adobe.com>.

Hi,

Please avoid cross-posting between dev@ and user@. Responding only on dev@, as this is mostly related to Tika internals.

From: qubit [mailto:lauraeaves@yahoo.com]
> First, it appears that the code in TextParser.java thinks it is
> dealing with a file in plain text (isn't that the same as text/plain?)

Correct. (Also, text/plain = plain text).

> However it is output as xhtml with very little processing. I think I
> mentioned before that things like '<' should be translated to '&lt;'
> and '&' should become '&amp;'.

Escaping happens only when a SAX event stream is serialized to a character or a byte stream. The character SAX events produced by a parser aren't supposed to be escaped.

> I noticed the header and footer elements you output for the file.
> But this translation, and probably other insertions, need to be
> made to the text within the file itself, not the header/footer.
> Otherwise the rendered xhtml will be wrong.

I'm not sure what you're referring to here. Can you elaborate?

> I have been trying to make this patch myself by looking at your
> code, which has taken me into the SAX content handlers, and I
> have one question: Is this code considered complete?

Pretty much so. I think the basic SAX event handling machinery in Tika is already quite stable and there aren't any major open design issues in that part of our codebase.

> I find XHTMLContentHandler code, which calls SAFEContentHandler code.
> But I gather these methods have a different purpose than what I am
> looking at. I thought to create a subclass SafeTextContentHandler of
> SafeContentHandler to override the write function and provide the
> necessary replacement strings.

You're talking about entity escaping? There's no need to do this, as the functionality is already there in the Transformer part of JAXP.

More generally it's usually better to use a decorator than a subclass when you want to customize the behavior of a SAX ContentHandler.

BR,

Jukka Zitting