You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Stefano Fornari <st...@gmail.com> on 2014/04/03 08:23:20 UTC
Re: [PDFParser] - read limited number of characters
Hi Jukka,
any feedbacks?
Ste
On Sat, Mar 29, 2014 at 3:31 PM, Stefano Fornari
<st...@gmail.com>wrote:
> Hi Jukka, given we agree the pattern is not very nice, would you be ok to
> hide it to client classes? I digged a bit more in the code and I found all
> we need was already there. This is what I would propose:
>
> 1 promote WriteLimitReachedException to a public class
> 2 move the awkward trick into PDFParser as follows:
>
> metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
> extractMetadata(pdfDocument, metadata);
> try {
> PDF2XText.process(pdfDocument, handler, context, metadata,
> localConfig);
> } catch (WriteLimitReachedException x) {
> //
> // This is a valid condition; just ignoring the exception
> //
> }
>
> In this way the only think client classes should do is to use a limiting
> BodyContentHandler:
>
> @Test
> public void testLimitTextToParse() throws Exception {
> ContentHandler handler = new BodyContentHandler();
>
> new PDFParser().parse(
> getResourceAsStream("/test-documents/testPDF.pdf"),
> handler,
> new Metadata(),
> new ParseContext()
> );
>
> assertEquals(1067, handler.toString().length());
>
> handler = new BodyContentHandler(500);
>
> new PDFParser().parse(
> getResourceAsStream("/test-documents/testPDF.pdf"),
> handler,
> new Metadata(),
> new ParseContext()
> );
>
> assertEquals(500, handler.toString().length());
> }
>
>
> One additional thing I would do is to change WriteOutContentHandler as per
> the below:
>
> /**
> * Writes the given characters to the given character stream.
> */
> @Override
> public void characters(char[] ch, int start, int length)
> throws SAXException {
> if (writeLimit == -1 || writeCount + length <= writeLimit) {
> super.characters(ch, start, length);
> writeCount += length;
> } else {
> super.characters(ch, start, writeLimit - writeCount);
> writeCount = writeLimit;
> writeLimitReached = true;
> throw new WriteLimitReachedException(
> "Your document contained more than " + writeLimit
> + " characters, and so your requested limit has been"
> + " reached. To receive the full text of the document,"
> + " increase your limit. (Text up to the limit is"
> + " however available).", tag);
> }
> }
>
> /**
> * Checks whether the given exception (or any of it's root causes) was
> * thrown by this handler as a signal of reaching the write limit.
> *
> * @since Apache Tika 0.7
> * @param t throwable
> * @return <code>true</code> if the write limit was reached,
> * <code>false</code> otherwise
> *
> * Deprecated in Tika 1.6, use isWriteLimitReached(); the current
> * implementation ignores the given Throwable and is equivalent to
> * isWriteLimitReached()
> *
> */
> @Deprecated
> public boolean isWriteLimitReached(Throwable t) {
> return isWriteLimitReached();
> }
>
> /**
> * Returns true if the limit has been reached, false otherwise.
> *
> * @since Apache Tika 1.6
> * @return <code>true</code> if the write limit was reached,
> * <code>false</code> otherwise
> */
> public boolean isWriteLimitReached() {
> return writeLimitReached;
> }
>
>
> If you are ok with the changes for #1 and #2 I will be happy to provide a
> patch.
>
> Ste
>
>
>
>> > On #2, I expected the code you presented would not work. And in fact the
>> > pattern is quite odd, isn't it? What is the reason of throwing the
>> > exception if limiting the text read is a legal use case? (I am asking
>> just
>> > to understand the background).
>>
>> Yes, the pattern is a bit awkward and generally shouldn't be
>> recommended as it uses an exception to control the flow of the
>> program. However, in this case we considered it worth doing as the
>> alternative would have been far more complicated.
>>
>> Basically we wanted to avoid having to modify each parser
>> implementation (even those implemented outside Tika...) to keep track
>> of how much content has already been extracted and instead do that
>> just once in the WriteOutContentHandler class. However, the only way
>> for the WriteOutContentHandler to signal that parsing should be
>> stopped is by throwing a SAXException, which is what we're doing here.
>> By catching the exception and inspecting it with isWriteLimitReached()
>> the client can determine whether this is what happened.
>>
>> BR,
>>
>> Jukka Zitting
>>
>
>