You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Stefano Fornari <st...@gmail.com> on 2014/04/03 08:23:20 UTC
Re: [PDFParser] - read limited number of characters

Hi Jukka,
any feedbacks?

Ste


On Sat, Mar 29, 2014 at 3:31 PM, Stefano Fornari
<st...@gmail.com>wrote:

> Hi Jukka, given we agree the pattern is not very nice, would you be ok to
> hide it to client classes? I digged a bit more in the code and I found all
> we need was already there. This is what I would propose:
>
> 1 promote WriteLimitReachedException to a public class
> 2 move the awkward trick into PDFParser as follows:
>
>     metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
>     extractMetadata(pdfDocument, metadata);
>     try {
>          PDF2XText.process(pdfDocument, handler, context, metadata,
> localConfig);
>     } catch (WriteLimitReachedException x) {
>           //
>           // This is a valid condition; just ignoring the exception
>           //
>     }
>
> In this way the only think client classes should do is to use a limiting
> BodyContentHandler:
>
> @Test
>     public void testLimitTextToParse() throws Exception {
>         ContentHandler handler = new BodyContentHandler();
>
>         new PDFParser().parse(
>             getResourceAsStream("/test-documents/testPDF.pdf"),
>             handler,
>             new Metadata(),
>             new ParseContext()
>         );
>
>         assertEquals(1067, handler.toString().length());
>
>         handler = new BodyContentHandler(500);
>
>         new PDFParser().parse(
>             getResourceAsStream("/test-documents/testPDF.pdf"),
>             handler,
>             new Metadata(),
>             new ParseContext()
>         );
>
>         assertEquals(500, handler.toString().length());
>     }
>
>
> One additional thing I would do is to change WriteOutContentHandler as per
> the below:
>
> /**
>      * Writes the given characters to the given character stream.
>      */
>     @Override
>     public void characters(char[] ch, int start, int length)
>             throws SAXException {
>         if (writeLimit == -1 || writeCount + length <= writeLimit) {
>             super.characters(ch, start, length);
>             writeCount += length;
>         } else {
>             super.characters(ch, start, writeLimit - writeCount);
>             writeCount = writeLimit;
>             writeLimitReached = true;
>             throw new WriteLimitReachedException(
>                     "Your document contained more than " + writeLimit
>                     + " characters, and so your requested limit has been"
>                     + " reached. To receive the full text of the document,"
>                     + " increase your limit. (Text up to the limit is"
>                     + " however available).", tag);
>         }
>     }
>
>     /**
>      * Checks whether the given exception (or any of it's root causes) was
>      * thrown by this handler as a signal of reaching the write limit.
>      *
>      * @since Apache Tika 0.7
>      * @param t throwable
>      * @return <code>true</code> if the write limit was reached,
>      *         <code>false</code> otherwise
>      *
>      * Deprecated in Tika 1.6, use isWriteLimitReached(); the current
>      * implementation ignores the given Throwable and is equivalent to
>      * isWriteLimitReached()
>      *
>      */
>     @Deprecated
>     public boolean isWriteLimitReached(Throwable t) {
>         return isWriteLimitReached();
>     }
>
>     /**
>      * Returns true if the limit has been reached, false otherwise.
>      *
>      * @since Apache Tika 1.6
>      * @return <code>true</code> if the write limit was reached,
>      *         <code>false</code> otherwise
>      */
>     public boolean isWriteLimitReached() {
>         return writeLimitReached;
>     }
>
>
> If you are ok with the changes for #1 and #2 I will be happy to provide a
> patch.
>
> Ste
>
>
>
>> > On #2, I expected the code you presented would not work. And in fact the
>> > pattern is quite odd, isn't it? What is the reason of throwing the
>> > exception if limiting the text read is a legal use case? (I am asking
>> just
>> > to understand the background).
>>
>> Yes, the pattern is a bit awkward and generally shouldn't be
>> recommended as it uses an exception to control the flow of the
>> program. However, in this case we considered it worth doing as the
>> alternative would have been far more complicated.
>>
>> Basically we wanted to avoid having to modify each parser
>> implementation (even those implemented outside Tika...) to keep track
>> of how much content has already been extracted and instead do that
>> just once in the WriteOutContentHandler class. However, the only way
>> for the WriteOutContentHandler to signal that parsing should be
>> stopped is by throwing a SAXException, which is what we're doing here.
>> By catching the exception and inspecting it with isWriteLimitReached()
>> the client can determine whether this is what happened.
>>
>> BR,
>>
>> Jukka Zitting
>>
>
>