You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2021/04/26 18:06:13 UTC

Re: Tika Server writeLimit header

Yikes and sorry... Your understanding is what I intended to happen and
added a unit test to confirm.  Tika should return the partial contents up
to the write limit.  Can you give me more details about how you're calling
tika?  /rmeta endpoint with text or xhtml or body?  What is the
writelimit you're choosing? Can you reproduce on a unit test file in our
repo?  Thank you for alerting us to this!

Best,

       Tim


On Mon, Apr 26, 2021 at 12:39 PM <ju...@francelabs.com> wrote:

> Hi Tim,
>
>
>
> I tried the fix you did on Tika 1.26 for the issue TIKA-3325 I reported.
> As a reminder it was concerning the limit of extracted content and the goal
> was to limit across all content in a container file. It works well but not
> as I expected 😊 , indeed if the limit is reached, Tika just returns no
> content at all.
>
>
>
> On my side, I was thinking it would return content up to the limit and
> that the content exceeding the limit would be truncated to fits the limit.
> Would it be possible to have another header option implementing such
> behavior or is it too complicated ?
>
>
>
> Regards,
>
> Julien
>
>
>
>

Re: Tika Server writeLimit header

Posted by Tim Allison <ta...@apache.org>.
In our 1.27 branch this new pseudo test prints out the output below
it.  I'm looking into why this content is slightly longer than the
write limit...

@Test
public void testJsonWriteLimitEmbedded() throws Exception {
    for (int i = 500; i < 10000; i += 500) {
        Response response = WebClient.create(endPoint + META_PATH +
"/text").accept("application/json")
                .header("writeLimit",

Integer.toString(i)).put(ClassLoader.getSystemResourceAsStream(TEST_RECURSIVE_DOC));
        List<Metadata> metadata = JsonMetadataList.fromJson(
                new InputStreamReader(((InputStream)
response.getEntity()), StandardCharsets.UTF_8));
        int len = 0;
        int j = 0;
        for (Metadata m : metadata) {
            j++;
            len +=
m.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT).length();
        }
        System.out.println("write limit: "+i + " : actual len: "+ len);

    }
}

write limit: 500 : actual len: 998
write limit: 1000 : actual len: 1498
write limit: 1500 : actual len: 1998
write limit: 2000 : actual len: 2498
write limit: 2500 : actual len: 2998
write limit: 3000 : actual len: 3498
write limit: 3500 : actual len: 3998
write limit: 4000 : actual len: 4498
write limit: 4500 : actual len: 4998
write limit: 5000 : actual len: 5498
write limit: 5500 : actual len: 5998
write limit: 6000 : actual len: 6498
write limit: 6500 : actual len: 6998
write limit: 7000 : actual len: 7498
write limit: 7500 : actual len: 7998
write limit: 8000 : actual len: 8498
write limit: 8500 : actual len: 8672
write limit: 9000 : actual len: 8672
write limit: 9500 : actual len: 8672

On Mon, Apr 26, 2021 at 2:06 PM Tim Allison <ta...@apache.org> wrote:
>
> Yikes and sorry... Your understanding is what I intended to happen and added a unit test to confirm.  Tika should return the partial contents up to the write limit.  Can you give me more details about how you're calling tika?  /rmeta endpoint with text or xhtml or body?  What is the writelimit you're choosing? Can you reproduce on a unit test file in our repo?  Thank you for alerting us to this!
>
> Best,
>
>        Tim
>
>
> On Mon, Apr 26, 2021 at 12:39 PM <ju...@francelabs.com> wrote:
>>
>> Hi Tim,
>>
>>
>>
>> I tried the fix you did on Tika 1.26 for the issue TIKA-3325 I reported. As a reminder it was concerning the limit of extracted content and the goal was to limit across all content in a container file. It works well but not as I expected , indeed if the limit is reached, Tika just returns no content at all.
>>
>>
>>
>> On my side, I was thinking it would return content up to the limit and that the content exceeding the limit would be truncated to fits the limit. Would it be possible to have another header option implementing such behavior or is it too complicated ?
>>
>>
>>
>> Regards,
>>
>> Julien
>>
>>
>>