You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Josh Burchard (Jira)" <ji...@apache.org> on 2022/01/12 22:20:00 UTC

[jira] [Updated] (TIKA-3643) writeLimit for bytes in addition to characters

     [ https://issues.apache.org/jira/browse/TIKA-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Burchard updated TIKA-3643:
--------------------------------
    Description: 
[~jmssiera] wrote up the enhancement request TIKA-3325 where he originally requested that the number of bytes be passed as the write limit.  I see that issue was marked as Resolved, but writeLimit is number of chars instead of number of bytes.

I have a use-case where the consumer side (an indexer) has a control for the maximum number of bytes to index.  When I'm using the writeLimit header with Tika and I'm extracting text from a document with mixed ASCII and multi-byte characters I can't get back exactly, for example, 6MB worth of text because I don't know a-priori what chars will be in the file.   

My ask here is for a new control, maybe "writeLimitBytes" where the number of characters returned breaks on the last coherent character.  Therefore the returned text would be <= writeLimitBytes but would more or less be close to that value.

  was:
[~jmssiera] wrote up the enhancement request TIKA-3325 where he originally requested that the number of bytes be passed as the write limit.  I see that issue was marked as Resolved, but writeLimit is number of chars instead of number of bytes.

I have a use-case where the consumer side (an indexer) has a control for the maximum number of bytes to index.  When I'm using the writeLimit header with Tika and I'm extracting text from a document with mixed ASCII and multi-byte characters I can't get back exactly 6MB worth of text because I don't know a-priori what chars will be in the file.   

My ask here is for a new control, maybe "writeLimitBytes" where the number of characters returned breaks on the last coherent character.  Therefore the returned text would be <= writeLimitBytes but would more or less be close to that value.


> writeLimit for bytes in addition to characters
> ----------------------------------------------
>
>                 Key: TIKA-3643
>                 URL: https://issues.apache.org/jira/browse/TIKA-3643
>             Project: Tika
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 2.2.1
>            Reporter: Josh Burchard
>            Priority: Major
>
> [~jmssiera] wrote up the enhancement request TIKA-3325 where he originally requested that the number of bytes be passed as the write limit.  I see that issue was marked as Resolved, but writeLimit is number of chars instead of number of bytes.
> I have a use-case where the consumer side (an indexer) has a control for the maximum number of bytes to index.  When I'm using the writeLimit header with Tika and I'm extracting text from a document with mixed ASCII and multi-byte characters I can't get back exactly, for example, 6MB worth of text because I don't know a-priori what chars will be in the file.   
> My ask here is for a new control, maybe "writeLimitBytes" where the number of characters returned breaks on the last coherent character.  Therefore the returned text would be <= writeLimitBytes but would more or less be close to that value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)