You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/03/22 14:30:00 UTC

[jira] [Reopened] (TIKA-3695) LimitingMetadataFilter

     [ https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison reopened TIKA-3695:
-------------------------------

Let's add per field limits.  

Also, we should use UTF-16 as size estimates because that's what Java uses under the hood, and yes, I realize that estimating memory consumed by Strings is, um, not an exact science.  The goal is to prevent a DoS, not be super precise on these limits.



> LimitingMetadataFilter
> ----------------------
>
>                 Key: TIKA-3695
>                 URL: https://issues.apache.org/jira/browse/TIKA-3695
>             Project: Tika
>          Issue Type: New Feature
>          Components: metadata
>    Affects Versions: 1.28.1, 2.3.0
>            Reporter: Julien Massiera
>            Priority: Major
>             Fix For: 2.4.0
>
>         Attachments: huge-title.docx, tika-config.xml
>
>
> Some files may contain abnormally big metadata (several MB, be it for the metadata values, the metadata names, but also for the total amount of metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can filter out the metadata according to different bytes limits (on metadata names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)