You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by GitBox <gi...@apache.org> on 2022/03/03 18:22:19 UTC

[GitHub] [tika] tballison commented on a change in pull request #520: Fix email detection (TIKA-3687)

tballison commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818943762



##########
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##########
@@ -6422,7 +6422,7 @@
       <!-- match X- DKIM- ARC- at start of file and then require at least one
            of the usual: from, received, date...but look farther into the file
            because of the X|DKIM|ARC headers-->
-      <match value="(X|DKIM|ARC)-" type="regex" offset="0">
+      <match value="(X|DKIM|ARC)-" type="regex" offset="0:1024">

Review comment:
       I worry about looking for X- anywhere in the first 1024 without requiring a \n before it.
   
   What would you think of adding something like this into the previous minShouldMatch=2 clause?
   `
   <match value="\nX-" type="string" offset="0:1024">
   <match value="\nDKIM-" type="string" offset="0:1024">
   <match value="\nARC-" type="string" offset="0:1024">
   `
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@tika.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org