You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/11/19 12:06:35 UTC
[jira] [Commented] (TIKA-1483) Create a general raw string parser
[ https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217747#comment-14217747 ]
Tim Allison commented on TIKA-1483:
-----------------------------------
+1. It would be great to have something like this, especially if we could add language models eventually a la [la-strings|http://la-strings.sourceforge.net/]. We could also use this as a fallback parser in case there's an exception.
> Create a general raw string parser
> ----------------------------------
>
> Key: TIKA-1483
> URL: https://issues.apache.org/jira/browse/TIKA-1483
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.6
> Reporter: Luis Filipe Nassif
>
> I think it can be very useful adding a general parser able to extract raw strings from files (like the strings command), which can be used as the fallback parser for all mimetypes not having a specific parser implementation, like application/octet-stream. It can also be used as a fallback for corrupt files throwing a TikaException.
> It must be configured with the script/language to be extracted from the files (currently I implemented one specific for Latin1).
> It can use heuristics to extract strings encoded with different charsets within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
> What the community thinks about that?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)