You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/11/14 14:52:33 UTC
[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image
metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212246#comment-14212246 ]
Tim Allison edited comment on TIKA-1445 at 11/14/14 1:52 PM:
-------------------------------------------------------------
The AutoDetectParser was doing its regular lookup for which parser supported x file type. No luck in that.
Now, there is unfortunately something approaching luck in how we're handling the case where multiple parsers support a given file type. Our current algorithm, if I understand it correctly is to sort parsers in reverse alphabetical order by their package+class name (with a special case of "prefer" non-o.a.t parsers) and then pick the first parser that claims that it will parse the given file type.
>From the DefaultParser:
{noformat}
List<Parser> parsers =
loader.loadStaticServiceProviders(Parser.class);
Collections.sort(parsers, new Comparator<Parser>() {
public int compare(Parser p1, Parser p2) {
String n1 = p1.getClass().getName();
String n2 = p2.getClass().getName();
boolean t1 = n1.startsWith("org.apache.tika.");
boolean t2 = n2.startsWith("org.apache.tika.");
if (t1 == t2) {
return n1.compareTo(n2);
} else if (t1) {
return -1;
} else {
return 1;
}
}
});
{noformat}
and from CompositeParser:
{noformat}
public Map<MediaType, Parser> getParsers(ParseContext context) {
Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();
for (Parser parser : parsers) {
for (MediaType type : parser.getSupportedTypes(context)) {
map.put(registry.normalize(type), parser);
}
}
return map;
}
{noformat}
The "luck" so far is that, for example, the org.apache.tika.parser.gdal.GDALParser parser (which supports jpeg and gif) happens to sort after the org.apache.tika.parser.jpeg.JPegParser, the org.apache.tika.parser.image.ImageParser and the other o.a.t.p.image.* parsers. If you run the GDALParser on "/test-documents/testJPEG_EXIF.jpg", you get no metadata. :(
Depending on what the community thinks, we may want to open a separate issue and change DefaultParser's method of selecting a parser so that it:
1) selects non-o.a.t. parsers first
2) respects the order of parsers in the services files
This wouldn't change the behavior, but it would allow users to select parser preference by a means other than relying on reverse alphabetical order.
was (Author: tallison@mitre.org):
The AutoDetectParser was doing its regular lookup for which parser supported x file type. No luck in that.
Now, there is unfortunately something approaching luck in how we're handling the case where multiple parsers support a given file type. Our current algorithm, if I understand it correctly is to sort parsers in reverse alphabetical order by their package+class name (with a special case of "prefer" non-o.a.t parsers) and then pick the first parser that claims that it will parse the given file type.
>From the DefaultParser:
{noformat}
List<Parser> parsers =
loader.loadStaticServiceProviders(Parser.class);
Collections.sort(parsers, new Comparator<Parser>() {
public int compare(Parser p1, Parser p2) {
String n1 = p1.getClass().getName();
String n2 = p2.getClass().getName();
boolean t1 = n1.startsWith("org.apache.tika.");
boolean t2 = n2.startsWith("org.apache.tika.");
if (t1 == t2) {
return n1.compareTo(n2);
} else if (t1) {
return -1;
} else {
return 1;
}
}
});
{noformat}
and
{noformat}
if (loader != null) {
// Add dynamic parser service (they always override static ones)
MediaTypeRegistry registry = getMediaTypeRegistry();
List<Parser> parsers =
loader.loadDynamicServiceProviders(Parser.class);
Collections.reverse(parsers); // best parser last
for (Parser parser : parsers) {
for (MediaType type : parser.getSupportedTypes(context)) {
map.put(registry.normalize(type), parser);
}
}
}
{noformat}
The "luck" so far is that, for example, the org.apache.tika.parser.gdal.GDALParser parser (which supports jpeg and gif) happens to sort after the org.apache.tika.parser.jpeg.JPegParser, the org.apache.tika.parser.image.ImageParser and the other o.a.t.p.image.* parsers. If you run the GDALParser on "/test-documents/testJPEG_EXIF.jpg", you get no metadata. :(
Depending on what the community thinks, we may want to open a separate issue and change DefaultParser's method of selecting a parser so that it:
1) selects non-o.a.t. parsers first
2) respects the order of parsers in the services files
This wouldn't change the behavior, but it would allow users to select parser preference by a means other than relying on reverse alphabetical order.
> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)