You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/11/14 14:52:33 UTC

[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

    [ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212246#comment-14212246 ] 

Tim Allison edited comment on TIKA-1445 at 11/14/14 1:52 PM:
-------------------------------------------------------------

The AutoDetectParser was doing its regular lookup for which parser supported x file type.  No luck in that.

Now, there is unfortunately something approaching luck in how we're handling the case where multiple parsers support a given file type.  Our current algorithm, if I understand it correctly is to sort parsers in reverse alphabetical order by their package+class name (with a special case of "prefer" non-o.a.t parsers) and then pick the first parser that claims that it will parse the given file type.  

>From the DefaultParser:
{noformat}
        List<Parser> parsers =
                loader.loadStaticServiceProviders(Parser.class);
        Collections.sort(parsers, new Comparator<Parser>() {
            public int compare(Parser p1, Parser p2) {
                String n1 = p1.getClass().getName();
                String n2 = p2.getClass().getName();
                boolean t1 = n1.startsWith("org.apache.tika.");
                boolean t2 = n2.startsWith("org.apache.tika.");
                if (t1 == t2) {
                    return n1.compareTo(n2);
                } else if (t1) {
                    return -1;
                } else {
                    return 1;
                }
            }
        });
{noformat}

and from CompositeParser:

{noformat}
public Map<MediaType, Parser> getParsers(ParseContext context) {
        Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();
        for (Parser parser : parsers) {
            for (MediaType type : parser.getSupportedTypes(context)) {
                map.put(registry.normalize(type), parser);
            }
        }
        return map;
    }
{noformat}

The "luck" so far is that, for example, the org.apache.tika.parser.gdal.GDALParser parser (which supports jpeg and gif) happens to sort after the org.apache.tika.parser.jpeg.JPegParser, the org.apache.tika.parser.image.ImageParser and the other o.a.t.p.image.* parsers.  If you run the GDALParser on "/test-documents/testJPEG_EXIF.jpg", you get no metadata. :(

Depending on what the community thinks, we may want to open a separate issue and change DefaultParser's method of selecting a parser so that it:

1) selects non-o.a.t. parsers first
2) respects the order of parsers in the services files

This wouldn't change the behavior, but it would allow users to select parser preference by a means other than relying on reverse alphabetical order.



was (Author: tallison@mitre.org):
The AutoDetectParser was doing its regular lookup for which parser supported x file type.  No luck in that.

Now, there is unfortunately something approaching luck in how we're handling the case where multiple parsers support a given file type.  Our current algorithm, if I understand it correctly is to sort parsers in reverse alphabetical order by their package+class name (with a special case of "prefer" non-o.a.t parsers) and then pick the first parser that claims that it will parse the given file type.  

>From the DefaultParser:
{noformat}
        List<Parser> parsers =
                loader.loadStaticServiceProviders(Parser.class);
        Collections.sort(parsers, new Comparator<Parser>() {
            public int compare(Parser p1, Parser p2) {
                String n1 = p1.getClass().getName();
                String n2 = p2.getClass().getName();
                boolean t1 = n1.startsWith("org.apache.tika.");
                boolean t2 = n2.startsWith("org.apache.tika.");
                if (t1 == t2) {
                    return n1.compareTo(n2);
                } else if (t1) {
                    return -1;
                } else {
                    return 1;
                }
            }
        });
{noformat}

and 

{noformat}
        if (loader != null) {
            // Add dynamic parser service (they always override static ones)
            MediaTypeRegistry registry = getMediaTypeRegistry();
            List<Parser> parsers =
                    loader.loadDynamicServiceProviders(Parser.class);
            Collections.reverse(parsers); // best parser last
            for (Parser parser : parsers) {
                for (MediaType type : parser.getSupportedTypes(context)) {
                    map.put(registry.normalize(type), parser);
                }
            }
        }
{noformat}

The "luck" so far is that, for example, the org.apache.tika.parser.gdal.GDALParser parser (which supports jpeg and gif) happens to sort after the org.apache.tika.parser.jpeg.JPegParser, the org.apache.tika.parser.image.ImageParser and the other o.a.t.p.image.* parsers.  If you run the GDALParser on "/test-documents/testJPEG_EXIF.jpg", you get no metadata. :(

Depending on what the community thinks, we may want to open a separate issue and change DefaultParser's method of selecting a parser so that it:

1) selects non-o.a.t. parsers first
2) respects the order of parsers in the services files

This wouldn't change the behavior, but it would allow users to select parser preference by a means other than relying on reverse alphabetical order.


> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)