You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Karl Heinz Marbaise (JIRA)" <ji...@apache.org> on 2009/05/21 22:57:45 UTC

[jira] Created: (TIKA-232) Scanning of archive files

Scanning of archive files
-------------------------

                 Key: TIKA-232
                 URL: https://issues.apache.org/jira/browse/TIKA-232
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 0.3
         Environment: All
            Reporter: Karl Heinz Marbaise
            Priority: Minor


If i parse an archive all the files inside the archive will be extracted with their text as well. It would be nice to have the choice to extract only the list of files (directory) of an archive instead of extracting the whole contents. This seemed to be usable only for zip, tar, tar.gz, tar.bz2, .jar. May be this could be realized by using a different calling or by a run time configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-232) Scanning of archive files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712283#action_12712283 ] 

Jukka Zitting commented on TIKA-232:
------------------------------------

If you're instantiating the package parsers directly, then you can achieve this simply by overriding the parser that is used for the files inside a package:

    PackageParser parser = ...;
    parser.setParser(new EmptyParser());

You could also use the following hack to do this for a pre-configured composite parser like the AutoDetectParser:

    CompositeParser composite = new AutoDetectParser();
    for (Parser parser : composite.getParsers().values()) {
        if (Parser instanceof PackageParser) {
            ((PackageParser) parser).setParser(new EmptyParser());
        }
    }

Perhaps someone has a good idea how to make this easier?

> Scanning of archive files
> -------------------------
>
>                 Key: TIKA-232
>                 URL: https://issues.apache.org/jira/browse/TIKA-232
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>
> If i parse an archive all the files inside the archive will be extracted with their text as well. It would be nice to have the choice to extract only the list of files (directory) of an archive instead of extracting the whole contents. This seemed to be usable only for zip, tar, tar.gz, tar.bz2, .jar. May be this could be realized by using a different calling or by a run time configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-232) Scanning of archive files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-232.
--------------------------------

    Resolution: Duplicate
      Assignee: Jukka Zitting

With TIKA-238 resolved, the former case above is now the default:

    Parser parser = new ZipParser();

And the latter case is much simpler:

    TikaConfig config = TikaConfig.getDefaultConfig(); // without a delegate parser
    Parser parser = new AutoDetectParser(config);

Resolving this as a Duplicate of TIKA-238.

> Scanning of archive files
> -------------------------
>
>                 Key: TIKA-232
>                 URL: https://issues.apache.org/jira/browse/TIKA-232
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Assignee: Jukka Zitting
>            Priority: Minor
>
> If i parse an archive all the files inside the archive will be extracted with their text as well. It would be nice to have the choice to extract only the list of files (directory) of an archive instead of extracting the whole contents. This seemed to be usable only for zip, tar, tar.gz, tar.bz2, .jar. May be this could be realized by using a different calling or by a run time configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.