You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2008/07/30 10:27:32 UTC

[jira] Created: (TIKA-149) Parser for zip files

Parser for zip files
--------------------

                 Key: TIKA-149
                 URL: https://issues.apache.org/jira/browse/TIKA-149
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Jukka Zitting


Tika should be able to parse zip files. The resulting XHTML document should be something like this:

<xhtml>
  <head>...</head>
  <body>
    <div class="file">
        <h1>path/to/file/inside/the/zip</h1>
        ... (parsed contents of the file)
    </div>
    ...
  </body>
</xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-149) Parser for zip files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619460#action_12619460 ] 

Jukka Zitting commented on TIKA-149:
------------------------------------

Looks good, though I don't think we need to copy the zip entries to temporary files before parsing. Also, instead of using ParseUtils.getParser, how about using an instance variable and a setter method for the delegate parser?

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-149) Parser for zip files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620217#action_12620217 ] 

Jukka Zitting commented on TIKA-149:
------------------------------------

You can use the CloseShieldInputStream wrapper from commons-io to prevent the zis stream from being closed. Also, note that the general contract of the parse() method is that the parser should _not_ close the stream.

> In Tika what is the preferred approach for setting instance variables like this, via constructors or getters/setters.

I'd use getter/setter methods, with some reasonable default value like in this case the AutoDetectParser.

[1] http://commons.apache.org/io/api-release/org/apache/commons/io/input/CloseShieldInputStream.html

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149-II.diff, TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Attachment: TIKA-149.patch

Patch to implement basic Zip Parser.

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Attachment: TIKA-149.patch

Patch for basic Zip Parser......With actual new file in it this time - it has been a long day ;-)

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Attachment: TIKA-149-II.diff

Update to remove temporary file creation and import tidy

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149-II.diff, TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Attachment: TIKA-149-III.diff

Updated patch to include getters/setter for parser

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149-II.diff, TIKA-149-III.diff, TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Attachment:     (was: TIKA-149.patch)

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Attachment: TIKA-149-II.diff

update removing temp file creation.

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149-II.diff, TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Comment: was deleted

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-149) Parser for zip files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620181#action_12620181 ] 

Jukka Zitting commented on TIKA-149:
------------------------------------

Thanks for the update!

I was thinking that you could simply pass the "zis" stream to the underlying parser... This way you wouldn't need to spool the content to a temporary buffer or file.

Also, my idea of the delegate parser was that you don't instantiate it inside the parse() method, but instead use an instance variable (that perhaps defaults to AutoDetectParser) that can be set by the client. This way the client has full control over how the zip entries get parsed. The problem with the original ParseUtils.getParser() call with TikaConfig.getDefaultConfig() was that the client couldn't override the parsing mechanism other than by changing the default configuration.

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149-II.diff, TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620011#action_12620011 ] 

Dave Meikle commented on TIKA-149:
----------------------------------

Sorry, should have just manipulated the stream. Not sure about the delegate parser though, as each file may require a different parser. I have just updated the code to use the AutoDetectParser but if you can see the other use case it can be changed for the seperate setter method for the delegate parser.

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620200#action_12620200 ] 

Dave Meikle commented on TIKA-149:
----------------------------------

The problem is when using the zis directly if during a parse the stream is closed you cannot access the next file in the ZIP. Using the stream manipulation you can.

I see what you mean about the delegate parser. In Tika what is the preferred approach for setting instance variables like this, via constructors or getters/setters.

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149-II.diff, TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Attachment:     (was: TIKA-149.patch)

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Attachment:     (was: TIKA-149-II.diff)

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: TIKA-149-II.diff, TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-149) Parser for zip files

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-149:
-----------------------------

    Attachment: TIKA-149.patch

Patch to implement a basic Zip Parser

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-149) Parser for zip files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-149.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.2-incubating
         Assignee: Jukka Zitting

Applied the patch (plus the Apache header on ZipParser.java) in revision 692148. Good work, thanks!

> Parser for zip files
> --------------------
>
>                 Key: TIKA-149
>                 URL: https://issues.apache.org/jira/browse/TIKA-149
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>         Attachments: TIKA-149-II.diff, TIKA-149-III.diff, TIKA-149.patch
>
>
> Tika should be able to parse zip files. The resulting XHTML document should be something like this:
> <xhtml>
>   <head>...</head>
>   <body>
>     <div class="file">
>         <h1>path/to/file/inside/the/zip</h1>
>         ... (parsed contents of the file)
>     </div>
>     ...
>   </body>
> </xhtml>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.