You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Karl Heinz Marbaise (JIRA)" <ji...@apache.org> on 2009/05/21 22:53:45 UTC

[jira] Created: (TIKA-231) Difference between Web-Site and real working code

Difference between Web-Site and real working code
-------------------------------------------------

                 Key: TIKA-231
                 URL: https://issues.apache.org/jira/browse/TIKA-231
             Project: Tika
          Issue Type: Bug
          Components: documentation
    Affects Versions: 0.3
         Environment: All
            Reporter: Karl Heinz Marbaise
            Priority: Minor


On the official web site there is written that OpenOffice files will not be scanned or to be more accurate "TODO", but if i scan a tar.gz / zip archive with open office files their contents will be extracted. So I think the web site should be updated to document the correct state of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-231) Difference between Web-Site and real working code

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715273#action_12715273 ] 

Uwe Schindler commented on TIKA-231:
------------------------------------

This is incorrect in your commit:
bq. The older sxc, sxw formats (OpenOffice 1.0) are not supported.
They are supported!

> Difference between Web-Site and real working code
> -------------------------------------------------
>
>                 Key: TIKA-231
>                 URL: https://issues.apache.org/jira/browse/TIKA-231
>             Project: Tika
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Assignee: Jukka Zitting
>            Priority: Minor
>         Attachments: TIKA-231.patch
>
>
> On the official web site there is written that OpenOffice files will not be scanned or to be more accurate "TODO", but if i scan a tar.gz / zip archive with open office files their contents will be extracted. So I think the web site should be updated to document the correct state of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-231) Difference between Web-Site and real working code

Posted by "Karl Heinz Marbaise (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711802#action_12711802 ] 

Karl Heinz Marbaise commented on TIKA-231:
------------------------------------------

I have observed that i can parse OpenOffice .odp files as well and get a result. So this should be documented as well.

> Difference between Web-Site and real working code
> -------------------------------------------------
>
>                 Key: TIKA-231
>                 URL: https://issues.apache.org/jira/browse/TIKA-231
>             Project: Tika
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>
> On the official web site there is written that OpenOffice files will not be scanned or to be more accurate "TODO", but if i scan a tar.gz / zip archive with open office files their contents will be extracted. So I think the web site should be updated to document the correct state of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-231) Difference between Web-Site and real working code

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711819#action_12711819 ] 

Uwe Schindler commented on TIKA-231:
------------------------------------

Yes ODP and other StarOffice/OpenDocument files work since TIKA-172, even basic formatting and tables are extracted and written to the XHTML SAX stream.

> Difference between Web-Site and real working code
> -------------------------------------------------
>
>                 Key: TIKA-231
>                 URL: https://issues.apache.org/jira/browse/TIKA-231
>             Project: Tika
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>
> On the official web site there is written that OpenOffice files will not be scanned or to be more accurate "TODO", but if i scan a tar.gz / zip archive with open office files their contents will be extracted. So I think the web site should be updated to document the correct state of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-231) Difference between Web-Site and real working code

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-231.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Patch applied in revision 780831. Thanks!

> Difference between Web-Site and real working code
> -------------------------------------------------
>
>                 Key: TIKA-231
>                 URL: https://issues.apache.org/jira/browse/TIKA-231
>             Project: Tika
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Assignee: Jukka Zitting
>            Priority: Minor
>         Attachments: TIKA-231.patch
>
>
> On the official web site there is written that OpenOffice files will not be scanned or to be more accurate "TODO", but if i scan a tar.gz / zip archive with open office files their contents will be extracted. So I think the web site should be updated to document the correct state of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-231) Difference between Web-Site and real working code

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712306#action_12712306 ] 

Uwe Schindler commented on TIKA-231:
------------------------------------

sxw & co files from OpenOffice 1.0 are supported (so the pre-release of OpenDocument with the other sun-specific namespaces). The mapping is done using a SAX filter, that rewrites the outdated namespaces to the new ones.
The problem is currently only mime-types.conf, that only detects sxw, the other signatures should be added soon). My idea would be to use a internal catch-all mime-type (like for office) for all Open Document types. When I am back home, I will prepare a patch.

> Difference between Web-Site and real working code
> -------------------------------------------------
>
>                 Key: TIKA-231
>                 URL: https://issues.apache.org/jira/browse/TIKA-231
>             Project: Tika
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>         Attachments: TIKA-231.patch
>
>
> On the official web site there is written that OpenOffice files will not be scanned or to be more accurate "TODO", but if i scan a tar.gz / zip archive with open office files their contents will be extracted. So I think the web site should be updated to document the correct state of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-231) Difference between Web-Site and real working code

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715278#action_12715278 ] 

Jukka Zitting commented on TIKA-231:
------------------------------------

Thanks, I updated the documentation accordingly.

> Difference between Web-Site and real working code
> -------------------------------------------------
>
>                 Key: TIKA-231
>                 URL: https://issues.apache.org/jira/browse/TIKA-231
>             Project: Tika
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Assignee: Jukka Zitting
>            Priority: Minor
>         Attachments: TIKA-231.patch
>
>
> On the official web site there is written that OpenOffice files will not be scanned or to be more accurate "TODO", but if i scan a tar.gz / zip archive with open office files their contents will be extracted. So I think the web site should be updated to document the correct state of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-231) Difference between Web-Site and real working code

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712280#action_12712280 ] 

Jukka Zitting commented on TIKA-231:
------------------------------------

Good point. Do you have a patch for this? The site sources are in src/site/apt within Tika trunk.

> Difference between Web-Site and real working code
> -------------------------------------------------
>
>                 Key: TIKA-231
>                 URL: https://issues.apache.org/jira/browse/TIKA-231
>             Project: Tika
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>
> On the official web site there is written that OpenOffice files will not be scanned or to be more accurate "TODO", but if i scan a tar.gz / zip archive with open office files their contents will be extracted. So I think the web site should be updated to document the correct state of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-231) Difference between Web-Site and real working code

Posted by "Karl Heinz Marbaise (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Heinz Marbaise updated TIKA-231:
-------------------------------------

    Attachment: TIKA-231.patch

Take a look at the text and do a review on it.

> Difference between Web-Site and real working code
> -------------------------------------------------
>
>                 Key: TIKA-231
>                 URL: https://issues.apache.org/jira/browse/TIKA-231
>             Project: Tika
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>         Attachments: TIKA-231.patch
>
>
> On the official web site there is written that OpenOffice files will not be scanned or to be more accurate "TODO", but if i scan a tar.gz / zip archive with open office files their contents will be extracted. So I think the web site should be updated to document the correct state of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.