You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Albert L. (Created) (JIRA)" <ji...@apache.org> on 2012/03/09 20:14:59 UTC

[jira] [Created] (TIKA-873) Tika --extract fails for DOC

Tika --extract fails for DOC
----------------------------

                 Key: TIKA-873
                 URL: https://issues.apache.org/jira/browse/TIKA-873
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 1.0
         Environment: Windows 7 + Java v1.6
            Reporter: Albert L.
             Fix For: 1.2


A file that is embedded in an DOCfile doesn't get extracted to disk.

To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.

See this link for an example: http://dl.dropbox.com/u/2490783/embedded.doc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-873) Tika --extract fails for DOC

Posted by "Albert L. (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Albert L. updated TIKA-873:
---------------------------

    Attachment: embedded.doc
    
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See this link for an example: http://dl.dropbox.com/u/2490783/embedded.doc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-873) Tika --extract fails for DOC

Posted by "Albert L. (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234350#comment-13234350 ] 

Albert L. commented on TIKA-873:
--------------------------------

Thanks, Maxim.
                
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-873) Tika --extract fails for DOC

Posted by "Albert L. (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Albert L. updated TIKA-873:
---------------------------

    Description: 
A file that is embedded in an DOCfile doesn't get extracted to disk.

To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.

See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

  was:
A file that is embedded in an DOCfile doesn't get extracted to disk.

To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.

See attached file "embedded.doc.zip" for an example input file that fails with Tika v1.0.

    
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-873) Tika --extract fails for DOC

Posted by "Albert L. (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226475#comment-13226475 ] 

Albert L. commented on TIKA-873:
--------------------------------

Hi Nick,

In the case of my attached file to this bug, I get 2 junk files; none of which is the file I originally embedded.


Albert
                
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-873) Tika --extract fails for DOC

Posted by "Albert L. (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226476#comment-13226476 ] 

Albert L. commented on TIKA-873:
--------------------------------

Hi Nick,

ps: I am getting this result with all DOC files I create.


Albert
                
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-873) Tika --extract fails for DOC

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226489#comment-13226489 ] 

Nick Burch commented on TIKA-873:
---------------------------------

What about with the test files that ship with Tika (eg testWORD_embeded.doc), do you get the right data or corrupt data with them?


                
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-873) Tika --extract fails for DOC

Posted by "Maxim Valyanskiy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234294#comment-13234294 ] 

Maxim Valyanskiy commented on TIKA-873:
---------------------------------------

Current trunk version extracts following files:

-rw-r--r-- 1 maxcom consult 1596416 марта 21 15:46 _1392807349.wps
-rw-r--r-- 1 maxcom consult 1505804 марта 21 15:46 image1.emf

That emf file can be opened with libreoffice without any errors/warning
                
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-873) Tika --extract fails for DOC

Posted by "Albert L. (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226499#comment-13226499 ] 

Albert L. commented on TIKA-873:
--------------------------------

Hi Nick,

"testWORD_embeded.doc" is working.  I get the following:

C:\code\temp>java -jar c:\code\tika-app-1.0.jar -z testWORD_embeded.doc
Extracting 'image1' (image/unknown)
Extracting 'image4.png' (image/png)
Extracting 'image5.jpg' (image/jpeg)
Extracting 'image6.png' (image/png)
Extracting 'image2' (image/unknown)
Extracting 'image3' (image/unknown)
Extracting 'file0.docx' (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
Extracting '_1345471035.ppt' (application/vnd.ms-powerpoint)
Extracting '_1345470949.xls' (application/vnd.ms-excel)



Albert
                
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-873) Tika --extract fails for DOC

Posted by "Maxim Valyanskiy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234310#comment-13234310 ] 

Maxim Valyanskiy commented on TIKA-873:
---------------------------------------

hm, 1.0 extracts something that is not valid.
                
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-873) Tika --extract fails for DOC

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226461#comment-13226461 ] 

Nick Burch commented on TIKA-873:
---------------------------------

Tika has a number of unit tests for the extraction of embedded resources from Word documents, in POIContainerExtractionTest

Are you having this problem for only some files, or all? Do you get some, all or none of the embedded resources out?
                
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-873) Tika --extract fails for DOC

Posted by "Maxim Valyanskiy (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maxim Valyanskiy resolved TIKA-873.
-----------------------------------

    Resolution: Fixed
    
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-873) Tika --extract fails for DOC

Posted by "Albert L. (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Albert L. updated TIKA-873:
---------------------------

    Description: 
A file that is embedded in an DOCfile doesn't get extracted to disk.

To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.

See attached file "embedded.doc.zip" for an example input file that fails with Tika v1.0.

  was:
A file that is embedded in an DOCfile doesn't get extracted to disk.

To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.

See this link for an example: http://dl.dropbox.com/u/2490783/embedded.doc

    
> Tika --extract fails for DOC
> ----------------------------
>
>                 Key: TIKA-873
>                 URL: https://issues.apache.org/jira/browse/TIKA-873
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows 7 + Java v1.6
>            Reporter: Albert L.
>             Fix For: 1.2
>
>         Attachments: embedded.doc
>
>
> A file that is embedded in an DOCfile doesn't get extracted to disk.
> To "embed" a file into an DOC, simply drag-drop it into an DOC document when using MS-Word 2010.  It will then create an EMF of the embedded file's preview.
> See attached file "embedded.doc.zip" for an example input file that fails with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira