You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Gabriel Valencia (JIRA)" <ji...@apache.org> on 2012/04/27 18:58:53 UTC

[jira] [Created] (TIKA-905) Embedded text boxes and shapes with text not supported

Gabriel Valencia created TIKA-905:
-------------------------------------

             Summary: Embedded text boxes and shapes with text not supported
                 Key: TIKA-905
                 URL: https://issues.apache.org/jira/browse/TIKA-905
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0
         Environment: Windows 7
            Reporter: Gabriel Valencia


This is similar to TIKA-904 but for normal word processing documents. In those, text contained in text boxes and shapes is not extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-905) Embedded text boxes and shapes with text not supported

Posted by "Gabriel Valencia (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Valencia updated TIKA-905:
----------------------------------

    Attachment: testPagesEmbeddedJIRA.pages

Contains various embedded objects including text boxes and shapes with text
                
> Embedded text boxes and shapes with text not supported
> ------------------------------------------------------
>
>                 Key: TIKA-905
>                 URL: https://issues.apache.org/jira/browse/TIKA-905
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>         Attachments: testPagesEmbeddedJIRA.pages
>
>
> This is similar to TIKA-904 but for normal word processing documents. In those, text contained in text boxes and shapes is not extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-905) Embedded text boxes and shapes with text not supported

Posted by "Gabriel Valencia (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265991#comment-13265991 ] 

Gabriel Valencia commented on TIKA-905:
---------------------------------------

Check out my comment in TIKA-904. They are all contained in sl:document -> sl:drawables -> sl:page-group (1 or more) -> sf:drawable-shape (1 or more) -> sf:text -> sf:text-storage -> sf:text-body -> sf:p. 

You get one sf:drawable-shape for each text box.
                
> Embedded text boxes and shapes with text not supported
> ------------------------------------------------------
>
>                 Key: TIKA-905
>                 URL: https://issues.apache.org/jira/browse/TIKA-905
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>         Attachments: testPagesEmbeddedJIRA.pages
>
>
> This is similar to TIKA-904 but for normal word processing documents. In those, text contained in text boxes and shapes is not extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-905) Embedded text boxes and shapes with text not supported

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264129#comment-13264129 ] 

Nick Burch commented on TIKA-905:
---------------------------------

Are you able to identify where in the file these text boxes occur, and what sort of tags hold the text? If the text boxes don't occur in the main text area, can you identify how to link back from the main text to the text box? (You might find it helpful to review how annotations work, which we now support as of r1331640, for an idea of how this might work)
                
> Embedded text boxes and shapes with text not supported
> ------------------------------------------------------
>
>                 Key: TIKA-905
>                 URL: https://issues.apache.org/jira/browse/TIKA-905
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>         Attachments: testPagesEmbeddedJIRA.pages
>
>
> This is similar to TIKA-904 but for normal word processing documents. In those, text contained in text boxes and shapes is not extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-905) Embedded text boxes and shapes with text not supported

Posted by "Gabriel Valencia (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Valencia updated TIKA-905:
----------------------------------

    Issue Type: Improvement  (was: Bug)

I'm new to JIRA, so please change if I'm wrong. I figure this should be an improvement, not a bug.
                
> Embedded text boxes and shapes with text not supported
> ------------------------------------------------------
>
>                 Key: TIKA-905
>                 URL: https://issues.apache.org/jira/browse/TIKA-905
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iwork
>         Attachments: testPagesEmbeddedJIRA.pages
>
>
> This is similar to TIKA-904 but for normal word processing documents. In those, text contained in text boxes and shapes is not extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-905) Embedded text boxes and shapes with text not supported

Posted by "Gabriel Valencia (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Valencia updated TIKA-905:
----------------------------------

    Labels: iwork  (was: )
    
> Embedded text boxes and shapes with text not supported
> ------------------------------------------------------
>
>                 Key: TIKA-905
>                 URL: https://issues.apache.org/jira/browse/TIKA-905
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iwork
>         Attachments: testPagesEmbeddedJIRA.pages
>
>
> This is similar to TIKA-904 but for normal word processing documents. In those, text contained in text boxes and shapes is not extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-905) Embedded text boxes and shapes with text not supported

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-905.
-------------------------------------

       Resolution: Duplicate
    Fix Version/s: 1.2

Looks like this was fixed with TIKA-904.
                
> Embedded text boxes and shapes with text not supported
> ------------------------------------------------------
>
>                 Key: TIKA-905
>                 URL: https://issues.apache.org/jira/browse/TIKA-905
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.2
>
>         Attachments: testPagesEmbeddedJIRA.pages
>
>
> This is similar to TIKA-904 but for normal word processing documents. In those, text contained in text boxes and shapes is not extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira