You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Albert L. (Created) (JIRA)" <ji...@apache.org> on 2011/12/19 19:49:32 UTC

[jira] [Created] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

Make Option to Exclude Embedded Files' Text for Text Content
------------------------------------------------------------

                 Key: TIKA-819
                 URL: https://issues.apache.org/jira/browse/TIKA-819
             Project: Tika
          Issue Type: New Feature
          Components: general
    Affects Versions: 1.0
         Environment: Windows-7 + JDK 1.6 u26
            Reporter: Albert L.
             Fix For: 1.1


It would be nice to be able to disable text content from embedded files.

For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX.  In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

Posted by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-819:
-----------------------------------

    Fix Version/s:     (was: 1.1)
                   1.2

- push out to 1.2
                
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
>                 Key: TIKA-819
>                 URL: https://issues.apache.org/jira/browse/TIKA-819
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows-7 + JDK 1.6 u26
>            Reporter: Albert L.
>             Fix For: 1.2
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX.  In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-819:
-----------------------------------

    Fix Version/s:     (was: 1.2)
                   1.3

- push to 1.3
                
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
>                 Key: TIKA-819
>                 URL: https://issues.apache.org/jira/browse/TIKA-819
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows-7 + JDK 1.6 u26
>            Reporter: Albert L.
>             Fix For: 1.3
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX.  In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172825#comment-13172825 ] 

Nick Burch commented on TIKA-819:
---------------------------------

You have to explicitly ask for embedded files to be parsed, by supplying a Parser in the ParseContext object

If you don't want recursion, don't supply the parser!
                
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
>                 Key: TIKA-819
>                 URL: https://issues.apache.org/jira/browse/TIKA-819
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows-7 + JDK 1.6 u26
>            Reporter: Albert L.
>             Fix For: 1.1
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX.  In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

Posted by "Albert L. (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173225#comment-13173225 ] 

Albert L. commented on TIKA-819:
--------------------------------

Oh, I see.  Could this be a command-line option when using the Tika JAR?
                
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
>                 Key: TIKA-819
>                 URL: https://issues.apache.org/jira/browse/TIKA-819
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows-7 + JDK 1.6 u26
>            Reporter: Albert L.
>             Fix For: 1.1
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX.  In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

Posted by "Albert L. (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174121#comment-13174121 ] 

Albert L. commented on TIKA-819:
--------------------------------

I think that by default retrieving the text content should be a recursive and deep.  An optional command-line argument would set Tika to a "cursory" text content retrieval.  Hence, I suggest the following.

-c  or --cursory        Output cursory content (does not recursively retrieve content from embedded/attached files)


Thanks!
                
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
>                 Key: TIKA-819
>                 URL: https://issues.apache.org/jira/browse/TIKA-819
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows-7 + JDK 1.6 u26
>            Reporter: Albert L.
>             Fix For: 1.1
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX.  In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173715#comment-13173715 ] 

Nick Burch commented on TIKA-819:
---------------------------------

Probably. Can you think up a suitable short and long form name for the option, and appropriate help for --help?
                
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
>                 Key: TIKA-819
>                 URL: https://issues.apache.org/jira/browse/TIKA-819
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows-7 + JDK 1.6 u26
>            Reporter: Albert L.
>             Fix For: 1.1
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX.  In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-819:
-----------------------------------


- push to 1.3
                
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
>                 Key: TIKA-819
>                 URL: https://issues.apache.org/jira/browse/TIKA-819
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 1.0
>         Environment: Windows-7 + JDK 1.6 u26
>            Reporter: Albert L.
>             Fix For: 1.3
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX.  In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira