You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Albert L. (Created) (JIRA)" <ji...@apache.org> on 2011/12/19 19:49:32 UTC
[jira] [Created] (TIKA-819) Make Option to Exclude Embedded Files'
Text for Text Content
Make Option to Exclude Embedded Files' Text for Text Content
------------------------------------------------------------
Key: TIKA-819
URL: https://issues.apache.org/jira/browse/TIKA-819
Project: Tika
Issue Type: New Feature
Components: general
Affects Versions: 1.0
Environment: Windows-7 + JDK 1.6 u26
Reporter: Albert L.
Fix For: 1.1
It would be nice to be able to disable text content from embedded files.
For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files'
Text for Text Content
Posted by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-819:
-----------------------------------
Fix Version/s: (was: 1.1)
1.2
- push out to 1.2
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
> Issue Type: New Feature
> Components: general
> Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
> Reporter: Albert L.
> Fix For: 1.2
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files'
Text for Text Content
Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-819:
-----------------------------------
Fix Version/s: (was: 1.2)
1.3
- push to 1.3
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
> Issue Type: New Feature
> Components: general
> Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
> Reporter: Albert L.
> Fix For: 1.3
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded
Files' Text for Text Content
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172825#comment-13172825 ]
Nick Burch commented on TIKA-819:
---------------------------------
You have to explicitly ask for embedded files to be parsed, by supplying a Parser in the ParseContext object
If you don't want recursion, don't supply the parser!
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
> Issue Type: New Feature
> Components: general
> Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
> Reporter: Albert L.
> Fix For: 1.1
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded
Files' Text for Text Content
Posted by "Albert L. (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173225#comment-13173225 ]
Albert L. commented on TIKA-819:
--------------------------------
Oh, I see. Could this be a command-line option when using the Tika JAR?
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
> Issue Type: New Feature
> Components: general
> Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
> Reporter: Albert L.
> Fix For: 1.1
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded
Files' Text for Text Content
Posted by "Albert L. (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174121#comment-13174121 ]
Albert L. commented on TIKA-819:
--------------------------------
I think that by default retrieving the text content should be a recursive and deep. An optional command-line argument would set Tika to a "cursory" text content retrieval. Hence, I suggest the following.
-c or --cursory Output cursory content (does not recursively retrieve content from embedded/attached files)
Thanks!
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
> Issue Type: New Feature
> Components: general
> Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
> Reporter: Albert L.
> Fix For: 1.1
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded
Files' Text for Text Content
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173715#comment-13173715 ]
Nick Burch commented on TIKA-819:
---------------------------------
Probably. Can you think up a suitable short and long form name for the option, and appropriate help for --help?
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
> Issue Type: New Feature
> Components: general
> Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
> Reporter: Albert L.
> Fix For: 1.1
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files'
Text for Text Content
Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-819:
-----------------------------------
- push to 1.3
> Make Option to Exclude Embedded Files' Text for Text Content
> ------------------------------------------------------------
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
> Issue Type: New Feature
> Components: general
> Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
> Reporter: Albert L.
> Fix For: 1.3
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira