You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2020/09/30 17:38:00 UTC

[jira] [Commented] (TIKA-3044) add -C/--content cli option using WriteOutContentHandler

    [ https://issues.apache.org/jira/browse/TIKA-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204912#comment-17204912 ] 

Tim Allison commented on TIKA-3044:
-----------------------------------

It has been a while since I looked at this part of the codebase.  To confirm...

-t --text uses the BodyContentHandler which should only include the <body/> content
-T --text-main uses the BoilerpipeContentHandler, which relies on heuristics to guess what the main content of a page is and remove the boilerplate navigational sections, ads, etc.  So, to confirm, --text-main will return the title only for some specific html pages that BoilerpipeContentHandler fails on.  It _should_ return the main content of an html page if it works correctly.

The proposal is to add a feature to write out the text and body, just the simple WriteoutContentHandler.

This makes sense to me.

The current proposal is {{-C}} and {{-content}}.  What would people think of {{-A}} and {{--text-all}}?

> add -C/--content cli option using WriteOutContentHandler
> --------------------------------------------------------
>
>                 Key: TIKA-3044
>                 URL: https://issues.apache.org/jira/browse/TIKA-3044
>             Project: Tika
>          Issue Type: New Feature
>          Components: cli
>            Reporter: Alexander Klimetschek
>            Priority: Major
>
> For text extraction, the cli currently provides both --text and --text-main options. For html files, --text will return the body, while --text-main will only return the title. There is currently no cli option that gives all text content. However, the Tika API has the WriteOutContentHandler which does the trick.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)