You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by imyuka <oc...@163.com> on 2014/10/09 14:22:45 UTC

Formatted Content Extraction and Title Detection

Hi all,


    Here is my problem: I have extracted plain texts from a serious of doc(x) documents and their titles via the "dc:title" label of metadata, but I'm not sure this is the right way to attain a title of a document. In many cases, a title inside a document could be of the largest font-size and bold-style, which I want to utilized to extract the very title, however, I have no idea how to get a formatted content and font-size/bold-style detection. please let me know if I miss something.
    Thank you very much!

Re:Re: Formatted Content Extraction and Title Detection

Posted by Nick Burch <ap...@gagravarr.org>.

On Thu, 9 Oct 2014, imyuka wrote:
> I roughly checked up the book <Tika in Action> and found the instruction 
> about transforming a document to a XHTML file with command line, while I 
> have no idea about the Java coding implementation. Are there any 
> instructions or tutorials I can refer to?

We have quite a few examples, they're available in svn:
https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example

You probably want ContentHandlerExample.parseToHTML()

Longer term, we're planning to get those automatically included in the 
website, along with supporting text. See 
http://tika.apache.org/1.7/examples.html for the WIP on that

Nick

Re:Re: Formatted Content Extraction and Title Detection

Posted by imyuka <oc...@163.com>.

Thanks Nick, I really appreciate it. In this case, does it suppose that formatted context extraction can only be processed by producing corresponding XHTML file as output? I roughly checked up the book <Tika in Action> and found the instruction about transforming a document to a XHTML file with command line, while I have no idea about the Java coding implementation. Are there any instructions or tutorials I can refer to?

Thanks!

At 2014-10-09 20:46:01, "Nick Burch" <ap...@gagravarr.org> wrote:
>On Thu, 9 Oct 2014, imyuka wrote:
>> Here is my problem: I have extracted plain texts from a serious of 
>> doc(x) documents and their titles via the "dc:title" label of metadata, 
>> but I'm not sure this is the right way to attain a title of a document. 
>> In many cases, a title inside a document could be of the largest 
>> font-size and bold-style, which I want to utilized to extract the very 
>> title, however, I have no idea how to get a formatted content and 
>> font-size/bold-style detection
>
>If it's been styled as a heading, then you'll be able to get that from the 
>html contents. If in Word it's styled as normal body text, but manually 
>set to a larger font size, then there's nothing in Tika to help with that.
>
>Nick

Re: Formatted Content Extraction and Title Detection

Posted by Nick Burch <ap...@gagravarr.org>.

On Thu, 9 Oct 2014, imyuka wrote:
> Here is my problem: I have extracted plain texts from a serious of 
> doc(x) documents and their titles via the "dc:title" label of metadata, 
> but I'm not sure this is the right way to attain a title of a document. 
> In many cases, a title inside a document could be of the largest 
> font-size and bold-style, which I want to utilized to extract the very 
> title, however, I have no idea how to get a formatted content and 
> font-size/bold-style detection

If it's been styled as a heading, then you'll be able to get that from the 
html contents. If in Word it's styled as normal body text, but manually 
set to a larger font size, then there's nothing in Tika to help with that.

Nick