You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by imyuka <oc...@163.com> on 2014/10/09 14:22:45 UTC
Formatted Content Extraction and Title Detection
Hi all,
Here is my problem: I have extracted plain texts from a serious of doc(x) documents and their titles via the "dc:title" label of metadata, but I'm not sure this is the right way to attain a title of a document. In many cases, a title inside a document could be of the largest font-size and bold-style, which I want to utilized to extract the very title, however, I have no idea how to get a formatted content and font-size/bold-style detection. please let me know if I miss something.
Thank you very much!
Re:Re: Formatted Content Extraction and Title Detection
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 9 Oct 2014, imyuka wrote:
> I roughly checked up the book <Tika in Action> and found the instruction
> about transforming a document to a XHTML file with command line, while I
> have no idea about the Java coding implementation. Are there any
> instructions or tutorials I can refer to?
We have quite a few examples, they're available in svn:
https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example
You probably want ContentHandlerExample.parseToHTML()
Longer term, we're planning to get those automatically included in the
website, along with supporting text. See
http://tika.apache.org/1.7/examples.html for the WIP on that
Nick
Re:Re: Formatted Content Extraction and Title Detection
Posted by imyuka <oc...@163.com>.
Thanks Nick, I really appreciate it. In this case, does it suppose that formatted context extraction can only be processed by producing corresponding XHTML file as output? I roughly checked up the book <Tika in Action> and found the instruction about transforming a document to a XHTML file with command line, while I have no idea about the Java coding implementation. Are there any instructions or tutorials I can refer to?
Thanks!
At 2014-10-09 20:46:01, "Nick Burch" <ap...@gagravarr.org> wrote:
>On Thu, 9 Oct 2014, imyuka wrote:
>> Here is my problem: I have extracted plain texts from a serious of
>> doc(x) documents and their titles via the "dc:title" label of metadata,
>> but I'm not sure this is the right way to attain a title of a document.
>> In many cases, a title inside a document could be of the largest
>> font-size and bold-style, which I want to utilized to extract the very
>> title, however, I have no idea how to get a formatted content and
>> font-size/bold-style detection
>
>If it's been styled as a heading, then you'll be able to get that from the
>html contents. If in Word it's styled as normal body text, but manually
>set to a larger font size, then there's nothing in Tika to help with that.
>
>Nick
Re: Formatted Content Extraction and Title Detection
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 9 Oct 2014, imyuka wrote:
> Here is my problem: I have extracted plain texts from a serious of
> doc(x) documents and their titles via the "dc:title" label of metadata,
> but I'm not sure this is the right way to attain a title of a document.
> In many cases, a title inside a document could be of the largest
> font-size and bold-style, which I want to utilized to extract the very
> title, however, I have no idea how to get a formatted content and
> font-size/bold-style detection
If it's been styled as a heading, then you'll be able to get that from the
html contents. If in Word it's styled as normal body text, but manually
set to a larger font size, then there's nothing in Tika to help with that.
Nick