You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "Zhang, Lisheng" <Li...@BroadVision.com> on 2017/02/15 00:35:15 UTC

How to keep all HTML link when doing file content extraction?

Hi, We have been using TIKA for sometime, which is very helpful, thanks a lot!

So far when TIKA extracted text, it throws away HTML link and only keep word, this is good for search indexing, but in new application we need to keep whole HTML link
when extracting text from a binary file like MS DOC, i could not find a simple way to do that, could you provide a pointer to suitable API or doc?

Thanks for helps again, Lisheng

RE: How to keep all HTML link when doing file content extraction?

Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.

-----Original Message-----
From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
Sent: Tue 2/14/2017 5:09 PM
To: user@tika.apache.org
Subject: Re: How to keep all HTML link when doing file content extraction?

> On Feb 14, 2017, at 4:35pm, Zhang, Lisheng <Li...@broadvision.com> wrote:
> 
> 
> Hi, We have been using TIKA for sometime, which is very helpful, thanks a lot!
> 
> So far when TIKA extracted text, it throws away HTML link and only keep word, this is good for search indexing, but in new application we need to keep whole HTML link
> when extracting text from a binary file like MS DOC, i could not find a simple way to do that, could you provide a pointer to suitable API or doc?

One example is in the Bixo web mining toolkit.

See https://github.com/bixo/bixo/tree/master/src/main/java/bixo/parser for all the related files.

Specifically there's https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java, which runs the parse in a thread (so that if it hangs it doesn't kill the hadoop job).

It calls the Tika parse() method with a org.apache.tika.sax.TeeContentHandler that sends SAX events to the regular content extraction handler, and (typically) the SimpleLinkExtractor class (in the same package).

- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Thanks very much for such timely help, i will study and test

Re: How to keep all HTML link when doing file content extraction?

Posted by Ken Krugler <kk...@transpac.com>.

> On Feb 14, 2017, at 4:35pm, Zhang, Lisheng <Li...@broadvision.com> wrote:
> 
> 
> Hi, We have been using TIKA for sometime, which is very helpful, thanks a lot!
> 
> So far when TIKA extracted text, it throws away HTML link and only keep word, this is good for search indexing, but in new application we need to keep whole HTML link
> when extracting text from a binary file like MS DOC, i could not find a simple way to do that, could you provide a pointer to suitable API or doc?

One example is in the Bixo web mining toolkit.

See https://github.com/bixo/bixo/tree/master/src/main/java/bixo/parser for all the related files.

Specifically there’s https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java, which runs the parse in a thread (so that if it hangs it doesn’t kill the hadoop job).

It calls the Tika parse() method with a org.apache.tika.sax.TeeContentHandler that sends SAX events to the regular content extraction handler, and (typically) the SimpleLinkExtractor class (in the same package).

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr