You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Linh Tang <tt...@gmail.com> on 2014/11/03 23:30:46 UTC

Parse Html with Tika

Dear All,

I am Phuong Linh,
I am using Tika to extract content form Html file to search. But HtmlParser
cannot parse all tag of Html.  ( I get Html page by Nutch, then use Tika to
extract the important information, after then use Solr to search.)
Can you tell me what i can do to parse all tag of html.

Thanks advance!

Regards,
Tang Thi Phuong Linh.
-- 
P.Linh

Re: Parse Html with Tika

Posted by Julien Nioche <li...@gmail.com>.
Hi Linh

You can specify a mapper to control what the html parser will filter or not.

see
https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639
for an example

Julien

On Monday, 3 November 2014, Linh Tang <tt...@gmail.com> wrote:

> Dear All,
>
> I am Phuong Linh,
> I am using Tika to extract content form Html file to search. But HtmlParser
> cannot parse all tag of Html.  ( I get Html page by Nutch, then use Tika to
> extract the important information, after then use Solr to search.)
> Can you tell me what i can do to parse all tag of html.
>
> Thanks advance!
>
> Regards,
> Tang Thi Phuong Linh.
> --
> P.Linh
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

RE: Parse Html with Tika

Posted by Ken Krugler <kk...@transpac.com>.
> From: Linh Tang
> Sent: November 3, 2014 2:30:46pm PST
> To: dev@tika.apache.org
> Subject: Parse Html with Tika
> 
> Dear All,
> 
> I am Phuong Linh,
> I am using Tika to extract content form Html file to search. But HtmlParser
> cannot parse all tag of Html.  

I'm not sure what you mean by "cannot parse all tag of Html".

Do you have an example of an HTML page, and text that isn't being extracted?

-- Ken

> ( I get Html page by Nutch, then use Tika to
> extract the important information, after then use Solr to search.)
> Can you tell me what i can do to parse all tag of html.
> 
> Thanks advance!
> 
> Regards,
> Tang Thi Phuong Linh.
> -- 
> P.Linh

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr