You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Linh Tang <tt...@gmail.com> on 2014/11/03 23:30:46 UTC
Parse Html with Tika
Dear All,
I am Phuong Linh,
I am using Tika to extract content form Html file to search. But HtmlParser
cannot parse all tag of Html. ( I get Html page by Nutch, then use Tika to
extract the important information, after then use Solr to search.)
Can you tell me what i can do to parse all tag of html.
Thanks advance!
Regards,
Tang Thi Phuong Linh.
--
P.Linh
Re: Parse Html with Tika
Posted by Julien Nioche <li...@gmail.com>.
Hi Linh
You can specify a mapper to control what the html parser will filter or not.
see
https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639
for an example
Julien
On Monday, 3 November 2014, Linh Tang <tt...@gmail.com> wrote:
> Dear All,
>
> I am Phuong Linh,
> I am using Tika to extract content form Html file to search. But HtmlParser
> cannot parse all tag of Html. ( I get Html page by Nutch, then use Tika to
> extract the important information, after then use Solr to search.)
> Can you tell me what i can do to parse all tag of html.
>
> Thanks advance!
>
> Regards,
> Tang Thi Phuong Linh.
> --
> P.Linh
>
--
Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
RE: Parse Html with Tika
Posted by Ken Krugler <kk...@transpac.com>.
> From: Linh Tang
> Sent: November 3, 2014 2:30:46pm PST
> To: dev@tika.apache.org
> Subject: Parse Html with Tika
>
> Dear All,
>
> I am Phuong Linh,
> I am using Tika to extract content form Html file to search. But HtmlParser
> cannot parse all tag of Html.
I'm not sure what you mean by "cannot parse all tag of Html".
Do you have an example of an HTML page, and text that isn't being extracted?
-- Ken
> ( I get Html page by Nutch, then use Tika to
> extract the important information, after then use Solr to search.)
> Can you tell me what i can do to parse all tag of html.
>
> Thanks advance!
>
> Regards,
> Tang Thi Phuong Linh.
> --
> P.Linh
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr