You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Chris Mattmann <ch...@gmail.com> on 2015/11/12 14:38:37 UTC

Re: Extraction table from HTML document in Tika

Also take a look at Scrapy and the work that Hyperion
Grey is doing with Splash and Avatar/HH.

Cheers,
Chris

—
Chris Mattmann
chris.mattmann@gmail.com






-----Original Message-----
From: Ken Krugler <kk...@transpac.com>
Reply-To: <us...@tika.apache.org>
Date: Thursday, November 12, 2015 at 10:58 AM
To: <us...@tika.apache.org>
Subject: RE: Extraction table from HTML document in Tika

>There's no (semi)automated method.
>For simple tables you could create a custom ContentHandler that triggers
>of appropriate HTML tags.
>
>But a general purpose extractor is a serious technical challenge.
>
>Companies like Factual have invested heavily in being able to find &
>extract this type of structured content from web pages.
>
>There are some open source projects out there which could help, I just
>haven't looked recently.
>
>http://blog.import.io/post/get-data-from-html-tables-automatically is an
>example of a commercial solution.
>
>-- Ken
>
>
>________________________________________
>From: Sznajder ForMailingList
> Sent: November 12, 2015 6:49:23am PST
> To: user@tika.apache.org
> Subject: Extraction table from HTML document in Tika
> 
>
>Hi
>
>
>Is there a way to extract tables from a HTML document using Tika?
>
>thanks!
>
>
>Benjamin
>
>
>
>
>
>
>
>--------------------------
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>