You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ian Holsman <li...@holsman.net> on 2007/04/16 09:10:27 UTC
Using Tika/Nutch to analyze a website
Hi.
I was planning on using nutch and UIMA to analyze to perform entity
extraction, and noticed that you mention that Tika would be designed
to do this.
i was wondering how things were going with Tika, as it doesn't seem
like there is any code/design plans checked in (except for the
proposal).
So I would like to spark the discussion.
i would like to:
- use nutch to fetch the pages (HTML) from the site
- UIMA to analyze them and extract interesting information.
- mysql, or possibly HBase to store versioned/historical output of
this analysis, for possible further reporting on (stats, and page
timelines)
is Tika going to be able to do this for me?
regards
Ian
--
Ian Holsman
Ian@Holsman.net
Re: Using Tika/Nutch to analyze a website
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On 4/16/07, Ian Holsman <li...@holsman.net> wrote:
> I was planning on using nutch and UIMA to analyze to perform entity
> extraction, and noticed that you mention that Tika would be designed
> to do this.
>
> i was wondering how things were going with Tika, as it doesn't seem
> like there is any code/design plans checked in (except for the
> proposal).
Thanks for the interest! As you noticed, we're just getting started
and haven't yet achieved much.
> So I would like to spark the discussion.
>
> i would like to:
> - use nutch to fetch the pages (HTML) from the site
> - UIMA to analyze them and extract interesting information.
> - mysql, or possibly HBase to store versioned/historical output of
> this analysis, for possible further reporting on (stats, and page
> timelines)
>
> is Tika going to be able to do this for me?
Certainly not all of it. In this scheme Tika would most naturally fit
as a component used by UIMA to parse the HTML pages. The main benefit
of using Tika instead of a native HTML parser in this case would be
that you could easily extend the application to also analyze other
types of document like PDFs, etc.
BR,
Jukka Zitting