You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ian Holsman <li...@holsman.net> on 2007/04/16 09:10:27 UTC

Using Tika/Nutch to analyze a website

Hi.

I was planning on using nutch and UIMA to analyze to perform entity  
extraction, and noticed that you mention that Tika would be designed  
to do this.

i was wondering how things were going with Tika, as it doesn't seem  
like there is any code/design plans checked in (except for the  
proposal).

So I would like to spark the discussion.

i would like to:
- use nutch to fetch the pages (HTML) from the site
- UIMA to analyze them and extract interesting information.
- mysql, or possibly HBase to store versioned/historical output of  
this analysis, for possible further reporting on (stats, and page  
timelines)

is Tika going to be able to do this for me?

regards
Ian
--
Ian Holsman
Ian@Holsman.net




Re: Using Tika/Nutch to analyze a website

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 4/16/07, Ian Holsman <li...@holsman.net> wrote:
> I was planning on using nutch and UIMA to analyze to perform entity
> extraction, and noticed that you mention that Tika would be designed
> to do this.
>
> i was wondering how things were going with Tika, as it doesn't seem
> like there is any code/design plans checked in (except for the
> proposal).

Thanks for the interest! As you noticed, we're just getting started
and haven't yet achieved much.

> So I would like to spark the discussion.
>
> i would like to:
> - use nutch to fetch the pages (HTML) from the site
> - UIMA to analyze them and extract interesting information.
> - mysql, or possibly HBase to store versioned/historical output of
> this analysis, for possible further reporting on (stats, and page
> timelines)
>
> is Tika going to be able to do this for me?

Certainly not all of it. In this scheme Tika would most naturally fit
as a component used by UIMA to parse the HTML pages. The main benefit
of using Tika instead of a native HTML parser in this case would be
that you could easily extend the application to also analyze other
types of document like PDFs, etc.

BR,

Jukka Zitting