You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Nancy Sharma <na...@gmail.com> on 2015/02/26 04:52:48 UTC

tika to parse url data content

Hello,

I have crawled a webpage as a part of my assignment(CS572). I have the
segment folder with the url metadata and data(parsed and otherwise).

I have also merged all the segments, to dump into an output file.

This dump file, when opened in a text editor contains some parsed content
and some encoded content, like special characters that is actually data
from that url.

The problem is, I am not very clear how to use tika here? Please help

Thanks
Nancy

Re: tika to parse url data content

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Nancy,

Tika is what put the metadata into the parsed content
in the file you are looking at. See the parse-tika
plugin. You don’t need to use Tika further that the
information that is in  your crawled data.

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Nancy Sharma <na...@gmail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Wednesday, February 25, 2015 at 7:52 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: tika to parse url data content

>Hello,
>
>
>I have crawled a webpage as a part of my assignment(CS572). I have the
>segment folder with the url metadata and data(parsed and otherwise).
>
>
>I have also merged all the segments, to dump into an output file.
>
>
>This dump file, when opened in a text editor contains some parsed content
>and some encoded content, like special characters that is actually data
>from that url.
>
>
>The problem is, I am not very clear how to use tika here? Please help
>
>
>Thanks
>Nancy
>


Fwd: tika to parse url data content

Posted by Nancy Sharma <na...@gmail.com>.
Hello,

I have crawled a webpage as a part of my assignment(CS572). I have the
segment folder with the url metadata and data(parsed and otherwise).

I have also merged all the segments, to dump into an output file.

This dump file, when opened in a text editor contains some parsed content
and some encoded content, like special characters that is actually data
from that url.

The problem is, I am not very clear how to use tika here? Please help

Thanks
Nancy