You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Nancy Sharma <na...@gmail.com> on 2015/02/26 04:52:48 UTC
tika to parse url data content
Hello,
I have crawled a webpage as a part of my assignment(CS572). I have the
segment folder with the url metadata and data(parsed and otherwise).
I have also merged all the segments, to dump into an output file.
This dump file, when opened in a text editor contains some parsed content
and some encoded content, like special characters that is actually data
from that url.
The problem is, I am not very clear how to use tika here? Please help
Thanks
Nancy
Re: tika to parse url data content
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Nancy,
Tika is what put the metadata into the parsed content
in the file you are looking at. See the parse-tika
plugin. You don’t need to use Tika further that the
information that is in your crawled data.
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: Nancy Sharma <na...@gmail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Wednesday, February 25, 2015 at 7:52 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: tika to parse url data content
>Hello,
>
>
>I have crawled a webpage as a part of my assignment(CS572). I have the
>segment folder with the url metadata and data(parsed and otherwise).
>
>
>I have also merged all the segments, to dump into an output file.
>
>
>This dump file, when opened in a text editor contains some parsed content
>and some encoded content, like special characters that is actually data
>from that url.
>
>
>The problem is, I am not very clear how to use tika here? Please help
>
>
>Thanks
>Nancy
>
Fwd: tika to parse url data content
Posted by Nancy Sharma <na...@gmail.com>.
Hello,
I have crawled a webpage as a part of my assignment(CS572). I have the
segment folder with the url metadata and data(parsed and otherwise).
I have also merged all the segments, to dump into an output file.
This dump file, when opened in a text editor contains some parsed content
and some encoded content, like special characters that is actually data
from that url.
The problem is, I am not very clear how to use tika here? Please help
Thanks
Nancy