You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by yeshwanth kumar <ye...@gmail.com> on 2014/06/11 12:48:41 UTC
tika parser is not parsing the BytesWritable in mapreduce
i am writing a mapreduce job,
where it takes a zip file as input, zip file contains different types of
documents such as docx odt pdf txt,
i am using tika parser to parse the documents.
here's the code snippet of my mapper method
public void map(Text key, BytesWritable value, Context context)throws
IOException, InterruptedException {
------------------------------
------------------------------
logger.info("Length:\t" + value.getLength());
byte[] bytesbefore = value.getBytes();
logger.info("CONTENT BEFORE" + new String(bytesbefore));
InputStream in = new ByteArrayInputStream(bytesbefore);
Metadata metadata = new Metadata();
String mimeType = new Tika().detect(in);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(
value.getLength());
try {
parser.parse(in, handler, metadata, new ParseContext());
} catch (SAXException e1) {
logger.info(e1.getMessage());
e1.printStackTrace();
} catch (TikaException e1) {
logger.info(e1.getMessage());
e1.printStackTrace();
}
in.close();
logger.info("Content AFTER" + handler.toString());
------------------------------
}
output is written to hbase, content of the document is empty after parsing ,
am i missing anything here??
Re: tika parser is not parsing the BytesWritable in mapreduce
Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 11 Jun 2014, Mattmann, Chris A (3980) wrote:
>> output is written to hbase, content of the document is empty after
>> parsing , am i missing anything here??
Sounds a lot like a query we had earlier this week on something similar:
http://mail-archives.apache.org/mod_mbox/tika-user/201406.mbox/%3CCACQuOSXq3cnrG9HbAkPwom2u2VQDcJQ5H-Qrus_TMJcysZRSWQ%40mail.gmail.com%3E
Problem there was that the user had the Tika Core jar, but had forgotten
to include the Tika Parsers jar + dependencies, so didn't have any parsers
available at runtime.
Could it be the same here?
Nick
Re: tika parser is not parsing the BytesWritable in mapreduce
Posted by Julien Nioche <li...@gmail.com>.
I don't know what the issue here but the Tika module in Behemoth is a good
example of how to use Tika over MapReduce
https://github.com/DigitalPebble/behemoth/tree/master/tika
J.
On 11 June 2014 11:59, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:
> cross posting to Tika list for help there too.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: yeshwanth kumar <ye...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Wednesday, June 11, 2014 3:48 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: tika parser is not parsing the BytesWritable in mapreduce
>
> >i am writing a mapreduce job,
> >where it takes a zip file as input, zip file contains different types of
> >documents such as docx odt pdf txt,
> > i am using tika parser to parse the documents.
> >here's the code snippet of my mapper method
> >
> >public void map(Text key, BytesWritable value, Context context)throws
> >IOException, InterruptedException {
> >
> > ------------------------------
> > ------------------------------
> >
> >logger.info <http://logger.info>("Length:\t" + value.getLength());
> > byte[] bytesbefore = value.getBytes();
> >
> >logger.info <http://logger.info>("CONTENT BEFORE" + new
> >String(bytesbefore));
> > InputStream in = new ByteArrayInputStream(bytesbefore);
> > Metadata metadata = new Metadata();
> > String mimeType = new Tika().detect(in);
> > metadata.set(Metadata.CONTENT_TYPE, mimeType);
> > Parser parser = new AutoDetectParser();
> > ContentHandler handler = new BodyContentHandler(
> > value.getLength());
> > try {
> > parser.parse(in, handler, metadata, new ParseContext());
> > } catch (SAXException e1) {
> >
> >logger.info <http://logger.info>(e1.getMessage());
> > e1.printStackTrace();
> > } catch (TikaException e1) {
> >
> >logger.info <http://logger.info>(e1.getMessage());
> > e1.printStackTrace();
> > }
> > in.close();
> >
> >logger.info <http://logger.info>("Content AFTER" + handler.toString());
> > ------------------------------
> > }
> >output is written to hbase, content
> > of the document is empty after parsing ,
> >am i missing anything here??
> >
>
>
--
Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: tika parser is not parsing the BytesWritable in mapreduce
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
cross posting to Tika list for help there too.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: yeshwanth kumar <ye...@gmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Wednesday, June 11, 2014 3:48 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: tika parser is not parsing the BytesWritable in mapreduce
>i am writing a mapreduce job,
>where it takes a zip file as input, zip file contains different types of
>documents such as docx odt pdf txt,
> i am using tika parser to parse the documents.
>here's the code snippet of my mapper method
>
>public void map(Text key, BytesWritable value, Context context)throws
>IOException, InterruptedException {
>
> ------------------------------
> ------------------------------
>
>logger.info <http://logger.info>("Length:\t" + value.getLength());
> byte[] bytesbefore = value.getBytes();
>
>logger.info <http://logger.info>("CONTENT BEFORE" + new
>String(bytesbefore));
> InputStream in = new ByteArrayInputStream(bytesbefore);
> Metadata metadata = new Metadata();
> String mimeType = new Tika().detect(in);
> metadata.set(Metadata.CONTENT_TYPE, mimeType);
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler(
> value.getLength());
> try {
> parser.parse(in, handler, metadata, new ParseContext());
> } catch (SAXException e1) {
>
>logger.info <http://logger.info>(e1.getMessage());
> e1.printStackTrace();
> } catch (TikaException e1) {
>
>logger.info <http://logger.info>(e1.getMessage());
> e1.printStackTrace();
> }
> in.close();
>
>logger.info <http://logger.info>("Content AFTER" + handler.toString());
> ------------------------------
> }
>output is written to hbase, content
> of the document is empty after parsing ,
>am i missing anything here??
>