You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by yeshwanth kumar <ye...@gmail.com> on 2014/06/11 12:48:41 UTC

tika parser is not parsing the BytesWritable in mapreduce

i am writing a mapreduce job,

where it takes a zip file as input, zip file contains different types of
documents such as docx odt pdf txt,

 i am using tika parser to parse the documents.

here's the code snippet of my mapper method

public void map(Text key, BytesWritable value, Context context)throws
IOException, InterruptedException {

    ------------------------------

    ------------------------------

        logger.info("Length:\t" + value.getLength());

        byte[] bytesbefore = value.getBytes();

        logger.info("CONTENT BEFORE" + new String(bytesbefore));

        InputStream in = new ByteArrayInputStream(bytesbefore);

        Metadata metadata = new Metadata();

        String mimeType = new Tika().detect(in);

        metadata.set(Metadata.CONTENT_TYPE, mimeType);

        Parser parser = new AutoDetectParser();

        ContentHandler handler = new BodyContentHandler(

                value.getLength());

        try {

            parser.parse(in, handler, metadata, new ParseContext());

        } catch (SAXException e1) {

            logger.info(e1.getMessage());

            e1.printStackTrace();

        } catch (TikaException e1) {

            logger.info(e1.getMessage());

            e1.printStackTrace();

        }

        in.close();

        logger.info("Content AFTER" + handler.toString());

    ------------------------------

                   }

output is written to hbase, content of the document is empty after parsing ,

am i missing anything here??

Re: tika parser is not parsing the BytesWritable in mapreduce

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 11 Jun 2014, Mattmann, Chris A (3980) wrote:
>> output is written to hbase, content of the document is empty after 
>> parsing , am i missing anything here??

Sounds a lot like a query we had earlier this week on something similar:
http://mail-archives.apache.org/mod_mbox/tika-user/201406.mbox/%3CCACQuOSXq3cnrG9HbAkPwom2u2VQDcJQ5H-Qrus_TMJcysZRSWQ%40mail.gmail.com%3E

Problem there was that the user had the Tika Core jar, but had forgotten 
to include the Tika Parsers jar + dependencies, so didn't have any parsers 
available at runtime.

Could it be the same here?

Nick

Re: tika parser is not parsing the BytesWritable in mapreduce

Posted by Julien Nioche <li...@gmail.com>.
I don't know what the issue here but the Tika module in Behemoth is a good
example of how to use Tika over MapReduce
https://github.com/DigitalPebble/behemoth/tree/master/tika

J.


On 11 June 2014 11:59, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> cross posting to Tika list for help there too.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: yeshwanth kumar <ye...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Wednesday, June 11, 2014 3:48 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: tika parser is not parsing the BytesWritable in mapreduce
>
> >i am writing a mapreduce job,
> >where it takes a zip file as input, zip file contains different types of
> >documents such as docx odt pdf txt,
> > i am using tika parser to parse the documents.
> >here's the code snippet of my mapper method
> >
> >public void map(Text key, BytesWritable value, Context context)throws
> >IOException, InterruptedException {
> >
> >    ------------------------------
> >    ------------------------------
> >
> >logger.info <http://logger.info>("Length:\t" + value.getLength());
> >        byte[] bytesbefore = value.getBytes();
> >
> >logger.info <http://logger.info>("CONTENT BEFORE" + new
> >String(bytesbefore));
> >        InputStream in = new ByteArrayInputStream(bytesbefore);
> >        Metadata metadata = new Metadata();
> >        String mimeType = new Tika().detect(in);
> >        metadata.set(Metadata.CONTENT_TYPE, mimeType);
> >        Parser parser = new AutoDetectParser();
> >        ContentHandler handler = new BodyContentHandler(
> >                value.getLength());
> >        try {
> >            parser.parse(in, handler, metadata, new ParseContext());
> >        } catch (SAXException e1) {
> >
> >logger.info <http://logger.info>(e1.getMessage());
> >            e1.printStackTrace();
> >        } catch (TikaException e1) {
> >
> >logger.info <http://logger.info>(e1.getMessage());
> >            e1.printStackTrace();
> >        }
> >        in.close();
> >
> >logger.info <http://logger.info>("Content AFTER" + handler.toString());
> >    ------------------------------
> >                   }
> >output is written to hbase, content
> > of the document is empty after parsing ,
> >am i missing anything here??
> >
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: tika parser is not parsing the BytesWritable in mapreduce

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
cross posting to Tika list for help there too.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: yeshwanth kumar <ye...@gmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Wednesday, June 11, 2014 3:48 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: tika parser is not parsing the BytesWritable in mapreduce

>i am writing a mapreduce job,
>where it takes a zip file as input, zip file contains different types of
>documents such as docx odt pdf txt,
> i am using tika parser to parse the documents.
>here's the code snippet of my mapper method
>
>public void map(Text key, BytesWritable value, Context context)throws
>IOException, InterruptedException {
>
>    ------------------------------
>    ------------------------------
>       
>logger.info <http://logger.info>("Length:\t" + value.getLength());
>        byte[] bytesbefore = value.getBytes();
>       
>logger.info <http://logger.info>("CONTENT BEFORE" + new
>String(bytesbefore));
>        InputStream in = new ByteArrayInputStream(bytesbefore);
>        Metadata metadata = new Metadata();
>        String mimeType = new Tika().detect(in);
>        metadata.set(Metadata.CONTENT_TYPE, mimeType);
>        Parser parser = new AutoDetectParser();
>        ContentHandler handler = new BodyContentHandler(
>                value.getLength());
>        try {
>            parser.parse(in, handler, metadata, new ParseContext());
>        } catch (SAXException e1) {
>           
>logger.info <http://logger.info>(e1.getMessage());
>            e1.printStackTrace();
>        } catch (TikaException e1) {
>           
>logger.info <http://logger.info>(e1.getMessage());
>            e1.printStackTrace();
>        }
>        in.close();
>       
>logger.info <http://logger.info>("Content AFTER" + handler.toString());
>    ------------------------------
>                   }
>output is written to hbase, content
> of the document is empty after parsing ,
>am i missing anything here??
>