You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Adriana Farina <ad...@gmail.com> on 2013/05/23 17:14:38 UTC

Nutch 2.1 pdf parsing

Hi,

I'm using Nutch 2.1 in distributed mode on top of Hadoop 1.0.4, with HBase
0.90.4 as database.

I wrote a Java class from which I run the crawling cycle, the code that
implements the crawling cycle is the following:

                  for (int i = 0; i < depth; i++) {
batchid = generator.generate((Long) args.get(Nutch.ARG_TOPN),
System.currentTimeMillis(), false, false);
fetcher.fetch(batchid, 1, false, -1);
parser.parse(batchid, false, true);
updater.run(new String[0]);
  }

The problem is that I'm not able to parse the pdf files, inside HBase I got
no pdf content. The strange thing is that I got one row with the following
content: column=p:parsestat, timestamp=1369316742871,
value=\x04\x90\x03\x02\x96\x01org.apache.nutch.parse.ParseException: Unable
to successfully parse content\x00.

It seems to me that I have configured all nutch property files correctly.
Can anybody help me?

Thank you very much.


-- 
Adriana Farina

Re: Nutch 2.1 pdf parsing

Posted by Adriana Farina <ad...@gmail.com>.
Hi Lewis,

thank you very much. I will try your solution.


2013/5/23 Lewis John Mcgibbney <le...@gmail.com>

> Hi Adriana,
> If I were you I would switch your logging to DEBUG for the ParserJob
>
> - log4j.logger.org.apache.nutch.parse.ParserJob=INFO,cmdstdout
> + log4j.logger.org.apache.nutch.parse.ParserJob=DEBUG,cmdstdout
>
>
> recompile the code, then look closely at the parse chunk of the log to see
> what parser is being used, and if there are any particular issues flagged
> up @runtime.
>
>
> On Thu, May 23, 2013 at 8:14 AM, Adriana Farina
> <ad...@gmail.com>wrote:
>
> > Hi,
> >
> > I'm using Nutch 2.1 in distributed mode on top of Hadoop 1.0.4, with
> HBase
> > 0.90.4 as database.
> >
> > I wrote a Java class from which I run the crawling cycle, the code that
> > implements the crawling cycle is the following:
> >
> >                   for (int i = 0; i < depth; i++) {
> > batchid = generator.generate((Long) args.get(Nutch.ARG_TOPN),
> > System.currentTimeMillis(), false, false);
> > fetcher.fetch(batchid, 1, false, -1);
> > parser.parse(batchid, false, true);
> > updater.run(new String[0]);
> >   }
> >
> > The problem is that I'm not able to parse the pdf files, inside HBase I
> got
> > no pdf content. The strange thing is that I got one row with the
> following
> > content: column=p:parsestat, timestamp=1369316742871,
> > value=\x04\x90\x03\x02\x96\x01org.apache.nutch.parse.ParseException:
> Unable
> > to successfully parse content\x00.
> >
> > It seems to me that I have configured all nutch property files correctly.
> > Can anybody help me?
> >
> > Thank you very much.
> >
> >
> > --
> > Adriana Farina
> >
>
>
>
> --
> *Lewis*
>



-- 
Adriana Farina

Re: Nutch 2.1 pdf parsing

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Adriana,
If I were you I would switch your logging to DEBUG for the ParserJob

- log4j.logger.org.apache.nutch.parse.ParserJob=INFO,cmdstdout
+ log4j.logger.org.apache.nutch.parse.ParserJob=DEBUG,cmdstdout


recompile the code, then look closely at the parse chunk of the log to see
what parser is being used, and if there are any particular issues flagged
up @runtime.


On Thu, May 23, 2013 at 8:14 AM, Adriana Farina
<ad...@gmail.com>wrote:

> Hi,
>
> I'm using Nutch 2.1 in distributed mode on top of Hadoop 1.0.4, with HBase
> 0.90.4 as database.
>
> I wrote a Java class from which I run the crawling cycle, the code that
> implements the crawling cycle is the following:
>
>                   for (int i = 0; i < depth; i++) {
> batchid = generator.generate((Long) args.get(Nutch.ARG_TOPN),
> System.currentTimeMillis(), false, false);
> fetcher.fetch(batchid, 1, false, -1);
> parser.parse(batchid, false, true);
> updater.run(new String[0]);
>   }
>
> The problem is that I'm not able to parse the pdf files, inside HBase I got
> no pdf content. The strange thing is that I got one row with the following
> content: column=p:parsestat, timestamp=1369316742871,
> value=\x04\x90\x03\x02\x96\x01org.apache.nutch.parse.ParseException: Unable
> to successfully parse content\x00.
>
> It seems to me that I have configured all nutch property files correctly.
> Can anybody help me?
>
> Thank you very much.
>
>
> --
> Adriana Farina
>



-- 
*Lewis*