You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yicheng Ye <54...@qq.com> on 2014/02/21 02:10:03 UTC
Re: Inconsistencies in use of ParseStatus in 2.x

Hi Lewis,

I began to work with Nutch 2.2.1 with Solr 4.6.1 since last week. My main
purpose is to use Nutch to crawl pdf, docs or other files and index them
with solr.

When I tested it with some pdf files, I noticed that only the first pdf is
parsed successfully (whose text field is not null in database). After I
looked into the code, I found that the ParseStatus.STATUS_SUCCESS is not
consistent which is pretty weird (I printed out the value in
rg.apache.nutch.parse.tika.TikaParser around line 183). I believe that
ParseStatus.STATUS_SUCCESS is initialized only once and I could not found
anywhere else that this value is re-assigned. I made a temporary fix to have
a call to setMajorCode(ParseStatusCodes.SUCCESS).

I saw you say it is related to NUTCH-1591, does it mean that it will be
fixed in the next come up release of Nutch (maybe 2.2.2?)?

Thank you.
Yicheng


lewis john mcgibbney wrote
> Hi,
> We define the structure of ParseStatus [0] in our WebPage JSON schema [1].
> All good so far.
> What is not good (or not clear to me at least), is how we currently use
> methods within this class to define Hadoop counters for the parsing tasks.
> I parse large amounts of URLs, but the counters on one of my jobs only
> indicates counters and their values as
> 
> failed 11
> success 498
> notparsed 252
> I now digress slightly for some more technical stuff/observations. These
> are merely observations of me stepping through the Nutch code in an
> attempt
> to find out why the numbers are so (embarrassingly/surprisingly) low.
> 
> I began at where we actually initiate the counter. This can of course be
> located at line #134 of ParserJob [2], where we do
> 
> 133 if (pstatus != null) {  134 context.getCounter("ParserStatus",
> 135 ParseStatusCodes.majorCodes[pstatus.getMajorCode()]).increment(1);
>  136 }
> So I then wondered when the ParseStatus.setMajorCode(int value) is
> actually
> called to assign one of "failed", "success" or "notparsed" respectively.
> It turns out that .setMajorCode(int value) is called in now fewer than two
> places; line #217 of HtmlParser [3]
> 
> 216 ParseStatus status = new ParseStatus();  217
> status.setMajorCode(ParseStatusCodes.SUCCESS);
>  218 if (metaTags.getRefresh()) {
> and numerous lines within ParseStatusUtils [4].
> 
> It therefore seems that there is clear inconsistency in our implementation
> of assigning ParseStatusCodes to ParseStatus'. My hope is that this is why
> the counters are all messed up.
> 
> My suggestion, I believe that implementations should follow that as
> defined
> in HtmlParser, where we access the ParseStatus bean directly. We could
> pass
> this stuff through ParseStatusUtils, but for me this is unnecessary and
> just adding more confusion.
> 
> I know this is a long post, and I apologize for that, but I would be
> really
> please if others were able to comment.
> I can then work towards a patch for this... if one is required.
> 
> Thanks
> 
> [0]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java?view=markup
> [1]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/gora/webpage.avsc?view=markup
> [2]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?view=markup
> [3]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=markup
> [4]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseStatusUtils.java?view=markup
> [5]
> 
> -- 
> *Lewis*





--
View this message in context: http://lucene.472066.n3.nabble.com/Inconsistencies-in-use-of-ParseStatus-in-2-x-tp4071797p4118696.html
Sent from the Nutch - User mailing list archive at Nabble.com.