You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by yichengye <54...@qq.com> on 2014/02/20 16:06:08 UTC

parseStatus not updated after parsing some files

Hi, I am new to Nutch and  I am using Nutch 2.2.1 + MySQL + Solr 4.6.1 to
crawl files.

I noticed that the files that actually indexed by solr are much fewer than
the files are crawled. I understand that some files are not indexed because
they are not parsed and the text field in database is simply null. But a lot
of other data entries which obviously is parsed successfully are not indexed
as well. I found in the database that some files seemed to be parsed
correctly shows a null in parseStatus which made them skipped in the
indexing process. You can see in the picture below
<http://lucene.472066.n3.nabble.com/file/n4118570/QQ%E6%88%AA%E5%9B%BE20140220221615.jpg> 
If I run bin/nutch parse -all for some more times, the parse status will be
updated.

After I looked into the source code and let the log to print more debug
information, I found that the parse status' are actually success for those
files whose text field are not null (I printed out the parse status in
public void map(String key, WebPage page, Context context) method in
ParserJob.java just before context.write(key, page)). 

I wonder if anyone other than me also encountered this problem? What will be
the possible solution for me?

Thank you very much



--
View this message in context: http://lucene.472066.n3.nabble.com/parseStatus-not-updated-after-parsing-some-files-tp4118570.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: parseStatus not updated after parsing some files

Posted by coldboy128 <nh...@gmail.com>.
I had the same problem like U. In parseUtil.java and ParseJob.java I make the
clone object for parStatus by code:
ParseStatus pstatus = null;
      if(page.getParseStatus() != null){
    	  pstatus = (ParseStatus) page.getParseStatus().clone();
      }




--
View this message in context: http://lucene.472066.n3.nabble.com/parseStatus-not-updated-after-parsing-some-files-tp4118570p4169014.html
Sent from the Nutch - User mailing list archive at Nabble.com.