You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2014/06/16 23:18:35 UTC

updatedb deletes all metadata except _csh_

Hello,


I am using nutch-2.x with GORA_97. I noticed that the second updatedb deletes all metadata except _csh_  for pages from the first fetch. Step to reproduce are the following
1. inject
2.generate batchId 1
3. fetch batchId 1 that adds some metadata to mtdt field
4 updatedb batchId 1
5.generate  batchId 2
6. fetch batchId 2
7. updatedb 2


check if metadata for urls with batchId 1 is present.


Thanks.
Alex.

Re: updatedb deletes all metadata except _csh_

Posted by Julien Nioche <li...@gmail.com>.
Any Nutch-2 users or committers to help Alex on this one?

Re: updatedb deletes all metadata except _csh_

Posted by alxsss <al...@aim.com>.
Further investigation shows that DbUpdateReducer
calls 
 inlinkedScoreData.clear();

and it calls this function 

 public void readFields(DataInput in) throws IOException {
    System.out.println("readFields in score datum is called");
    score = in.readFloat();
    url = Text.readString(in);
    anchor = Text.readString(in);
    distance = WritableUtils.readVInt(in);
    metaData.clear();

    int size = WritableUtils.readVInt(in);
    for (int i = 0; i < size; i++) {
      String key = Text.readString(in);
      byte[] value = Bytes.readByteArray(in);
      metaData.put(key, value);
    }
  }

of ScoreDatum class.
And metaData.clear(); line clears all metadata.

Why metaData.clear(); line is needed in this function?

Thanks.
Alex.



--
View this message in context: http://lucene.472066.n3.nabble.com/updatedb-deletes-all-metadata-except-csh-tp4142158p4142184.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: updatedb deletes all metadata except _csh_

Posted by alxsss <al...@aim.com>.
Hi,

So far, this looks like a bug in updatedb when filtering with batchId. 

I could only found one solution, to check if new pages are in the datastore
and if they are skip them.
Otherwise updatedb with option -all will also work.

Thanks.
Alex.



--
View this message in context: http://lucene.472066.n3.nabble.com/updatedb-deletes-all-metadata-except-csh-tp4142158p4143574.html
Sent from the Nutch - User mailing list archive at Nabble.com.