You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2014/06/16 23:18:35 UTC
updatedb deletes all metadata except _csh_
Hello,
I am using nutch-2.x with GORA_97. I noticed that the second updatedb deletes all metadata except _csh_ for pages from the first fetch. Step to reproduce are the following
1. inject
2.generate batchId 1
3. fetch batchId 1 that adds some metadata to mtdt field
4 updatedb batchId 1
5.generate batchId 2
6. fetch batchId 2
7. updatedb 2
check if metadata for urls with batchId 1 is present.
Thanks.
Alex.
Re: updatedb deletes all metadata except _csh_
Posted by Julien Nioche <li...@gmail.com>.
Any Nutch-2 users or committers to help Alex on this one?
Re: updatedb deletes all metadata except _csh_
Posted by alxsss <al...@aim.com>.
Further investigation shows that DbUpdateReducer
calls
inlinkedScoreData.clear();
and it calls this function
public void readFields(DataInput in) throws IOException {
System.out.println("readFields in score datum is called");
score = in.readFloat();
url = Text.readString(in);
anchor = Text.readString(in);
distance = WritableUtils.readVInt(in);
metaData.clear();
int size = WritableUtils.readVInt(in);
for (int i = 0; i < size; i++) {
String key = Text.readString(in);
byte[] value = Bytes.readByteArray(in);
metaData.put(key, value);
}
}
of ScoreDatum class.
And metaData.clear(); line clears all metadata.
Why metaData.clear(); line is needed in this function?
Thanks.
Alex.
--
View this message in context: http://lucene.472066.n3.nabble.com/updatedb-deletes-all-metadata-except-csh-tp4142158p4142184.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: updatedb deletes all metadata except _csh_
Posted by alxsss <al...@aim.com>.
Hi,
So far, this looks like a bug in updatedb when filtering with batchId.
I could only found one solution, to check if new pages are in the datastore
and if they are skip them.
Otherwise updatedb with option -all will also work.
Thanks.
Alex.
--
View this message in context: http://lucene.472066.n3.nabble.com/updatedb-deletes-all-metadata-except-csh-tp4142158p4143574.html
Sent from the Nutch - User mailing list archive at Nabble.com.