You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/01/06 20:27:19 UTC
Re: Class Cast exception
Matt Zytaruk wrote:
> The newest src (as of this morning) of trunk is occaisionally giving
> ClassCastExceptions when doing a crawl, with parsing (and by
> occaisionally I mean this was the only page out of the small list I
> crawled that it happened on). This is with the nothing changed from
> the defaults and on a server running Suse linux. Here is a sample of
> the logging:
>
> 060106 111516 Parsing [http://easily.co.uk/] with
> [org.apache.nutch.parse.html.HtmlParser@e1899b]
> 060106 111516 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060106 111516 fetch of http://easily.co.uk/ failed with:
> java.lang.ClassCastException: java.util.ArrayList
>
> -Matt Zytaruk
Could you please add a call to printStackTrace() in that catch{}
statement, so that we know where the exception is thrown?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Class Cast exception
Posted by Andrzej Bialecki <ab...@getopt.org>.
Matt Zytaruk wrote:
> So will this throw an exception on older segments? or will it just not
> get the correct metadata? I have a lot of older segments I still need
> to use.
> Thanks for your help.
The patch that I sent in my previous email handles both versions, so you
will be able to use your old segments.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Class Cast exception
Posted by Matt Zytaruk <ma...@wavefire.com>.
So will this throw an exception on older segments? or will it just not
get the correct metadata? I have a lot of older segments I still need to
use.
Thanks for your help.
-Matt Zytaruk
Andrzej Bialecki wrote:
> Matt Zytaruk wrote:
>
>> Here you go.
>>
>> java.lang.ClassCastException: java.util.ArrayList
>> at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
>> at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
>> at
>> org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
>> at
>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)
>> at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)
>
>
>
> Congratulations! You are the first person to actually use (and suffer
> from) the multiple values in ContentProperties... ;-)
>
> It turns out that ParseData.write() uses its own method for writing
> out metadata, instead of using ContentProperties.write(). It works
> well if you only have single values (then they are stored as Strings),
> but if there are multiple values they are stored in ArrayLists, which
> ParseData accesses directly by the virtue of using
> metadata.entrySet().iterator().
>
> The fix is easy: please replace the following lines in ParseData.write():
>
> out.writeInt(metadata.size()); // write metadata
> Iterator i = metadata.entrySet().iterator();
> while (i.hasNext()) {
> Map.Entry e = (Map.Entry)i.next();
> UTF8.writeString(out, (String)e.getKey());
> UTF8.writeString(out, (String)e.getValue());
> }
>
> with this:
>
> metadata.write(out);
>
> and the same for reading the metadata field; replace in
> ParseData.readField() this:
>
> int propertyCount = in.readInt(); // read metadata
> metadata = new ContentProperties();
> for (int i = 0; i < propertyCount; i++) {
> metadata.put(UTF8.readString(in), UTF8.readString(in));
> }
>
> with this:
>
> metadata = new ContentProperties();
> metadata.readFields(in);
> Compile, deploy, test, report ... :-) Please note that this changes
> the on-disk segment format, so you won't be able to read the old
> segments with the new code. You may want to bump the
> ParseData.VERSION, and leave this code to handle older versions...
>
Re: Class Cast exception
Posted by Matt Zytaruk <ma...@wavefire.com>.
Worked perfectly. Thanks
-Matt Zytaruk
Andrzej Bialecki wrote:
> Hi,
>
> I attached the patch. Please test.
>
>------------------------------------------------------------------------
>
>Index: ParseData.java
>===================================================================
>--- ParseData.java (revision 365563)
>+++ ParseData.java (working copy)
>@@ -31,7 +31,7 @@
> public final class ParseData extends VersionedWritable {
> public static final String DIR_NAME = "parse_data";
>
>- private final static byte VERSION = 2;
>+ private final static byte VERSION = 3;
>
> private String title;
> private Outlink[] outlinks;
>@@ -96,10 +96,15 @@
> Outlink.skip(in);
> }
>
>- int propertyCount = in.readInt(); // read metadata
>- metadata = new ContentProperties();
>- for (int i = 0; i < propertyCount; i++) {
>- metadata.put(UTF8.readString(in), UTF8.readString(in));
>+ if (version < 3) {
>+ int propertyCount = in.readInt(); // read metadata
>+ metadata = new ContentProperties();
>+ for (int i = 0; i < propertyCount; i++) {
>+ metadata.put(UTF8.readString(in), UTF8.readString(in));
>+ }
>+ } else {
>+ metadata = new ContentProperties();
>+ metadata.readFields(in);
> }
>
> }
>@@ -113,14 +118,7 @@
> for (int i = 0; i < outlinks.length; i++) {
> outlinks[i].write(out);
> }
>-
>- out.writeInt(metadata.size()); // write metadata
>- Iterator i = metadata.entrySet().iterator();
>- while (i.hasNext()) {
>- Map.Entry e = (Map.Entry)i.next();
>- UTF8.writeString(out, (String)e.getKey());
>- UTF8.writeString(out, (String)e.getValue());
>- }
>+ metadata.write(out);
> }
>
> public static ParseData read(DataInput in) throws IOException {
>
>
Re: Class Cast exception
Posted by Andrzej Bialecki <ab...@getopt.org>.
Hi,
I attached the patch. Please test.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Class Cast exception
Posted by Andrzej Bialecki <ab...@getopt.org>.
Matt Zytaruk wrote:
> Here you go.
>
> java.lang.ClassCastException: java.util.ArrayList
> at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
> at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
> at
> org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
> at
> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)
> at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)
Congratulations! You are the first person to actually use (and suffer
from) the multiple values in ContentProperties... ;-)
It turns out that ParseData.write() uses its own method for writing out
metadata, instead of using ContentProperties.write(). It works well if
you only have single values (then they are stored as Strings), but if
there are multiple values they are stored in ArrayLists, which ParseData
accesses directly by the virtue of using metadata.entrySet().iterator().
The fix is easy: please replace the following lines in ParseData.write():
out.writeInt(metadata.size()); // write metadata
Iterator i = metadata.entrySet().iterator();
while (i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
UTF8.writeString(out, (String)e.getKey());
UTF8.writeString(out, (String)e.getValue());
}
with this:
metadata.write(out);
and the same for reading the metadata field; replace in
ParseData.readField() this:
int propertyCount = in.readInt(); // read metadata
metadata = new ContentProperties();
for (int i = 0; i < propertyCount; i++) {
metadata.put(UTF8.readString(in), UTF8.readString(in));
}
with this:
metadata = new ContentProperties();
metadata.readFields(in);
Compile, deploy, test, report ... :-) Please note that this changes the
on-disk segment format, so you won't be able to read the old segments
with the new code. You may want to bump the ParseData.VERSION, and leave
this code to handle older versions...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Class Cast exception
Posted by Matt Zytaruk <ma...@wavefire.com>.
Here you go.
java.lang.ClassCastException: java.util.ArrayList
at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
at
org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
at
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)
at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)
Andrzej Bialecki wrote:
> Matt Zytaruk wrote:
>
>> The newest src (as of this morning) of trunk is occaisionally giving
>> ClassCastExceptions when doing a crawl, with parsing (and by
>> occaisionally I mean this was the only page out of the small list I
>> crawled that it happened on). This is with the nothing changed from
>> the defaults and on a server running Suse linux. Here is a sample of
>> the logging:
>>
>> 060106 111516 Parsing [http://easily.co.uk/] with
>> [org.apache.nutch.parse.html.HtmlParser@e1899b]
>> 060106 111516 Using Signature impl: org.apache.nutch.crawl.MD5Signature
>> 060106 111516 fetch of http://easily.co.uk/ failed with:
>> java.lang.ClassCastException: java.util.ArrayList
>>
>> -Matt Zytaruk
>
>
>
> Could you please add a call to printStackTrace() in that catch{}
> statement, so that we know where the exception is thrown?
>