You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/01/06 20:27:19 UTC

Re: Class Cast exception

Matt Zytaruk wrote:

> The newest src (as of this morning) of trunk is occaisionally giving 
> ClassCastExceptions when doing a crawl, with parsing (and by 
> occaisionally I mean this was the only page out of the small list I 
> crawled that it happened on). This is with the nothing changed from 
> the defaults and on a server running Suse linux. Here is a sample of 
> the logging:
>
> 060106 111516 Parsing [http://easily.co.uk/] with 
> [org.apache.nutch.parse.html.HtmlParser@e1899b]
> 060106 111516 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060106 111516 fetch of http://easily.co.uk/ failed with: 
> java.lang.ClassCastException: java.util.ArrayList
>
> -Matt Zytaruk


Could you please add a call to printStackTrace() in that catch{} 
statement, so that we know where the exception is thrown?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Class Cast exception

Posted by Andrzej Bialecki <ab...@getopt.org>.
Matt Zytaruk wrote:

> So will this throw an exception on older segments? or will it just not 
> get the correct metadata? I have a lot of older segments I still need 
> to use.
> Thanks for your help.


The patch that I sent in my previous email handles both versions, so you 
will be able to use your old segments.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Class Cast exception

Posted by Matt Zytaruk <ma...@wavefire.com>.
So will this throw an exception on older segments? or will it just not 
get the correct metadata? I have a lot of older segments I still need to 
use.
Thanks for your help.

-Matt Zytaruk

Andrzej Bialecki wrote:

> Matt Zytaruk wrote:
>
>> Here you go.
>>
>> java.lang.ClassCastException: java.util.ArrayList
>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
>>        at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
>>        at 
>> org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
>>        at 
>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)
>>        at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)
>
>
>
> Congratulations! You are the first person to actually use (and suffer 
> from) the multiple values in ContentProperties... ;-)
>
> It turns out that ParseData.write() uses its own method for writing 
> out metadata, instead of using ContentProperties.write(). It works 
> well if you only have single values (then they are stored as Strings), 
> but if there are multiple values they are stored in ArrayLists, which 
> ParseData accesses directly by the virtue of using 
> metadata.entrySet().iterator().
>
> The fix is easy: please replace the following lines in ParseData.write():
>
>    out.writeInt(metadata.size());                // write metadata
>    Iterator i = metadata.entrySet().iterator();
>    while (i.hasNext()) {
>      Map.Entry e = (Map.Entry)i.next();
>      UTF8.writeString(out, (String)e.getKey());
>      UTF8.writeString(out, (String)e.getValue());
>    }
>
> with this:
>
>    metadata.write(out);
>
> and the same for reading the metadata field; replace in 
> ParseData.readField() this:
>
>    int propertyCount = in.readInt();             // read metadata
>    metadata = new ContentProperties();
>    for (int i = 0; i < propertyCount; i++) {
>      metadata.put(UTF8.readString(in), UTF8.readString(in));
>    }
>
> with this:
>
>    metadata = new ContentProperties();
>    metadata.readFields(in);
>   Compile, deploy, test, report ... :-) Please note that this changes 
> the on-disk segment format, so you won't be able to read the old 
> segments with the new code. You may want to bump the 
> ParseData.VERSION, and leave this code to handle older versions...
>


Re: Class Cast exception

Posted by Matt Zytaruk <ma...@wavefire.com>.
Worked perfectly. Thanks

-Matt Zytaruk

Andrzej Bialecki wrote:

> Hi,
>
> I attached the patch. Please test.
>
>------------------------------------------------------------------------
>
>Index: ParseData.java
>===================================================================
>--- ParseData.java	(revision 365563)
>+++ ParseData.java	(working copy)
>@@ -31,7 +31,7 @@
> public final class ParseData extends VersionedWritable {
>   public static final String DIR_NAME = "parse_data";
> 
>-  private final static byte VERSION = 2;
>+  private final static byte VERSION = 3;
> 
>   private String title;
>   private Outlink[] outlinks;
>@@ -96,10 +96,15 @@
>       Outlink.skip(in);
>     }
>     
>-    int propertyCount = in.readInt();             // read metadata
>-    metadata = new ContentProperties();
>-    for (int i = 0; i < propertyCount; i++) {
>-      metadata.put(UTF8.readString(in), UTF8.readString(in));
>+    if (version < 3) {
>+      int propertyCount = in.readInt();             // read metadata
>+      metadata = new ContentProperties();
>+      for (int i = 0; i < propertyCount; i++) {
>+        metadata.put(UTF8.readString(in), UTF8.readString(in));
>+      }
>+    } else {
>+      metadata = new ContentProperties();
>+      metadata.readFields(in);
>     }
>     
>   }
>@@ -113,14 +118,7 @@
>     for (int i = 0; i < outlinks.length; i++) {
>       outlinks[i].write(out);
>     }
>-
>-    out.writeInt(metadata.size());                // write metadata
>-    Iterator i = metadata.entrySet().iterator();
>-    while (i.hasNext()) {
>-      Map.Entry e = (Map.Entry)i.next();
>-      UTF8.writeString(out, (String)e.getKey());
>-      UTF8.writeString(out, (String)e.getValue());
>-    }
>+    metadata.write(out);
>   }
> 
>   public static ParseData read(DataInput in) throws IOException {
>  
>


Re: Class Cast exception

Posted by Andrzej Bialecki <ab...@getopt.org>.
Hi,

I attached the patch. Please test.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Class Cast exception

Posted by Andrzej Bialecki <ab...@getopt.org>.
Matt Zytaruk wrote:

> Here you go.
>
> java.lang.ClassCastException: java.util.ArrayList
>        at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
>        at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
>        at 
> org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
>        at 
> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)
>        at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)


Congratulations! You are the first person to actually use (and suffer 
from) the multiple values in ContentProperties... ;-)

It turns out that ParseData.write() uses its own method for writing out 
metadata, instead of using ContentProperties.write(). It works well if 
you only have single values (then they are stored as Strings), but if 
there are multiple values they are stored in ArrayLists, which ParseData 
accesses directly by the virtue of using metadata.entrySet().iterator().

The fix is easy: please replace the following lines in ParseData.write():

    out.writeInt(metadata.size());                // write metadata
    Iterator i = metadata.entrySet().iterator();
    while (i.hasNext()) {
      Map.Entry e = (Map.Entry)i.next();
      UTF8.writeString(out, (String)e.getKey());
      UTF8.writeString(out, (String)e.getValue());
    }

with this:

    metadata.write(out);

and the same for reading the metadata field; replace in 
ParseData.readField() this:

    int propertyCount = in.readInt();             // read metadata
    metadata = new ContentProperties();
    for (int i = 0; i < propertyCount; i++) {
      metadata.put(UTF8.readString(in), UTF8.readString(in));
    }

with this:

    metadata = new ContentProperties();
    metadata.readFields(in);
   
Compile, deploy, test, report ... :-) Please note that this changes the 
on-disk segment format, so you won't be able to read the old segments 
with the new code. You may want to bump the ParseData.VERSION, and leave 
this code to handle older versions...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Class Cast exception

Posted by Matt Zytaruk <ma...@wavefire.com>.
Here you go.

java.lang.ClassCastException: java.util.ArrayList
        at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
        at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
        at 
org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
        at 
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)
        at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)

Andrzej Bialecki wrote:

> Matt Zytaruk wrote:
>
>> The newest src (as of this morning) of trunk is occaisionally giving 
>> ClassCastExceptions when doing a crawl, with parsing (and by 
>> occaisionally I mean this was the only page out of the small list I 
>> crawled that it happened on). This is with the nothing changed from 
>> the defaults and on a server running Suse linux. Here is a sample of 
>> the logging:
>>
>> 060106 111516 Parsing [http://easily.co.uk/] with 
>> [org.apache.nutch.parse.html.HtmlParser@e1899b]
>> 060106 111516 Using Signature impl: org.apache.nutch.crawl.MD5Signature
>> 060106 111516 fetch of http://easily.co.uk/ failed with: 
>> java.lang.ClassCastException: java.util.ArrayList
>>
>> -Matt Zytaruk
>
>
>
> Could you please add a call to printStackTrace() in that catch{} 
> statement, so that we know where the exception is thrown?
>