You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/29 01:56:05 UTC

Mega-cleanup in trunk/

Hi,

I just commited a large patch to cleanup the trunk/ of obsolete and 
broken classes remaining from the 0.7.x development line. Please test 
that things still work as they should ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Mega-cleanup in trunk/

Posted by Andrzej Bialecki <ab...@getopt.org>.
Piotr Kosiorowski wrote:

> Andrzej Bialecki wrote:
>
>> Hi,
>>
>> I just commited a large patch to cleanup the trunk/ of obsolete and 
>> broken classes remaining from the 0.7.x development line. Please test 
>> that things still work as they should ...
>>
> Hi,
> I am not sure what is wrong but a lot of JUnit test simply does not 
> compile - I did svn checkout to new directory to be sure I do not 
> anything left from my experiments.


Yes, you are right - I would welcome any help, I'm a bit tight on time...

>
> I am looking at it right now but - I would suggest to temporarily do a 
> quick cleanup to make trunk testable:
>

Agreed.


> 3) Remove unused import in:
>         src/test/org/apache/nutch/parse/TestParseText.java


Ok.

> 4) Fix (as it looks simple to fix it - I will look at it in meantime):
>
> src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword/TestMSWordParser.java 
>
> src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java 
>
> src/plugin/parse-rss/src/test/org/apache/nutch/parse/rss/TestRSSParser.java 
>
> src/plugin/parse-pdf/src/test/org/apache/nutch/parse/pdf/TestPdfParser.java 
>
> src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java 
>
> src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/TestMSPowerPointParser.java 
>
> src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/AllTests.java 
>


Yes, they are just one-line fixes. I removed the 
getProtocolContent(urlString) methods, you need to replace them with 
getProtocolContent(new UTF8(urlString), new CrawlDatum()).

>
> After removal of all these not compiling classes tests of trunk 
> complete sucessfully on my machine (JDK 1.4.2).
>
> If no objections - especially from Andrzej would be raised I can do 
> the cleanup tommorow.


Your help would be most welcome, no objections here.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Mega-cleanup in trunk/

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Andrzej Bialecki wrote:
> Hi,
> 
> I just commited a large patch to cleanup the trunk/ of obsolete and 
> broken classes remaining from the 0.7.x development line. Please test 
> that things still work as they should ...
> 
Hi,
I am not sure what is wrong but a lot of JUnit test simply does not 
compile - I did svn checkout to new directory to be sure I do not 
anything left from my experiments.

I am looking at it right now but - I would suggest to temporarily do a 
quick cleanup to make trunk testable:

1) Remove permanently - as classes under tests are removed in trunk:
	src/test/org/apache/nutch/pagedb/TestFetchListEntry.java
	src/test/org/apache/nutch/pagedb/TestPage.java
	src/test/org/apache/nutch/db/TestWebDB.java
	src/test/org/apache/nutch/db/DBTester.java
	src/test/org/apache/nutch/tools/TestSegmentMergeTool.java
2) Remove temporarly and create JIRA issue to fix it:
	src/test/org/apache/nutch/fetcher/TestFetcher.java
	src/test/org/apache/nutch/fetcher/TestFetcherOutput.java

3) Remove unused import in:
		src/test/org/apache/nutch/parse/TestParseText.java
4) Fix (as it looks simple to fix it - I will look at it in meantime):

src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword/TestMSWordParser.java
src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java
src/plugin/parse-rss/src/test/org/apache/nutch/parse/rss/TestRSSParser.java
src/plugin/parse-pdf/src/test/org/apache/nutch/parse/pdf/TestPdfParser.java
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/TestMSPowerPointParser.java
src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/AllTests.java

After removal of all these not compiling classes tests of trunk complete 
sucessfully on my machine (JDK 1.4.2).

If no objections - especially from Andrzej would be raised I can do the 
cleanup tommorow.
P.



Re: java.io.IOException: Job failed

Posted by Doug Cutting <cu...@nutch.org>.
Gal Nitzan wrote:
> I am using trunk. while trying to crawl I get the following:

[ ...]

> 050825 100222 task_m_ns3ehv  Error running child
> 050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero
> 050825 100222 task_m_ns3ehv     at
> org.apache.nutch.indexer.DeleteDuplicates
> $1.getPos(DeleteDuplicates.java:193)

I just fixed this.

Doug

java.io.IOException: Job failed

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

I am using trunk. while trying to crawl I get the following:



in crawl log:

051229 235114 Dedup: adding indexes in: crawl/indexes
051229 235114 parsing file:/home/nutchuser/nutch/conf/nutch-default.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/crawl-tool.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/mapred-default.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/mapred-default.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/nutch-site.xml
051229 235115 Running job: job_r1bmnj
051229 235116  map 0%
051229 235138  reduce 100%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.jav
a:309)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:123)



in tasktracker log:

050825 100222 task_m_ns3ehv  Error running child
050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero
050825 100222 task_m_ns3ehv     at
org.apache.nutch.indexer.DeleteDuplicates
$1.getPos(DeleteDuplicates.java:193)
050825 100222 task_m_ns3ehv     at org.apache.nutch.mapred.MapTask
$2.next(MapTask.java:102)
050825 100222 task_m_ns3ehv     at
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
050825 100222 task_m_ns3ehv     at
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
050825 100222 task_m_ns3ehv     at org.apache.nutch.mapred.TaskTracker
$Child.main(TaskTracker.java:604)

Regards,

Gal



Re: Trunk is broken

Posted by Andrzej Bialecki <ab...@getopt.org>.
Thomas Jaeger wrote:

>Hi Andrzej,
>
>Gal Nitzan wrote:
>
>  
>
>>>It seems that Trunk is now broken...
>>>
>>>      
>>>
>
>
>DmozParser seems to be broken, too. It's package declaration is still
>org.apache.nutch.crawl instead of org.apache.nutch.tools.
>  
>

Fixed. Thanks!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Trunk is broken

Posted by Thomas Jaeger <nu...@thjaeger.org>.
Hi Andrzej,

Gal Nitzan wrote:

>> It seems that Trunk is now broken...
>>


DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.


TJ


Re: Trunk is broken

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:

>It seems that Trunk is now broken...
>
>In Crawl.java line 111 the parameter for parsing is missing.
>
>For my self I have added the line:
>	
>	boolean parsing = conf.getBoolean("fetcher.parse", true);
>
>and added the param parsing to 
>	new Fetcher(conf).fetch(segment, threads, parsing);  // fetch it
>
>Also the Javadoc build has million errors.
>  
>

Fixed. Thanks for spotting this!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Trunk is broken

Posted by Thomas Jaeger <nu...@thjaeger.org>.
Hi Andrzej,

Gal Nitzan wrote:
> It seems that Trunk is now broken...
> 

DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.


TJ

Re: Trunk is broken

Posted by Thomas Jaeger <nu...@thjaeger.org>.
Hi Andrzej,

Gal Nitzan wrote:

>> It seems that Trunk is now broken...
>>


DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.


TJ


Trunk is broken

Posted by Gal Nitzan <gn...@usa.net>.
It seems that Trunk is now broken...

In Crawl.java line 111 the parameter for parsing is missing.

For my self I have added the line:
	
	boolean parsing = conf.getBoolean("fetcher.parse", true);

and added the param parsing to 
	new Fetcher(conf).fetch(segment, threads, parsing);  // fetch it

Also the Javadoc build has million errors.

Gal



Re: Bug in DeleteDuplicates.java ?

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Gal Nitzan wrote:
> 
>> this function throws IOException. Why?
>>
>>         public long getPos() throws IOException {
>>            return (doc*INDEX_LENGTH)/maxDoc;
>>          }
>>
>> It should be throwing ArithmeticException
>>  
>>
> 
> The IOException is required by the API of RecordReader.
> 
>> What happens when maxDoc is zero?
>>  
>>
> 
> Ka-boom! ;-) You're right, this should be wrapped in an IOException and 
> rethrown.

No, it should really just be fixed to not cause an ArithmeticException. 
  This is called to report progress.  In this case the input "file" for 
the map is a Lucene index whose documents we iterate through.  To 
simplify the construction of input splits (without opening each index) a 
constant "length" is used for each "file".  So we have to scale the 
document numbers to give progress in this range.

The problem is that progress may be reported even when there are no 
documents in the index.  So the call is valid and no exception should be 
thrown.

Doug

Re: Bug in DeleteDuplicates.java ?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:

>this function throws IOException. Why?
>
>         public long getPos() throws IOException {
>            return (doc*INDEX_LENGTH)/maxDoc;
>          }
>
>It should be throwing ArithmeticException 
>
>  
>

The IOException is required by the API of RecordReader.

>What happens when maxDoc is zero?
>  
>

Ka-boom! ;-) You're right, this should be wrapped in an IOException and 
rethrown.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Bug in DeleteDuplicates.java ?

Posted by Gal Nitzan <gn...@usa.net>.
this function throws IOException. Why?

         public long getPos() throws IOException {
            return (doc*INDEX_LENGTH)/maxDoc;
          }

It should be throwing ArithmeticException 

What happens when maxDoc is zero?


Gal





Re: Mega-cleanup in trunk/

Posted by Byron Miller <by...@yahoo.com>.
I'll pull a build down tonight and let you know how it
goes!

-byron

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Hi,
> 
> I just commited a large patch to cleanup the trunk/
> of obsolete and 
> broken classes remaining from the 0.7.x development
> line. Please test 
> that things still work as they should ...
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
>