You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/29 01:56:05 UTC
Mega-cleanup in trunk/
Hi,
I just commited a large patch to cleanup the trunk/ of obsolete and
broken classes remaining from the 0.7.x development line. Please test
that things still work as they should ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Mega-cleanup in trunk/
Posted by Andrzej Bialecki <ab...@getopt.org>.
Piotr Kosiorowski wrote:
> Andrzej Bialecki wrote:
>
>> Hi,
>>
>> I just commited a large patch to cleanup the trunk/ of obsolete and
>> broken classes remaining from the 0.7.x development line. Please test
>> that things still work as they should ...
>>
> Hi,
> I am not sure what is wrong but a lot of JUnit test simply does not
> compile - I did svn checkout to new directory to be sure I do not
> anything left from my experiments.
Yes, you are right - I would welcome any help, I'm a bit tight on time...
>
> I am looking at it right now but - I would suggest to temporarily do a
> quick cleanup to make trunk testable:
>
Agreed.
> 3) Remove unused import in:
> src/test/org/apache/nutch/parse/TestParseText.java
Ok.
> 4) Fix (as it looks simple to fix it - I will look at it in meantime):
>
> src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword/TestMSWordParser.java
>
> src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java
>
> src/plugin/parse-rss/src/test/org/apache/nutch/parse/rss/TestRSSParser.java
>
> src/plugin/parse-pdf/src/test/org/apache/nutch/parse/pdf/TestPdfParser.java
>
> src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
>
> src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/TestMSPowerPointParser.java
>
> src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/AllTests.java
>
Yes, they are just one-line fixes. I removed the
getProtocolContent(urlString) methods, you need to replace them with
getProtocolContent(new UTF8(urlString), new CrawlDatum()).
>
> After removal of all these not compiling classes tests of trunk
> complete sucessfully on my machine (JDK 1.4.2).
>
> If no objections - especially from Andrzej would be raised I can do
> the cleanup tommorow.
Your help would be most welcome, no objections here.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Mega-cleanup in trunk/
Posted by Piotr Kosiorowski <pk...@gmail.com>.
Andrzej Bialecki wrote:
> Hi,
>
> I just commited a large patch to cleanup the trunk/ of obsolete and
> broken classes remaining from the 0.7.x development line. Please test
> that things still work as they should ...
>
Hi,
I am not sure what is wrong but a lot of JUnit test simply does not
compile - I did svn checkout to new directory to be sure I do not
anything left from my experiments.
I am looking at it right now but - I would suggest to temporarily do a
quick cleanup to make trunk testable:
1) Remove permanently - as classes under tests are removed in trunk:
src/test/org/apache/nutch/pagedb/TestFetchListEntry.java
src/test/org/apache/nutch/pagedb/TestPage.java
src/test/org/apache/nutch/db/TestWebDB.java
src/test/org/apache/nutch/db/DBTester.java
src/test/org/apache/nutch/tools/TestSegmentMergeTool.java
2) Remove temporarly and create JIRA issue to fix it:
src/test/org/apache/nutch/fetcher/TestFetcher.java
src/test/org/apache/nutch/fetcher/TestFetcherOutput.java
3) Remove unused import in:
src/test/org/apache/nutch/parse/TestParseText.java
4) Fix (as it looks simple to fix it - I will look at it in meantime):
src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword/TestMSWordParser.java
src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java
src/plugin/parse-rss/src/test/org/apache/nutch/parse/rss/TestRSSParser.java
src/plugin/parse-pdf/src/test/org/apache/nutch/parse/pdf/TestPdfParser.java
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/TestMSPowerPointParser.java
src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/AllTests.java
After removal of all these not compiling classes tests of trunk complete
sucessfully on my machine (JDK 1.4.2).
If no objections - especially from Andrzej would be raised I can do the
cleanup tommorow.
P.
Re: java.io.IOException: Job failed
Posted by Doug Cutting <cu...@nutch.org>.
Gal Nitzan wrote:
> I am using trunk. while trying to crawl I get the following:
[ ...]
> 050825 100222 task_m_ns3ehv Error running child
> 050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero
> 050825 100222 task_m_ns3ehv at
> org.apache.nutch.indexer.DeleteDuplicates
> $1.getPos(DeleteDuplicates.java:193)
I just fixed this.
Doug
java.io.IOException: Job failed
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
I am using trunk. while trying to crawl I get the following:
in crawl log:
051229 235114 Dedup: adding indexes in: crawl/indexes
051229 235114 parsing file:/home/nutchuser/nutch/conf/nutch-default.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/crawl-tool.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/mapred-default.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/mapred-default.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/nutch-site.xml
051229 235115 Running job: job_r1bmnj
051229 235116 map 0%
051229 235138 reduce 100%
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.jav
a:309)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:123)
in tasktracker log:
050825 100222 task_m_ns3ehv Error running child
050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero
050825 100222 task_m_ns3ehv at
org.apache.nutch.indexer.DeleteDuplicates
$1.getPos(DeleteDuplicates.java:193)
050825 100222 task_m_ns3ehv at org.apache.nutch.mapred.MapTask
$2.next(MapTask.java:102)
050825 100222 task_m_ns3ehv at
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
050825 100222 task_m_ns3ehv at
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
050825 100222 task_m_ns3ehv at org.apache.nutch.mapred.TaskTracker
$Child.main(TaskTracker.java:604)
Regards,
Gal
Re: Trunk is broken
Posted by Andrzej Bialecki <ab...@getopt.org>.
Thomas Jaeger wrote:
>Hi Andrzej,
>
>Gal Nitzan wrote:
>
>
>
>>>It seems that Trunk is now broken...
>>>
>>>
>>>
>
>
>DmozParser seems to be broken, too. It's package declaration is still
>org.apache.nutch.crawl instead of org.apache.nutch.tools.
>
>
Fixed. Thanks!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Trunk is broken
Posted by Thomas Jaeger <nu...@thjaeger.org>.
Hi Andrzej,
Gal Nitzan wrote:
>> It seems that Trunk is now broken...
>>
DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.
TJ
Re: Trunk is broken
Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
>It seems that Trunk is now broken...
>
>In Crawl.java line 111 the parameter for parsing is missing.
>
>For my self I have added the line:
>
> boolean parsing = conf.getBoolean("fetcher.parse", true);
>
>and added the param parsing to
> new Fetcher(conf).fetch(segment, threads, parsing); // fetch it
>
>Also the Javadoc build has million errors.
>
>
Fixed. Thanks for spotting this!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Trunk is broken
Posted by Thomas Jaeger <nu...@thjaeger.org>.
Hi Andrzej,
Gal Nitzan wrote:
> It seems that Trunk is now broken...
>
DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.
TJ
Re: Trunk is broken
Posted by Thomas Jaeger <nu...@thjaeger.org>.
Hi Andrzej,
Gal Nitzan wrote:
>> It seems that Trunk is now broken...
>>
DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.
TJ
Trunk is broken
Posted by Gal Nitzan <gn...@usa.net>.
It seems that Trunk is now broken...
In Crawl.java line 111 the parameter for parsing is missing.
For my self I have added the line:
boolean parsing = conf.getBoolean("fetcher.parse", true);
and added the param parsing to
new Fetcher(conf).fetch(segment, threads, parsing); // fetch it
Also the Javadoc build has million errors.
Gal
Re: Bug in DeleteDuplicates.java ?
Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>
>> this function throws IOException. Why?
>>
>> public long getPos() throws IOException {
>> return (doc*INDEX_LENGTH)/maxDoc;
>> }
>>
>> It should be throwing ArithmeticException
>>
>>
>
> The IOException is required by the API of RecordReader.
>
>> What happens when maxDoc is zero?
>>
>>
>
> Ka-boom! ;-) You're right, this should be wrapped in an IOException and
> rethrown.
No, it should really just be fixed to not cause an ArithmeticException.
This is called to report progress. In this case the input "file" for
the map is a Lucene index whose documents we iterate through. To
simplify the construction of input splits (without opening each index) a
constant "length" is used for each "file". So we have to scale the
document numbers to give progress in this range.
The problem is that progress may be reported even when there are no
documents in the index. So the call is valid and no exception should be
thrown.
Doug
Re: Bug in DeleteDuplicates.java ?
Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
>this function throws IOException. Why?
>
> public long getPos() throws IOException {
> return (doc*INDEX_LENGTH)/maxDoc;
> }
>
>It should be throwing ArithmeticException
>
>
>
The IOException is required by the API of RecordReader.
>What happens when maxDoc is zero?
>
>
Ka-boom! ;-) You're right, this should be wrapped in an IOException and
rethrown.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Bug in DeleteDuplicates.java ?
Posted by Gal Nitzan <gn...@usa.net>.
this function throws IOException. Why?
public long getPos() throws IOException {
return (doc*INDEX_LENGTH)/maxDoc;
}
It should be throwing ArithmeticException
What happens when maxDoc is zero?
Gal
Re: Mega-cleanup in trunk/
Posted by Byron Miller <by...@yahoo.com>.
I'll pull a build down tonight and let you know how it
goes!
-byron
--- Andrzej Bialecki <ab...@getopt.org> wrote:
> Hi,
>
> I just commited a large patch to cleanup the trunk/
> of obsolete and
> broken classes remaining from the 0.7.x development
> line. Please test
> that things still work as they should ...
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _
> __________________________________
> [__ || __|__/|__||\/| Information Retrieval,
> Semantic Web
> ___|||__|| \| || | Embedded Unix, System
> Integration
> http://www.sigram.com Contact: info at sigram dot
> com
>
>
>