You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/12/14 14:18:51 UTC

vote for issues to fix in 0.7.2

Full list of open issues
complete description can be found here :
http://issues.apache.org/jira/secure/IssueNavigator.jspa? 
view=full&tempMax=30

Please add a "+1" in case you vote for the issue under this issue.
Please keep in mind that this will be more a maintenance release.

NUTCH-141	jobdetails.jsp doesnt work on webbrowser "safari"
NUTCH-140	Add alias capability in parse-plugins.xml file that allows  
mimeType->extensionId mapping
NUTCH-139	Standard metadata property names in the ParseData metadata
NUTCH-138	non-Latin-1 characters cannot be submitted for search
NUTCH-137	footer is not displayed in search result page	
NUTCH-136	mapreduce segment generator generates 50 % less than  
excepted urls
NUTCH-34	Parsing different content formats	
NUTCH-3	multi values of header discarded	
NUTCH-134	Summarizer doesn't select the best snippets	
NUTCH-132	Add ability to sort on more than one column	
NUTCH-131	Non-documented variable: mapred.child.heap.size
NUTCH-98	RobotRulesParser interprets robots.txt incorrectly
NUTCH-129	rtf-parser does not work when opened with wordpad files and  
saved
NUTCH-120	one "bad" link on a page kills parsing	
NUTCH-128	second configuration nodes overwrites first node
NUTCH-127	uncorrect values using -du, or ls does not return items
NUTCH-126	Fetching via https does not work with a proxy (patch)
NUTCH-125	OpenOffice Parser plugin	
NUTCH-110	OpenSearchServlet outputs illegal xml characters
NUTCH-36	Chinese in Nutch	
NUTCH-123	Cache.jsp some times generate NullPointerException
NUTCH-39	pagination in search result	
NUTCH-49	Flag for generate to fetch only new pages to complement the - 
refetchonly flag
NUTCH-94	MapFile.Writer throwing 'File exists error'.	
NUTCH-117	Crawl crashes with java.io.IOException: already exists: C: 
\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
NUTCH-122	block numbers need a better random number generator
NUTCH-82	Nutch Commands should run on Windows without external tools
NUTCH-121	SegmentReader for mapred	
NUTCH-119	Regexp to extract outlinks incorrect	
NUTCH-118	FAQ link points to invalid URL	
NUTCH-115	jobtracker.jsp shows too much information	
NUTCH-103	Vivisimo like treeview and url redirect	
NUTCH-108	tasktracker crashs when reconnecting to a new jobtracker.
NUTCH-113	Disable permanent DNS-to-IP caching for JVM 1.4
NUTCH-111	ndfs.replication is not documented within the nutch- 
default.xml configuration file.
NUTCH-100	New plugin urlfilter-db	
NUTCH-101	RobotRulesParser	
NUTCH-96	MapFile.Writer throws directory exists exception if run  
multiple times in the same JVM or server JVM.
NUTCH-106	Datanode corruption	
NUTCH-105	Network error during robots.txt fetch causes file to be  
ignored
NUTCH-104	Nutch query parser does not support CJK bi-gram segmentation.
NUTCH-102	jobtracker does not start when webapps is in src
NUTCH-95	DeleteDuplicates depends on the order of input segments
NUTCH-92	DistributedSearch incorrectly scores results	
NUTCH-87	Efficient site-specific crawling for a large number of sites
NUTCH-91	empty encoding causes exception	
NUTCH-90	reduce logging output of IndexSegment	
NUTCH-52	Parser plugin for MS Excel files	
NUTCH-86	LanguageIdentifier API enhancements	
NUTCH-84	Fetcher for constrained crawls	
NUTCH-74	French Analyzer Plugin	
NUTCH-83	Release deliverable as zip	
NUTCH-81	Webapp only works when deployed in root	
NUTCH-79	Fault tolerant searching.	
NUTCH-64	no results after a restart of a search--server (without  
tomcat restart)
NUTCH-76	NDFS DataNode advertises localhost as it's address
NUTCH-75	Patch for WebDBReader to get more detailed information about  
WebDBs
NUTCH-73	A page for CSV results	
NUTCH-72	Query basic filter with correction feature	
NUTCH-70	duplicate pages - virtual hosts in db.	
NUTCH-68	A tool to generate arbitrary fetchlists	
NUTCH-62	Add html META tag information into metaData in index-more  
plugin
NUTCH-61	Adaptive re-fetch interval. Detecting umodified content
NUTCH-55	Create dmoz.org search plugin - incorporate the dmoz.org  
title/category/description if available &
NUTCH-59	meta data support in webdb	
NUTCH-25	needs 'character encoding' detector	
NUTCH-44	too many search results	
NUTCH-42	enhance search.jsp such that it can also returns XML
NUTCH-50	Benchmarks & Performance goals	
NUTCH-13	If dns points to 127.0.0.1, the url is also crawled
NUTCH-48	"Did you mean" query enhancement/refignment feature request
NUTCH-47	Configure host filter to do wildcard prefixes - *.redhat.com
NUTCH-45	Log corrupt segments in SegmentMergeTool	
NUTCH-26	New Http Authentication mechanism	
NUTCH-24	Cannot handle incorrectly cased Content-Type	
NUTCH-23	content text/xml parser	
NUTCH-18	Windows servers include illegal characters in URLs
NUTCH-16	boost documents matching a url pattern	
NUTCH-14	NullPointerException NutchBean.getSummary	
NUTCH-12	WebDBReader options to print incoming links

Re: vote for issues to fix in 0.7.2

Posted by Andrew McNabb <am...@mcnabbs.org>.
> NUTCH-127	uncorrect values using -du, or ls does not return items
NUTCH-127 +1

> NUTCH-121	SegmentReader for mapred	
NUTCH-121 +1

> NUTCH-115	jobtracker.jsp shows too much information	
NUTCH-115 +1

> NUTCH-108	tasktracker crashs when reconnecting to a new jobtracker.
NUTCH-108 +1

> NUTCH-111	ndfs.replication is not documented within the nutch- 
> default.xml configuration file.
NUTCH-111 +1


-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

Re: vote for issues to fix in 0.7.2

Posted by Matthias Jaekle <ja...@eventax.de>.
> NUTCH-134    Summarizer doesn't select the best snippets   
+1

> NUTCH-98    RobotRulesParser interprets robots.txt incorrectly
+1

> NUTCH-120    one "bad" link on a page kills parsing   
+1

> NUTCH-95    DeleteDuplicates depends on the order of input segments
+1

> NUTCH-13    If dns points to 127.0.0.1, the url is also crawled
+1

> NUTCH-45    Log corrupt segments in SegmentMergeTool   
+1


Matthias

Re: vote for issues to fix in 0.7.2

Posted by Florent Gluck <fl...@busytonight.com>.
I hope it's not too late to accept my votes. Here there are:

> NUTCH-136    mapreduce segment generator generates 50 % less than 
> excepted urls

+1

> NUTCH-121    SegmentReader for mapred   

+1

> NUTCH-108    tasktracker crashs when reconnecting to a new jobtracker.

+1

Thanks,
--Flo

Re: vote for issues to fix in 0.7.2

Posted by YourSoft <yo...@freemail.hu>.
Dear Stefan,

This is a great list.
There is my votes:

> NUTCH-141    jobdetails.jsp doesnt work on webbrowser "safari"
> NUTCH-140    Add alias capability in parse-plugins.xml file that 
> allows  mimeType->extensionId mapping  +1
> NUTCH-139    Standard metadata property names in the ParseData 
> metadata  +1
> NUTCH-138    non-Latin-1 characters cannot be submitted for search
> NUTCH-137    footer is not displayed in search result page   
> NUTCH-136    mapreduce segment generator generates 50 % less than  
> excepted urls
> NUTCH-34    Parsing different content formats
> NUTCH-3    multi values of header discarded   
> NUTCH-134    Summarizer doesn't select the best snippets
> NUTCH-132    Add ability to sort on more than one column   
> NUTCH-131    Non-documented variable: mapred.child.heap.size
> NUTCH-98    RobotRulesParser interprets robots.txt incorrectly
> NUTCH-129    rtf-parser does not work when opened with wordpad files 
> and  saved
> NUTCH-120    one "bad" link on a page kills parsing    +1
> NUTCH-128    second configuration nodes overwrites first node
> NUTCH-127    uncorrect values using -du, or ls does not return items
> NUTCH-126    Fetching via https does not work with a proxy (patch)
> NUTCH-125    OpenOffice Parser plugin    +1
> NUTCH-110    OpenSearchServlet outputs illegal xml characters
> NUTCH-36    Chinese in Nutch   
> NUTCH-123    Cache.jsp some times generate NullPointerException +1
> NUTCH-39    pagination in search result   
> NUTCH-49    Flag for generate to fetch only new pages to complement 
> the - refetchonly flag
> NUTCH-94    MapFile.Writer throwing 'File exists error'.   
> NUTCH-117    Crawl crashes with java.io.IOException: already exists: 
> C: \nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> NUTCH-122    block numbers need a better random number generator
> NUTCH-82    Nutch Commands should run on Windows without external tools
> NUTCH-121    SegmentReader for mapred   
> NUTCH-119    Regexp to extract outlinks incorrect    +1
> NUTCH-118    FAQ link points to invalid URL   
> NUTCH-115    jobtracker.jsp shows too much information   
> NUTCH-103    Vivisimo like treeview and url redirect   
> NUTCH-108    tasktracker crashs when reconnecting to a new jobtracker.
> NUTCH-113    Disable permanent DNS-to-IP caching for JVM 1.4 +1
> NUTCH-111    ndfs.replication is not documented within the nutch- 
> default.xml configuration file.
> NUTCH-100    New plugin urlfilter-db   
> NUTCH-101    RobotRulesParser   
> NUTCH-96    MapFile.Writer throws directory exists exception if run  
> multiple times in the same JVM or server JVM.
> NUTCH-106    Datanode corruption   
> NUTCH-105    Network error during robots.txt fetch causes file to be  
> ignored +1
> NUTCH-104    Nutch query parser does not support CJK bi-gram 
> segmentation.
> NUTCH-102    jobtracker does not start when webapps is in src
> NUTCH-95    DeleteDuplicates depends on the order of input segments
> NUTCH-92    DistributedSearch incorrectly scores results    +1
> NUTCH-87    Efficient site-specific crawling for a large number of sites
> NUTCH-91    empty encoding causes exception   
> NUTCH-90    reduce logging output of IndexSegment   
> NUTCH-52    Parser plugin for MS Excel files    +1
> NUTCH-86    LanguageIdentifier API enhancements   
> NUTCH-84    Fetcher for constrained crawls   
> NUTCH-74    French Analyzer Plugin   
> NUTCH-83    Release deliverable as zip   
> NUTCH-81    Webapp only works when deployed in root   
> NUTCH-79    Fault tolerant searching.   
> NUTCH-64    no results after a restart of a search--server (without  
> tomcat restart) +1
> NUTCH-76    NDFS DataNode advertises localhost as it's address
> NUTCH-75    Patch for WebDBReader to get more detailed information 
> about  WebDBs
> NUTCH-73    A page for CSV results   
> NUTCH-72    Query basic filter with correction feature   
> NUTCH-70    duplicate pages - virtual hosts in db.   
> NUTCH-68    A tool to generate arbitrary fetchlists   
> NUTCH-62    Add html META tag information into metaData in index-more  
> plugin
> NUTCH-61    Adaptive re-fetch interval. Detecting umodified content
> NUTCH-55    Create dmoz.org search plugin - incorporate the dmoz.org  
> title/category/description if available &
> NUTCH-59    meta data support in webdb   
> NUTCH-25    needs 'character encoding' detector   
> NUTCH-44    too many search results
> NUTCH-42    enhance search.jsp such that it can also returns XML
> NUTCH-50    Benchmarks & Performance goals   
> NUTCH-13    If dns points to 127.0.0.1, the url is also crawled
> NUTCH-48    "Did you mean" query enhancement/refignment feature request
> NUTCH-47    Configure host filter to do wildcard prefixes - *.redhat.com
> NUTCH-45    Log corrupt segments in SegmentMergeTool   
> NUTCH-26    New Http Authentication mechanism   
> NUTCH-24    Cannot handle incorrectly cased Content-Type    +1
> NUTCH-23    content text/xml parser   
> NUTCH-18    Windows servers include illegal characters in URLs
> NUTCH-16    boost documents matching a url pattern    +1
> NUTCH-14    NullPointerException NutchBean.getSummary   
> NUTCH-12    WebDBReader options to print incoming links


Re: vote for issues to fix in 0.7.2

Posted by Stefan Groschupf <sg...@media-style.com>.
My personal fav. list
In a day or so I will count all votes and post them.

> NUTCH-141	jobdetails.jsp doesnt work on webbrowser "safari"
+1
> NUTCH-140	Add alias capability in parse-plugins.xml file that  
> allows mimeType->extensionId mapping
> NUTCH-139	Standard metadata property names in the ParseData metadata
+1
> NUTCH-138	non-Latin-1 characters cannot be submitted for search
+1
> NUTCH-137	footer is not displayed in search result page	
> NUTCH-136	mapreduce segment generator generates 50 % less than  
> excepted urls
> NUTCH-34	Parsing different content formats	
> NUTCH-3	multi values of header discarded	
+1
> NUTCH-134	Summarizer doesn't select the best snippets	
> NUTCH-132	Add ability to sort on more than one column	
> NUTCH-131	Non-documented variable: mapred.child.heap.size
> NUTCH-98	RobotRulesParser interprets robots.txt incorrectly
> NUTCH-129	rtf-parser does not work when opened with wordpad files  
> and saved
> NUTCH-120	one "bad" link on a page kills parsing	
+1
> NUTCH-128	second configuration nodes overwrites first node
> NUTCH-127	uncorrect values using -du, or ls does not return items
> NUTCH-126	Fetching via https does not work with a proxy (patch)
+1
> NUTCH-125	OpenOffice Parser plugin	
+1
> NUTCH-110	OpenSearchServlet outputs illegal xml characters
+1
> NUTCH-36	Chinese in Nutch	
> NUTCH-123	Cache.jsp some times generate NullPointerException
+1 (may already fixed)
> NUTCH-39	pagination in search result	
> NUTCH-49	Flag for generate to fetch only new pages to complement  
> the -refetchonly flag
> NUTCH-94	MapFile.Writer throwing 'File exists error'.	
> NUTCH-117	Crawl crashes with java.io.IOException: already exists: C: 
> \nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> NUTCH-122	block numbers need a better random number generator
> NUTCH-82	Nutch Commands should run on Windows without external tools
> NUTCH-121	SegmentReader for mapred	
> NUTCH-119	Regexp to extract outlinks incorrect	
+1
> NUTCH-118	FAQ link points to invalid URL	
> NUTCH-115	jobtracker.jsp shows too much information	
> NUTCH-103	Vivisimo like treeview and url redirect	
> NUTCH-108	tasktracker crashs when reconnecting to a new jobtracker.
> NUTCH-113	Disable permanent DNS-to-IP caching for JVM 1.4
> NUTCH-111	ndfs.replication is not documented within the nutch- 
> default.xml configuration file.
> NUTCH-100	New plugin urlfilter-db
+1
> 	
> NUTCH-101	RobotRulesParser	
> NUTCH-96	MapFile.Writer throws directory exists exception if run  
> multiple times in the same JVM or server JVM.
> NUTCH-106	Datanode corruption	
> NUTCH-105	Network error during robots.txt fetch causes file to be  
> ignored
> NUTCH-104	Nutch query parser does not support CJK bi-gram  
> segmentation.
> NUTCH-102	jobtracker does not start when webapps is in src
> NUTCH-95	DeleteDuplicates depends on the order of input segments
> NUTCH-92	DistributedSearch incorrectly scores results	
> NUTCH-87	Efficient site-specific crawling for a large number of sites
> NUTCH-91	empty encoding causes exception
+1
> 	
> NUTCH-90	reduce logging output of IndexSegment	
> NUTCH-52	Parser plugin for MS Excel files	
> NUTCH-86	LanguageIdentifier API enhancements	
> NUTCH-84	Fetcher for constrained crawls	
> NUTCH-74	French Analyzer Plugin
+1
> 	
> NUTCH-83	Release deliverable as zip	
> NUTCH-81	Webapp only works when deployed in root	
> NUTCH-79	Fault tolerant searching.	
> NUTCH-64	no results after a restart of a search--server (without  
> tomcat restart)
> NUTCH-76	NDFS DataNode advertises localhost as it's address
> NUTCH-75	Patch for WebDBReader to get more detailed information  
> about WebDBs
> NUTCH-73	A page for CSV results	
> NUTCH-72	Query basic filter with correction feature	
> NUTCH-70	duplicate pages - virtual hosts in db.	
> NUTCH-68	A tool to generate arbitrary fetchlists	
+1
> NUTCH-62	Add html META tag information into metaData in index-more  
> plugin
++1!
> NUTCH-61	Adaptive re-fetch interval. Detecting umodified content
++1! but is it ready to us?
> NUTCH-55	Create dmoz.org search plugin - incorporate the dmoz.org  
> title/category/description if available &
> NUTCH-59	meta data support in webdb	
> NUTCH-25	needs 'character encoding' detector	
> NUTCH-44	too many search results	
> NUTCH-42	enhance search.jsp such that it can also returns XML
> NUTCH-50	Benchmarks & Performance goals	
> NUTCH-13	If dns points to 127.0.0.1, the url is also crawled
> NUTCH-48	"Did you mean" query enhancement/refignment feature request
+1
> NUTCH-47	Configure host filter to do wildcard prefixes - *.redhat.com
> NUTCH-45	Log corrupt segments in SegmentMergeTool	
> NUTCH-26	New Http Authentication mechanism	
> NUTCH-24	Cannot handle incorrectly cased Content-Type	
> NUTCH-23	content text/xml parser	
> NUTCH-18	Windows servers include illegal characters in URLs
> NUTCH-16	boost documents matching a url pattern	
> NUTCH-14	NullPointerException NutchBean.getSummary	
> NUTCH-12	WebDBReader options to print incoming links
>


Re: vote for issues to fix in 0.7.2

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Marko Bauhardt wrote:
>> NUTCH-141    jobdetails.jsp doesnt work on webbrowser "safari"
> 
> +1
> :-)
> 
> Marko.
> 
I have just fixed NUTCH-141 in all branches so we do not concentrate on 
obvious things.
I have one additional thing - majority of issues people vote for in this 
thread are mapred related. I think voters use mapred branch so fixing it 
in 0.7.2 would not help them. Please use JIRA features to vote for such 
issues - here I would like to see a list of thing that we think are 
important for 0.7 branch users.
Regards
Piotr

Re: vote for issues to fix in 0.7.2

Posted by Marko Bauhardt <mb...@media-style.com>.
> NUTCH-141	jobdetails.jsp doesnt work on webbrowser "safari"
+1
:-)

Marko.