You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/12/14 14:18:51 UTC
vote for issues to fix in 0.7.2
Full list of open issues
complete description can be found here :
http://issues.apache.org/jira/secure/IssueNavigator.jspa?
view=full&tempMax=30
Please add a "+1" in case you vote for the issue under this issue.
Please keep in mind that this will be more a maintenance release.
NUTCH-141 jobdetails.jsp doesnt work on webbrowser "safari"
NUTCH-140 Add alias capability in parse-plugins.xml file that allows
mimeType->extensionId mapping
NUTCH-139 Standard metadata property names in the ParseData metadata
NUTCH-138 non-Latin-1 characters cannot be submitted for search
NUTCH-137 footer is not displayed in search result page
NUTCH-136 mapreduce segment generator generates 50 % less than
excepted urls
NUTCH-34 Parsing different content formats
NUTCH-3 multi values of header discarded
NUTCH-134 Summarizer doesn't select the best snippets
NUTCH-132 Add ability to sort on more than one column
NUTCH-131 Non-documented variable: mapred.child.heap.size
NUTCH-98 RobotRulesParser interprets robots.txt incorrectly
NUTCH-129 rtf-parser does not work when opened with wordpad files and
saved
NUTCH-120 one "bad" link on a page kills parsing
NUTCH-128 second configuration nodes overwrites first node
NUTCH-127 uncorrect values using -du, or ls does not return items
NUTCH-126 Fetching via https does not work with a proxy (patch)
NUTCH-125 OpenOffice Parser plugin
NUTCH-110 OpenSearchServlet outputs illegal xml characters
NUTCH-36 Chinese in Nutch
NUTCH-123 Cache.jsp some times generate NullPointerException
NUTCH-39 pagination in search result
NUTCH-49 Flag for generate to fetch only new pages to complement the -
refetchonly flag
NUTCH-94 MapFile.Writer throwing 'File exists error'.
NUTCH-117 Crawl crashes with java.io.IOException: already exists: C:
\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
NUTCH-122 block numbers need a better random number generator
NUTCH-82 Nutch Commands should run on Windows without external tools
NUTCH-121 SegmentReader for mapred
NUTCH-119 Regexp to extract outlinks incorrect
NUTCH-118 FAQ link points to invalid URL
NUTCH-115 jobtracker.jsp shows too much information
NUTCH-103 Vivisimo like treeview and url redirect
NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker.
NUTCH-113 Disable permanent DNS-to-IP caching for JVM 1.4
NUTCH-111 ndfs.replication is not documented within the nutch-
default.xml configuration file.
NUTCH-100 New plugin urlfilter-db
NUTCH-101 RobotRulesParser
NUTCH-96 MapFile.Writer throws directory exists exception if run
multiple times in the same JVM or server JVM.
NUTCH-106 Datanode corruption
NUTCH-105 Network error during robots.txt fetch causes file to be
ignored
NUTCH-104 Nutch query parser does not support CJK bi-gram segmentation.
NUTCH-102 jobtracker does not start when webapps is in src
NUTCH-95 DeleteDuplicates depends on the order of input segments
NUTCH-92 DistributedSearch incorrectly scores results
NUTCH-87 Efficient site-specific crawling for a large number of sites
NUTCH-91 empty encoding causes exception
NUTCH-90 reduce logging output of IndexSegment
NUTCH-52 Parser plugin for MS Excel files
NUTCH-86 LanguageIdentifier API enhancements
NUTCH-84 Fetcher for constrained crawls
NUTCH-74 French Analyzer Plugin
NUTCH-83 Release deliverable as zip
NUTCH-81 Webapp only works when deployed in root
NUTCH-79 Fault tolerant searching.
NUTCH-64 no results after a restart of a search--server (without
tomcat restart)
NUTCH-76 NDFS DataNode advertises localhost as it's address
NUTCH-75 Patch for WebDBReader to get more detailed information about
WebDBs
NUTCH-73 A page for CSV results
NUTCH-72 Query basic filter with correction feature
NUTCH-70 duplicate pages - virtual hosts in db.
NUTCH-68 A tool to generate arbitrary fetchlists
NUTCH-62 Add html META tag information into metaData in index-more
plugin
NUTCH-61 Adaptive re-fetch interval. Detecting umodified content
NUTCH-55 Create dmoz.org search plugin - incorporate the dmoz.org
title/category/description if available &
NUTCH-59 meta data support in webdb
NUTCH-25 needs 'character encoding' detector
NUTCH-44 too many search results
NUTCH-42 enhance search.jsp such that it can also returns XML
NUTCH-50 Benchmarks & Performance goals
NUTCH-13 If dns points to 127.0.0.1, the url is also crawled
NUTCH-48 "Did you mean" query enhancement/refignment feature request
NUTCH-47 Configure host filter to do wildcard prefixes - *.redhat.com
NUTCH-45 Log corrupt segments in SegmentMergeTool
NUTCH-26 New Http Authentication mechanism
NUTCH-24 Cannot handle incorrectly cased Content-Type
NUTCH-23 content text/xml parser
NUTCH-18 Windows servers include illegal characters in URLs
NUTCH-16 boost documents matching a url pattern
NUTCH-14 NullPointerException NutchBean.getSummary
NUTCH-12 WebDBReader options to print incoming links
Re: vote for issues to fix in 0.7.2
Posted by Andrew McNabb <am...@mcnabbs.org>.
> NUTCH-127 uncorrect values using -du, or ls does not return items
NUTCH-127 +1
> NUTCH-121 SegmentReader for mapred
NUTCH-121 +1
> NUTCH-115 jobtracker.jsp shows too much information
NUTCH-115 +1
> NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker.
NUTCH-108 +1
> NUTCH-111 ndfs.replication is not documented within the nutch-
> default.xml configuration file.
NUTCH-111 +1
--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868
Re: vote for issues to fix in 0.7.2
Posted by Matthias Jaekle <ja...@eventax.de>.
> NUTCH-134 Summarizer doesn't select the best snippets
+1
> NUTCH-98 RobotRulesParser interprets robots.txt incorrectly
+1
> NUTCH-120 one "bad" link on a page kills parsing
+1
> NUTCH-95 DeleteDuplicates depends on the order of input segments
+1
> NUTCH-13 If dns points to 127.0.0.1, the url is also crawled
+1
> NUTCH-45 Log corrupt segments in SegmentMergeTool
+1
Matthias
Re: vote for issues to fix in 0.7.2
Posted by Florent Gluck <fl...@busytonight.com>.
I hope it's not too late to accept my votes. Here there are:
> NUTCH-136 mapreduce segment generator generates 50 % less than
> excepted urls
+1
> NUTCH-121 SegmentReader for mapred
+1
> NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker.
+1
Thanks,
--Flo
Re: vote for issues to fix in 0.7.2
Posted by YourSoft <yo...@freemail.hu>.
Dear Stefan,
This is a great list.
There is my votes:
> NUTCH-141 jobdetails.jsp doesnt work on webbrowser "safari"
> NUTCH-140 Add alias capability in parse-plugins.xml file that
> allows mimeType->extensionId mapping +1
> NUTCH-139 Standard metadata property names in the ParseData
> metadata +1
> NUTCH-138 non-Latin-1 characters cannot be submitted for search
> NUTCH-137 footer is not displayed in search result page
> NUTCH-136 mapreduce segment generator generates 50 % less than
> excepted urls
> NUTCH-34 Parsing different content formats
> NUTCH-3 multi values of header discarded
> NUTCH-134 Summarizer doesn't select the best snippets
> NUTCH-132 Add ability to sort on more than one column
> NUTCH-131 Non-documented variable: mapred.child.heap.size
> NUTCH-98 RobotRulesParser interprets robots.txt incorrectly
> NUTCH-129 rtf-parser does not work when opened with wordpad files
> and saved
> NUTCH-120 one "bad" link on a page kills parsing +1
> NUTCH-128 second configuration nodes overwrites first node
> NUTCH-127 uncorrect values using -du, or ls does not return items
> NUTCH-126 Fetching via https does not work with a proxy (patch)
> NUTCH-125 OpenOffice Parser plugin +1
> NUTCH-110 OpenSearchServlet outputs illegal xml characters
> NUTCH-36 Chinese in Nutch
> NUTCH-123 Cache.jsp some times generate NullPointerException +1
> NUTCH-39 pagination in search result
> NUTCH-49 Flag for generate to fetch only new pages to complement
> the - refetchonly flag
> NUTCH-94 MapFile.Writer throwing 'File exists error'.
> NUTCH-117 Crawl crashes with java.io.IOException: already exists:
> C: \nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> NUTCH-122 block numbers need a better random number generator
> NUTCH-82 Nutch Commands should run on Windows without external tools
> NUTCH-121 SegmentReader for mapred
> NUTCH-119 Regexp to extract outlinks incorrect +1
> NUTCH-118 FAQ link points to invalid URL
> NUTCH-115 jobtracker.jsp shows too much information
> NUTCH-103 Vivisimo like treeview and url redirect
> NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker.
> NUTCH-113 Disable permanent DNS-to-IP caching for JVM 1.4 +1
> NUTCH-111 ndfs.replication is not documented within the nutch-
> default.xml configuration file.
> NUTCH-100 New plugin urlfilter-db
> NUTCH-101 RobotRulesParser
> NUTCH-96 MapFile.Writer throws directory exists exception if run
> multiple times in the same JVM or server JVM.
> NUTCH-106 Datanode corruption
> NUTCH-105 Network error during robots.txt fetch causes file to be
> ignored +1
> NUTCH-104 Nutch query parser does not support CJK bi-gram
> segmentation.
> NUTCH-102 jobtracker does not start when webapps is in src
> NUTCH-95 DeleteDuplicates depends on the order of input segments
> NUTCH-92 DistributedSearch incorrectly scores results +1
> NUTCH-87 Efficient site-specific crawling for a large number of sites
> NUTCH-91 empty encoding causes exception
> NUTCH-90 reduce logging output of IndexSegment
> NUTCH-52 Parser plugin for MS Excel files +1
> NUTCH-86 LanguageIdentifier API enhancements
> NUTCH-84 Fetcher for constrained crawls
> NUTCH-74 French Analyzer Plugin
> NUTCH-83 Release deliverable as zip
> NUTCH-81 Webapp only works when deployed in root
> NUTCH-79 Fault tolerant searching.
> NUTCH-64 no results after a restart of a search--server (without
> tomcat restart) +1
> NUTCH-76 NDFS DataNode advertises localhost as it's address
> NUTCH-75 Patch for WebDBReader to get more detailed information
> about WebDBs
> NUTCH-73 A page for CSV results
> NUTCH-72 Query basic filter with correction feature
> NUTCH-70 duplicate pages - virtual hosts in db.
> NUTCH-68 A tool to generate arbitrary fetchlists
> NUTCH-62 Add html META tag information into metaData in index-more
> plugin
> NUTCH-61 Adaptive re-fetch interval. Detecting umodified content
> NUTCH-55 Create dmoz.org search plugin - incorporate the dmoz.org
> title/category/description if available &
> NUTCH-59 meta data support in webdb
> NUTCH-25 needs 'character encoding' detector
> NUTCH-44 too many search results
> NUTCH-42 enhance search.jsp such that it can also returns XML
> NUTCH-50 Benchmarks & Performance goals
> NUTCH-13 If dns points to 127.0.0.1, the url is also crawled
> NUTCH-48 "Did you mean" query enhancement/refignment feature request
> NUTCH-47 Configure host filter to do wildcard prefixes - *.redhat.com
> NUTCH-45 Log corrupt segments in SegmentMergeTool
> NUTCH-26 New Http Authentication mechanism
> NUTCH-24 Cannot handle incorrectly cased Content-Type +1
> NUTCH-23 content text/xml parser
> NUTCH-18 Windows servers include illegal characters in URLs
> NUTCH-16 boost documents matching a url pattern +1
> NUTCH-14 NullPointerException NutchBean.getSummary
> NUTCH-12 WebDBReader options to print incoming links
Re: vote for issues to fix in 0.7.2
Posted by Stefan Groschupf <sg...@media-style.com>.
My personal fav. list
In a day or so I will count all votes and post them.
> NUTCH-141 jobdetails.jsp doesnt work on webbrowser "safari"
+1
> NUTCH-140 Add alias capability in parse-plugins.xml file that
> allows mimeType->extensionId mapping
> NUTCH-139 Standard metadata property names in the ParseData metadata
+1
> NUTCH-138 non-Latin-1 characters cannot be submitted for search
+1
> NUTCH-137 footer is not displayed in search result page
> NUTCH-136 mapreduce segment generator generates 50 % less than
> excepted urls
> NUTCH-34 Parsing different content formats
> NUTCH-3 multi values of header discarded
+1
> NUTCH-134 Summarizer doesn't select the best snippets
> NUTCH-132 Add ability to sort on more than one column
> NUTCH-131 Non-documented variable: mapred.child.heap.size
> NUTCH-98 RobotRulesParser interprets robots.txt incorrectly
> NUTCH-129 rtf-parser does not work when opened with wordpad files
> and saved
> NUTCH-120 one "bad" link on a page kills parsing
+1
> NUTCH-128 second configuration nodes overwrites first node
> NUTCH-127 uncorrect values using -du, or ls does not return items
> NUTCH-126 Fetching via https does not work with a proxy (patch)
+1
> NUTCH-125 OpenOffice Parser plugin
+1
> NUTCH-110 OpenSearchServlet outputs illegal xml characters
+1
> NUTCH-36 Chinese in Nutch
> NUTCH-123 Cache.jsp some times generate NullPointerException
+1 (may already fixed)
> NUTCH-39 pagination in search result
> NUTCH-49 Flag for generate to fetch only new pages to complement
> the -refetchonly flag
> NUTCH-94 MapFile.Writer throwing 'File exists error'.
> NUTCH-117 Crawl crashes with java.io.IOException: already exists: C:
> \nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
> NUTCH-122 block numbers need a better random number generator
> NUTCH-82 Nutch Commands should run on Windows without external tools
> NUTCH-121 SegmentReader for mapred
> NUTCH-119 Regexp to extract outlinks incorrect
+1
> NUTCH-118 FAQ link points to invalid URL
> NUTCH-115 jobtracker.jsp shows too much information
> NUTCH-103 Vivisimo like treeview and url redirect
> NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker.
> NUTCH-113 Disable permanent DNS-to-IP caching for JVM 1.4
> NUTCH-111 ndfs.replication is not documented within the nutch-
> default.xml configuration file.
> NUTCH-100 New plugin urlfilter-db
+1
>
> NUTCH-101 RobotRulesParser
> NUTCH-96 MapFile.Writer throws directory exists exception if run
> multiple times in the same JVM or server JVM.
> NUTCH-106 Datanode corruption
> NUTCH-105 Network error during robots.txt fetch causes file to be
> ignored
> NUTCH-104 Nutch query parser does not support CJK bi-gram
> segmentation.
> NUTCH-102 jobtracker does not start when webapps is in src
> NUTCH-95 DeleteDuplicates depends on the order of input segments
> NUTCH-92 DistributedSearch incorrectly scores results
> NUTCH-87 Efficient site-specific crawling for a large number of sites
> NUTCH-91 empty encoding causes exception
+1
>
> NUTCH-90 reduce logging output of IndexSegment
> NUTCH-52 Parser plugin for MS Excel files
> NUTCH-86 LanguageIdentifier API enhancements
> NUTCH-84 Fetcher for constrained crawls
> NUTCH-74 French Analyzer Plugin
+1
>
> NUTCH-83 Release deliverable as zip
> NUTCH-81 Webapp only works when deployed in root
> NUTCH-79 Fault tolerant searching.
> NUTCH-64 no results after a restart of a search--server (without
> tomcat restart)
> NUTCH-76 NDFS DataNode advertises localhost as it's address
> NUTCH-75 Patch for WebDBReader to get more detailed information
> about WebDBs
> NUTCH-73 A page for CSV results
> NUTCH-72 Query basic filter with correction feature
> NUTCH-70 duplicate pages - virtual hosts in db.
> NUTCH-68 A tool to generate arbitrary fetchlists
+1
> NUTCH-62 Add html META tag information into metaData in index-more
> plugin
++1!
> NUTCH-61 Adaptive re-fetch interval. Detecting umodified content
++1! but is it ready to us?
> NUTCH-55 Create dmoz.org search plugin - incorporate the dmoz.org
> title/category/description if available &
> NUTCH-59 meta data support in webdb
> NUTCH-25 needs 'character encoding' detector
> NUTCH-44 too many search results
> NUTCH-42 enhance search.jsp such that it can also returns XML
> NUTCH-50 Benchmarks & Performance goals
> NUTCH-13 If dns points to 127.0.0.1, the url is also crawled
> NUTCH-48 "Did you mean" query enhancement/refignment feature request
+1
> NUTCH-47 Configure host filter to do wildcard prefixes - *.redhat.com
> NUTCH-45 Log corrupt segments in SegmentMergeTool
> NUTCH-26 New Http Authentication mechanism
> NUTCH-24 Cannot handle incorrectly cased Content-Type
> NUTCH-23 content text/xml parser
> NUTCH-18 Windows servers include illegal characters in URLs
> NUTCH-16 boost documents matching a url pattern
> NUTCH-14 NullPointerException NutchBean.getSummary
> NUTCH-12 WebDBReader options to print incoming links
>
Re: vote for issues to fix in 0.7.2
Posted by Piotr Kosiorowski <pk...@gmail.com>.
Marko Bauhardt wrote:
>> NUTCH-141 jobdetails.jsp doesnt work on webbrowser "safari"
>
> +1
> :-)
>
> Marko.
>
I have just fixed NUTCH-141 in all branches so we do not concentrate on
obvious things.
I have one additional thing - majority of issues people vote for in this
thread are mapred related. I think voters use mapred branch so fixing it
in 0.7.2 would not help them. Please use JIRA features to vote for such
issues - here I would like to see a list of thing that we think are
important for 0.7 branch users.
Regards
Piotr
Re: vote for issues to fix in 0.7.2
Posted by Marko Bauhardt <mb...@media-style.com>.
> NUTCH-141 jobdetails.jsp doesnt work on webbrowser "safari"
+1
:-)
Marko.