You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by cm...@apache.org on 2002/06/18 13:39:52 UTC

cvs commit: jakarta-lucene-sandbox/contributions/webcrawler-LARM TODO.txt

cmarschner    2002/06/18 04:39:51

  Modified:    contributions/webcrawler-LARM TODO.txt
  Log:
  see file
  
  Revision  Changes    Path
  1.2       +40 -13    jakarta-lucene-sandbox/contributions/webcrawler-LARM/TODO.txt
  
  Index: TODO.txt
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene-sandbox/contributions/webcrawler-LARM/TODO.txt,v
  retrieving revision 1.1
  retrieving revision 1.2
  diff -u -r1.1 -r1.2
  --- TODO.txt	1 Jun 2002 18:55:15 -0000	1.1
  +++ TODO.txt	18 Jun 2002 11:39:51 -0000	1.2
  @@ -1,11 +1,39 @@
   
   Todos for 1.0 (not yet ordered in decreasing priority)
   
  -$id: $
  +$Id$
  +
  +-----------------------------------------------------------------------------------------------
  +solved:
  +-----------------------------------------------------------------------------------------------
  +
  +Bugs:
  +	- some relative URLs are not appended appropriately, leading to wrong and growing URLs
  +	  - 301/302 URLs were not updated: the docs were saved under the old URL, which lead to
  +	    wrong relative URLs (cmarschner, 2002-06-17)
  +
  +URLs: 
  +	- include a URLNormalizer
  +	  * lowercase host names
  +	  * avoid ambiguities like '%20' / '+'
  +	  * make sure http://host URLs end with "/"
  +	  * avoid host name aliases
  +	    - two host names / one ip adress can point to the same web site: www.lmu.de / www.uni-muenchen.de
  +	    - two host names / one ip adress can point to different web sites (then other URLs / pages must differ)
  +	      suche.lmu.de / interesse.lmu.de
  +	  * cater 301/302 result codes
  +	STATUS: seems to be solved except that URL parameters can occur in different orders, which is NOT resolved
  +		host names are resolved by hand, via a synonym in HostManager. (cmarschner, 2002-06-17)
  +		problem: URLMessage size doubles
  +
  +-----------------------------------------------------------------------------------------------
  +remaining:
  +-----------------------------------------------------------------------------------------------
   
   * Bugs
   	- on very fast LAN connections (100MBit), sockets are not freed as fast as allocated
  -	- some relative URLs are not appended appropriately, leading to wrong and growing URLs
  +	  probably this will be solved by changing from HTTPClient.* to Jakarta HTTP client and reuse sockets
  +
   
   * Build
   	- added build.xml, but build.bat and build.sh are still working without ANT. Change that.
  @@ -16,16 +44,6 @@
   * Configuration
   	- move all configuration stuff into a meaningful properties file
   
  -* URLs: 
  -	- include a URLNormalizer
  -	  * lowercase host names
  -	  * avoid ambiguities like '%20' / '+'
  -	  * make sure http://host URLs end with "/"
  -	  * avoid host name aliases
  -	    - two host names / one ip adress can point to the same web site: www.lmu.de / www.uni-muenchen.de
  -	    - two host names / one ip adress can point to different web sites (then other URLs / pages must differ)
  -	      suche.lmu.de / interesse.lmu.de
  -	  * cater 301/302 result codes
   
   * Repository
   	- optionally use a database as repository (caches, queues, logs)
  @@ -50,13 +68,22 @@
   * Politeness
   	- add the option to restrict the number of host accesses per hour/minute
   
  +* URL Extraction
  +	- URLs can be encoded in different encoding styles - see http://www.unicode.org/unicode/faq/unicode_web.html
  +
  +* I18N, HTML encoding
  +	- determine document encoding style in content-type, meta tag (http-equiv), or Doctype-tag; adapt URLs to
  +	  encoding style
  +
   * Anchor text extraction
   	  * read until a meaningful end tag, not just the first encountered
   	  * remove entities
   	  * optionally remove Tags, leave ALT attribute
   	  * remove redundant spaces
   
  -
  +* URLNormalizer
  +	* add possibility to add synonyms to top level domains, i.e. "d1.com = d2.com" --> "sub1.d1.com = sub1.d2.com"
  +	* add possibility to detect synonyms automatically, i.e. by comparing IP addresses or file checksums
   
   Nice-to-have:
   
  
  
  

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>