You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by cu...@apache.org on 2005/10/04 23:58:56 UTC
svn commit: r294928 - in /lucene/nutch/branches/mapred: site/tutorial.html site/tutorial.pdf src/site/src/documentation/content/xdocs/tutorial.xml

Author: cutting
Date: Tue Oct  4 14:58:53 2005
New Revision: 294928

URL: http://svn.apache.org/viewcvs?rev=294928&view=rev
Log:
Update tutorial for mapred changes.  Still does not describe mapred or NDFS configuration.

Modified:
    lucene/nutch/branches/mapred/site/tutorial.html
    lucene/nutch/branches/mapred/site/tutorial.pdf
    lucene/nutch/branches/mapred/src/site/src/documentation/content/xdocs/tutorial.xml

Modified: lucene/nutch/branches/mapred/site/tutorial.html
URL: http://svn.apache.org/viewcvs/lucene/nutch/branches/mapred/site/tutorial.html?rev=294928&r1=294927&r2=294928&view=diff
==============================================================================
--- lucene/nutch/branches/mapred/site/tutorial.html (original)
+++ lucene/nutch/branches/mapred/site/tutorial.html Tue Oct  4 14:58:53 2005
@@ -276,11 +276,11 @@
 <ol>
 
 
-<li>Create a flat file of root urls.  For example, to crawl the
-<span class="codefrag">nutch</span> site you might start with a file named
-<span class="codefrag">urls</span> containing just the Nutch home page.  All other
-Nutch pages should be reachable from this page.  The <span class="codefrag">urls</span>
-file would thus look like:
+<li>Create a directory with a flat file of root urls.  For example, to
+crawl the <span class="codefrag">nutch</span> site you might start with a file named
+<span class="codefrag">urls/nutch</span> containing the url of just the Nutch home
+page.  All other Nutch pages should be reachable from this page.  The
+<span class="codefrag">urls/nutch</span> file would thus contain:
 <pre class="code">
 http://lucene.apache.org/nutch/
 </pre>
@@ -310,138 +310,152 @@
 <span class="codefrag">-dir</span> <em>dir</em> names the directory to put the crawl in.</li>
 
 <li>
-<span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root
-page that should be crawled.</li>
+<span class="codefrag">-threads</span> <em>threads</em> determines the number of
+threads that will fetch in parallel.</li>
 
 <li>
-<span class="codefrag">-delay</span> <em>delay</em> determines the number of seconds
-between accesses to each host.</li>
+<span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root
+page that should be crawled.</li>
 
 <li>
-<span class="codefrag">-threads</span> <em>threads</em> determines the number of
-threads that will fetch in parallel.</li>
+<span class="codefrag">-topN</span> <em>N</em> determines the maximum number of pages that
+will be retrieved at each level up to the depth.</li>
 
 </ul>
 <p>For example, a typical call might be:</p>
 <pre class="code">
-bin/nutch crawl urls -dir crawl.test -depth 3 &gt;&amp; crawl.log
+bin/nutch crawl urls -dir crawl -depth 3 -topN 50
 </pre>
-<p>Typically one starts testing one's configuration by crawling at low
-depths, and watching the output to check that desired pages are found.
-Once one is more confident of the configuration, then an appropriate
-depth for a full crawl is around 10.</p>
+<p>Typically one starts testing one's configuration by crawling at
+shallow depths, sharply limiting the number of pages fetched at each
+level (<span class="codefrag">-topN</span>), and watching the output to check that
+desired pages are fetched and undesirable pages are not.  Once one is
+confident of the configuration, then an appropriate depth for a full
+crawl is around 10.  The number of pages per level
+(<span class="codefrag">-topN</span>) for a full crawl can be from tens of thousands to
+millions, depending on your resources.</p>
 <p>Once crawling has completed, one can skip to the Searching section
 below.</p>
 </div>
 
 
-<a name="N100E4"></a><a name="Whole-web+Crawling"></a>
+<a name="N100EA"></a><a name="Whole-web+Crawling"></a>
 <h2 class="h3">Whole-web Crawling</h2>
 <div class="section">
 <p>Whole-web crawling is designed to handle very large crawls which may
 take weeks to complete, running on multiple machines.</p>
-<a name="N100ED"></a><a name="Whole-web%3A+Concepts"></a>
+<a name="N100F3"></a><a name="Whole-web%3A+Concepts"></a>
 <h3 class="h4">Whole-web: Concepts</h3>
-<p>Nutch data is of two types:</p>
+<p>Nutch data is composed of:</p>
 <ol>
+
   
-<li>The web database.  This contains information about every
-page known to Nutch, and about links between those pages.</li>
+<li>The crawl database, or <em>crawldb</em>.  This contains
+information about every url known to Nutch, including whether it was
+fetched, and, if so, when.</li>
+
   
-<li>A set of segments.  Each segment is a set of pages that are
-fetched and indexed as a unit.  Segment data consists of the
-following types:</li>
+<li>The link database, or <em>linkdb</em>.  This contains the list
+of known links to each url, including both the source url and anchor
+text of the link.</li>
+
+  
+<li>A set of <em>segments</em>.  Each segment is a set of urls that are
+fetched as a unit.  Segments are directories with the following
+subdirectories:</li>
+
   
 <li>
 <ul>
     
-<li>a <em>fetchlist</em> is a file
-that names a set of pages to be fetched</li>
+<li>a <em>crawl_generate</em> names a set of urls to be fetched</li>
+    
+<li>a <em>crawl_fetch</em> contains the status of fetching each url</li>
+    
+<li>a <em>content</em> contains the content of each url</li>
     
-<li>the<em> fetcher output</em> is a
-set of files containing the fetched pages</li>
+<li>a <em>parse_text</em> contains the parsed text of each url</li>
     
-<li>the <em>index </em>is a
-Lucene-format index of the fetcher output.</li>
+<li>a <em>parse_data</em> contains outlinks and metadata parsed
+    from each url</li>
+    
+<li>a <em>crawl_parse</em> contains the outlink urls, used to
+    update the crawldb</li>
   
 </ul>
 </li>
 
+
+<li>The <em>indexes</em>are Lucene-format indexes.</li>
+
+
 </ol>
-<p>In the following examples we will keep our web database in a directory
-named <span class="codefrag">db</span> and our segments
-in a directory named <span class="codefrag">segments</span>:</p>
-<pre class="code">mkdir db
-mkdir segments</pre>
-<a name="N10123"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a>
+<a name="N10140"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a>
 <h3 class="h4">Whole-web: Boostrapping the Web Database</h3>
-<p>The admin tool is used to create a new, empty database:</p>
-<pre class="code">bin/nutch admin db -create</pre>
-<p>The <em>injector</em> adds urls into the database.  Let's inject
-URLs from the <a href="http://dmoz.org/">DMOZ</a> Open
-Directory. First we must download and uncompress the file listing all
-of the DMOZ pages.  (This is a 200+Mb file, so this will take a few
-minutes.)</p>
+<p>The <em>injector</em> adds urls to the crawldb.  Let's inject URLs
+from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we
+must download and uncompress the file listing all of the DMOZ pages.
+(This is a 200+Mb file, so this will take a few minutes.)</p>
 <pre class="code">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
 gunzip content.rdf.u8.gz</pre>
-<p>Next we inject a random subset of these pages into the web database.
+<p>Next we select a random subset of these pages.
  (We use a random subset so that everyone who runs this tutorial
 doesn't hammer the same sites.)  DMOZ contains around three million
-URLs.  We inject one out of every 3000, so that we end up with
+URLs.  We select one out of every 5000, so that we end up with
 around 1000 URLs:</p>
-<pre class="code">bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000</pre>
-<p>This also takes a few minutes, as it must parse the full file.</p>
+<pre class="code">mkdir dmoz
+bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 &gt; dmoz/urls</pre>
+<p>The parser also takes a few minutes, as it must parse the full
+file.  Finally, we initialize the crawl db with the selected urls.</p>
+<pre class="code">bin/nutch inject crawl/crawldb dmoz</pre>
 <p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
-<a name="N1014C"></a><a name="Whole-web%3A+Fetching"></a>
+<a name="N10166"></a><a name="Whole-web%3A+Fetching"></a>
 <h3 class="h4">Whole-web: Fetching</h3>
 <p>To fetch, we first generate a fetchlist from the database:</p>
-<pre class="code">bin/nutch generate db segments
+<pre class="code">bin/nutch generate crawl/crawldb crawl/segments
 </pre>
 <p>This generates a fetchlist for all of the pages due to be fetched.
  The fetchlist is placed in a newly created segment directory.
  The segment directory is named by the time it's created.  We
 save the name of this segment in the shell variable <span class="codefrag">s1</span>:</p>
-<pre class="code">s1=`ls -d segments/2* | tail -1`
+<pre class="code">s1=`ls -d crawl/segments/2* | tail -1`
 echo $s1
 </pre>
 <p>Now we run the fetcher on this segment with:</p>
 <pre class="code">bin/nutch fetch $s1</pre>
 <p>When this is complete, we update the database with the results of the
 fetch:</p>
-<pre class="code">bin/nutch updatedb db $s1</pre>
+<pre class="code">bin/nutch updatedb crawl/crawldb $s1</pre>
 <p>Now the database has entries for all of the pages referenced by the
 initial set.</p>
 <p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
-<pre class="code">bin/nutch generate db segments -topN 1000
-s2=`ls -d segments/2* | tail -1`
+<pre class="code">bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+s2=`ls -d crawl/segments/2* | tail -1`
 echo $s2
 
 bin/nutch fetch $s2
-bin/nutch updatedb db $s2
+bin/nutch updatedb crawl/crawldb $s2
 </pre>
 <p>Let's fetch one more round:</p>
 <pre class="code">
-bin/nutch generate db segments -topN 1000
-s3=`ls -d segments/2* | tail -1`
+bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+s3=`ls -d crawl/segments/2* | tail -1`
 echo $s3
 
 bin/nutch fetch $s3
-bin/nutch updatedb db $s3
+bin/nutch updatedb crawl/crawldb $s3
 </pre>
 <p>By this point we've fetched a few thousand pages.  Let's index
 them!</p>
-<a name="N10186"></a><a name="Whole-web%3A+Indexing"></a>
+<a name="N101A0"></a><a name="Whole-web%3A+Indexing"></a>
 <h3 class="h4">Whole-web: Indexing</h3>
-<p>To index each segment we use the <span class="codefrag">index</span>
-command, as follows:</p>
-<pre class="code">bin/nutch index $s1
-bin/nutch index $s2
-bin/nutch index $s3</pre>
-<p>Then, before we can search a set of segments, we need to delete
-duplicate pages.  This is done with:</p>
-<pre class="code">bin/nutch dedup segments dedup.tmp</pre>
+<p>Before indexing we first invert all of the links, so that we may
+index incoming anchor text with the pages.</p>
+<pre class="code">bin/nutch invertlinks crawl/linkdb crawl/segments</pre>
+<p>To index the segments we use the <span class="codefrag">index</span> command, as follows:</p>
+<pre class="code">bin/nutch index indexes crawl/linkdb crawl/segments/*</pre>
 <p>Now we're ready to search!</p>
-<a name="N101A1"></a><a name="Searching"></a>
+<a name="N101C1"></a><a name="Searching"></a>
 <h3 class="h4">Searching</h3>
 <p>To search you need to put the nutch war file into your servlet
 container.  (If instead of downloading a Nutch release you checked the
@@ -452,10 +466,8 @@
 <pre class="code">rm -rf ~/local/tomcat/webapps/ROOT*
 cp nutch*.war ~/local/tomcat/webapps/ROOT.war
 </pre>
-<p>The webapp finds its indexes in <span class="codefrag">./segments</span>, relative
-to where you start Tomcat, so, if you've done intranet crawling,
-connect to your crawl directory, or, if you've done whole-web
-crawling, don't change directories, and give the command:</p>
+<p>The webapp finds its indexes in <span class="codefrag">./crawl</span>, relative
+to where you start Tomcat, so use a command like:</p>
 <pre class="code">~/local/tomcat/bin/catalina.sh start
 </pre>
 <p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a>

Modified: lucene/nutch/branches/mapred/site/tutorial.pdf
URL: http://svn.apache.org/viewcvs/lucene/nutch/branches/mapred/site/tutorial.pdf?rev=294928&r1=294927&r2=294928&view=diff
==============================================================================
--- lucene/nutch/branches/mapred/site/tutorial.pdf (original)
+++ lucene/nutch/branches/mapred/site/tutorial.pdf Tue Oct  4 14:58:53 2005
@@ -146,10 +146,10 @@
 >>
 endobj
 30 0 obj
-<< /Length 2578 /Filter [ /ASCII85Decode /FlateDecode ]
+<< /Length 2473 /Filter [ /ASCII85Decode /FlateDecode ]
  >>
 stream
-Gat%%=`<%a...@FG>=c_O=e69SOql<?FF)>.fK8)3s(O%0RPdR/Bf2pVO/!i+D>;tQPb>We^Y63ui+W7!(X7+Xk&kmi!_Z4'V68\d+7%R2'F$@F=gbPE(JHps#39HdV!HdM+ihbt]Xm!shH!77^IS.'/-ciuo13q@>2o0Q\VPY.hJN+7=0@Ph+iQ9UE8?`E#];J[R/[A1gJJ$9Ad*6=.;N8DgQCVk\_Hr'=12@uHT/-lB_-I3iI?ln3qK6U:J6(BQn'#9ZN9/DI15HEUG8c)OuJ9\^KWC%9nP1:R:27A9h.s36jFZbYg..An\l_Nt8!<BKe3dqa1`j-U]jK5@"^6ei;;&+ZBfN_HPJJmq]=pWmo-/VV.1<rLjA\lFh/%oOCVj[eLh[0-<k@CsKJNUnZH]g5o-pW+aP'FL$!ad,AK!XNkr&3dWM6QC>43=@YP9%p^Jp,cV-:s:b>TYe*,\Wo[\,h%u*XRujO>SuM-j-KIpYb+PMMIV8*J")9UY_KBM(RuOcP,uWB1]ri"ndBkU_(sU8k_c`)hnU;G!E[,/T8.,/2UNTHQ<[TeMu);D)`I%XcGap$mXYq(qb^3[\K1>ed>*DL7M*Ph>sVn',DER5XVDrhQ^p#0;K&rSmXhTkh6.F2;+;9\JS%n"T[V,-Ju-:=\N1A"a%W;o/N[bGuu?u):l%9@WYLM:PB%!So[r!9JN:ZErTm2A6;+XFU,&C$S92Y`4oRmTkhOScaW?!X'l[5P,>MOc*RHlFC2_Xd5e\EH/=?RS&rXG_6WAjGL,-P/DC/F-dVf5Z%QFbFRp.H85!3,#:L(55@@#-7,hdOoWU@i=G;J!8/_l>D>ObL.o+%9@l*a(*OfH@GXJI2R;6?/9_c@aIXj0!6UBnplT-^t'WKbYlLn/;n*)s_QiL4jOab*)QQol^ChehuRI@qRh.H)LHD1o2A478l/Cr6XQ%XHI0!\c/:YY8rXL9F"28+Nb*@O^-mJ'2=PUVu!,>ZVF:n5-b$(=RbA(fPU8tXg@3%M6@QD(?X(_8dm:q_$XicJMiYk/Z4a7o]<);:Fu-HsYu7P?0nfo)"fCFo+DHedU3@Z7,S(!7oIp@4j;EdfSceGpr_#MmE:"[a1[G^)W<?KAgU&`YYS6HNqF(1M9h(kG1^2`Q/i(]jQl\au]7`R,CeP7Um.O->D8<rGEfnt\OU^W$3$+(:TcGbpr[deNoUH,.[@>BOP7!SQu%7&YpHs(_=;bJoH*]us01RrJ!5K`QP&hqb)_0FLf?LP^b$OMLTj'm4W]&fb#l)p6Zo/rVT_ar\`8CAf#([>:[CQZ[AEm+u-iaOIs9%.js9Tnlg-[u&B)-_=rFFH/VcDPM07S8;Q2q2.aJfCO[:JU.BDU0P2>D:<]b2j@1'>cD%eW'gh:N1kq^M4a"k_B>W5HB;H'Z`VrUCBJ1/\Zdh[6+G(DS?kBg/Ab.V+mn3(6<BfR3gb6&Ml5jD%H'oae7`Kql=qGJ2>D/#3<K^Uon`At8%'u$5OG-cAL6:L(81&b/Tsi2?pRb6h#0UsI_*/eE)5n?B)`4tkV-%ldV+hNh3t6#_X^r1E"DLI9eLXaY1#MJM.S1Rd"Xrf-)dOrCUsa..sF_$reqA1Zg[2A1t<)G(dN3Ed0_FMhs^-BDL7q2c';96<8[=a#;>5"6NS+N^4-!'qM++X]L1=Pf7W<)8@,H\rR3eoQaVBhl(:2H]r:!-f=netWD*%&TVGVI=7&@rGCc?"6T'n)$j?!J_DC/K~>
+Gat%%D/\/u%/ui*+nWrU+?AcdV;K_(Y?9OrFZ0`P:rr@`;BkA4e#@*]r@$pkj4Pg76\-@XO$)tVq2eQ^0!?%/,0TM5D(W4u[^E_f/0"RK<ar...@Te>dA#lnT_H>%(?Z:M&C)(?SugZ.]9^Y>FXF$s9Ytd)I=0F%AC,pt7g@p@mV<gt^E=],R$MS6c?9sNM+AI1(O$WBJY/,fY6J;iL1ieKcS6t3].HqAeM]'mem85q?H\62`T[iOMZ-/i]"mY#M1e(Km/4X>fL9lZgpGRpnW?Y3A#tD])88a'Yft+N8-tbH?7,Msp_#hL#ck>.-i)4eM!+X2brrXNLI.(`QL&s.RTK+J&6,.Zqp76BCXp6(R'G*;<G1MSP9iA[TFABK'.VK2,&s:UJ;0MO[<R%$l#YlG939*[<elRDd5C6bjN1!/L.T22]t,UI!O0k3jHp]0e6q2*Vc0/SX2nIT^'sG_h.4t"_sN$:HbF^6%^\1hb]Y06]ncetE7%aC4UUU3I]tEt1M6j"5`Wu5?j_#j&,ruP;L)GZ*e>sh45(:'!kDRhrHE%]IQL\^9W8%$"UXtVQbOp:qc/*c*3XFg\^G\hD;E8Dn0T%V,_:CYg\[%cY?:bA-TW)eh7J5j1u!2HSsC8*FSo'iHN(diU6BB9puY<.];Il=m)?tr)'=UF)tZgZhR]?8!P&.0pZn^i~>
 endstream
 endobj
 31 0 obj
@@ -250,10 +250,10 @@
 >>
 endobj
 40 0 obj
-<< /Length 2213 /Filter [ /ASCII85Decode /FlateDecode ]
+<< /Length 2366 /Filter [ /ASCII85Decode /FlateDecode ]
  >>
 stream
-Gau0E9on$e&A@sBkgX4nQn)%kHROsYfci`0Y8HIHRf]rc>Ja",.T18p^V5p\=[SU?]d\iPLmEeV]U=':aF1Bth;8$pK9:"fp++lYX;oELIu`XCOo-s'!u0CZ[4/1Q018ENX>T]gq;Oh7$8>rdnQG`?@0OcqEW!bDEH'-H$LHF?]W9./8)sB9e`21/,t4*J=5tseIUNDKR+KEc+qlcemaf.X5$OKe;a`_aGpc#4`4e%;]oPsNSrDJ]@oAFp8eJ*7-Th@P%Qj+.:-n_kB'4G?Je:d]P6:i#8s6f8\TaOnh^]YbT&YJZnAE^NX56]47Y0[3m=s1eN'_@2eCq'S?bP,@([[';ap_e@q.cflr07Udh;-pO]_FnGs4HZ4o?2mooPjDBo_isLq0kfDO;IT8DpjT^nE)aPJo[2:$1OON:ZJ+I[8/I:lYI\`*25D&oNb`;8iMRP/^_Z_[(9'9K=Ka"+*kqgZ230C(_jsd;UhM<ib...@2geV9gVkAR>.H+(<eIN)oV:om7h2FEr'VnDFAqEh$=ukEQ*Lp`,aB6R1''nrgSMLA=//DAUC6f*`r?XeoEkA`8L3;*><hCoE",4ZAdOh,CTmS,28!5_dPEr]R>k2Mbh$I,7WdtR[2&GIs`!]k(8hpa&*`UQ3(nm[h8=.-_4bRqsh8B:pjJ"'lFJGp2Z5jO`O@d$3"9>GJL>fjY`Je*I(]^ld#;)\WGW2Jgm1ac5>XmHXOSq;KV1f0$cH4OH[%WIs@Rf+/c\r"/$e`AYi%CL.97[e-\q<pQ5m`<;M'JjIl,dg53S,S_ktNIZO/%qZE=l8*QBh+o>eLuYlWU*$6fq%_##X-/U".Z6+<'TcC@n-[4frDC:q[8*WIQtF:0WjLj>kJIM1WDp(b,63hBQDY:R*J-aS>u,VT97HJ15J8rM[t3fJ*jfY;<d_r[9kc6$=(p-am0'>;chJ^t+=I+%f8Ea-#A2m8G9u8&-d(K'C>Dq"Wln;q%6gW=cODf14,J^Dhl;%bRCf>#LGc^a,"GdsAY(qKq^\&-SfrVB?"LrbeLb*&+R:G+;AZLH/04;EPQuo=<4%A/TPe4F#'gU%^Bg!_MYEnRa-.'?Xme9OMHSkaX?\ZUU-$\u*muL'jkeBBgt?-.p/hREONjjIC2Y<f;`^I"31CT<"u(<Lq822rEY'V1Q;62m<],PTHf.I)`Fm/!DK$'I)Ehb0ol48r8jh/UIatQYNJfT[3NjX3;HTdH*kjl:K";r\MZ<.Pk0?pn"/`h,^@tY&__gH1GUjEk4H_m6na6M_<VhgXc$_6\]m3&VC[0ZhmSE*Lfu1GWA%K$\`m]-hUmd)L,X_1Y<ZS!i75h"I??75>E-)+Xu,[*rI#;]QF(%i@14#A/03_+DjXu^N_0EZL?P15K(c#71Gd0hD1M&PuqeSC>.%#IAT4HrmGJ.1a.I<_kln=&LQjC7JXni$hgD.)t"I]Zm=<c"k.qhj3r!Q$e\S<n!7o"T[D*RP)b2XC2:j@8LDEP?'$Xj'M\7BdtRC'ac<2XCHFANh8=3FK0:0aK6NNe>HPXc087o7<8SY0-m:;T:cQ7OYCI^Ma;Ho2)R@[_CZ+h]aM,JV+Ppg)\,TijSZc.(der`<WOK`F)a\#]R^bIqdBD8.o1#0('@S#MK-RES*l.S2oh-q]R=@?Z@]kejXb5f"a`VlK?#<.rjpo4MY6_9?#XFq.erFeQ@,Yk9P?Ajhp)Ve2f':<OAO@QM0%LJEikNh1fgG&c<oI8g"r-CSPVRu4409isJI>f))?BA6'ZdVp2M50?Sq239\9L(S"Ss[qXe$<ZF'>*K1cqb3*_6r.Tf8.f\)@F%kj!4DR4o!G?t*Z$j6(#E9=e/W5G<UC.af:dUN1I(^mOTR/q+6ehOcm'I"C*K47td8>/B%Ed)aQLU!#oU%!Q].OX1Ao$sr-?lsnl1A@iC-08a;Y-5ILIGFfX*o`3AL*TmCk$+KqgnE/;k0?cCDG[`S:2)_d`.$#EE1mMEVeE':IpRRH*p0?Dp=9itQi7K1[rVj+<Q8*'V$MMA;_+\d'V0$kW`I@,aN*=DY)mts1N?T?g55YYVn"Vl3X[mog/)[n<HE)`.%lMr@p761KC#0q%]m*J.fA41q`plR-(WX78X'b;LF'>Ke&27$OTEa6+*/l>Eop>2t;PNk<d^,VG~>
+GatU5D/\/e&H88.Tl3T8Qk?=lHqUefBS!&8C1%nD%\PE-&gE1$7+KD.rM]IcX;;gCJ\lMR8+PV[cQ1@#f2m/<Y_...@fXuk>@Us;',3N;V4G:qe!C:p!l9.3q>t'*>rJVVdHWrK+],kRVA!*>%%q%7*Ct+^&%]mrOP1e4rSCOLPSt!o]Nsk+k$8"]?U"-])#K8H/ILr2J@1'Y!;g>@L-20K1HsPlQFF1ogTO$6*cCU+.SoLDuDqU.%Ilaf?&prE]9S&X1LFY.Lk,o[HnXPdAj5O#*n+L%Ru1ClIc@c+=KE#,,5GXp\e0C'M/oaLP`o3!`kBLY[4Q(CZ'NT0?K5?B@8_hT4TcfU`+uC,%Eg0Ai@8n-H`nm!Nm<bhnW)R'fcQM,%p(ah!m@il3]6Zq&ld)6Q7q(gI(R3KKR@PLiqT@LF8aY@7"qnAOD82-O;T=V1ma.]CNk@Z`lqF`'^\B!=_N87nUEE<C8JiFZZ3>J!dL3i#.=hW,QIU+I>5tBFu`M7f$[m8IMkW8d(uTj`>Co0!\KXU!\\OlH[btB/6U0;H9Vk98[u;5^H=L@C'GC"U-$KqR'sF(I/Z8oJE=?`1f\*9ogFk,$[i<W_-&QQ?K+?.&rSiLh`]RN6op/O=P)f`@+FXE(@kmW%s@G!+OMPF?jD"T!6Bt(D+"WS&0WA)Ri't1Og9Z7n2'@BY-N8=+[?rF'tK_d,]I1hW5)"hA(Z3qSr.bN/RHn5*peka_$?A`^CTJ+0ZB1L5\IU7rmU6m:G!:V2KG#V,Tl6M1`e^h_]m4/sM\:G^i@4DdeNsnBc0"#R&(RZ.NeLK98eHc)f.7WjZY(@NsQHl9fC*j=Vlr$*eVAQ^a["Jb"7hVjiOBXUqjL"8o-1D*,,sV2E-SQfk_QZE=4e5`Xeq`@>OJ+N7G8PR9b&rX&M9/N[X6ZN9X4Is!TeLT1Gs2uGYn9Ra%daO5+HV^sSc$X)lEWX$Lb[<"W=<LMd>BPE6klPU!aA<l2CY.ACKKLX*bLR]9"DI@_)pZmsaX\XSb$_`juHk]VJ+JW48,-cKGmN<?1b@T)\:Eld&W2Hl:SJOlZN,Bk`T"BX`:g'fpZGPjFaB+u[67'JS#4f<TEEB]0<8&)KeTYC&cNZS2Bu\UQPR2OpWBq$sMRP73^s/,11uebX;^tg4=>*+[#QU0E]G?%64f$I&Oq9%B6;NK>id<0XjO="UBQNc0Z.m)6K%Y0@hD"^=;2\(RjH@AV/O:S>ANDI$0t3q9R\\PGaTNsV0i_f^cYpGra3.R;a#fTJVH&9,?7"*_30iHQ.Y1H/?(jA&PpfQ"b$W#,o)Dc\Z[j+Xnqj-=$7=kAH&%k68W:aj?.5__AccTE)=WHaNbmI"p"5gW4u&3U`7soIio_"1(iQ6S*KmDi<Zm5"L$T%scanpELZ-e!^.JsR*"u9G,,?m-o#TJ,J:FJm+D'?8kA9U4Y"Jbu-S?STVNFc]Q<2cRE3a[SLjWaE<dSJ(m++GnjH3*LVIJtA=M'i=i/jgeo3<])2P1=;IU2'GGn.$9mh\%mfQ@aC)h3+Ch-J=88CoA]I\N2+gsR2!!hX6lOKsWa*@hLs55jl=ElobdB<,'*<Z%W%5sr'M,*b#,P]h/l@0OJHAh.`4Q\6HE)0Rp23@6MGqh2*5WqW9mo,'/11-cI\)u3$#lSu:QRcmu=r)Qs0\VGcR*BEul^P-&7,<SSY_/&pWDI@"EVWtp<3DU**`9IDA%,(m\TV.aH5e4"RI&Y]IP"4O`(aG,iVgIeC[N&=JatE"u3Pe^l(8],*J2N$gY7p(\l'a0P`fP:\Bk*Z!)["1;"&,M#2=,Q06`PZ,<)5M9`3KfC;d8V1Vh]&fG(:c4b)J7g-]6B@I=pjQltn;d,K<e7Oq>>=^31l)]JdeXj1"-k1:mA!ZV*3:>A^VV>Sd!pdUUW6F\`&(%W\YJV4a0ia^e*"[ge#tDKm"D7^L`c:%PGmbCCP^CK%o_>*p*#N>@kKhR9es"4cK08"RG+D-Nh0[36=R7mtt2IamRZ4a<oS=6[b4q-G&1X/Nf/:qeri"G))+?$4bOC@*""=jF;pI</2Yp)g0-'FF~>
 endstream
 endobj
 41 0 obj
@@ -265,10 +265,10 @@
 >>
 endobj
 42 0 obj
-<< /Length 2071 /Filter [ /ASCII85Decode /FlateDecode ]
+<< /Length 2365 /Filter [ /ASCII85Decode /FlateDecode ]
  >>
 stream
-Gau0DD/\/e&H;*)Taq6oC^htRUg=Dp_m1luL,Rjk%gBBM-uIA?<`r...@E5X>S0i:D7(/ct&'-heOL"2bCkYH`;PWS(tLUh>$!,G;/cPkQh5BBEp:i/Ed@g34taS#9G+J;MNMDk^\T@#=[ZKi]p`Ke.5$lc$$C-)#Gc0npYV'S@jDq::M#[CeFsM33l0JS%P)*Ds,RHGFXOA4m(<a,RC]4lKF'WQAZ!!t.9O8's"RYXbRGS+cEU5<+W+k#9-!fOH)F&I&9^c2Lu5/^I%#WjK24SXd,k3S?@jZT3OB7KhaIbtZ3+ic/k\9k9;.4-.BERqK@Y+Z*c`E.KUMN]i]'!chVBG>ORb#%nj3o3<<kXC;;aD*Bd)L92q'\\N1)!10SDAJD(cn.:P6.B45sGmd40h2X$Ao\u=Np(M':1T8:V_YJi'ZZUk8G+oF>^"PrY@P/na@."9%$+RIl@&ET%HEa[B8%*/b]]PKe_YC[.d([`QEDhdVd*,mbnmLt8oWRHbr%gP_ro&_*;VI%V^]].-T+24FFN=^K^AZ^/_#&_0SA%qjlga(Dj!/YD.6:$bRR.=+9d6dU9eEqoYs7d/8^H`uq]:S^pdiea[aS7r]/2s8OCr4dU+L(l7R@AI`KSEoirSMRPA@)oe(hnj$)8Y^f#h_YIVrf_<7T,h_3uS!$C7q&fbsm.1CWVE>,.RhN8.QbVXHZCh!"":V`kDm]C0rEj'mb8cNsdA%%.F4O#V*<4$ipXRGd)sn&nq7QE!p;i[b.,ZD?&`2X$fnN(nq\pdT>6LN,1PndK`a``0=1R(Tr/(,/nN+*dF,ZZEp-BVGMQd+B\MLd:QQ#+R*c;rU%*PP/MoB&/J?3fY]d[8:f>4dCBo1R1lK6_fX`]mW4l7g[$1LbLc0!q73Ne@k64e.r<+s#bg](gZfCE<@$iqkuc`.,(%>LR?G`$\_5F"P#2Xs&ad.:On&e3f_bX,74bC"^I]B7k#!rj9m>g*W:NbI9o%#a_G+BZL0h@`t]l6(fu-T[9J#[a9>nbc^4<*QKZRf4#b8BaYSO/mF$BkkjGV#>#hT8+8[I*gu$16#s\~>
+GatU5gQ(#H&:O:Skb18)2G8rBWCqaWBUWgh,.An'"DG88BUcH\aHA&NEAm%$GB=q0.G:1ZJ;l!naI0q%1PbVOX4dDK30Y$,\Q[<aB/?<C$id.mh%1Y(_C*(@SZA=Qf>CUYX/4,4=;?me$RT,'iSbs2gL6neC)F%Y,fpO9\euP(jT<]tjH9&CF7gLI5^OTaoqUf#T/.+6mIHe>YD]1->BdVJGj"KP@ff].ZrpT+f)Dq^a,.i'dsd_.6h4Z\;O'nW-Iueu^_b5%:?;u',S:5Q%te[]HnP!n=m0\s]YLHq^"fVfld)^.4dWDM`/j;Tq!7%=J<)f*7I6=SQ>.;8`<\1*/Tk4K6)`Bk32>PTCOfq+SHd$G:[$HFnmA5>.dWh^F8sm/,4'bZ>MNj\l$=Vf:R'L\0CqMYd-AebTO@hdVW!@klZlRqbBek\%pD4MnF$sr@G_4$P$1VDRN>ZS8"Go+M:56[bu5>N:2/B41ks48cZXM%,7*^;isbf3a(r.2i<V!3dj/V30rs&=1c-Y5MI!c"eb.Lb6^>Qi`Mpj#I#eD9it/+-L[b+9]rneY$:bjDh`!.HFJZ]NjaMRp?P&ibVO>Y*A1u(O[$lnqQL?(kQZhX;KE-2D(9nYgO0Yog1,;PnI[_YKih.&_L1a;gXO+N5mu:4\5lG"@L;>\(nNp!VUiD%`,SK`#+=XFD`"m@h:G!'RO;,'>3/3u/XU:8,G&H!CKTLMRoRd3T?[4Gds*4F8(e\2ka85,Pm,(t+X7Q%q1ZNAl7G)o98,s.g*:?7K;SR^E3%bM+i#J82ZeoCkY<(usj7HajZ_4Op]mTck].2BL*+Vu(5C3)#iE#:CPPKU6bJ',jhor8Gh't,'%TWs[7^)P973lQCgC8c$1<p71d/i4</+GR!m#WQ^6qM;2b;*@8]7$i>Z%B9A'/W>@66"F>8Rh-C`ZtgD,A7R:72PT&DX)o@N37`Q0rBY1f.knUiT7pu(_s^#TODY-B^E?sT9OKt[5oFHj0[=oA\X!0lU\rC%=efAVrtB5^6"J-#7dr>05*`Tk2aZf1jkd"7Sj@Bl-'5q3&%\Yjfk_LX7+U\Zu!ldCXP!G#4kmWOQq`Z,b(u<]V-=&W1A-2VI/B^0_<p*UIBrZ.*5f-H`mjcJb`EH`(4NoD%9ghfaR&.;4.oI.L`_+Rk=*V[q0"*bYnn<Cr$QHI9-/<Q7=DdSkjS9`*nSda@ij<A,b_u3fZeM_,[@ioDr-;.sF(?O3)Z,lC_#0n^:)lXFa1&+S2\QV*Js?L!'at8"?F,GaV'((N3<EES$]4q&2Gm<(>AOL%I&5Y3V,1ZQ.qo12>Sc-TsXPhhHZu>%ZZ;EaG>)8)^PqF-%jq4=e`D"];JJa5&+&UCZTkA[`+r0+&!n>+7sNcClH4!k,LhJjAB<1p%tp2l,lNj;CYcV@oD4#rCUsk8/PMZk#QW90<!ll;e`7:0Y1Whrft=QnBO'SEIDU>7Ac:W%SfokFl)6gOZJ3,Pjsd:_7orVV*G?HUB?(gSh]Cd?[_pLc:C^XGkV16c-lNDpN9O;ac:]^skH%b/@bJcXqU(ZOb7XC.H?JK_@[unY'qk,kTFAZc>)"@[rq$ZKhcu"ZGf[(Dc:s1P\9]6s]E+%X,e"o!?"A*3dRlod]iCb(M;uDK$+8e,HMkY$g9(F7A6&K2i4"7GSQn4)r3Klu-pA!Fu50<%i&d&*6jZr<S&-VX,:?l)^9Yo2L<];S1iCKC]F&o0LNGr3#s.4s_B8U\mOM9Ja&$>c>mWlK?8=a1*7#cp;N)J8W?2@Wi/gAe`4*e9<L0&CMjW5L\5TLhfJ&[Q'#:g*^otq2ZCHD<6e'N;+D[WB):q!%`1;NAB)%NZtbKOKQf*3+cUn^Ucj'$SH<dOOZ(O@'O^/D;?PiE8`(X]5S4uACN,>XjiGZk%8,\O(-)dRgii-0Rl8Z:nrqZ\3#M,f^&o>(n?<irsLm(J'SX^J(ru,*3UtE!e0=M'6Bj<G*c#/rUFBG/($qXeQ<I1g%WBcrs^&92(-J:I^Er?NAqNO>uN?@,=$&7h'>1DFV)DE3'>U3i%-T08a9hTFWjE(Ak1D9cJ&jJ\0MW0Y.<'V#kuATDYm)I?/X]CiDGq)>;[<Mr`i`M,YG,"`\r,>GtK.q0B]>`'0YnBbON$3M7j-amHEuj6WrepN\/mB;'.a*UOIuBKcTOfLMlIhTbDqi[*-*`m8>J@lcAtP0#"IB[,JOr:ofT$"lWYh\*ei0^2+Q;,nX+/#laGLp"hs#)j]O:h%goqKIl$_s4Sq*M1"98n4'H^J'%CCH+_hq@BOAWikpS_Xr'WFRqW'kf>k%&r%;4>6[RSC4LqPHUss@d-HN9f_Te`[]0Kj0a.UaG*s(=!#OBYF"9~>
 endstream
 endobj
 43 0 obj
@@ -288,7 +288,7 @@
 45 0 obj
 << /Type /Annot
 /Subtype /Link
-/Rect [ 415.788 579.967 451.116 567.967 ]
+/Rect [ 403.788 525.147 439.116 513.147 ]
 /C [ 0 0 0 ]
 /Border [ 0 0 0 ]
 /A << /URI (http://dmoz.org/)
@@ -297,10 +297,10 @@
 >>
 endobj
 46 0 obj
-<< /Length 1777 /Filter [ /ASCII85Decode /FlateDecode ]
+<< /Length 1818 /Filter [ /ASCII85Decode /FlateDecode ]
  >>
 stream
-Gat%$gQ(#H&:O:SW'Go.JmFh@86m"p6W,a,Jn!2@NV9%p&DEjqTsElS<87...@GF>.9)b]qoq%pN=5FZ85Z@8G&JBeBiUUnb&Yqin>AWRK!_S2XT<%>Drq!Do^^#.8f9.@fRIA]J;#k=HnXN26[`Rc4Wf1W;oc;a&"jo[(?XN(2"m')*!!6ia6)E/\"dDtLT0WQ*\`o-UkbP;uP"<u*cT+ZJCb(G8HkRaS>O:F`L'^YjSF0e6.r/Zb_qo6&(t8Rd1J44I'@VIaM5W#@DZdlVKf"G7.;!-(:k&Doc&M#)0^Ij=P5M;\?ki'"NRFi),;4$#&D5sB+eh;?Q5&@\ODpCWDK]Q41)BAt+O%3f4.B3XK17T]Di!qL=b63*YN+,#Xf,`J1^=#E?EjoVi&e&l>p6J5q8&h`LGY5>$#EGP(#)#Kl/?W9\>es=77G'j9LGpjp>J4YPZOBt>9he@NrqkkXoD!dn]1RDo;ED#>3lM>=DGRYl8T-)>36K:OZhUZ&FTWacR+L-)cC(H<k&tO0Y@HrLSW;R5O^;[V~>
+Gau`T>Ar7S'Roe[d!u.2K8(^%;3q90BSNqI0Vho_C_,i^dP>YIPsnKCis$@ogNR`I%rKF3?8`KgF_L;M5CVclGI,E`Sa[IDe=/$;e6>Y78)YQ]&diEJi$>+L2qrS,]Xl_lfA6rQoae7M>$58WNL\0!XfQ)F"]^pF;eTU=9.S(G[t7qHR+TOrekiBC.TMV2KjC04NU4oIb[_>hWK\R0]g'3`:C#kTN!+dlmZG)W1?f_QD]]@pa$s)KL`G[#b^.2pUUN0^WlqJGfgS@Z4\3A*kOf@la^5]RP(9eCC=YV_T<r$GT7gOtm(3i9@A-aPDa*u>E`^jSnJH-/m0:m'ij1Q:GudUs>oRBH)uD-cW)=OF=41"7o4df?0gAq2U9Eq[%0Urp`AY;Y)UOQKU336jhgNeTMT(sN.DN;UrG<X9b]:(_I3(-mP)tKq26#$3[Mur_pu$lT,]PZ0WZE9qP_9XBj8%PKiDcf:`#e$pWlIW_2'j*56&4cuQ+<nadB!]YbDe-:<9dbo\.!p4n-KVO#_iYZr"98B*YOq>mgp1P'"Hgbd/S#,#cRKm@0l]s)hL$pTsu(o_b'pLd<NB4!($W)"G;f!R!0+SF-o!UB\_msThKV%Gf[Obng4:[ktZP+aTB0/cmQ*h_(!H_L&1(roB)P*O<Tb]6Qu8UHCqYi"XH@eIp_MHh<`\ieunsRRqN*,9VuJi7+;gdSQ\'e0RHhUfU9%N1AKs$<?h7E:7Gn8Qg,%F@iE*1c*6GlHlC7A<ht'%3%YNR319l,qJ(`bl6l80k]1[;J+hF+J=?Bq-:#7-2^K``W&si1,<Qn-r&3/$mLG:N]m,F*:;amY"gmAD"Z(Q6Z:D5k2-d1&T%'3WSI<Q%$X<f[M*fgPcLRpk4RnIk!<b<"+q_&T1D/c[hHr8"p`^o6^:]-g3bP?_fHq;jl@mcY?D"?t*tU'Wi/DsCZs+gmQmXaDY:S/q=b?3Y5l?)8$&ChLp!@B^J;nft8Wq$4n8!HVnrqqh5o?pbG($tY,FIl!d:YfjITU3m6\8Db(bW'tj^YHTNDp_@JRNL_d=6K8itADn%-6-=`a^X\0ZHM/-5Lo7jB=0@<jRk]DD7p0Q#f01Z3<BCrSf*H3Qht@MXN_IV\%>P_f,9f*$p;Kn-[YE6)ustfK]L+4XP@ZR.%>.bi>JH+g4Y&GfLAnfI3I>$t@g%faJ;FBl$J(,225Q*.4u_I!RpmXUjLC^[nn6G&@Pco8,u$?*TZ4&SHERibjV;m[n3r;jgEUo?M$sSFkTDHp4',4BI0_EDBUX?$u]\2SdV10@PV[bo(IsnpD0[E/$!V$?-mqemHD>%2tn_V%qcJ:1<(DgMJhHpu0-WH73Tq6uXgD[&.'.W$.dKdO+Y"EcN(2XfZ2EkaBm[+,qUP4\-s[,BGi3.VI_i!=A%O)nNIL40Y>nfRcMHgCPic^8/jlSIZJaS@'ieW""92-d^NihXPRLN$WMKHso]^n)#p,fYR6=\dXf3/gEIc?91>*!PL2TcuX_tHCfZe-&`(l<+r;q12mFJIf1k6@Kgbe@od0\M5o7kI=1KX=MTa]6f!9n@bABq!!Mt#]3iWB\93YLgNsHJCu+5ojZP\5Poq1"`b]RkD2;t5RG4+KFP=9V583T1C(6o^:0#ln\\[Y?41=Y;A42N,eu/>3oqJt4h9P5iq3HdLZKfR>S.UU\W_d,$(NnCIPmpqj(gQ&]A&#f:_V,XTd1-][,rQ)58cTNV,]FEXiJh``r*F<6b&"ID<nA]9q5hL`'isV0TbZEN)o;)t?5aq*[.:C!$j4OI(BeSjH[@8"[OWW(H7Df#F@#a>7aWbU~>
 endstream
 endobj
 47 0 obj
@@ -320,7 +320,7 @@
 49 0 obj
 << /Type /Annot
 /Subtype /Link
-/Rect [ 141.336 194.134 244.02 182.134 ]
+/Rect [ 141.336 145.214 244.02 133.214 ]
 /C [ 0 0 0 ]
 /Border [ 0 0 0 ]
 /A << /URI (http://localhost:8080/)
@@ -509,43 +509,43 @@
 17 0 obj
 <<
 /S /GoTo
-/D [41 0 R /XYZ 85.0 599.68 null]
+/D [41 0 R /XYZ 85.0 586.48 null]
 >>
 endobj
 19 0 obj
 <<
 /S /GoTo
-/D [41 0 R /XYZ 85.0 372.707 null]
+/D [41 0 R /XYZ 85.0 306.707 null]
 >>
 endobj
 21 0 obj
 <<
 /S /GoTo
-/D [41 0 R /XYZ 85.0 307.173 null]
+/D [41 0 R /XYZ 85.0 241.173 null]
 >>
 endobj
 23 0 obj
 <<
 /S /GoTo
-/D [43 0 R /XYZ 85.0 639.28 null]
+/D [43 0 R /XYZ 85.0 553.4 null]
 >>
 endobj
 25 0 obj
 <<
 /S /GoTo
-/D [43 0 R /XYZ 85.0 397.787 null]
+/D [43 0 R /XYZ 85.0 313.387 null]
 >>
 endobj
 27 0 obj
 <<
 /S /GoTo
-/D [47 0 R /XYZ 85.0 527.86 null]
+/D [47 0 R /XYZ 85.0 446.02 null]
 >>
 endobj
 29 0 obj
 <<
 /S /GoTo
-/D [47 0 R /XYZ 85.0 381.567 null]
+/D [47 0 R /XYZ 85.0 319.447 null]
 >>
 endobj
 50 0 obj
@@ -556,74 +556,74 @@
 xref
 0 69
 0000000000 65535 f 
-0000017176 00000 n 
-0000017262 00000 n 
-0000017354 00000 n 
+0000017559 00000 n 
+0000017645 00000 n 
+0000017737 00000 n 
 0000000015 00000 n 
 0000000071 00000 n 
 0000000922 00000 n 
 0000001042 00000 n 
 0000001137 00000 n 
-0000017499 00000 n 
+0000017882 00000 n 
 0000001271 00000 n 
-0000017562 00000 n 
+0000017945 00000 n 
 0000001408 00000 n 
-0000017628 00000 n 
+0000018011 00000 n 
 0000001545 00000 n 
-0000017694 00000 n 
+0000018077 00000 n 
 0000001682 00000 n 
-0000017760 00000 n 
+0000018143 00000 n 
 0000001819 00000 n 
-0000017825 00000 n 
+0000018208 00000 n 
 0000001956 00000 n 
-0000017891 00000 n 
+0000018274 00000 n 
 0000002092 00000 n 
-0000017957 00000 n 
+0000018340 00000 n 
 0000002229 00000 n 
-0000018022 00000 n 
+0000018404 00000 n 
 0000002366 00000 n 
-0000018088 00000 n 
+0000018470 00000 n 
 0000002503 00000 n 
-0000018153 00000 n 
+0000018535 00000 n 
 0000002640 00000 n 
-0000005311 00000 n 
-0000005434 00000 n 
-0000005503 00000 n 
-0000005696 00000 n 
-0000005897 00000 n 
-0000006084 00000 n 
-0000006260 00000 n 
-0000006450 00000 n 
-0000006622 00000 n 
-0000006796 00000 n 
-0000009102 00000 n 
-0000009210 00000 n 
-0000011374 00000 n 
-0000011497 00000 n 
-0000011524 00000 n 
-0000011694 00000 n 
-0000013564 00000 n 
-0000013687 00000 n 
-0000013714 00000 n 
-0000018219 00000 n 
-0000013889 00000 n 
-0000014052 00000 n 
-0000014247 00000 n 
-0000014494 00000 n 
-0000014732 00000 n 
-0000014992 00000 n 
-0000015230 00000 n 
-0000015443 00000 n 
-0000015793 00000 n 
-0000016020 00000 n 
-0000016247 00000 n 
+0000005206 00000 n 
+0000005329 00000 n 
+0000005398 00000 n 
+0000005591 00000 n 
+0000005792 00000 n 
+0000005979 00000 n 
+0000006155 00000 n 
+0000006345 00000 n 
+0000006517 00000 n 
+0000006691 00000 n 
+0000009150 00000 n 
+0000009258 00000 n 
+0000011716 00000 n 
+0000011839 00000 n 
+0000011866 00000 n 
+0000012036 00000 n 
+0000013947 00000 n 
+0000014070 00000 n 
+0000014097 00000 n 
+0000018601 00000 n 
+0000014272 00000 n 
+0000014435 00000 n 
+0000014630 00000 n 
+0000014877 00000 n 
+0000015115 00000 n 
+0000015375 00000 n 
+0000015613 00000 n 
+0000015826 00000 n 
+0000016176 00000 n 
 0000016403 00000 n 
-0000016516 00000 n 
-0000016626 00000 n 
-0000016737 00000 n 
-0000016845 00000 n 
-0000016951 00000 n 
-0000017067 00000 n 
+0000016630 00000 n 
+0000016786 00000 n 
+0000016899 00000 n 
+0000017009 00000 n 
+0000017120 00000 n 
+0000017228 00000 n 
+0000017334 00000 n 
+0000017450 00000 n 
 trailer
 <<
 /Size 69
@@ -631,5 +631,5 @@
 /Info 4 0 R
 >>
 startxref
-18270
+18652
 %%EOF

Modified: lucene/nutch/branches/mapred/src/site/src/documentation/content/xdocs/tutorial.xml
URL: http://svn.apache.org/viewcvs/lucene/nutch/branches/mapred/src/site/src/documentation/content/xdocs/tutorial.xml?rev=294928&r1=294927&r2=294928&view=diff
==============================================================================
--- lucene/nutch/branches/mapred/src/site/src/documentation/content/xdocs/tutorial.xml (original)
+++ lucene/nutch/branches/mapred/src/site/src/documentation/content/xdocs/tutorial.xml Tue Oct  4 14:58:53 2005
@@ -66,11 +66,11 @@
 
 <ol>
 
-<li>Create a flat file of root urls.  For example, to crawl the
-<code>nutch</code> site you might start with a file named
-<code>urls</code> containing just the Nutch home page.  All other
-Nutch pages should be reachable from this page.  The <code>urls</code>
-file would thus look like:
+<li>Create a directory with a flat file of root urls.  For example, to
+crawl the <code>nutch</code> site you might start with a file named
+<code>urls/nutch</code> containing the url of just the Nutch home
+page.  All other Nutch pages should be reachable from this page.  The
+<code>urls/nutch</code> file would thus contain:
 <source>
 http://lucene.apache.org/nutch/
 </source>
@@ -97,24 +97,28 @@
 
 <ul>
 <li><code>-dir</code> <em>dir</em> names the directory to put the crawl in.</li>
-<li><code>-depth</code> <em>depth</em> indicates the link depth from the root
-page that should be crawled.</li>
-<li><code>-delay</code> <em>delay</em> determines the number of seconds
-between accesses to each host.</li>
 <li><code>-threads</code> <em>threads</em> determines the number of
 threads that will fetch in parallel.</li>
+<li><code>-depth</code> <em>depth</em> indicates the link depth from the root
+page that should be crawled.</li>
+<li><code>-topN</code> <em>N</em> determines the maximum number of pages that
+will be retrieved at each level up to the depth.</li>
 </ul>
 
 <p>For example, a typical call might be:</p>
 
 <source>
-bin/nutch crawl urls -dir crawl.test -depth 3 >&amp; crawl.log
+bin/nutch crawl urls -dir crawl -depth 3 -topN 50
 </source>
 
-<p>Typically one starts testing one's configuration by crawling at low
-depths, and watching the output to check that desired pages are found.
-Once one is more confident of the configuration, then an appropriate
-depth for a full crawl is around 10.</p>
+<p>Typically one starts testing one's configuration by crawling at
+shallow depths, sharply limiting the number of pages fetched at each
+level (<code>-topN</code>), and watching the output to check that
+desired pages are fetched and undesirable pages are not.  Once one is
+confident of the configuration, then an appropriate depth for a full
+crawl is around 10.  The number of pages per level
+(<code>-topN</code>) for a full crawl can be from tens of thousands to
+millions, depending on your resources.</p>
 
 <p>Once crawling has completed, one can skip to the Searching section
 below.</p>
@@ -131,54 +135,62 @@
 <section>
 <title>Whole-web: Concepts</title>
 
-<p>Nutch data is of two types:</p>
+<p>Nutch data is composed of:</p>
 
 <ol>
-  <li>The web database.  This contains information about every
-page known to Nutch, and about links between those pages.</li>
-  <li>A set of segments.  Each segment is a set of pages that are
-fetched and indexed as a unit.  Segment data consists of the
-following types:</li>
+
+  <li>The crawl database, or <em>crawldb</em>.  This contains
+information about every url known to Nutch, including whether it was
+fetched, and, if so, when.</li>
+
+  <li>The link database, or <em>linkdb</em>.  This contains the list
+of known links to each url, including both the source url and anchor
+text of the link.</li>
+
+  <li>A set of <em>segments</em>.  Each segment is a set of urls that are
+fetched as a unit.  Segments are directories with the following
+subdirectories:</li>
+
   <li><ul>
-    <li>a <em>fetchlist</em> is a file
-that names a set of pages to be fetched</li>
-    <li>the<em> fetcher output</em> is a
-set of files containing the fetched pages</li>
-    <li>the <em>index </em>is a
-Lucene-format index of the fetcher output.</li>
+    <li>a <em>crawl_generate</em> names a set of urls to be fetched</li>
+    <li>a <em>crawl_fetch</em> contains the status of fetching each url</li>
+    <li>a <em>content</em> contains the content of each url</li>
+    <li>a <em>parse_text</em> contains the parsed text of each url</li>
+    <li>a <em>parse_data</em> contains outlinks and metadata parsed
+    from each url</li>
+    <li>a <em>crawl_parse</em> contains the outlink urls, used to
+    update the crawldb</li>
   </ul></li>
+
+<li>The <em>indexes</em>are Lucene-format indexes.</li>
+
 </ol>
-<p>In the following examples we will keep our web database in a directory
-named <code>db</code> and our segments
-in a directory named <code>segments</code>:</p>
-<source>mkdir db
-mkdir segments</source>
 
 </section>
 <section>
 <title>Whole-web: Boostrapping the Web Database</title>
-<p>The admin tool is used to create a new, empty database:</p>
-
-<source>bin/nutch admin db -create</source>
 
-<p>The <em>injector</em> adds urls into the database.  Let's inject
-URLs from the <a href="http://dmoz.org/">DMOZ</a> Open
-Directory. First we must download and uncompress the file listing all
-of the DMOZ pages.  (This is a 200+Mb file, so this will take a few
-minutes.)</p>
+<p>The <em>injector</em> adds urls to the crawldb.  Let's inject URLs
+from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we
+must download and uncompress the file listing all of the DMOZ pages.
+(This is a 200+Mb file, so this will take a few minutes.)</p>
 
 <source>wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
 gunzip content.rdf.u8.gz</source>
 
-<p>Next we inject a random subset of these pages into the web database.
+<p>Next we select a random subset of these pages.
  (We use a random subset so that everyone who runs this tutorial
 doesn't hammer the same sites.)  DMOZ contains around three million
-URLs.  We inject one out of every 3000, so that we end up with
+URLs.  We select one out of every 5000, so that we end up with
 around 1000 URLs:</p>
 
-<source>bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000</source>
+<source>mkdir dmoz
+bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 &gt; dmoz/urls</source>
 
-<p>This also takes a few minutes, as it must parse the full file.</p>
+<p>The parser also takes a few minutes, as it must parse the full
+file.  Finally, we initialize the crawl db with the selected urls.</p>
+
+<source>bin/nutch inject crawl/crawldb dmoz</source>
 
 <p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
 
@@ -186,39 +198,39 @@
 <section>
 <title>Whole-web: Fetching</title>
 <p>To fetch, we first generate a fetchlist from the database:</p>
-<source>bin/nutch generate db segments
+<source>bin/nutch generate crawl/crawldb crawl/segments
 </source>
 <p>This generates a fetchlist for all of the pages due to be fetched.
  The fetchlist is placed in a newly created segment directory.
  The segment directory is named by the time it's created.  We
 save the name of this segment in the shell variable <code>s1</code>:</p>
-<source>s1=`ls -d segments/2* | tail -1`
+<source>s1=`ls -d crawl/segments/2* | tail -1`
 echo $s1
 </source>
 <p>Now we run the fetcher on this segment with:</p>
 <source>bin/nutch fetch $s1</source>
 <p>When this is complete, we update the database with the results of the
 fetch:</p>
-<source>bin/nutch updatedb db $s1</source>
+<source>bin/nutch updatedb crawl/crawldb $s1</source>
 <p>Now the database has entries for all of the pages referenced by the
 initial set.</p>
 
 <p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
-<source>bin/nutch generate db segments -topN 1000
-s2=`ls -d segments/2* | tail -1`
+<source>bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+s2=`ls -d crawl/segments/2* | tail -1`
 echo $s2
 
 bin/nutch fetch $s2
-bin/nutch updatedb db $s2
+bin/nutch updatedb crawl/crawldb $s2
 </source>
 <p>Let's fetch one more round:</p>
 <source>
-bin/nutch generate db segments -topN 1000
-s3=`ls -d segments/2* | tail -1`
+bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+s3=`ls -d crawl/segments/2* | tail -1`
 echo $s3
 
 bin/nutch fetch $s3
-bin/nutch updatedb db $s3
+bin/nutch updatedb crawl/crawldb $s3
 </source>
 
 <p>By this point we've fetched a few thousand pages.  Let's index
@@ -227,16 +239,20 @@
 </section>
 <section>
 <title>Whole-web: Indexing</title>
-<p>To index each segment we use the <code>index</code>
-command, as follows:</p>
-<source>bin/nutch index $s1
-bin/nutch index $s2
-bin/nutch index $s3</source>
 
-<p>Then, before we can search a set of segments, we need to delete
-duplicate pages.  This is done with:</p>
+<p>Before indexing we first invert all of the links, so that we may
+index incoming anchor text with the pages.</p>
+
+<source>bin/nutch invertlinks crawl/linkdb crawl/segments</source>
+
+<p>To index the segments we use the <code>index</code> command, as follows:</p>
+
+<source>bin/nutch index indexes crawl/linkdb crawl/segments/*</source>
+
+<!-- <p>Then, before we can search a set of segments, we need to delete -->
+<!-- duplicate pages.  This is done with:</p> -->
 
-<source>bin/nutch dedup segments dedup.tmp</source>
+<!-- <source>bin/nutch dedup indexes</source> -->
 
 <p>Now we're ready to search!</p>
 
@@ -256,10 +272,8 @@
 cp nutch*.war ~/local/tomcat/webapps/ROOT.war
 </source>
 
-<p>The webapp finds its indexes in <code>./segments</code>, relative
-to where you start Tomcat, so, if you've done intranet crawling,
-connect to your crawl directory, or, if you've done whole-web
-crawling, don't change directories, and give the command:</p>
+<p>The webapp finds its indexes in <code>./crawl</code>, relative
+to where you start Tomcat, so use a command like:</p>
 
 <source>~/local/tomcat/bin/catalina.sh start
 </source>