You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2015/09/25 03:40:14 UTC

[Nutch Wiki] Update of "NutchFileFormats" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchFileFormats" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchFileFormats?action=diff&rev1=3&rev2=4

- -- LarsAronsson - 30 Jun 2004
+ <<TableOfContents(4)>>
  
- These notes were written on 30 June 2004, and doesn't handle the config files or the web db.
+ = Introduction =
  
- == Nutch file formats from the bottom up ==
+ The page provides information on the Nutch file formats (for the Nutch 1.X series) from the bottom up.
+ 
+ = Nutch Files in Detail =
+ 
+ Nutch implements its own custom serialization to store custom serialized Java data types and structures on file. The interface [[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/Writable.html|org.apache.hadoop.io.Writable]] must be implemented for all such data types.
+ 
+ The list below indicates all of the Nutch custom writable's which implement the Hadoop [[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/Writable.html|org.apache.hadoop.io.Writable]] interface. The remaining sections of this page explains how and where each of these Writable's fits into the core Nutch data structures such are '''CrawlDB''', '''LinkDB''' and '''Segments'''.
+ {{{
+ ./src/java/org/apache/nutch/crawl/CrawlDatum.java
+ ./src/java/org/apache/nutch/crawl/Generator.java
+ ./src/java/org/apache/nutch/crawl/Inlink.java
+ ./src/java/org/apache/nutch/crawl/Inlinks.java
+ ./src/java/org/apache/nutch/crawl/MapWritable.java
+ ./src/java/org/apache/nutch/indexer/NutchDocument.java
+ ./src/java/org/apache/nutch/indexer/NutchField.java
+ ./src/java/org/apache/nutch/indexer/NutchIndexAction.java
+ ./src/java/org/apache/nutch/metadata/Metadata.java
+ ./src/java/org/apache/nutch/parse/Outlink.java
+ ./src/java/org/apache/nutch/parse/ParseStatus.java
+ ./src/java/org/apache/nutch/parse/ParseText.java
+ ./src/java/org/apache/nutch/protocol/Content.java
+ ./src/java/org/apache/nutch/protocol/ProtocolStatus.java
+ ./src/java/org/apache/nutch/scoring/webgraph/LinkDatum.java
+ ./src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
+ ./src/java/org/apache/nutch/scoring/webgraph/Loops.java
+ ./src/java/org/apache/nutch/scoring/webgraph/Node.java
+ }}}
+ 
+ = CrawlDB =
+ 
+ TODO
+ 
+ = LinkDB = 
+ 
+ TODO
+ 
+ = Segments = 
+ 
+ TODO
+ 
+ Nutch uses Java's native UTF-8 character set, and the class net.nutch.io.UTF8 for writing short strings to files. The UTF8 class limits the length of strings to 0xffff/3 or 21845 bytes. The function UTF8.write() uses java.io.DataOutput.writeShort() to prepend the length of the string. This is why the two bytes \000\003 is seen before a three letter word in a file. The zero byte is thus not a null termination of the previous string (strings are not null terminated), but the most significant byte of the 16 bit short integer indicating the length of the following string.
+ 
+ Nutch relies heavily on mappings (associative arrays) from keys to values. The class net.nutch.io.SequenceFile is a flat file of keys and values. The first four bytes of each such file are ASCII "SEQ" and \001 (C-a), followed by the Java class names of keys and values, written as UTF8 strings, e.g. "SEQ\001\000\004long\000\004long", for a mapping from long integers to long integers. After that follows the key-value pairs. Each pair is introduced by four bytes telling the length in bytes of the pair (excluding the eight length bytes) and four bytes telling the length of the key. The typical long (64 bit) integer is 8 bytes and a long-to-long mapping will have pairs of length 16 bytes, e.g.
+ 
+ {{{
+   00 00 00 10                                   int length of pair = 0x10 = 16 bytes
+   00 00 00 08                                   int length of key  = 0x08 =  8 bytes
+   00 00 00 00 00 00 02 80       long key = 0x280 = 640
+   00 00 00 00 00 0a 42 9b       long value = 0xa429b = 672411
+ }}}
+ 
+ To economize the handling of large data volumes, net.nutch.io.MapFile manages a mapping as two separate files in a subdirectory of its own. The large "data" file stores all keys and values, sorted by the key. The much smaller "index" file points to byte offsets in the data file for a small sample of keys. Only the index file is read into memory.
+ 
+ net.nutch.io.ArrayFile is a specialization of MapFile where the keys are long integers.
+ 
+ The Java files in net.nutch.io.* comprise 2556 lines of source code. The biggest one is Sequencefile.java, which contains a Writer (112 lines), a Reader (138 lines), a BufferedRandomAccessFile (140 lines) and a Sorter (389 lines).
+ 
+ When Nutch crawls the web, each resulting segment has four subdirectories, each containing an ArrayFile (a MapFile having keys that are long integers):
+ 
+ {{{#!CSV ,
+ Subdirectory,Value datatype,Variable
+ fetchlist,net.nutch.pagedb.FetchListEntry,fetchList
+ fetcher,net.nutch.fetcher.FetcherOutput,fetcherDb
+ fetcher_content,net.nutch.fetcher.FetcherContent,rawDb
+ fetcher_text,net.nutch.fetcher.FetcherText,strippedDb
+ }}}
+ 
+ Crawling is performed by net.nutch.fetcher.Fetcher which starts a number of parallel FetcherThread?. Each thread gets an URL from the fetchList, checks robots.txt, retrieves the contents and appends the results to fetcherDb, rawDb, and strippedDb.
+ 
+ = Old File Format Documentation =
  
  == Nutch version 0.5 ==