You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2006/01/09 17:16:42 UTC

XmlInputFortmat ?

Hi

I am going to feed nutch-0.8-dev crawler with seeds in xml format. And
I have read nutch TextInputFormat/InputFormatBase. It seems now nutch
breaks the plain text files into chars and parses on them. My question
is how to support XmlInputFormat, in my eye, xml format is not
character-based but blocke-based.

Thanks

/Jack

--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: XmlInputFortmat ?

Posted by Doug Cutting <cu...@nutch.org>.

Jack Tang wrote:
> I am going to feed nutch-0.8-dev crawler with seeds in xml format. And
> I have read nutch TextInputFormat/InputFormatBase. It seems now nutch
> breaks the plain text files into chars and parses on them. My question
> is how to support XmlInputFormat, in my eye, xml format is not
> character-based but blocke-based.

To make splitting efficient (since it is performed on the single master 
node) TextInputFormat splits files by byte positions.  Then, when the 
split is processed, on a node, input is synchronized to lines.  In 
particular, the first line in the split is the first starting after the 
start of the split (unless the split start position is zero, in which 
case the first line in the split is the first line in the file) and the 
last line in the split is the line containing the split's end position.

SequenceFiles are handled similarly.

XML would be hard to handle in this way, since one cannot seek to an 
arbitrary position in an XML file and begin processing it.  So the 
simplest way to implement an XML input format would be to disable file 
splitting: each split should contain an entire file.

Doug