You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2006/01/09 17:16:42 UTC
XmlInputFortmat ?
Hi
I am going to feed nutch-0.8-dev crawler with seeds in xml format. And
I have read nutch TextInputFormat/InputFormatBase. It seems now nutch
breaks the plain text files into chars and parses on them. My question
is how to support XmlInputFormat, in my eye, xml format is not
character-based but blocke-based.
Thanks
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: XmlInputFortmat ?
Posted by Doug Cutting <cu...@nutch.org>.
Jack Tang wrote:
> I am going to feed nutch-0.8-dev crawler with seeds in xml format. And
> I have read nutch TextInputFormat/InputFormatBase. It seems now nutch
> breaks the plain text files into chars and parses on them. My question
> is how to support XmlInputFormat, in my eye, xml format is not
> character-based but blocke-based.
To make splitting efficient (since it is performed on the single master
node) TextInputFormat splits files by byte positions. Then, when the
split is processed, on a node, input is synchronized to lines. In
particular, the first line in the split is the first starting after the
start of the split (unless the split start position is zero, in which
case the first line in the split is the first line in the file) and the
last line in the split is the line containing the split's end position.
SequenceFiles are handled similarly.
XML would be hard to handle in this way, since one cannot seek to an
arbitrary position in an XML file and begin processing it. So the
simplest way to implement an XML input format would be to disable file
splitting: each split should contain an entire file.
Doug