You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by sa...@students.iiit.ac.in on 2007/08/23 15:05:54 UTC

Indexing Local File System

Hi all,

I am using nutch0.9 on Linux (FC5). I have been trying to index a
particular drectory of my filesystem. However I am facing cerain problems:

1) While Fetching, the fetcher fetches the parent directories also, I
tried modifying the conf/regex-urlfilter.txt as

+^file:///home/user/parent-dir/
-.

however it still fetches the file:///home/user/

2)I also get the following error while parsing the dir.

Error parsing:
file:/home/sachin/IR/enterprise/csiro-split/CSIRO208/01607532:
failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
contentType=
url=file:/home/sachin/IR/enterprise/csiro-split/CSIRO208/01607532

Details of the file :
File name:
/home/user/IR/enterprise/csiro-split/CSIRO208/01607532    #(NO extension)

and file content:

HTTP/1.1 200 OK
Connection: close
Date: Sat, 24 Mar 2007 10:52:12 GMT
Server: Microsoft-IIS/6.0
Set-Cookie: CFID=42733090;expires=Mon, 16-Mar-2037 10:52:12 GMT;path=/
Set-Cookie: CFID=42733090;expires=Mon, 16-Mar-2037 10:52:12 GMT;path=/
Content-Language: en-AU
Content-Type: text/html; charset=UTF-8
Stored-Length: 20604


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
        <head>


                <meta http-equiv="Content-Type" content="text/html;
charset=utf-8">

        <title>CSIRO PUBLISHING - Publishing Partners</title>
        <meta name="description" content="CSIRO PUBLISHING, Australia's
leading publisher of quality scientific and technical books,
journals and CDs">
        <meta name="keywords" content="csiro publishing, CSIRO,
publications, science, scientific, educational, journal, journals,
Australia, Australian, books, cd, landlinks, lucid, ecos, sage,
cyberscience, building, construction, agricultural research,
astronomy, botany, chemistry, experimental agriculture, historical
records, invertebrate systematics, marine research, molluscan
research, freshwater research, physics, plant biology,
reproduction, fertility, development, sexual health, soil
research, systematic botany, wildlife research, zoology, emu,
ornithology, plant pathology, wildland fire,cd-rom, botanical,
systematics, multimedia, video, images, magazine, agribusiness,
environment, environmental, sustainable, zoological">

.....

Some body suggested that i have to make some changes in the Fetcher code.

Please Help.

Thanks in Advance.

Regards,
Sachin Srivastava