You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raymond Giorgi <ra...@gmail.com> on 2010/06/16 20:23:01 UTC

Indexing local file system without directory information

Hi, all,

I'm using Nutch to index a specific directory of a local file system without
indexing directory information. So far, I've tried two approaches, but
neither of them seem to be working.

1) Pointing the injected urls to a [temp.html] page that contains links to
every page that I'd like to index (i.e. temp.html points to
file://foo1.html, file://foo2.html, etc) and running
"inject-generate-fetch-parse-updatedb-generate-fetch-parse-updatedb-invertlinks-index",
but this approach won't index any of my files. Checking out the parse_data
tells me that information is getting parsed, and looking at the link_db
tells me that the system is inverting the links. Nothing seems to be getting
indexed, though.

2) Doing a crawl starting with a specific directory (i.e.
/user/directory/to/crawl/. This approach indexes all of my files, but it
also indexes directory information (i.e. the file meta-data). In other
words, if I do a search for 'foo', Solr will tell me that it found 50
documents containing the word 'foo' and two directories that contain a file
named 'foo', which is unfit for production purposes.

Any suggestions on how I could index a specific directory without directory
information would be a great help.

Thanks,
Ray