You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Claude Garceau <cl...@gmail.com> on 2017/05/12 18:56:29 UTC

Collecting files from File System

Here is the scope of my project.

We want to collect content from a document management system (Nuxeo), an
intranet (Drupal) and files from file system (shared drives) in oprder to
be retrivable by means of a search engine. All of of these sources are
internal information for internal audience, this is about unstructured
content (documents and web pages) We want to use Nutch as the crawler on
these sources. Then Tika would extract and format the data and commit to
Elasticsearch (or SolR).

1) Is Nutch an appropriate solution to collect documents and their
metadatas from file system (shared drives) ?

2) Is Nutch has the ability to collect the permissions that are set on the
NTFS Security tab of the directory tree or on the file ?

Re: Collecting files from File System

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Claude,

> 1) Is Nutch an appropriate solution to collect documents and their
> metadatas from file system (shared drives) ?

That works if the shared drives have a mount point. The plugin protocol-file
supports crawling file:/ URLs, in a way similar to a web server with directory
listings enabled.

> 2) Is Nutch has the ability to collect the permissions that are set on the
> NTFS Security tab of the directory tree or on the file ?
>

Not out of the box. Permissions are quite specific to platforms / operating systems.
But it would be possible to extend the plugin so that permissions are attached as metadata.

Best,
Sebastian

On 05/12/2017 08:56 PM, Claude Garceau wrote:
> Here is the scope of my project.
> 
> We want to collect content from a document management system (Nuxeo), an
> intranet (Drupal) and files from file system (shared drives) in oprder to
> be retrivable by means of a search engine. All of of these sources are
> internal information for internal audience, this is about unstructured
> content (documents and web pages) We want to use Nutch as the crawler on
> these sources. Then Tika would extract and format the data and commit to
> Elasticsearch (or SolR).
> 
> 1) Is Nutch an appropriate solution to collect documents and their
> metadatas from file system (shared drives) ?
> 
> 2) Is Nutch has the ability to collect the permissions that are set on the
> NTFS Security tab of the directory tree or on the file ?
>