You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Touretsky, Gregory" <gr...@intel.com> on 2014/04/28 12:54:18 UTC

Nutch for NFS crawling and data indexing

Hi,

      I see multiple references to Web search implementation with Nutch.
Have anyone implemented large scale (many TBs of data, millions of files) NFS search?
Are there any alternative solutions for scale-out crawling and indexing of file systems?

Thank you,
   Gregory
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

RE: Nutch for NFS crawling and data indexing

Posted by "Touretsky, Gregory" <gr...@intel.com>.
Harald,

  I assume you're referring to http://www.raytion.com/products/enterprise-search-connectors.html 
What about scalability of this solution - would it support crawling and indexing 100s of millions of files within reasonable amount of time? 
Where would the index be stored?
Would it allow us to define parsers for various file formats - similar to what might be possible with Tika?

Thanks,
   Gregory

-----Original Message-----
From: Harald Kirsch [mailto:Harald.Kirsch@raytion.com] 
Sent: Monday, April 28, 2014 14:15
To: user@nutch.apache.org
Subject: Re: Nutch for NFS crawling and data indexing

The strength of web crawlers is their ability to follow links in websites to explore more and more of the landscape to index.

When you want to index fileshares, like NFS, potentially with access rights included, you way want to use a different beast, (we call it a connector). The benefit is, that the connector can just follow the directory structure. There is no need to figure out new documents by parsing the documents found so far.

Harald.


On 28.04.2014 12:54, Touretsky, Gregory wrote:
> Hi,
>
>        I see multiple references to Web search implementation with Nutch.
> Have anyone implemented large scale (many TBs of data, millions of files) NFS search?
> Are there any alternative solutions for scale-out crawling and indexing of file systems?
>
> Thank you,
>     Gregory
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient, please contact the sender and delete all copies.
>

--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Nutch for NFS crawling and data indexing

Posted by Harald Kirsch <Ha...@raytion.com>.
The strength of web crawlers is their ability to follow links in 
websites to explore more and more of the landscape to index.

When you want to index fileshares, like NFS, potentially with access 
rights included, you way want to use a different beast, (we call it a 
connector). The benefit is, that the connector can just follow the 
directory structure. There is no need to figure out new documents by 
parsing the documents found so far.

Harald.


On 28.04.2014 12:54, Touretsky, Gregory wrote:
> Hi,
>
>        I see multiple references to Web search implementation with Nutch.
> Have anyone implemented large scale (many TBs of data, millions of files) NFS search?
> Are there any alternative solutions for scale-out crawling and indexing of file systems?
>
> Thank you,
>     Gregory
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

-- 
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com