You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jim Wilson <wi...@gmail.com> on 2006/09/08 21:42:16 UTC

Windows File Shares

Dear Nutch Users,

Does anyone have experience indexing the contents of windows files shares?
There's information on the Wiki about indexing the local disk, but nothing
about remote shares.

Also, does Nutch traverse directories on its own, or does it require
document links?  Thanks in advance.

-- Jim R. Wilson

Re: Windows File Shares

Posted by Jim Wilson <wi...@gmail.com>.
Thanks for clearing that up.  So in short: it can't be done until somebody
does some coding.

Unfortunately, mounting the shares is not a feasible option for the
following reasons:

1) This would render useless the links served up through the Nutch search
(end-users won't have the same shares mounted).
2) This method has an upper limit of about 24 shares.
3) As the Fetcher discovers new documents, they might reference documents in
new shares that may not be mounted (this assumes that the *.doc interpreter
follows Word hyperlinks).

I admit that the above is not a problem for a singler user scenario, or
could be overcome through code, but the energy required to code a solution
would be better spent on the aforementioned SMB implementation.

In my particular use case, people are fond of making links of the following
form:

<a href="\\share\path\to\somefile.doc">Somefile.doc</a>

It would be nice if there were a parser hook that could interpret a pair of
leading backslash characters as SMB file links and follow them accordingly.

Anyway, that's probably enough ranting for now. I really do LOVE Nutch as it
mostly solves my Intranet indexing problem ... mostly.

-- Jim



On 9/10/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Jim Wilson wrote:
> > Thanks for responding Renaud,
> >
> > I'm using Nutch 0.8, and I have a single file (urls.txt) in my urls
> > directory.
> >
> > In it, I tried putting a line just like this:
> >
> > file://///server/path/to/filename.doc
> >
>
>
> Folks,
>
> Windows shares (CIFS / SMB shares) are accessible using CIFS/SMB
> protocol, not the file protocol. Under Windows you either "mount" them
> under a local driver letter (and then you can access them using the file
> protocol) or you use the double backslash notation and access them
> remotely through the SMB protocol - Windows Explorer tries to hide this
> difference, but it does exist ...
>
> Unfortunately, there is no SMB protocol plugin for Nutch yet - which
> means that unless you mount the remote shares you are not able to access
> them using the double-backslash notation, which requires using SMB.
>
> It wouldn't be too hard to write an implementation of protocol-cifs
> using the JCIFS library ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: Windows File Shares

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jim Wilson wrote:
> Thanks for responding Renaud,
>
> I'm using Nutch 0.8, and I have a single file (urls.txt) in my urls
> directory.
>
> In it, I tried putting a line just like this:
>
> file://///server/path/to/filename.doc
>


Folks,

Windows shares (CIFS / SMB shares) are accessible using CIFS/SMB 
protocol, not the file protocol. Under Windows you either "mount" them 
under a local driver letter (and then you can access them using the file 
protocol) or you use the double backslash notation and access them 
remotely through the SMB protocol - Windows Explorer tries to hide this 
difference, but it does exist ...

Unfortunately, there is no SMB protocol plugin for Nutch yet - which 
means that unless you mount the remote shares you are not able to access 
them using the double-backslash notation, which requires using SMB.

It wouldn't be too hard to write an implementation of protocol-cifs 
using the JCIFS library ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Windows File Shares

Posted by Jim Wilson <wi...@gmail.com>.
Thanks for responding Renaud,

I'm using Nutch 0.8, and I have a single file (urls.txt) in my urls
directory.

In it, I tried putting a line just like this:

file://///server/path/to/filename.doc

I also tried this single line:

file://///server/path/to/

Additionally, I adjusted my url-filters.txt to the following (allow
everything):

+.

After indexing finished, I started up Tomcat, but I always get 0 matches
regardless of search term.  I have yet to learn how one produces a "segment
dump" which appears to be very useful.  I do not know if Nutch was able to
resolve the file URL.

Any future help is much appreciated.

-- Jim

On 9/8/06, Renaud Richardet <re...@wyona.com> wrote:
>
> Jim Wilson wrote:
> > Dear Nutch Users,
> >
> > Does anyone have experience indexing the contents of windows files
> > shares?
> > There's information on the Wiki about indexing the local disk, but
> > nothing
> > about remote shares.
> Did you try out already to index the remote drives? Was Nutch able to
> fetch them?
> >
> > Also, does Nutch traverse directories on its own, or does it require
> > document links?  Thanks in advance.
> On its own. Basically, if you crawl a directory, Nutch add to its link
> database all files in this directory, and the directory higher up (like
> you would get from an Apache server index.html file). See the segment
> dump below. The depth determines how much levels Nutch will crawl.
>
> ParseData::
> Status: success(1,0)
> Title: Index of /home/ren/testdata/stable/standardSentence
> Outlinks: 9
>   outlink: toUrl: file:/home/ren/testdata/stable/ anchor: ../
>   outlink: toUrl:
> file:/home/ren/testdata/stable/standardSentence/test.zip anchor: test.zip
>   outlink: toUrl:
> file:/home/ren/testdata/stable/standardSentence/test.odp anchor: test.odp
>   outlink: toUrl:
> file:/home/ren/testdata/stable/standardSentence/test.ppt anchor: test.ppt
>   outlink: toUrl:
> file:/home/ren/testdata/stable/standardSentence/test.ods anchor: test.ods
>   outlink: toUrl:
> file:/home/ren/testdata/stable/standardSentence/test.xls anchor: test.xls
>   outlink: toUrl:
> file:/home/ren/testdata/stable/standardSentence/test.odt anchor: test.odt
>   outlink: toUrl:
> file:/home/ren/testdata/stable/standardSentence/test.pdf anchor: test.pdf
>   outlink: toUrl:
> file:/home/ren/testdata/stable/standardSentence/test.rtf anchor: test.rtf
>
> HTH,
> Renaud
>
> --
> Renaud Richardet
> COO America
> Wyona    -   Open Source Content Management   -   Apache Lenya
> office +1 857 776-3195                  mobile +1 617 230 9112
> renaud.richardet <at> wyona.com           http://www.wyona.com
>
>

Re: Windows File Shares

Posted by Renaud Richardet <re...@wyona.com>.
Jim Wilson wrote:
> Dear Nutch Users,
>
> Does anyone have experience indexing the contents of windows files 
> shares?
> There's information on the Wiki about indexing the local disk, but 
> nothing
> about remote shares.
Did you try out already to index the remote drives? Was Nutch able to 
fetch them?
>
> Also, does Nutch traverse directories on its own, or does it require
> document links?  Thanks in advance.
On its own. Basically, if you crawl a directory, Nutch add to its link 
database all files in this directory, and the directory higher up (like 
you would get from an Apache server index.html file). See the segment 
dump below. The depth determines how much levels Nutch will crawl.

ParseData::
Status: success(1,0)
Title: Index of /home/ren/testdata/stable/standardSentence
Outlinks: 9
  outlink: toUrl: file:/home/ren/testdata/stable/ anchor: ../
  outlink: toUrl: 
file:/home/ren/testdata/stable/standardSentence/test.zip anchor: test.zip
  outlink: toUrl: 
file:/home/ren/testdata/stable/standardSentence/test.odp anchor: test.odp
  outlink: toUrl: 
file:/home/ren/testdata/stable/standardSentence/test.ppt anchor: test.ppt
  outlink: toUrl: 
file:/home/ren/testdata/stable/standardSentence/test.ods anchor: test.ods
  outlink: toUrl: 
file:/home/ren/testdata/stable/standardSentence/test.xls anchor: test.xls
  outlink: toUrl: 
file:/home/ren/testdata/stable/standardSentence/test.odt anchor: test.odt
  outlink: toUrl: 
file:/home/ren/testdata/stable/standardSentence/test.pdf anchor: test.pdf
  outlink: toUrl: 
file:/home/ren/testdata/stable/standardSentence/test.rtf anchor: test.rtf

HTH,
Renaud

-- 
Renaud Richardet
COO America
Wyona    -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                  mobile +1 617 230 9112
renaud.richardet <at> wyona.com           http://www.wyona.com