You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jason Tang <ja...@commcentral.com> on 2005/04/20 04:16:41 UTC

Re: [Nutch-dev] filesystem indexing

Hi

Do anyone working on this issue? If none, I will go on.
I suppose it is not hard to support "indexing locally and searching remotely".

Best regards, 
/Jack  
======= At 2005-04-14, 05:16:47 you wrote: =======

>Hi all,
>
>sorry to ask the same question on the user mailing list, but I didn't 
>get any answer to my problem.
>
>I have a filesystem with files to index.
>-> no problem to index the files.
>I want to search them remote via the WAR using Tomcat.
>-> no problem by moving the segments to the correct position
>When I search now the links to the files have all "file://..."
>-> This is does not work remote. :(
>
>My question is now:
>Is there a better way to solve this problem than modifying the JSP pages 
>the make a replacement for "file://..." into "http://<server name>/..."?
>
>Preferred solution would be correct links in the index/seg,emts so I can 
>narrow the search suppling part of the URL.
>I'm even thinking about indexing on one machine and then move the 
>segments to another machine serving it.
>But therefore I have to modify the link the documents as well ...
>
>Any ideas?  Thank you ...
>
>
>
>
>-------------------------------------------------------
>SF email is sponsored by - The IT Product Guide
>Read honest & candid reviews on hundreds of IT Products from real users.
>Discover which products truly live up to the hype. Start reading now.
>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>_______________________________________________
>Nutch-developers mailing list
>Nutch-developers@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nutch-developers




Re: [Nutch-dev] filesystem indexing

Posted by Kragen Sitaker <ks...@commerce.net>.
On Wed, 2005-04-20 at 09:55 -0700, Doug Cutting wrote:
> Jason Tang wrote:
> > Do anyone working on this issue [hiding file URLs when doing a remote search]
> > ? If none, I will go on.
> > I suppose it is not hard to support "indexing locally and searching remotely".
> 
> A simple way to implement this would be to change the protocol-file 
> plugin to handle http urls (add protocol-name="http" in plugin.xml), 
> then modify FileResponse.java to optionally accept http urls and convert 
> them to pathnames relative to some root directory.  Does that make sense?

Modifying the JSP sounds simpler for any particular installation. For
more general use, there's probably a general need for
Nutch-visible-URL-to-externally-visible-URL translation at display time
too.  For example, at one time we ran Nutch against an internal web
server with a mirror of a bunch of content that lived at some
externally-accessible URL; we wanted the search results to display the
externally-accessible URLs.

Last time I was doing filesystem indexing (with Nutch 0.5), I ran into a
bunch of minor problems:
- copying the entire filesystem into my segment directories was
undesirable, but mandatory
- limits on file size and number of outgoing links per "page" weren't
helpful
- if a directory name ended up in Nutch without a trailing slash
(file:///home/kragen rather than file:///home/kragen/), the relative
links from it were wrong.
- directories had links to "..", so three passes of crawling
from /home/kragen/a/b/c would index everything three levels down from
there, but also /home/kragen, /home/kragen/a/*,
and /home/kragen/a/b/*/*, which wasn't what I wanted.

Also, Nutch was noticeably slower than Lucene, for whatever reason, and
that was more noticeable when the data was coming from a
300-megabit-per-second hard disk than a 1-megabit-per-second network
link.



Re: [Nutch-dev] filesystem indexing

Posted by Boris Kröger <bo...@cip.wiwi.uni-karlsruhe.de>.
Hi Doug,

>> Do anyone working on this issue? If none, I will go on.
>> I suppose it is not hard to support "indexing locally and searching 
>> remotely".
>
>
> A simple way to implement this would be to change the protocol-file 
> plugin to handle http urls (add protocol-name="http" in plugin.xml), 
> then modify FileResponse.java to optionally accept http urls and 
> convert them to pathnames relative to some root directory.  Does that 
> make sense?

That sounds good. Perfect would be if the conversion to pathnames would 
be configurable in the global config.
Does this also cover the following scenario?
I want to index local files and web sites from the net containing the 
same topic. Later I want to merge these two segments to put it on the 
search server.
During search I want to narrow the search to only my local files for 
example. As far as I understand this would be done using the url option 
in the search.
But this would succeed only if your files are indexed like this.

Boris

Re: [Nutch-dev] filesystem indexing

Posted by Doug Cutting <cu...@nutch.org>.
Jason Tang wrote:
> Do anyone working on this issue? If none, I will go on.
> I suppose it is not hard to support "indexing locally and searching remotely".

A simple way to implement this would be to change the protocol-file 
plugin to handle http urls (add protocol-name="http" in plugin.xml), 
then modify FileResponse.java to optionally accept http urls and convert 
them to pathnames relative to some root directory.  Does that make sense?

Doug

> ======= At 2005-04-14, 05:16:47 you wrote: =======
> 
> 
>>Hi all,
>>
>>sorry to ask the same question on the user mailing list, but I didn't 
>>get any answer to my problem.
>>
>>I have a filesystem with files to index.
>>-> no problem to index the files.
>>I want to search them remote via the WAR using Tomcat.
>>-> no problem by moving the segments to the correct position
>>When I search now the links to the files have all "file://..."
>>-> This is does not work remote. :(
>>
>>My question is now:
>>Is there a better way to solve this problem than modifying the JSP pages 
>>the make a replacement for "file://..." into "http://<server name>/..."?
>>
>>Preferred solution would be correct links in the index/seg,emts so I can 
>>narrow the search suppling part of the URL.
>>I'm even thinking about indexing on one machine and then move the 
>>segments to another machine serving it.
>>But therefore I have to modify the link the documents as well ...
>>
>>Any ideas?  Thank you ...
>>
>>
>>
>>
>>-------------------------------------------------------
>>SF email is sponsored by - The IT Product Guide
>>Read honest & candid reviews on hundreds of IT Products from real users.
>>Discover which products truly live up to the hype. Start reading now.
>>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>>_______________________________________________
>>Nutch-developers mailing list
>>Nutch-developers@lists.sourceforge.net
>>https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 
> 
>