You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jason Tang <ja...@commcentral.com> on 2005/04/20 04:16:41 UTC
Re: [Nutch-dev] filesystem indexing
Hi
Do anyone working on this issue? If none, I will go on.
I suppose it is not hard to support "indexing locally and searching remotely".
Best regards,
/Jack
======= At 2005-04-14, 05:16:47 you wrote: =======
>Hi all,
>
>sorry to ask the same question on the user mailing list, but I didn't
>get any answer to my problem.
>
>I have a filesystem with files to index.
>-> no problem to index the files.
>I want to search them remote via the WAR using Tomcat.
>-> no problem by moving the segments to the correct position
>When I search now the links to the files have all "file://..."
>-> This is does not work remote. :(
>
>My question is now:
>Is there a better way to solve this problem than modifying the JSP pages
>the make a replacement for "file://..." into "http://<server name>/..."?
>
>Preferred solution would be correct links in the index/seg,emts so I can
>narrow the search suppling part of the URL.
>I'm even thinking about indexing on one machine and then move the
>segments to another machine serving it.
>But therefore I have to modify the link the documents as well ...
>
>Any ideas? Thank you ...
>
>
>
>
>-------------------------------------------------------
>SF email is sponsored by - The IT Product Guide
>Read honest & candid reviews on hundreds of IT Products from real users.
>Discover which products truly live up to the hype. Start reading now.
>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>_______________________________________________
>Nutch-developers mailing list
>Nutch-developers@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] filesystem indexing
Posted by Kragen Sitaker <ks...@commerce.net>.
On Wed, 2005-04-20 at 09:55 -0700, Doug Cutting wrote:
> Jason Tang wrote:
> > Do anyone working on this issue [hiding file URLs when doing a remote search]
> > ? If none, I will go on.
> > I suppose it is not hard to support "indexing locally and searching remotely".
>
> A simple way to implement this would be to change the protocol-file
> plugin to handle http urls (add protocol-name="http" in plugin.xml),
> then modify FileResponse.java to optionally accept http urls and convert
> them to pathnames relative to some root directory. Does that make sense?
Modifying the JSP sounds simpler for any particular installation. For
more general use, there's probably a general need for
Nutch-visible-URL-to-externally-visible-URL translation at display time
too. For example, at one time we ran Nutch against an internal web
server with a mirror of a bunch of content that lived at some
externally-accessible URL; we wanted the search results to display the
externally-accessible URLs.
Last time I was doing filesystem indexing (with Nutch 0.5), I ran into a
bunch of minor problems:
- copying the entire filesystem into my segment directories was
undesirable, but mandatory
- limits on file size and number of outgoing links per "page" weren't
helpful
- if a directory name ended up in Nutch without a trailing slash
(file:///home/kragen rather than file:///home/kragen/), the relative
links from it were wrong.
- directories had links to "..", so three passes of crawling
from /home/kragen/a/b/c would index everything three levels down from
there, but also /home/kragen, /home/kragen/a/*,
and /home/kragen/a/b/*/*, which wasn't what I wanted.
Also, Nutch was noticeably slower than Lucene, for whatever reason, and
that was more noticeable when the data was coming from a
300-megabit-per-second hard disk than a 1-megabit-per-second network
link.
Re: [Nutch-dev] filesystem indexing
Posted by Boris Kröger <bo...@cip.wiwi.uni-karlsruhe.de>.
Hi Doug,
>> Do anyone working on this issue? If none, I will go on.
>> I suppose it is not hard to support "indexing locally and searching
>> remotely".
>
>
> A simple way to implement this would be to change the protocol-file
> plugin to handle http urls (add protocol-name="http" in plugin.xml),
> then modify FileResponse.java to optionally accept http urls and
> convert them to pathnames relative to some root directory. Does that
> make sense?
That sounds good. Perfect would be if the conversion to pathnames would
be configurable in the global config.
Does this also cover the following scenario?
I want to index local files and web sites from the net containing the
same topic. Later I want to merge these two segments to put it on the
search server.
During search I want to narrow the search to only my local files for
example. As far as I understand this would be done using the url option
in the search.
But this would succeed only if your files are indexed like this.
Boris
Re: [Nutch-dev] filesystem indexing
Posted by Doug Cutting <cu...@nutch.org>.
Jason Tang wrote:
> Do anyone working on this issue? If none, I will go on.
> I suppose it is not hard to support "indexing locally and searching remotely".
A simple way to implement this would be to change the protocol-file
plugin to handle http urls (add protocol-name="http" in plugin.xml),
then modify FileResponse.java to optionally accept http urls and convert
them to pathnames relative to some root directory. Does that make sense?
Doug
> ======= At 2005-04-14, 05:16:47 you wrote: =======
>
>
>>Hi all,
>>
>>sorry to ask the same question on the user mailing list, but I didn't
>>get any answer to my problem.
>>
>>I have a filesystem with files to index.
>>-> no problem to index the files.
>>I want to search them remote via the WAR using Tomcat.
>>-> no problem by moving the segments to the correct position
>>When I search now the links to the files have all "file://..."
>>-> This is does not work remote. :(
>>
>>My question is now:
>>Is there a better way to solve this problem than modifying the JSP pages
>>the make a replacement for "file://..." into "http://<server name>/..."?
>>
>>Preferred solution would be correct links in the index/seg,emts so I can
>>narrow the search suppling part of the URL.
>>I'm even thinking about indexing on one machine and then move the
>>segments to another machine serving it.
>>But therefore I have to modify the link the documents as well ...
>>
>>Any ideas? Thank you ...
>>
>>
>>
>>
>>-------------------------------------------------------
>>SF email is sponsored by - The IT Product Guide
>>Read honest & candid reviews on hundreds of IT Products from real users.
>>Discover which products truly live up to the hype. Start reading now.
>>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>>_______________________________________________
>>Nutch-developers mailing list
>>Nutch-developers@lists.sourceforge.net
>>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>
>
>