You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Benny <be...@gmail.com> on 2005/08/22 20:53:51 UTC

Index local file.

Hi,

Can someone give me some hints how index local files?

I have a lot of plain HTML files (more than 50K pages, the size is
around 2-3k/page). I don't prefer puting them in the web service and
using url to index them. I'd like NUTCH to index them from local HD.
Is it possible? if it is, what kind of url I need inject into db? for
example, if you use web service, we use the

http://domain/file.html 

How about local HD file's format? I believe no more "http", what's
protocol supposed to be. These file are still in plain HTML format.


Benny

Re: [Nutch-general] Index local file.

Posted by praveen pathiyil <pa...@gmail.com>.
Hi Benny,

Check out this mail thread

http://www.mail-archive.com/nutch-user@lucene.apache.org/msg00340.html

HTH,
Praveen.

On 8/22/05, Benny <be...@gmail.com> wrote:
> Hi,
> 
> Can someone give me some hints how index local files?
> 
> I have a lot of plain HTML files (more than 50K pages, the size is
> around 2-3k/page). I don't prefer puting them in the web service and
> using url to index them. I'd like NUTCH to index them from local HD.
> Is it possible? if it is, what kind of url I need inject into db? for
> example, if you use web service, we use the
> 
> http://domain/file.html
> 
> How about local HD file's format? I believe no more "http", what's
> protocol supposed to be. These file are still in plain HTML format.
> 
> 
> Benny
> 
> 
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>

Re: Index local file.

Posted by g e k k o k i d <me...@gekkokid.org.uk>.
you could access the html files directly by lucene, theres a few sample 
chapters on http://lucenebook.com to get your adjusted with lucenc's api doc

best of luck :)

gk

----- Original Message ----- 
From: "Benny" <be...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Monday, August 22, 2005 7:53 PM
Subject: Index local file.


Hi,

Can someone give me some hints how index local files?

I have a lot of plain HTML files (more than 50K pages, the size is
around 2-3k/page). I don't prefer puting them in the web service and
using url to index them. I'd like NUTCH to index them from local HD.
Is it possible? if it is, what kind of url I need inject into db? for
example, if you use web service, we use the

http://domain/file.html

How about local HD file's format? I believe no more "http", what's
protocol supposed to be. These file are still in plain HTML format.


Benny



RE: Index local file.

Posted by Fuad Efendi <fu...@efendi.ca>.
Hi Benny,


Nutch is a generic web search engine, with a distributed file system
support (NFS, hugest crawls and indexes), and a framework... It has
plugins, you can probably design and use "FILE" plugin instead of
"PROTOCOL-HTTP"

However... What about hyperlinks, anchors? 

Indexing of presentation layers of aliens is very difficult, and
fuzzy... HTML has formatting, and 95% of an extracted plain text (Select
Options, Header, Footer, Menu, Reviews, ...) do not really need to be
indexed... 

If you need to index local files, best of all is to use Lucene directly,
with possible usage of org.apache.nutch.searcher package for web
front-end (if you really need web front-end); especially if you have
access to data layer (bypassing presentation such as HTML).

For all IntrAnet related tasks, Lucene.

If you have small amount of HTML, you can index your web-server directly
via HTTP without performance impact, it's easy... without any logic, you
will index everything... 


Regards,
Fuad


-----Original Message-----
From: Benny [mailto:bennynutch@gmail.com] 
Sent: Monday, August 22, 2005 2:54 PM
To: nutch-user@lucene.apache.org
Subject: Index local file.


Hi,

Can someone give me some hints how index local files?

I have a lot of plain HTML files (more than 50K pages, the size is
around 2-3k/page). I don't prefer puting them in the web service and
using url to index them. I'd like NUTCH to index them from local HD. Is
it possible? if it is, what kind of url I need inject into db? for
example, if you use web service, we use the

http://domain/file.html 

How about local HD file's format? I believe no more "http", what's
protocol supposed to be. These file are still in plain HTML format.


Benny