You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Adrian Dumitru <ct...@altonsys.com> on 2004/05/26 23:31:17 UTC

classic scenario

I salute the Lucene community!
it will be a great help for me if I get your valuable opinions on the
following issue; I know I could've find more answers to my questions from
reading the documentation but I did invest some time on this and still
have these questions:

I am (also) building a web crawler, a topic specific one to be more
precise, for a vortal. I recently learned about Lucene and I'd very much
like to use it in order to handle keyword specific searched on the info
that I collect.
I suspect this is a "classic" project, at least for Lucene, probably
something like this has been addressed already on this disussion list, I'm
interested to hear any experience anyone might have with this subject.
My crawler goes on the internet, extracts/parse/ranks and saves websites,
most of the information is also categoriezed and stored in the database
but I also save about 10 top pages from each site in the filesystem.
The first question is: should I care about indexing these files at the
time I extract them from internet? Or should I index them later, when I
make them available for search?
If yes, then can I still name my files the way I want?(i.e. are there any
constraints in the filenames from Lucene perspective?)
Is it an OK idea to have the same files repository (or index) where the
crawler writes (indexes files) and the search function searches? I guess
performance issues are important here.
Can I still organize the files that I save the way I want? (I planned to
write all the files from a given website on different folders...and the
folders will have as name the id from my database)
I maintain a taxonomy (list of categories)...each website will fall into
one or more of these categories, also each website will have a rank. Does
Lucene have something that I should be aware of related to what I said?

I guess that's it for now...this is more like a pet project for me, a pet
which keeps growing :) I wouldn't mind any help and opinions you can
provide, source code samples, etc.

Big thanks in advance and good luck on your work.
adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: classic scenario

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

Answers inlined.

--- Adrian Dumitru <ct...@altonsys.com> wrote:

> I am (also) building a web crawler, a topic specific one to be more
> precise, for a vortal. I recently learned about Lucene and I'd very
> much
> like to use it in order to handle keyword specific searched on the
> info
> that I collect.
> I suspect this is a "classic" project, at least for Lucene, probably
> something like this has been addressed already on this disussion
> list, I'm
> interested to hear any experience anyone might have with this
> subject.

See http://www.nutch.org/
It may make sense to join Nutch, contribute patches that help you, etc.
instead of building your own crawler from scratch.

> My crawler goes on the internet, extracts/parse/ranks and saves
> websites,
> most of the information is also categoriezed and stored in the
> database
> but I also save about 10 top pages from each site in the filesystem.
> The first question is: should I care about indexing these files at
> the
> time I extract them from internet? Or should I index them later, when
> I
> make them available for search?

Lucene does not care about files and is not limited to indexing files. 
It sounds like you tried the Lucene demo that indexes files in the file
system.

However, indexing in batch instead of as you crawl may be a more
scalable and cleaner, more manageable approach.  Nutch uses that
approach for a reason. :)

> If yes, then can I still name my files the way I want?(i.e. are there
> any
> constraints in the filenames from Lucene perspective?)

No constraints.

> Is it an OK idea to have the same files repository (or index) where
> the
> crawler writes (indexes files) and the search function searches?

Not a good idea.  Keep your Lucene index directory clean, and use it
only as an index directory.  Write your files elsewhere, I would
suggest.

> I
> guess
> performance issues are important here.
> Can I still organize the files that I save the way I want? (I planned
> to
> write all the files from a given website on different folders...and
> the
> folders will have as name the id from my database)

That is up to you and your application.  I just suggest you keep that
outside the index directory, in order to keep things clean, well
organized, and such.

> I maintain a taxonomy (list of categories)...each website will fall
> into
> one or more of these categories, also each website will have a rank.
> Does
> Lucene have something that I should be aware of related to what I
> said?

Lucene ranks search result items.  Look at Similarity and
DefaultSimilarity classes.  It sounds like you may benefit from having
a custom Similarity that is aware of your categories.

> I guess that's it for now...this is more like a pet project for me, a
> pet
> which keeps growing :) I wouldn't mind any help and opinions you can
> provide, source code samples, etc.

It this is really a pet project, perhaps joining Nutch will also be fun
for you.  Some recent Nutch contributors are also Lucene users.

Otis

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org