You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Saravana <ms...@gmail.com> on 2007/03/01 16:20:38 UTC

Re: [Fwd: Re: indexing performance]

Hi,

You need just the counts? And you want to do just whole-field matching, not
word matching? In that case, Lucene might be an overkill for you. Or, if you
do use Lucene, make sure to use "keyword" (untokenized) fields, not
"tokenized" fields.

Sorry for not elaborating my requirement more. Actually I have some fields
that need word matching and for some fields I do not need word matching. I
have used NO_NORMS for whole fields and TOKENIZED for the fields that need
normalization. I need count as well as I need to show the fields that are
indexed.
For example the following criteria can be given by the user;

USER:john AND MSG:ftp

Here USER is NO_NORMS field and MSG will be tokenized field. Original log
message will be as follows.

2007 Jan 27 10:10:01 User John accessed ftp url images.html

So i cannot identify the count in the memory as the criteria will be
> selected by the user or its not predefined. Moreover I have read the
> following thread dated 2002


Thread on 2002:

my experiences are that the writing to the index takes the most time except
any parsing done by the user. I have been working on xml indexes and here
the
collection of data takes just as much time as to write. to increase *speed*i
have done three things that reduced my index time from 11hours to 2,5 hours
for the same dataset (1,3gb xml documents).

1: i index 50 documents into a ramdir, then when the limit is reached i
merge
this ramdir into a fsdir and flush the ramdir. this speeds up things
as i then don't have to use the fsdir as much and ramdir is much faster.

2: merging a large index into a large index takes nearly as much time as
merging a small index into a large index, so i have 4 (any number will do)
fsdirs that i write ramdirs to and then i merge these fsdirs into one large
fsdir at the end of a large indexrun.

3: multithreaded my application, create workerthreads that indexes into its
own sepparate ramdir, then flushes these ramdirs into each separate fsdir
(hench i have a fsdir for each workerthread), this because you can only
write
to a dir by one thread.

in the end this imporved my *indexing* time a lot...

hope some of this can help you!

mvh karl �ie



Is this still hold good now ? Thanks for your reply.

regards,
MSK

---------- Forwarded message ----------
> From: "Nadav Har'El" <ny...@math.technion.ac.il>
> To: java-user@lucene.apache.org
> Date: Thu, 1 Mar 2007 10:28:07 +0200
> Subject: Re: indexing performance
> On Tue, Feb 27, 2007, Saravana wrote about "indexing performance":
> > Hi,
> >
> > Is it possible to scale lucene indexing like 2000/3000 documents per
> > second?
>
> I don't know about the actual numbers, but one trick I've used in the past
> to get really fast indexing was to create several independent indexes in
> parallel. Simply, if you have, say, 4 CPUs and perhaps even several
> physical
> disks, run 4 indexing processes each indexing a 1/4 of the files and
> creating
> a separate index (on separate disks on separate IO channels, if possible).
>
> At the end, you have 4 indexes which you can actually search together
> without
> any real need to merge them, unless query performance is very important to
> you as well.
>
> > I need to index 10 fields each with 20 bytes long.  I should be
> > able to search by just giving any of the field values as criteria. I
> need to
> > get the count that has same field values.
>
> You need just the counts? And you want to do just whole-field matching,
> not
> word matching? In that case, Lucene might be an overkill for you. Or, if
> you
> do use Lucene, make sure to use "keyword" (untokenized) fields, not
> "tokenized" fields.
>
> --
> Nadav Har'El                        |      Thursday, Mar  1 2007, 11 Adar
> 5767
> IBM Haifa Research
> Lab              |-----------------------------------------
>                                     |Open your arms to change, but don't
> let
> http://nadav.harel.org.il           |go of your values.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

Re: [Fwd: Re: indexing performance]

Posted by Mike Klaas <mi...@gmail.com>.
On 3/1/07, Saravana <ms...@gmail.com> wrote:

> Is this still hold good now ? Thanks for your reply.

Probably most of that still applies to some extent.  However, it is
unclear whether it will speed up your application.

First thing is to find out what your bottleneck is.  Looking at the
stats on your machine during indexing, is io-bound? cpu-bound? mixed?

There are various possible strategies, but they will come from
finely-tuning your proceed to meet the bottlenecks you are
experiencing.  If you are cpu-bound, then perhaps you can use less
intensive analyzers, or purchase a multi-cpu machine and index
threadedly.  If you are i/o bound, you could 1) buy faster disks, 2)
use a faster i/o backend (e.g. RAID-0), 3) created indexes on multiple
independent disks and merge later.

regards,
-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org