You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2011/01/28 10:16:38 UTC

How do you know when index.optimize has finished ?

I'm building six different indexes in series, at the end of building an 
index I call optimize() and then close() the writer, then move onto the 
next one.
I build them in series because they are extracting the data from a 
database and I don't want to overload the database.
However the optimization takes a while and because that does'nt effect 
the db I want to start building the next index whilst the optimize of 
the last one is being done by using optimize(false), but if I do this 
how do I know when the optimization is finished so I can close the writer ?

thanks Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How do you know when index.optimize has finished ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
You can call IW.waitForMerges().

Mike

On Fri, Jan 28, 2011 at 4:16 AM, Paul Taylor <pa...@fastmail.fm> wrote:
> I'm building six different indexes in series, at the end of building an
> index I call optimize() and then close() the writer, then move onto the next
> one.
> I build them in series because they are extracting the data from a database
> and I don't want to overload the database.
> However the optimization takes a while and because that does'nt effect the
> db I want to start building the next index whilst the optimize of the last
> one is being done by using optimize(false), but if I do this how do I know
> when the optimization is finished so I can close the writer ?
>
> thanks Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to index part numbers

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: How to index part numbers
: References: <4D...@fastmail.fm>
: In-Reply-To: <4D...@fastmail.fm>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ****SPAM(5.0)**** Re: How to index part numbers

Posted by Karolina Bernat <ka...@googlemail.com>.
oh, okay.. well for the XML part we use Apache Digester and define rules to
enclose the correct elements. But I can't tell what's the best way to
proceed in your case, sorry. The steps you listed here sound reasonable to
me.

If you want to get search hits for a part number range and highlight
'A123-56' when searching for A124, you would need to create new tokens for
A124 and save all the information (like offset, docId ..), except for the
terms text, for those tokens by copying it from 'A123-56' for each of your
new tokens (I think..).


On Fri, Jan 28, 2011 at 1:45 PM, Wulf Berschin <be...@dosco.de> wrote:

> Hi Karolina,
>
> yes (of course!) We have an XML element for the part numbers, but upto now
> they are not all tagged thus we need regex matching as well...
>
> Am 28.01.2011 13:31, schrieb Karolina Bernat:
>
>> Hi Wulf,
>>
>> can I ask, if it is structured documentation (like XML or SGML) you're
>> dealing with? It's because I also work with technical documentation and we
>> do exactly, waht you're asking for, but it is XML-data.
>>
>>
>> On Fri, Jan 28, 2011 at 1:05 PM, Wulf Berschin<be...@dosco.de>  wrote:
>>
>>  Hi,
>>>
>>> I'm poking in the dark and hope someone has some light...
>>>
>>> We have part numbers in technical documentation to retrieve. For now we
>>> have a (long) regular expression to find those in a string. The part
>>> numbers
>>> have letters, digits and (redundant) whitespace. Furthermore authors
>>> often
>>> used a compressed notation for number ranges with dashes or slashes, like
>>> A123-56 or A123/4.
>>>
>>> When searching for part numbers users should be able to enter specific
>>> numbers like A126 (then the text "A123-56" should be found too) or
>>> wildcard
>>> searches like "A12?" or "A*". This part number seach is a separate
>>> feature
>>> apart from regular full text search.
>>>
>>> As far I see I have to
>>>
>>> - add an extra field for storing part numbers
>>>
>>> - create a Tokenizer which recognizes just the part numbers and skips all
>>> other text
>>>
>>> - create an Analyzer which expands ranges like A123-56 to A123, A124,
>>> ...,
>>> A156 and normalizes numbers by remving whitespace
>>>
>>> With this analyzer I hope to get the highlighting to work too (e.g.
>>> "A123-56" highlighted when "A126" was the search term).
>>>
>>> Is this the right way? What could I use as starting point (I found
>>> org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much
>>> more than I need...)
>>>
>>> Thanks for all hints!
>>>
>>> Wulf
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>
> --
>
> Mit freundlichen Grüßen,
>
> Wulf Berschin
>
> --
>
> <!-- *****************************************************************
> * Wulf Berschin                            Telefon: +49 6221 1486 16 *
> * DOSCO Document Systems Consulting GmbH   Telefax: +49 6221 1486 19 *
> * Mannheimer Strasse 1                     E-Mail: berschin@dosco.de *
> * 69115 Heidelberg, Germany                http://www.dosco.de       *
> * Handelsregister: Heidelberg HRB 335122                             *
> * Geschäftsführung: Robert Erfle                                     *
> ****************************************************************** -->
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: ****SPAM(5.0)**** Re: How to index part numbers

Posted by Erick Erickson <er...@gmail.com>.
I wonder if you can define the problem away? It sounds like
you have essentially random input here. That is, the users
can put in whatever they want so whatever you do will be wrong
sometime. Could you sidestep the problem with auto-complete
and prefix queries (essentially adding * to the user's input)?

That way, the user would see the exact input (A123-56 in
your example).

This assumes there's some kind of GUI front end, so I may
be way off base....

You could still let them search free-form if they really wanted,
but you wouldn't then have to try to figure out what the user
meant when they added A123,5,7....

FWIW
Erick


On Fri, Jan 28, 2011 at 7:45 AM, Wulf Berschin <be...@dosco.de> wrote:

> Hi Karolina,
>
> yes (of course!) We have an XML element for the part numbers, but upto now
> they are not all tagged thus we need regex matching as well...
>
> Am 28.01.2011 13:31, schrieb Karolina Bernat:
>
>> Hi Wulf,
>>
>> can I ask, if it is structured documentation (like XML or SGML) you're
>> dealing with? It's because I also work with technical documentation and we
>> do exactly, waht you're asking for, but it is XML-data.
>>
>>
>> On Fri, Jan 28, 2011 at 1:05 PM, Wulf Berschin<be...@dosco.de>  wrote:
>>
>>  Hi,
>>>
>>> I'm poking in the dark and hope someone has some light...
>>>
>>> We have part numbers in technical documentation to retrieve. For now we
>>> have a (long) regular expression to find those in a string. The part
>>> numbers
>>> have letters, digits and (redundant) whitespace. Furthermore authors
>>> often
>>> used a compressed notation for number ranges with dashes or slashes, like
>>> A123-56 or A123/4.
>>>
>>> When searching for part numbers users should be able to enter specific
>>> numbers like A126 (then the text "A123-56" should be found too) or
>>> wildcard
>>> searches like "A12?" or "A*". This part number seach is a separate
>>> feature
>>> apart from regular full text search.
>>>
>>> As far I see I have to
>>>
>>> - add an extra field for storing part numbers
>>>
>>> - create a Tokenizer which recognizes just the part numbers and skips all
>>> other text
>>>
>>> - create an Analyzer which expands ranges like A123-56 to A123, A124,
>>> ...,
>>> A156 and normalizes numbers by remving whitespace
>>>
>>> With this analyzer I hope to get the highlighting to work too (e.g.
>>> "A123-56" highlighted when "A126" was the search term).
>>>
>>> Is this the right way? What could I use as starting point (I found
>>> org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much
>>> more than I need...)
>>>
>>> Thanks for all hints!
>>>
>>> Wulf
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>
> --
>
> Mit freundlichen Grüßen,
>
> Wulf Berschin
>
> --
>
> <!-- *****************************************************************
> * Wulf Berschin                            Telefon: +49 6221 1486 16 *
> * DOSCO Document Systems Consulting GmbH   Telefax: +49 6221 1486 19 *
> * Mannheimer Strasse 1                     E-Mail: berschin@dosco.de *
> * 69115 Heidelberg, Germany                http://www.dosco.de       *
> * Handelsregister: Heidelberg HRB 335122                             *
> * Geschäftsführung: Robert Erfle                                     *
> ****************************************************************** -->
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: ****SPAM(5.0)**** Re: How to index part numbers

Posted by Wulf Berschin <be...@dosco.de>.
Hi Karolina,

yes (of course!) We have an XML element for the part numbers, but upto 
now they are not all tagged thus we need regex matching as well...

Am 28.01.2011 13:31, schrieb Karolina Bernat:
> Hi Wulf,
>
> can I ask, if it is structured documentation (like XML or SGML) you're
> dealing with? It's because I also work with technical documentation and we
> do exactly, waht you're asking for, but it is XML-data.
>
>
> On Fri, Jan 28, 2011 at 1:05 PM, Wulf Berschin<be...@dosco.de>  wrote:
>
>> Hi,
>>
>> I'm poking in the dark and hope someone has some light...
>>
>> We have part numbers in technical documentation to retrieve. For now we
>> have a (long) regular expression to find those in a string. The part numbers
>> have letters, digits and (redundant) whitespace. Furthermore authors often
>> used a compressed notation for number ranges with dashes or slashes, like
>> A123-56 or A123/4.
>>
>> When searching for part numbers users should be able to enter specific
>> numbers like A126 (then the text "A123-56" should be found too) or wildcard
>> searches like "A12?" or "A*". This part number seach is a separate feature
>> apart from regular full text search.
>>
>> As far I see I have to
>>
>> - add an extra field for storing part numbers
>>
>> - create a Tokenizer which recognizes just the part numbers and skips all
>> other text
>>
>> - create an Analyzer which expands ranges like A123-56 to A123, A124, ...,
>> A156 and normalizes numbers by remving whitespace
>>
>> With this analyzer I hope to get the highlighting to work too (e.g.
>> "A123-56" highlighted when "A126" was the search term).
>>
>> Is this the right way? What could I use as starting point (I found
>> org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much
>> more than I need...)
>>
>> Thanks for all hints!
>>
>> Wulf
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


-- 

Mit freundlichen Grüßen,

Wulf Berschin

--

<!-- *****************************************************************
* Wulf Berschin                            Telefon: +49 6221 1486 16 *
* DOSCO Document Systems Consulting GmbH   Telefax: +49 6221 1486 19 *
* Mannheimer Strasse 1                     E-Mail: berschin@dosco.de *
* 69115 Heidelberg, Germany                http://www.dosco.de       *
* Handelsregister: Heidelberg HRB 335122                             *
* Geschäftsführung: Robert Erfle                                     *
****************************************************************** -->


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to index part numbers

Posted by Karolina Bernat <ka...@googlemail.com>.
Hi Wulf,

can I ask, if it is structured documentation (like XML or SGML) you're
dealing with? It's because I also work with technical documentation and we
do exactly, waht you're asking for, but it is XML-data.


On Fri, Jan 28, 2011 at 1:05 PM, Wulf Berschin <be...@dosco.de> wrote:

> Hi,
>
> I'm poking in the dark and hope someone has some light...
>
> We have part numbers in technical documentation to retrieve. For now we
> have a (long) regular expression to find those in a string. The part numbers
> have letters, digits and (redundant) whitespace. Furthermore authors often
> used a compressed notation for number ranges with dashes or slashes, like
> A123-56 or A123/4.
>
> When searching for part numbers users should be able to enter specific
> numbers like A126 (then the text "A123-56" should be found too) or wildcard
> searches like "A12?" or "A*". This part number seach is a separate feature
> apart from regular full text search.
>
> As far I see I have to
>
> - add an extra field for storing part numbers
>
> - create a Tokenizer which recognizes just the part numbers and skips all
> other text
>
> - create an Analyzer which expands ranges like A123-56 to A123, A124, ...,
> A156 and normalizes numbers by remving whitespace
>
> With this analyzer I hope to get the highlighting to work too (e.g.
> "A123-56" highlighted when "A126" was the search term).
>
> Is this the right way? What could I use as starting point (I found
> org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much
> more than I need...)
>
> Thanks for all hints!
>
> Wulf
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

How to index part numbers

Posted by Wulf Berschin <be...@dosco.de>.
Hi,

I'm poking in the dark and hope someone has some light...

We have part numbers in technical documentation to retrieve. For now we 
have a (long) regular expression to find those in a string. The part 
numbers have letters, digits and (redundant) whitespace. Furthermore 
authors often used a compressed notation for number ranges with dashes 
or slashes, like A123-56 or A123/4.

When searching for part numbers users should be able to enter specific 
numbers like A126 (then the text "A123-56" should be found too) or 
wildcard searches like "A12?" or "A*". This part number seach is a 
separate feature apart from regular full text search.

As far I see I have to

- add an extra field for storing part numbers

- create a Tokenizer which recognizes just the part numbers and skips 
all other text

- create an Analyzer which expands ranges like A123-56 to A123, A124, 
..., A156 and normalizes numbers by remving whitespace

With this analyzer I hope to get the highlighting to work too (e.g. 
"A123-56" highlighted when "A126" was the search term).

Is this the right way? What could I use as starting point (I found 
org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much 
more than I need...)

Thanks for all hints!

Wulf


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org