You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Thomas QUESTE <TQ...@sqli.com> on 2004/07/23 13:50:42 UTC

Searching word in an URL

Hello all, 



I need to search words on URL which have been indexed. 

For exemple, I have "www.jakarta.org", If I search "jakarta", Lucene won't 
return a result. If I search "www.jakarta*", Lucene returns me the correct 
result. 

How should I proceed to make Lucene to be able to index "jakarta" ? I think 
I can write a Analyser that will break www.jakarta.org in "www", "jakarta" 
and "org" but It won't be the best way. 

Thanks for your help, 

Thomas 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Searching word in an URL

Posted by Thomas Plümpe <th...@gmx.de>.
> For exemple, I have "www.jakarta.org", If I search "jakarta", Lucene won't 
> return a result. If I search "www.jakarta*", Lucene returns me the correct 
> result.
> 
> How should I proceed to make Lucene to be able to index "jakarta" ? I think 
> I can write a Analyser that will break www.jakarta.org in "www", "jakarta" 
> and "org" but It won't be the best way. 
I think this article
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
(look for headline Analysis) might be helpful in understanding the
options, one of which would be using a StopAnalyzer.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Merging indexes

Posted by Rupinder Singh Mazara <rs...@ebi.ac.uk>.
Hi all 

  I got problem with merging indexes,
  I had to split up  the indexing of my data into 20 different indexes(based on a primary key) 
  I wanted to merge them all into one master index 

  for example i have 
  /xxx/lucene/tmp/1001-1000
  /xxx/lucene/tmp/1001-2000
  /xxx/lucene/tmp/2001-3001 ....etc

  i want to merge them into the index
  /xxx/lucene/index/complete 

  i was hoping to merge the indexes as the jobs get completed,

  i tried the following code , createNew: is a boolean variable turned to true for the first merge 
 and then is turned to false for all other indexes  
 tmpRoot: is the folder in which the split jobs are entered,
 d1: is the index i wanted merged

        Directory allIndexes[] = new Directory[1];
        allIndexes[0] = FSDirectory.getDirectory(new File(tmpRoot + d1), false);
        aLog.info("Dir's opened for merge ");

        IndexWriter iWriter = new IndexWriter(FSDirectory.getDirectory(new File(op), true), myAnalyzer, createNew);
        // IndexWriter iWriter = new IndexWriter(FSDirectory.getDirectory(new File(op), true), myAnalyzer, false);
        aLog.fatal("Retrieved Indexes preparing to MERGE ");
        iWriter.addIndexes(allIndexes);
        aLog.fatal("Preparing to optimize ");
        iWriter.optimize();
        aLog.fatal("Closing indexes  ");
        iWriter.close();

  the problems is the first run happens fine ,  but  the next job runs and ends up deleting the contents of the master index directory



 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Large index files

Posted by John Moylan <jo...@rte.ie>.
Depends... I index everything to ram, optimize and then dump to the
disk. I end up with three files:

segments
deletable
_1.cfs

John

On Fri, 2004-07-23 at 14:38, Joel Shellman wrote:
> I'm a little confused by this. I thought Lucene keeps creating new files 
> as the index gets bigger and any single file doesn't ever get all that 
> big. Is that not the case?
> 
> Thanks,
> 
> Joel Shellman
> 
> 
> John Moylan wrote:
> > As long as your kernel has "Large File Support", then you should be
> > fine. Most modern distro's support >2GB files now out of the box.
> > 
> > John 
> > 
> > On Fri, 2004-07-23 at 13:44, Karthik N S wrote:
> > 
> >>Hi
> >>
> >>  I think  (a) would be a better choice  [I have  done it on Linux  upt to
> >>7GB , it's pretty faster then doing the same on win2000 PF]
> >>
> >>
> >>with regards
> >>Karthik
> >>
> >>-----Original Message-----
> >>From: Rupinder Singh Mazara [mailto:rsmazara@ebi.ac.uk]
> >>Sent: Friday, July 23, 2004 5:55 PM
> >>To: Lucene Users List
> >>Subject: Large index files
> >>
> >>
> >>Hi all
> >>
> >>  I am using lucene to index a large dataset, it so happens 10% of this data
> >>yields indexes of
> >>  400MB, in all likelihood it is possible the index may go upto 7GB.
> >>
> >>  My deployment will be on a linux/tomcat  system, what will be a better
> >>solution
> >>  a) create one large index and hope linux does not mind
> >>  b) generate 7-10 indexes based on some criteria and glue them together
> >>using MultiReader, in this case I may cross the MAX file handles limit of
> >>Tomcat ?
> >>
> >> regards
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
-- 
John Moylan
RTÉ ePublishing,
Montrose House,
Donnybrook,
Dublin 4
T: +353 1 2083564
E: john.moylan@rte.ie


******************************************************************************
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RT� may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
******************************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Large index files

Posted by Praveen Peddi <pp...@contextmedia.com>.
Yes Lucene may create new file when you add document but based on merge
factor, minmergedocs, optimize and many other variables, it will merge the
multiple documents into single document. You may not always have a single
file but in most cases very few files.

Praveen
----- Original Message ----- 
From: "Joel Shellman" <jo...@ikestrel.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, July 23, 2004 9:38 AM
Subject: Re: Large index files


> I'm a little confused by this. I thought Lucene keeps creating new files
> as the index gets bigger and any single file doesn't ever get all that
> big. Is that not the case?
>
> Thanks,
>
> Joel Shellman
>
>
> John Moylan wrote:
> > As long as your kernel has "Large File Support", then you should be
> > fine. Most modern distro's support >2GB files now out of the box.
> >
> > John
> >
> > On Fri, 2004-07-23 at 13:44, Karthik N S wrote:
> >
> >>Hi
> >>
> >>  I think  (a) would be a better choice  [I have  done it on Linux  upt
to
> >>7GB , it's pretty faster then doing the same on win2000 PF]
> >>
> >>
> >>with regards
> >>Karthik
> >>
> >>-----Original Message-----
> >>From: Rupinder Singh Mazara [mailto:rsmazara@ebi.ac.uk]
> >>Sent: Friday, July 23, 2004 5:55 PM
> >>To: Lucene Users List
> >>Subject: Large index files
> >>
> >>
> >>Hi all
> >>
> >>  I am using lucene to index a large dataset, it so happens 10% of this
data
> >>yields indexes of
> >>  400MB, in all likelihood it is possible the index may go upto 7GB.
> >>
> >>  My deployment will be on a linux/tomcat  system, what will be a better
> >>solution
> >>  a) create one large index and hope linux does not mind
> >>  b) generate 7-10 indexes based on some criteria and glue them together
> >>using MultiReader, in this case I may cross the MAX file handles limit
of
> >>Tomcat ?
> >>
> >> regards
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Large index files

Posted by Rupinder Singh Mazara <rs...@ebi.ac.uk>.
by optimizing the created in index(es) u can reduce  multiple files into
a smaller set of files

and on some file systems might be a good idea to optimize once in while


>-----Original Message-----
>From: Joel Shellman [mailto:joel@ikestrel.com]
>Sent: 23 July 2004 14:38
>To: Lucene Users List
>Subject: Re: Large index files
>
>
>I'm a little confused by this. I thought Lucene keeps creating new files
>as the index gets bigger and any single file doesn't ever get all that
>big. Is that not the case?
>
>Thanks,
>
>Joel Shellman
>
>
>John Moylan wrote:
>> As long as your kernel has "Large File Support", then you should be
>> fine. Most modern distro's support >2GB files now out of the box.
>>
>> John
>>
>> On Fri, 2004-07-23 at 13:44, Karthik N S wrote:
>>
>>>Hi
>>>
>>>  I think  (a) would be a better choice  [I have  done it on
>Linux  upt to
>>>7GB , it's pretty faster then doing the same on win2000 PF]
>>>
>>>
>>>with regards
>>>Karthik
>>>
>>>-----Original Message-----
>>>From: Rupinder Singh Mazara [mailto:rsmazara@ebi.ac.uk]
>>>Sent: Friday, July 23, 2004 5:55 PM
>>>To: Lucene Users List
>>>Subject: Large index files
>>>
>>>
>>>Hi all
>>>
>>>  I am using lucene to index a large dataset, it so happens 10%
>of this data
>>>yields indexes of
>>>  400MB, in all likelihood it is possible the index may go upto 7GB.
>>>
>>>  My deployment will be on a linux/tomcat  system, what will be a better
>>>solution
>>>  a) create one large index and hope linux does not mind
>>>  b) generate 7-10 indexes based on some criteria and glue them together
>>>using MultiReader, in this case I may cross the MAX file handles limit of
>>>Tomcat ?
>>>
>>> regards
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Large index files

Posted by Joel Shellman <jo...@ikestrel.com>.
I'm a little confused by this. I thought Lucene keeps creating new files 
as the index gets bigger and any single file doesn't ever get all that 
big. Is that not the case?

Thanks,

Joel Shellman


John Moylan wrote:
> As long as your kernel has "Large File Support", then you should be
> fine. Most modern distro's support >2GB files now out of the box.
> 
> John 
> 
> On Fri, 2004-07-23 at 13:44, Karthik N S wrote:
> 
>>Hi
>>
>>  I think  (a) would be a better choice  [I have  done it on Linux  upt to
>>7GB , it's pretty faster then doing the same on win2000 PF]
>>
>>
>>with regards
>>Karthik
>>
>>-----Original Message-----
>>From: Rupinder Singh Mazara [mailto:rsmazara@ebi.ac.uk]
>>Sent: Friday, July 23, 2004 5:55 PM
>>To: Lucene Users List
>>Subject: Large index files
>>
>>
>>Hi all
>>
>>  I am using lucene to index a large dataset, it so happens 10% of this data
>>yields indexes of
>>  400MB, in all likelihood it is possible the index may go upto 7GB.
>>
>>  My deployment will be on a linux/tomcat  system, what will be a better
>>solution
>>  a) create one large index and hope linux does not mind
>>  b) generate 7-10 indexes based on some criteria and glue them together
>>using MultiReader, in this case I may cross the MAX file handles limit of
>>Tomcat ?
>>
>> regards
>>
>>
>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Large index files

Posted by John Moylan <jo...@rte.ie>.
As long as your kernel has "Large File Support", then you should be
fine. Most modern distro's support >2GB files now out of the box.

John 

On Fri, 2004-07-23 at 13:44, Karthik N S wrote:
> Hi
> 
>   I think  (a) would be a better choice  [I have  done it on Linux  upt to
> 7GB , it's pretty faster then doing the same on win2000 PF]
> 
> 
> with regards
> Karthik
> 
> -----Original Message-----
> From: Rupinder Singh Mazara [mailto:rsmazara@ebi.ac.uk]
> Sent: Friday, July 23, 2004 5:55 PM
> To: Lucene Users List
> Subject: Large index files
> 
> 
> Hi all
> 
>   I am using lucene to index a large dataset, it so happens 10% of this data
> yields indexes of
>   400MB, in all likelihood it is possible the index may go upto 7GB.
> 
>   My deployment will be on a linux/tomcat  system, what will be a better
> solution
>   a) create one large index and hope linux does not mind
>   b) generate 7-10 indexes based on some criteria and glue them together
> using MultiReader, in this case I may cross the MAX file handles limit of
> Tomcat ?
> 
>  regards
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
-- 
John Moylan
RTÉ ePublishing,
Montrose House,
Donnybrook,
Dublin 4
T: +353 1 2083564
E: john.moylan@rte.ie


******************************************************************************
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RT� may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
******************************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Large index files

Posted by Karthik N S <ka...@controlnet.co.in>.
Hi

  I think  (a) would be a better choice  [I have  done it on Linux  upt to
7GB , it's pretty faster then doing the same on win2000 PF]


with regards
Karthik

-----Original Message-----
From: Rupinder Singh Mazara [mailto:rsmazara@ebi.ac.uk]
Sent: Friday, July 23, 2004 5:55 PM
To: Lucene Users List
Subject: Large index files


Hi all

  I am using lucene to index a large dataset, it so happens 10% of this data
yields indexes of
  400MB, in all likelihood it is possible the index may go upto 7GB.

  My deployment will be on a linux/tomcat  system, what will be a better
solution
  a) create one large index and hope linux does not mind
  b) generate 7-10 indexes based on some criteria and glue them together
using MultiReader, in this case I may cross the MAX file handles limit of
Tomcat ?

 regards







---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Large index files

Posted by Rupinder Singh Mazara <rs...@ebi.ac.uk>.
Hi all

  I am using lucene to index a large dataset, it so happens 10% of this data
yields indexes of
  400MB, in all likelihood it is possible the index may go upto 7GB.

  My deployment will be on a linux/tomcat  system, what will be a better
solution
  a) create one large index and hope linux does not mind
  b) generate 7-10 indexes based on some criteria and glue them together
using MultiReader, in this case I may cross the MAX file handles limit of
Tomcat ?

 regards







---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org