You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Lokendra Singh <ls...@gmail.com> on 2011/02/25 07:26:52 UTC

Converting an existing index format to Lucene Index

Hi all,

I am seeking for some guidelines to directly convert an already existing
index to Lucene index.
The index available to me is of a set of <value1,value2> pairs. Where each
pair is :
< word ,  fileName >
i.e a word as a 'value1', and the 'value2' being the fileName containing
that word.

A word might appear in several fileNames as well a same file can contain
multiple copies of a word. For eg, following index is possible:
< "my"  , "file1" >
< "you" , "file2" >
< "my",  "file2" >
< "my", "file1">

My actual problem is that the index available to me is very large in size,
hence I am bit reluctant to create 'Document' object for each file because
for that I will have to read through all the pairs first and store them in
memory. Or I will have to 'update' the 'Document' object of a particular
file while iterating through the Pairs of my index, this 'update', again, is
a costly operation.

Please correct me if my understanding of Lucene is wrong or other
alternative ways.

Regards
Lokendra

RE: Converting an existing index format to Lucene Index

Posted by Steven A Rowe <sa...@syr.edu>.
If you sort your old index file by filename, then iterate over the sorted file, your problem is solved, no?

From: Lokendra Singh [mailto:lsingh.969@gmail.com]
Sent: Friday, February 25, 2011 1:27 AM
To: java-user@lucene.apache.org
Cc: dev@lucene.apache.org
Subject: Converting an existing index format to Lucene Index

Hi all,

I am seeking for some guidelines to directly convert an already existing index to Lucene index.
The index available to me is of a set of <value1,value2> pairs. Where each pair is :
< word ,  fileName >
i.e a word as a 'value1', and the 'value2' being the fileName containing that word.

A word might appear in several fileNames as well a same file can contain multiple copies of a word. For eg, following index is possible:
< "my"  , "file1" >
< "you" , "file2" >
< "my",  "file2" >
< "my", "file1">

My actual problem is that the index available to me is very large in size, hence I am bit reluctant to create 'Document' object for each file because for that I will have to read through all the pairs first and store them in memory. Or I will have to 'update' the 'Document' object of a particular file while iterating through the Pairs of my index, this 'update', again, is a costly operation.

Please correct me if my understanding of Lucene is wrong or other alternative ways.

Regards
Lokendra

Re: Converting an existing index format to Lucene Index

Posted by Lokendra Singh <ls...@gmail.com>.
Eddie and Aditya: Thanks a lot for your suggestions!!.

The collect-and-commit approach looks good.

@Eddie: My old index size is of order of  ~Milion pairs. Though, I haven't
run any such tests, I will be running them quickly for analysis.

Regards
Lokendra


On Fri, Feb 25, 2011 at 2:23 PM, Edward Drapkin <ed...@wolfram.com> wrote:

> On 2/25/2011 12:26 AM, Lokendra Singh wrote:
>
>> Hi all,
>>
>> I am seeking for some guidelines to directly convert an already existing
>> index to Lucene index.
>> The index available to me is of a set of <value1,value2> pairs. Where each
>> pair is :
>> < word ,  fileName >
>> i.e a word as a 'value1', and the 'value2' being the fileName containing
>> that word.
>>
>> A word might appear in several fileNames as well a same file can contain
>> multiple copies of a word. For eg, following index is possible:
>> < "my"  , "file1" >
>> < "you" , "file2" >
>> < "my",  "file2" >
>> < "my", "file1">
>>
>> My actual problem is that the index available to me is very large in size,
>> hence I am bit reluctant to create 'Document' object for each file because
>> for that I will have to read through all the pairs first and store them in
>> memory. Or I will have to 'update' the 'Document' object of a particular
>> file while iterating through the Pairs of my index, this 'update', again, is
>> a costly operation.
>>
>> Please correct me if my understanding of Lucene is wrong or other
>> alternative ways.
>>
>> Regards
>> Lokendra
>>
>
>
> Er, sorry for the blank email, hit the wrong button!
>
> There are basically two ways to do this:
>
> 1) Buffer everything in RAM and then write all at once - this is probably
> the quickest way to do it, but the most resource intensive and prone to
> failure (OOM will lose all work, for example).
> 2) Iterate through the list, collecting some number of values and then
> periodically committing them to the index.
>
> There's not really any other way: you either write it out in chunks or you
> write it out all at once.  However, there is some leeway in how you iterate
> through your old index.  Iterating through the entire index and buffering
> everything in RAM and writing it all out at once is, like you said, probably
> prohibitively resource intensive.  You could, on the other hand, iterate
> through the index and only collect values for a particular file, then commit
> that, then iterate again.  I would imagine this is a much slower approach,
> but it will be less memory intensive.
>
> Personally, the way I'd approach this problem... I'd iterate through the
> old index in one pass.  Every time I encountered a new file, I'd create a
> new Document and store it somewhere (something trivial like Map<String,
> Document> where the key is the filename).   I'd also ensure that the
> Documents have a field called "file" so that I could easily query them
> later.  Every iteration, I'd continue to add to the Documents and every n
> iterations I'd commit all the Documents to the index (ostensibly calling
> IndexWriter.updateDocument).  By tuning the number of iterations that
> triggers an index write optimization, you can adjust the balance between RAM
> usage and CPU/IO time spent.  n=1 would obviously be the most CPU/IO
> intensive and n=inf would be the most RAM intensive and the "sweet spot" for
> your requirements is very probably somewhere between those two points.
>
> How big is this old index, by the way?   Have you run tests to ensure that
> the memory limit or cpu cost in either method is actually a problem?  I
> think you may be surprised at the speeds you get, if you haven't run tests
> already.
>
> Thanks,
> Eddie
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Converting an existing index format to Lucene Index

Posted by Edward Drapkin <ed...@wolfram.com>.
On 2/25/2011 12:26 AM, Lokendra Singh wrote:
> Hi all,
>
> I am seeking for some guidelines to directly convert an already 
> existing index to Lucene index.
> The index available to me is of a set of <value1,value2> pairs. Where 
> each pair is :
> < word ,  fileName >
> i.e a word as a 'value1', and the 'value2' being the fileName 
> containing that word.
>
> A word might appear in several fileNames as well a same file can 
> contain multiple copies of a word. For eg, following index is possible:
> < "my"  , "file1" >
> < "you" , "file2" >
> < "my",  "file2" >
> < "my", "file1">
>
> My actual problem is that the index available to me is very large in 
> size, hence I am bit reluctant to create 'Document' object for each 
> file because for that I will have to read through all the pairs first 
> and store them in memory. Or I will have to 'update' the 'Document' 
> object of a particular file while iterating through the Pairs of my 
> index, this 'update', again, is a costly operation.
>
> Please correct me if my understanding of Lucene is wrong or other 
> alternative ways.
>
> Regards
> Lokendra


Er, sorry for the blank email, hit the wrong button!

There are basically two ways to do this:

1) Buffer everything in RAM and then write all at once - this is 
probably the quickest way to do it, but the most resource intensive and 
prone to failure (OOM will lose all work, for example).
2) Iterate through the list, collecting some number of values and then 
periodically committing them to the index.

There's not really any other way: you either write it out in chunks or 
you write it out all at once.  However, there is some leeway in how you 
iterate through your old index.  Iterating through the entire index and 
buffering everything in RAM and writing it all out at once is, like you 
said, probably prohibitively resource intensive.  You could, on the 
other hand, iterate through the index and only collect values for a 
particular file, then commit that, then iterate again.  I would imagine 
this is a much slower approach, but it will be less memory intensive.

Personally, the way I'd approach this problem... I'd iterate through the 
old index in one pass.  Every time I encountered a new file, I'd create 
a new Document and store it somewhere (something trivial like 
Map<String, Document> where the key is the filename).   I'd also ensure 
that the Documents have a field called "file" so that I could easily 
query them later.  Every iteration, I'd continue to add to the Documents 
and every n iterations I'd commit all the Documents to the index 
(ostensibly calling IndexWriter.updateDocument).  By tuning the number 
of iterations that triggers an index write optimization, you can adjust 
the balance between RAM usage and CPU/IO time spent.  n=1 would 
obviously be the most CPU/IO intensive and n=inf would be the most RAM 
intensive and the "sweet spot" for your requirements is very probably 
somewhere between those two points.

How big is this old index, by the way?   Have you run tests to ensure 
that the memory limit or cpu cost in either method is actually a 
problem?  I think you may be surprised at the speeds you get, if you 
haven't run tests already.

Thanks,
Eddie

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Converting an existing index format to Lucene Index

Posted by findbestopensource <fi...@gmail.com>.
Hello Lokendra,

You could updates frequently. Anyway i think it is one time job.

My advice would be do insertion and updates in batch.
1. Parse your file and read 1000 lines
2. Do some aggregation and insert / update with lucene.

Regards
Aditya
www.findbestopensource.com



On Fri, Feb 25, 2011 at 11:56 AM, Lokendra Singh <ls...@gmail.com>wrote:

> Hi all,
>
> I am seeking for some guidelines to directly convert an already existing
> index to Lucene index.
> The index available to me is of a set of <value1,value2> pairs. Where each
> pair is :
> < word ,  fileName >
> i.e a word as a 'value1', and the 'value2' being the fileName containing
> that word.
>
> A word might appear in several fileNames as well a same file can contain
> multiple copies of a word. For eg, following index is possible:
> < "my"  , "file1" >
> < "you" , "file2" >
> < "my",  "file2" >
> < "my", "file1">
>
> My actual problem is that the index available to me is very large in size,
> hence I am bit reluctant to create 'Document' object for each file because
> for that I will have to read through all the pairs first and store them in
> memory. Or I will have to 'update' the 'Document' object of a particular
> file while iterating through the Pairs of my index, this 'update', again,
> is
> a costly operation.
>
> Please correct me if my understanding of Lucene is wrong or other
> alternative ways.
>
> Regards
> Lokendra
>

Re: Converting an existing index format to Lucene Index

Posted by findbestopensource <fi...@gmail.com>.
Hello Lokendra,

You could updates frequently. Anyway i think it is one time job.

My advice would be do insertion and updates in batch.
1. Parse your file and read 1000 lines
2. Do some aggregation and insert / update with lucene.

Regards
Aditya
www.findbestopensource.com



On Fri, Feb 25, 2011 at 11:56 AM, Lokendra Singh <ls...@gmail.com>wrote:

> Hi all,
>
> I am seeking for some guidelines to directly convert an already existing
> index to Lucene index.
> The index available to me is of a set of <value1,value2> pairs. Where each
> pair is :
> < word ,  fileName >
> i.e a word as a 'value1', and the 'value2' being the fileName containing
> that word.
>
> A word might appear in several fileNames as well a same file can contain
> multiple copies of a word. For eg, following index is possible:
> < "my"  , "file1" >
> < "you" , "file2" >
> < "my",  "file2" >
> < "my", "file1">
>
> My actual problem is that the index available to me is very large in size,
> hence I am bit reluctant to create 'Document' object for each file because
> for that I will have to read through all the pairs first and store them in
> memory. Or I will have to 'update' the 'Document' object of a particular
> file while iterating through the Pairs of my index, this 'update', again,
> is
> a costly operation.
>
> Please correct me if my understanding of Lucene is wrong or other
> alternative ways.
>
> Regards
> Lokendra
>

Re: Converting an existing index format to Lucene Index

Posted by Edward Drapkin <ed...@wolfram.com>.
On 2/25/2011 12:26 AM, Lokendra Singh wrote:
> Hi all,
>
> I am seeking for some guidelines to directly convert an already 
> existing index to Lucene index.
> The index available to me is of a set of <value1,value2> pairs. Where 
> each pair is :
> < word ,  fileName >
> i.e a word as a 'value1', and the 'value2' being the fileName 
> containing that word.
>
> A word might appear in several fileNames as well a same file can 
> contain multiple copies of a word. For eg, following index is possible:
> < "my"  , "file1" >
> < "you" , "file2" >
> < "my",  "file2" >
> < "my", "file1">
>
> My actual problem is that the index available to me is very large in 
> size, hence I am bit reluctant to create 'Document' object for each 
> file because for that I will have to read through all the pairs first 
> and store them in memory. Or I will have to 'update' the 'Document' 
> object of a particular file while iterating through the Pairs of my 
> index, this 'update', again, is a costly operation.
>
> Please correct me if my understanding of Lucene is wrong or other 
> alternative ways.
>
> Regards
> Lokendra


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org