You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by br...@gmx.de on 2004/02/10 03:59:50 UTC

index: how to store binary data or objects ?

Hi Lucent Users!

Searching the documentation, API and this mailinglist results in:
"no way to store objects or binary data in an UnIndexed
org.apache.lucene.document.Field to attach it to the index directly"

Is there a way to do this? What would you suggest to do?

Thank you very much for any help!
Best regards, Markus

-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: index: how to store binary data or objects ?

Posted by petite_abeille <pe...@mac.com>.
On Feb 10, 2004, at 03:59, brosch@gmx.de wrote:

> Is there a way to do this?

Lucene deals with text. You could always serialize your objects in a 
byte array, hex encode them or something, and store that in an 
appropriate field.

> What would you suggest to do?

Don't store your objects in Lucene :)

As other have pointed out, you will be much better off storing your 
object somewhere else (files, db, btree, whatever) and only use Lucene 
to store a reference to those objects.

For a concrete example of this approach, take a look at ZOE [1] source 
code [2].

It uses Lucene for, er, indexing and JDBM [2] to store the 
corresponding object's binaries.

Cheers,

PA.

[1] http://zoe.nu/itstories/story.php?data=stories&num=16&sec=1
[2] http://zoe.nu/misc/Workspace20031122.tgz
[3] http://jdbm.sourceforge.net/


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: how to "re-index"

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Update in Lucene means: delete the document and then re-add it.
This may be a FAQ.

Otis

--- Markus Brosch <br...@gmx.de> wrote:
> > However, I have problems with "reindexing". 
> > First, I index all my object contents. Then some of these objects
> can
> > change
> > and need to be re-indexed. 
> > 
> > I did it with IndexWriter(Dir, Analyzer, FALSE). With the boolean
> value
> > "false" the new document will be added to the index, but the old
> document
> > still remains in the index :-/ 
> 
> Sorry for the second mail, but maybe I sould say that I am looking
> for an
> UPDATE of the index! What I am doing at the moment is adding (see
> above) and
> deleting with IndexReader ...
> 
> Thanks ;-)
> 
> -- 
> GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99
> EUR/Monat...)
> jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++
> http://www.gmx.net/derspiegel +++
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: how to "re-index"

Posted by Markus Brosch <br...@gmx.de>.
> However, I have problems with "reindexing". 
> First, I index all my object contents. Then some of these objects can
> change
> and need to be re-indexed. 
> 
> I did it with IndexWriter(Dir, Analyzer, FALSE). With the boolean value
> "false" the new document will be added to the index, but the old document
> still remains in the index :-/ 

Sorry for the second mail, but maybe I sould say that I am looking for an
UPDATE of the index! What I am doing at the moment is adding (see above) and
deleting with IndexReader ...

Thanks ;-)

-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


how to "re-index"

Posted by Markus Brosch <br...@gmx.de>.
> When retrieving your documents, you can use this keyword to reference 
> your object.
> 
> > Another problem is, that my objects can change their content and must 
> > be "reindexed". Is it possible to remove the single index for that
object 
> > and build a new one without reindexing all?
> 
> Yes.

Thank you for your answers!

However, I have problems with "reindexing". 
First, I index all my object contents. Then some of these objects can change
and need to be re-indexed. 

I did it with IndexWriter(Dir, Analyzer, FALSE). With the boolean value
"false" the new document will be added to the index, but the old document still
remains in the index :-/ 

Any suggestions? THANK YOU!

-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: index: how to store binary data or objects ?

Posted by petite_abeille <pe...@mac.com>.
On Feb 10, 2004, at 14:53, Markus Brosch wrote:

> My application will deal with "small" data sets. The problem is, that 
> I want
> to index the content (String) of some objects. I want to refer to that
> object once I found this by a keyword or whatever.  So, using a simple 
> map or
> tree?

Something along these lines:

- When indexing your object, you create one Lucene document for it and 
store its unique identifier as a keyword along side whatever you want 
to index.

- When retrieving your documents, you can use this keyword to reference 
your object.

> Another problem is, that my objects can change their content and must 
> be
> "reindexed". Is it possible to remove the single index for that object 
> and build
> a new one without reindexing all?

Yes.

Cheers,

PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: index: how to store binary data or objects ?

Posted by Markus Brosch <br...@gmx.de>.
> 1. Store the binary data in files and store the path in Lucene. There's
> scallability issues here when you handle more than a few hundred
> thousand objects.

> 2. Store the binary data in a database and store a unique id in Lucene.
> This will scale better but binary data fetching from the db might be
> slow.

Thank you all for your comments!
In general I understand your suggestions - mostly because of scaling issues.

My application will deal with "small" data sets. The problem is, that I want
to index the content (String) of some objects. I want to refer to that
object once I found this by a keyword or whatever.  So, using a simple map or
tree? 

Another problem is, that my objects can change their content and must be
"reindexed". Is it possible to remove the single index for that object and build
a new one without reindexing all?

Thank you for help!
Best regards, Markus



-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: index: how to store binary data or objects ?

Posted by petite_abeille <pe...@mac.com>.
On Feb 10, 2004, at 09:32, Andrzej Bialecki wrote:

> Just a comment: for ext2fs and BSD FFS (dunno about NT) scalability 
> issues with this approach can be partially addressed by building a 
> tree of subdirectories, instead of using just one. I.e. a file named 
> "myThesis.pdf" would go into /m/y/t/myThesis.pdf. This way the time 
> needed to list the files in a given directory is reduced (both unixes 
> can already cache the inode numbers for name/inode lookup, so there is 
> no significant time increase to lookup a longer path).

Yes. But you have to watch out for overall path length limit though. An 
alternative strategy is to hash your keys and store that as the 
directory path. This what some browsers do to store their cache.

> FreeBSD also has a special kind of filesystem, which uses inodes in a 
> flat space (no directories). It was specifically designed for storing 
> large numbers of files efficiently. Recent versions of Java on FreeBSD 
> (1.4.2) seem to be very stable and performing well, so that could also 
> be an option.

Yes. The file system can be used to simulate a fairly reasonable 
database. The problem then is not so much the time it takes to look up 
those files, but rather opening, reading, and closing them.

> After all, a filesystem _is_ a kind of very specialized database... ;-)

This is true and works quite well to a certain extend.

But it suffers from one major flaw in my experience: you run out of 
file descriptors very quickly. And lets face it, it's quite slow also 
:)

Cheers,

PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: index: how to store binary data or objects ?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dror Matalon wrote:

> On Tue, Feb 10, 2004 at 03:59:50AM +0100, brosch@gmx.de wrote:
> 
>>Hi Lucent Users!
>>
>>Searching the documentation, API and this mailinglist results in:
>>"no way to store objects or binary data in an UnIndexed
>>org.apache.lucene.document.Field to attach it to the index directly"
>>
>>Is there a way to do this? What would you suggest to do?
> 
> 
> 1. Store the binary data in files and store the path in Lucene. There's
> scallability issues here when you handle more than a few hundred
> thousand objects.

Just a comment: for ext2fs and BSD FFS (dunno about NT) scalability 
issues with this approach can be partially addressed by building a tree 
of subdirectories, instead of using just one. I.e. a file named 
"myThesis.pdf" would go into /m/y/t/myThesis.pdf. This way the time 
needed to list the files in a given directory is reduced (both unixes 
can already cache the inode numbers for name/inode lookup, so there is 
no significant time increase to lookup a longer path).

FreeBSD also has a special kind of filesystem, which uses inodes in a 
flat space (no directories). It was specifically designed for storing 
large numbers of files efficiently. Recent versions of Java on FreeBSD 
(1.4.2) seem to be very stable and performing well, so that could also 
be an option.

After all, a filesystem _is_ a kind of very specialized database... ;-)

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: index: how to store binary data or objects ?

Posted by Dror Matalon <dr...@zapatec.com>.
On Tue, Feb 10, 2004 at 03:59:50AM +0100, brosch@gmx.de wrote:
> Hi Lucent Users!
> 
> Searching the documentation, API and this mailinglist results in:
> "no way to store objects or binary data in an UnIndexed
> org.apache.lucene.document.Field to attach it to the index directly"
> 
> Is there a way to do this? What would you suggest to do?

1. Store the binary data in files and store the path in Lucene. There's
scallability issues here when you handle more than a few hundred
thousand objects.
2. Store the binary data in a database and store a unique id in Lucene.
This will scale better but binary data fetching from the db might be
slow.

> 
> Thank you very much for any help!
> Best regards, Markus
> 
> -- 
> GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
> jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org