You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Shaghayegh Sahebie <sh...@yahoo.com> on 2006/08/12 10:12:30 UTC

Indexing Documents which has Attachments and are Refered many times!!

Hi all;
We have got a Document management system and we want to build a search on it. We have tree kind of content in our system: Refers, Documents and Attachments. A document can have multiple attachments and can be Refered to many users.
Our users want to be able to search on documents attachments and refers. for example they want to search the Documents which are created at "2006/07/06" date and have the word "Lucene" in it or their Refers and are Refered to Mr.x.
Our users want to be ale to search in all 8 possible selections of Document, Refer and Attachment, I mean they want to be able to search just in Refers, in both Refers and Documents, ...
How can we handle it? 
I thaught to store diferent kinds of Docs in a DB, search in the DB at first and search in Lucene based on DB results and phrases given to search (Handling Document, Refer or Attachments parts in a DB search). But the DB results maybe so big and i don't know if a Lucene query can have these much of search Terms.
Another way is to Index each document, refer and attachment in the index 8 times(all the possible selections of Refer, Document and Attachment) but this way has lots of redundancy even more than 8 times! 'cause each Document is indexed "8 * Refer number of Document" times.
I really don't know what to do, Any suggestions Please?

Thanks in advance

 			
---------------------------------
Do you Yahoo!?
 Everyone is raving about the  all-new Yahoo! Mail Beta.

Re: Indexing Documents which has Attachments and are Refered many times!!

Posted by Jason Polites <ja...@gmail.com>.

I think you can still achieve your desired outcome, but I'm not sure I fully
understand the use case.  Can you describe more clearly a specific example
of what you need to achieve?

You are correct that "joins" in lucene aren't really a strong point, but
this is often a by-product of thinking about Lucene in the same wy we think
about a database; which is a mistake (in my opinion).

I often have trouble trying to use Lucene in the same way I use a database
(joined queries etc).  Usually I need to re-think the way I am using Lucene
to avoid needing to do this.

On 8/19/06, Shaghayegh Sahebie <sh...@yahoo.com> wrote:
>
> thanks Jason and Steve;
> maybe i didn't understand your solution well, but in this system a
> document is refered many times (we have a refer description wich we should
> index it also) and each time a document is refered i should update it in the
> lucene index and so i should delete it and then index it again. and it means
> deleting and indexing this document many times. i think indexing is time
> consuming.
> and as another question can lucene have different value for one of it's
> fields? 'cause i refer a doc many time and i need many refer fields(but i
> don't know how many) in one documents fields. and i can not e.g. concat
> all the refers in one feld 'cause i may have other constraints on a refer
> and doument also (e.g. give me the documents which they have word "foo" in
> their refers and the refer of them which has this wordis on "yyyy/mm/dd"
> date. if i concat the refers i may find a refer which has the word in it but
> another refer of this document is on the given date. i mean i should know
> which refer is on which date and also other fields of a refer other than
> date. the same is true for attachments) on the date of a refer an
>
> i think the main problem i got is that lucene can not handle joins and i
> think i need joins.
>
> regards
> --Shaghayegh
>
> Steven Rowe <sa...@syr.edu> wrote: As Jason says, you can structure each
> Lucene document with one Field per
> content type, and index all data that way.  The database is not required.
>
> To address your search complexity concern, you can create queries that
> search only those Field(s) the user wants -- there is no need to have a
> Field for each possible combination of content type.
>
> Steve
>
> Jason Polites wrote:
> > Maybe I'm not understanding your requirement, but this should be fairly
> > simple in Lucene.
> >
> > Each document in your document management system would be represented by
> a
> > single Lucene document in the index.  Each lucene document will then
> have
> > several fields, each field representing the values of the "meta data"
> > associated with your document in the document management system.
> >
> > For example:
> >
> > Lets say you have a document which has the following structure:
> >
> > Title: Sample Document
> > Date: 01/01/2006
> > Attachment: Some attachment (this is the content of the attachment)
> > Attachment: Some other attachment
> > Refers: Mr X
> > Refers: Mr Y
> >
> > Your Lucene document would have the same structure.  All of these items
> in
> > your "real" document would simply be Fields in the lucene document.
> >
> > In the case of your attachments, you could also consider indexing them
> > separately (as well).  This way users could search for attachments
> without
> > needing the documents to which they are attached.
> >
> > If your attachments ARE documents (that is, your just have a foreign key
> > style relationship between two documents), then you would simple index
> each
> > "real" document as a separate document in lucene and add some sort of
> > reference field which contains the ID of all related documents.
> >
> > For Example:
> > -------------------
> > ID: 123
> > Title: Sample Document
> > Date: 01/01/2006
> > Attachment: 456
> > Attachment: 789
> > Refers: Mr X
> > Refers: Mr Y
> > --------------------
> > ID: 456
> > Title: Other Document
> > Date:.. etc etc...
> >
> > The one thing to be mindful of is re-indexing existing documents.  If
> you
> > have document that is already indexed and you want to make a change (eg
> you
> > want to add a new "refers" value), then you need to re-index the entire
> > document.  This means you need to either "store" all fields you want to
> > keep
> > during re-indexing (which is typically all of them), or you need to
> > re-index
> > the document from its source. Storing all the data in the index can have
> > adverse effects on the performance of the index however. (hope this
> makes
> > sense).
> >
> >
> >
> > On 8/12/06, Shaghayegh Sahebie  wrote:
> >>
> >> Hi all;
> >> We have got a Document management system and we want to build a search
> on
> >> it. We have tree kind of content in our system: Refers, Documents and
> >> Attachments. A document can have multiple attachments and can be
> >> Refered to
> >> many users.
> >> Our users want to be able to search on documents attachments and
> refers.
> >> for example they want to search the Documents which are created at
> >> "2006/07/06" date and have the word "Lucene" in it or their Refers and
> >> are
> >> Refered to Mr.x.
> >> Our users want to be ale to search in all 8 possible selections of
> >> Document, Refer and Attachment, I mean they want to be able to search
> >> just
> >> in Refers, in both Refers and Documents, ...
> >> How can we handle it?
> >> I thaught to store diferent kinds of Docs in a DB, search in the DB at
> >> first and search in Lucene based on DB results and phrases given to
> >> search
> >> (Handling Document, Refer or Attachments parts in a DB search). But
> >> the DB
> >> results maybe so big and i don't know if a Lucene query can have these
> >> much
> >> of search Terms.
> >> Another way is to Index each document, refer and attachment in the
> >> index 8
> >> times(all the possible selections of Refer, Document and Attachment)
> but
> >> this way has lots of redundancy even more than 8 times! 'cause each
> >> Document
> >> is indexed "8 * Refer number of Document" times.
> >> I really don't know what to do, Any suggestions Please?
> >>
> >> Thanks in advance
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Everyone is raving about the  all-new Yahoo! Mail Beta.
>

Re: Indexing Documents which has Attachments and are Refered many times!!

Posted by Shaghayegh Sahebie <sh...@yahoo.com>.

thanks Jason and Steve;
maybe i didn't understand your solution well, but in this system a document is refered many times (we have a refer description wich we should index it also) and each time a document is refered i should update it in the lucene index and so i should delete it and then index it again. and it means deleting and indexing this document many times. i think indexing is time consuming. 
and as another question can lucene have different value for one of it's fields? 'cause i refer a doc many time and i need many refer fields(but i don't know how many) in one documents fields. and i can not e.g. concat all the refers in one feld 'cause i may have other constraints on a refer and doument also (e.g. give me the documents which they have word "foo" in their refers and the refer of them which has this wordis on "yyyy/mm/dd" date. if i concat the refers i may find a refer which has the word in it but another refer of this document is on the given date. i mean i should know which refer is on which date and also other fields of a refer other than date. the same is true for attachments) on the date of a refer an

i think the main problem i got is that lucene can not handle joins and i think i need joins.

regards 
--Shaghayegh

Steven Rowe <sa...@syr.edu> wrote: As Jason says, you can structure each Lucene document with one Field per
content type, and index all data that way.  The database is not required.

To address your search complexity concern, you can create queries that
search only those Field(s) the user wants -- there is no need to have a
Field for each possible combination of content type.

Steve

Jason Polites wrote:
> Maybe I'm not understanding your requirement, but this should be fairly
> simple in Lucene.
> 
> Each document in your document management system would be represented by a
> single Lucene document in the index.  Each lucene document will then have
> several fields, each field representing the values of the "meta data"
> associated with your document in the document management system.
> 
> For example:
> 
> Lets say you have a document which has the following structure:
> 
> Title: Sample Document
> Date: 01/01/2006
> Attachment: Some attachment (this is the content of the attachment)
> Attachment: Some other attachment
> Refers: Mr X
> Refers: Mr Y
> 
> Your Lucene document would have the same structure.  All of these items in
> your "real" document would simply be Fields in the lucene document.
> 
> In the case of your attachments, you could also consider indexing them
> separately (as well).  This way users could search for attachments without
> needing the documents to which they are attached.
> 
> If your attachments ARE documents (that is, your just have a foreign key
> style relationship between two documents), then you would simple index each
> "real" document as a separate document in lucene and add some sort of
> reference field which contains the ID of all related documents.
> 
> For Example:
> -------------------
> ID: 123
> Title: Sample Document
> Date: 01/01/2006
> Attachment: 456
> Attachment: 789
> Refers: Mr X
> Refers: Mr Y
> --------------------
> ID: 456
> Title: Other Document
> Date:.. etc etc...
> 
> The one thing to be mindful of is re-indexing existing documents.  If you
> have document that is already indexed and you want to make a change (eg you
> want to add a new "refers" value), then you need to re-index the entire
> document.  This means you need to either "store" all fields you want to
> keep
> during re-indexing (which is typically all of them), or you need to
> re-index
> the document from its source. Storing all the data in the index can have
> adverse effects on the performance of the index however. (hope this makes
> sense).
> 
> 
> 
> On 8/12/06, Shaghayegh Sahebie  wrote:
>>
>> Hi all;
>> We have got a Document management system and we want to build a search on
>> it. We have tree kind of content in our system: Refers, Documents and
>> Attachments. A document can have multiple attachments and can be
>> Refered to
>> many users.
>> Our users want to be able to search on documents attachments and refers.
>> for example they want to search the Documents which are created at
>> "2006/07/06" date and have the word "Lucene" in it or their Refers and
>> are
>> Refered to Mr.x.
>> Our users want to be ale to search in all 8 possible selections of
>> Document, Refer and Attachment, I mean they want to be able to search
>> just
>> in Refers, in both Refers and Documents, ...
>> How can we handle it?
>> I thaught to store diferent kinds of Docs in a DB, search in the DB at
>> first and search in Lucene based on DB results and phrases given to
>> search
>> (Handling Document, Refer or Attachments parts in a DB search). But
>> the DB
>> results maybe so big and i don't know if a Lucene query can have these
>> much
>> of search Terms.
>> Another way is to Index each document, refer and attachment in the
>> index 8
>> times(all the possible selections of Refer, Document and Attachment) but
>> this way has lots of redundancy even more than 8 times! 'cause each
>> Document
>> is indexed "8 * Refer number of Document" times.
>> I really don't know what to do, Any suggestions Please?
>>
>> Thanks in advance


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 			
---------------------------------
Do you Yahoo!?
 Everyone is raving about the  all-new Yahoo! Mail Beta.

Re: Indexing Documents which has Attachments and are Refered many times!!

Posted by Steven Rowe <sa...@syr.edu>.

As Jason says, you can structure each Lucene document with one Field per
content type, and index all data that way.  The database is not required.

To address your search complexity concern, you can create queries that
search only those Field(s) the user wants -- there is no need to have a
Field for each possible combination of content type.

Steve

Jason Polites wrote:
> Maybe I'm not understanding your requirement, but this should be fairly
> simple in Lucene.
> 
> Each document in your document management system would be represented by a
> single Lucene document in the index.  Each lucene document will then have
> several fields, each field representing the values of the "meta data"
> associated with your document in the document management system.
> 
> For example:
> 
> Lets say you have a document which has the following structure:
> 
> Title: Sample Document
> Date: 01/01/2006
> Attachment: Some attachment (this is the content of the attachment)
> Attachment: Some other attachment
> Refers: Mr X
> Refers: Mr Y
> 
> Your Lucene document would have the same structure.  All of these items in
> your "real" document would simply be Fields in the lucene document.
> 
> In the case of your attachments, you could also consider indexing them
> separately (as well).  This way users could search for attachments without
> needing the documents to which they are attached.
> 
> If your attachments ARE documents (that is, your just have a foreign key
> style relationship between two documents), then you would simple index each
> "real" document as a separate document in lucene and add some sort of
> reference field which contains the ID of all related documents.
> 
> For Example:
> -------------------
> ID: 123
> Title: Sample Document
> Date: 01/01/2006
> Attachment: 456
> Attachment: 789
> Refers: Mr X
> Refers: Mr Y
> --------------------
> ID: 456
> Title: Other Document
> Date:.. etc etc...
> 
> The one thing to be mindful of is re-indexing existing documents.  If you
> have document that is already indexed and you want to make a change (eg you
> want to add a new "refers" value), then you need to re-index the entire
> document.  This means you need to either "store" all fields you want to
> keep
> during re-indexing (which is typically all of them), or you need to
> re-index
> the document from its source. Storing all the data in the index can have
> adverse effects on the performance of the index however. (hope this makes
> sense).
> 
> 
> 
> On 8/12/06, Shaghayegh Sahebie <sh...@yahoo.com> wrote:
>>
>> Hi all;
>> We have got a Document management system and we want to build a search on
>> it. We have tree kind of content in our system: Refers, Documents and
>> Attachments. A document can have multiple attachments and can be
>> Refered to
>> many users.
>> Our users want to be able to search on documents attachments and refers.
>> for example they want to search the Documents which are created at
>> "2006/07/06" date and have the word "Lucene" in it or their Refers and
>> are
>> Refered to Mr.x.
>> Our users want to be ale to search in all 8 possible selections of
>> Document, Refer and Attachment, I mean they want to be able to search
>> just
>> in Refers, in both Refers and Documents, ...
>> How can we handle it?
>> I thaught to store diferent kinds of Docs in a DB, search in the DB at
>> first and search in Lucene based on DB results and phrases given to
>> search
>> (Handling Document, Refer or Attachments parts in a DB search). But
>> the DB
>> results maybe so big and i don't know if a Lucene query can have these
>> much
>> of search Terms.
>> Another way is to Index each document, refer and attachment in the
>> index 8
>> times(all the possible selections of Refer, Document and Attachment) but
>> this way has lots of redundancy even more than 8 times! 'cause each
>> Document
>> is indexed "8 * Refer number of Document" times.
>> I really don't know what to do, Any suggestions Please?
>>
>> Thanks in advance


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Documents which has Attachments and are Refered many times!!

Posted by Jason Polites <ja...@gmail.com>.

Maybe I'm not understanding your requirement, but this should be fairly
simple in Lucene.

Each document in your document management system would be represented by a
single Lucene document in the index.  Each lucene document will then have
several fields, each field representing the values of the "meta data"
associated with your document in the document management system.

For example:

Lets say you have a document which has the following structure:

Title: Sample Document
Date: 01/01/2006
Attachment: Some attachment (this is the content of the attachment)
Attachment: Some other attachment
Refers: Mr X
Refers: Mr Y

Your Lucene document would have the same structure.  All of these items in
your "real" document would simply be Fields in the lucene document.

In the case of your attachments, you could also consider indexing them
separately (as well).  This way users could search for attachments without
needing the documents to which they are attached.

If your attachments ARE documents (that is, your just have a foreign key
style relationship between two documents), then you would simple index each
"real" document as a separate document in lucene and add some sort of
reference field which contains the ID of all related documents.

For Example:
-------------------
ID: 123
Title: Sample Document
Date: 01/01/2006
Attachment: 456
Attachment: 789
Refers: Mr X
Refers: Mr Y
--------------------
ID: 456
Title: Other Document
Date:.. etc etc...

The one thing to be mindful of is re-indexing existing documents.  If you
have document that is already indexed and you want to make a change (eg you
want to add a new "refers" value), then you need to re-index the entire
document.  This means you need to either "store" all fields you want to keep
during re-indexing (which is typically all of them), or you need to re-index
the document from its source. Storing all the data in the index can have
adverse effects on the performance of the index however. (hope this makes
sense).

On 8/12/06, Shaghayegh Sahebie <sh...@yahoo.com> wrote:
>
> Hi all;
> We have got a Document management system and we want to build a search on
> it. We have tree kind of content in our system: Refers, Documents and
> Attachments. A document can have multiple attachments and can be Refered to
> many users.
> Our users want to be able to search on documents attachments and refers.
> for example they want to search the Documents which are created at
> "2006/07/06" date and have the word "Lucene" in it or their Refers and are
> Refered to Mr.x.
> Our users want to be ale to search in all 8 possible selections of
> Document, Refer and Attachment, I mean they want to be able to search just
> in Refers, in both Refers and Documents, ...
> How can we handle it?
> I thaught to store diferent kinds of Docs in a DB, search in the DB at
> first and search in Lucene based on DB results and phrases given to search
> (Handling Document, Refer or Attachments parts in a DB search). But the DB
> results maybe so big and i don't know if a Lucene query can have these much
> of search Terms.
> Another way is to Index each document, refer and attachment in the index 8
> times(all the possible selections of Refer, Document and Attachment) but
> this way has lots of redundancy even more than 8 times! 'cause each Document
> is indexed "8 * Refer number of Document" times.
> I really don't know what to do, Any suggestions Please?
>
> Thanks in advance
>
>
> ---------------------------------
> Do you Yahoo!?
> Everyone is raving about the  all-new Yahoo! Mail Beta.
>