You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Shah, Yagnesh" <ys...@hwwilson.com> on 2009/03/30 16:46:05 UTC

What is an optimal approach?

Hello Lucene users,
 We have all our xml documents stored in a content management system from MarkLogic. Is there any best approach to index these documents via lucene?

Re: What is an optimal approach?

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.

As a someone who earns for living on writing CMS system integrated
with Lucene I can tell you this is not that simple. You can of course
index your data, but be aware that all your subsequent content
repository operations should be in sync. Say what if a piece of
content is deleted from the CR? You probably don't want your search to
yield deleted content - you need to update your index not to include
it. Similar applies for all of CRUD operations. What if you want a
clustered solution? What about atomicity? The list goes on...

I can only second Mark, make sure you have exhausted all search
possibilities your current system has to offer.

To answer your question, I know nothing about MarkLogic API, but if
all your data is in XML, you always can parse it, select desired nodes
to be indexed and create a org.apache.lucene.document.Document from
it. At least that's what we do.

Regards,
Mindaugas

On Mon, Mar 30, 2009 at 3:46 PM, Shah, Yagnesh <ys...@hwwilson.com> wrote:
>
> Hello Lucene users,
>  We have all our xml documents stored in a content management system from MarkLogic. Is there any best approach to index these documents via lucene?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is an optimal approach?

Posted by mark harwood <ma...@yahoo.co.uk>.

If it is only a performance benchmark you need (as opposed to ongoing synching) then it would probably make life easier to read the original XML files from the file system (or first export them from MarkLogic to the file system if they were created in MarkLogic).

From there it is a matter of iterating through all files and using Java's DOM or SAX apis to read the file content and create appropriate Lucene documents programattically.
The example Java application that comes with Lucene shows how to traverse the file system. Make sure you review the indexing/searching performance tips on the Lucene WIKI.
One of the SOLR-literate people may jump in at this point with a word or two on how this can be mostly configured (rather than coded) using SOLR.

As ever, would be interested in your results.





----- Original Message ----
From: "Shah, Yagnesh" <ys...@hwwilson.com>
To: java-user@lucene.apache.org
Sent: Monday, 30 March, 2009 16:44:50
Subject: RE: What is an optimal approach?


Hello Mr. Harwood,
I am aware about in-built search capabilities but I like to get some performance benchmark. One way I can do is the retrieve the content and index but I was looking for some optimal approach incase someone already have similar situation.


-----Original Message-----
From: mark harwood [mailto:markharw00d@yahoo.co.uk]
Sent: Mon 3/30/2009 11:16 AM
To: java-user@lucene.apache.org
Subject: Re: What is an optimal approach?


That's probably more a question about MarkLogic APIs than it is about Lucene.
What APIs does MarkLogic provide for getting at the content e.g does it provide a JSR-170 standard interface ( http://www.slideshare.net/uncled/introduction-to-jcr )

I presume you have already ruled out the in-built MarkLogic search capabilities for some reason?




----- Original Message ----
From: "Shah, Yagnesh" <ys...@hwwilson.com>
To: java-user@lucene.apache.org
Sent: Monday, 30 March, 2009 15:46:05
Subject: What is an optimal approach?


Hello Lucene users,
We have all our xml documents stored in a content management system from MarkLogic. Is there any best approach to index these documents via lucene?



      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: What is an optimal approach?

Posted by "Shah, Yagnesh" <ys...@hwwilson.com>.

Hello Mr. Harwood,
 I am aware about in-built search capabilities but I like to get some performance benchmark. One way I can do is the retrieve the content and index but I was looking for some optimal approach incase someone already have similar situation.

-----Original Message-----
From: mark harwood [mailto:markharw00d@yahoo.co.uk]
Sent: Mon 3/30/2009 11:16 AM
To: java-user@lucene.apache.org
Subject: Re: What is an optimal approach?

That's probably more a question about MarkLogic APIs than it is about Lucene.
What APIs does MarkLogic provide for getting at the content e.g does it provide a JSR-170 standard interface ( http://www.slideshare.net/uncled/introduction-to-jcr )

I presume you have already ruled out the in-built MarkLogic search capabilities for some reason?

----- Original Message ----
From: "Shah, Yagnesh" <ys...@hwwilson.com>
To: java-user@lucene.apache.org
Sent: Monday, 30 March, 2009 15:46:05
Subject: What is an optimal approach?

Hello Lucene users,
We have all our xml documents stored in a content management system from MarkLogic. Is there any best approach to index these documents via lucene?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is an optimal approach?

Posted by mark harwood <ma...@yahoo.co.uk>.

That's probably more a question about MarkLogic APIs than it is about Lucene.
What APIs does MarkLogic provide for getting at the content e.g does it provide a JSR-170 standard interface ( http://www.slideshare.net/uncled/introduction-to-jcr )

I presume you have already ruled out the in-built MarkLogic search capabilities for some reason?




----- Original Message ----
From: "Shah, Yagnesh" <ys...@hwwilson.com>
To: java-user@lucene.apache.org
Sent: Monday, 30 March, 2009 15:46:05
Subject: What is an optimal approach?


Hello Lucene users,
We have all our xml documents stored in a content management system from MarkLogic. Is there any best approach to index these documents via lucene?



      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org