You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by spok <sp...@milkyweb.de> on 2014/04/04 12:27:45 UTC

relative document path in lucene index

Hi,

is there a possibility to use (and change) relative document path in lucene
index, so that
index can be generated on system a with documents stored in location a1, and
then
index can be moved i. e. to a cd with documents stored in location cd1,
where
relative position between index and document location has changed?

Thanks in advance

spok



--
View this message in context: http://lucene.472066.n3.nabble.com/relative-document-path-in-lucene-index-tp4129096.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: relative document path in lucene index

Posted by Mark Bennett <ma...@lucidworks.com>.

Hello again,

Re. Moving the index, when you say:
- moving $regainhome/serchindex/index to /somewhereonthedisk/index
- regain can´t search any longer because of no more index

To me this sounds like “regain” is looking for the index in a particular place.  Perhaps they have a message board?

Re. bigger question, changing the path names:

Stepping back and looking at the bigger picture, and reading your other comments, I’m guessing perhaps you are not a programmer?

I was assuming you were a Java programmer when I gave my answers.  Although what you want to do may be possible in theory, I think it will require some Java coding.

Have you considered looking for a programmer to help with this task?  Or, if you have some time, perhaps learning to program with Lucene and Solr yourself?

Mark

--
Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513

On Apr 6, 2014, at 2:37 PM, spok <sp...@milkyweb.de> wrote:

> Mark,
> 
> thanks again, here is my answer:
> 
> what I used for this ...
> - regain 2.0.4 - http://regain.sourceforge.net/ - a desktop search engine
> based on lucene
> - luke 3.5.0 - http://code.google.com/p/luke/ - Lucene Index Toolbox
> - carrot2 - 3.8.0 - http://project.carrot2.org/download.html - document
> clustering workbench
> 
> what I tried ...
> - having documents to index at location $doc_home
> - remove index directory
> - reindex documents
> - regain built new index in $regainhome/serchindex/index
> - search with regain - ok
> - using carrot2 document clustering workbench pointing to
> $regainhome/serchindex/index as input - ok
> - 
> - moving $regainhome/serchindex/index to /somewhereonthedisk/index
> - regain can´t search any longer because of no more index
> - starting carrot2 clustering workbench and pointing to
> /somewhereonthedisk/index as input - ok because index contains absolute path
> values to indexed documents
> -
> - using luke one can see that index contains a field with name path_sort and
> value = $doc_home
> - using an editor one can see that files *.fdt and *.tii in
> $regainhome/serchindex/index contain absolut path to $doc_home
> - all path are starting with drive letters c:\......
> - 
> - what I want to get are relative path names of indexed documents in
> relation to index directory, because then index as well as indexed document
> can be anywhere
> - luke website says that there is a way to "reconstruct the original
> document fields, edit them and re-insert to the index " but I didn´t find it
> ...
> -
> - what you mention is a way to do exactly what I´m looking for.
> 
> - "Then have an index-to-index writer to read from indexA, transform the
> data, then write to indexB"
> - and
> - "For item 2, Solr lets you update specific fields in a document.  But in
> the background, it’s actually still doing a full reindex, but for any fields
> you don’t update it copies and reindexes them from the old copy"
> - but - sorry - I´m new as I mentioned - I don´t know how
> - do you perhaps mean this
> "https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents"?
> - and: can this be done for all documents in the index?
> - but in any case I think I need path names without drive letters (with
> fixed positions of $doc_home and lucene index relative to each other
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/relative-document-path-in-lucene-index-tp4129096p4129544.html
> Sent from the Lucene - General mailing list archive at Nabble.com.

Re: relative document path in lucene index

Posted by spok <sp...@milkyweb.de>.

Mark,

thanks again, here is my answer:

what I used for this ...
- regain 2.0.4 - http://regain.sourceforge.net/ - a desktop search engine
based on lucene
- luke 3.5.0 - http://code.google.com/p/luke/ - Lucene Index Toolbox
- carrot2 - 3.8.0 - http://project.carrot2.org/download.html - document
clustering workbench

what I tried ...
- having documents to index at location $doc_home
- remove index directory
- reindex documents
- regain built new index in $regainhome/serchindex/index
- search with regain - ok
- using carrot2 document clustering workbench pointing to
$regainhome/serchindex/index as input - ok
- 
- moving $regainhome/serchindex/index to /somewhereonthedisk/index
- regain can´t search any longer because of no more index
- starting carrot2 clustering workbench and pointing to
/somewhereonthedisk/index as input - ok because index contains absolute path
values to indexed documents
-
- using luke one can see that index contains a field with name path_sort and
value = $doc_home
- using an editor one can see that files *.fdt and *.tii in
$regainhome/serchindex/index contain absolut path to $doc_home
- all path are starting with drive letters c:\......
- 
- what I want to get are relative path names of indexed documents in
relation to index directory, because then index as well as indexed document
can be anywhere
- luke website says that there is a way to "reconstruct the original
document fields, edit them and re-insert to the index " but I didn´t find it
...
-
- what you mention is a way to do exactly what I´m looking for.

- "Then have an index-to-index writer to read from indexA, transform the
data, then write to indexB"
- and
- "For item 2, Solr lets you update specific fields in a document.  But in
the background, it’s actually still doing a full reindex, but for any fields
you don’t update it copies and reindexes them from the old copy"
- but - sorry - I´m new as I mentioned - I don´t know how
- do you perhaps mean this
"https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents"?
- and: can this be done for all documents in the index?
- but in any case I think I need path names without drive letters (with
fixed positions of $doc_home and lucene index relative to each other



--
View this message in context: http://lucene.472066.n3.nabble.com/relative-document-path-in-lucene-index-tp4129096p4129544.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: relative document path in lucene index

Posted by Mark Bennett <ma...@lucidworks.com>.

As I understand it (?), there are 2 technical things you’re asking about:

1: Can you move an index from one place to another, a different disk or drive letter, CD, etc - Generally yes, this is allowed.

2: Can you alter the path names stored within the index - in some cases yes, but I was suggesting a 2 index solution.

Taking each in turn:

For item 1, moving an index.  A Solr “core” includes its configuration and its index data.  The paths are almost always relative; all the cores sit under a common “solr home” directory, and they can be moved, copied, zipped/unzipped, etc.

With Lucene, the coder has more control so things can vary, but within the base index directory I don’t believe there’s any absolute path names; all the files are in the same index directory.  So unless something very unusual has been done, it should be OK to move the files as a set.

For item 2, Solr lets you update specific fields in a document.  But in the background, it’s actually still doing a full reindex, but for any fields you don’t update it copies and reindexes them from the old copy.  I believe these Solr calls map to Lucene calls, and therefore should generally be the same rules.  This requires that fields be stored.  So you could do this with 1 index.  My only concern was that, with 1 index, you might only get one chance to get it right; the 2 index solution gives you some insurance to retry, iterate, adjust, etc.

--
Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513

On Apr 5, 2014, at 12:38 AM, spok <sp...@milkyweb.de> wrote:

> Mark, thanks for your answer, I´m rather new, so let me explain what I
> understood ...
> 
> Of course I can use solr within tomcat, move the whole server, and all will
> be ok.
> 
> But is there a way to generate a lucene file system index at location "a"
> for data files at "a1", move index files and data files to a movable
> location like CD or USB stick, adapt index files to reflect the new
> conditions, and then read the index with carrot2 workbench (to my knowledge
> the only carrot2 implementation with Aduna clustering capabilities).
> 
> What you propose is to have two lucene installations, and copy the index.
> But will this work on a movable file system like USB stick which can have
> different device letters?
> 
> So please if your explanation is a solution for what I want to do, what
> should I read to catch the details? Thanks in advance
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/relative-document-path-in-lucene-index-tp4129096p4129365.html
> Sent from the Lucene - General mailing list archive at Nabble.com.

Re: relative document path in lucene index

Posted by spok <sp...@milkyweb.de>.

Mark, thanks for your answer, I´m rather new, so let me explain what I
understood ...

Of course I can use solr within tomcat, move the whole server, and all will
be ok.

But is there a way to generate a lucene file system index at location "a"
for data files at "a1", move index files and data files to a movable
location like CD or USB stick, adapt index files to reflect the new
conditions, and then read the index with carrot2 workbench (to my knowledge
the only carrot2 implementation with Aduna clustering capabilities).

What you propose is to have two lucene installations, and copy the index.
But will this work on a movable file system like USB stick which can have
different device letters?

So please if your explanation is a solution for what I want to do, what
should I read to catch the details? Thanks in advance



--
View this message in context: http://lucene.472066.n3.nabble.com/relative-document-path-in-lucene-index-tp4129096p4129365.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: relative document path in lucene index

Posted by Mark Bennett <ma...@lucidworks.com>.

I’d suggest a slightly different approach: use a dual-index model.

So you get your rough data into Lucene or Solr indexA.

Then have an index-to-index writer to read from indexA, transform the data, then write to indexB

So indexA is “draft”, but your final application runs off of indexB.

The nice thing is you can do this as many times as needed.

Also, if you use Solr, you can leverage SolrJ and update chains to get the data exactly as you need it.

--
Mark Bennett / LucidWorks: Search & Big Data / mark.bennett@lucidworks.com
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513

On Apr 4, 2014, at 3:27 AM, spok <sp...@milkyweb.de> wrote:

> Hi,
> 
> is there a possibility to use (and change) relative document path in lucene
> index, so that
> index can be generated on system a with documents stored in location a1, and
> then
> index can be moved i. e. to a cd with documents stored in location cd1,
> where
> relative position between index and document location has changed?
> 
> Thanks in advance
> 
> spok
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/relative-document-path-in-lucene-index-tp4129096.html
> Sent from the Lucene - General mailing list archive at Nabble.com.