You are viewing a plain text version of this content. The canonical link for it is here.
Posted to slide-user@jakarta.apache.org by Bertrand Tignon <be...@capgemini.com> on 2005/04/12 09:03:16 UTC
Slide, Dasl and pdf
hello !
I'm working on a document library with Slide and Dasl.
I'm trying to do plain-text search with a basic DASL search inside .doc and .pdf files.
I've noticed that if I search a word with more than 2 characters it doesn't return any results. For instance I have a PDF document about XML, when I search the word "xm" it says "xm" was found in this PDF, but when I search the word "xml" it doesn't work !
Do u understand that ?
Please help me.
I've also noticed that DASL "like" search is not working with Slide. Is it normal ? Is it gonna be fixed ?
Thanx a lot for your help.
Bertrand.
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
Re: Slide, Dasl and pdf
Posted by Eirikur Hrafnsson <ei...@idega.is>.
Good question...hope someone from the Slide team can answer that since
we also are forced to use the 2.2pre1 in a production environment...
(Slide 2.2pre1 is a pretty stable version though so don't be afraid to
use it )
On 13.4.2005, at 07:54, Bertrand Tignon wrote:
>
> And when will the 2.2 version be released ?
>
> ----- Original Message -----
> From: "Eirikur Hrafnsson" <ei...@idega.is>
> To: "Slide Users Mailing List" <sl...@jakarta.apache.org>
> Sent: Tuesday, April 12, 2005 6:28 PM
> Subject: Re: Slide, Dasl and pdf
>
>
>>
>> On 12.4.2005, at 13:19, Edmund Urbani wrote:
>>
>>> Bertrand Tignon wrote:
>>>> Thank u for replying Edmund.
>>>> Well, I'm using Slide 2.1
>>>> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to"
>>>> but
>>>> I
>>>> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
>>>> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
>>>> About the wiki "DASL Configuration", I don't know how to get the
>>>> lucene
>>>> library needed (package org.apache.slide.index.*).
>>>> thanx for your help
>>>> Bertrand.
>>> There is no 2.2 release, yet. The closest you get to 2.2 is the
>>> current
>>> CVS HEAD.
>>> That org.apache.slide.index package is in slide-stores-2.x.jar. It's
>>> there
>>> even in 2.1, even though it appearantly does not work.
>>>
>>> Maybe I should ask a different question on this list:
>>> Does the LuceneIndexer that is currently in CVS HEAD work?
>> It does, the version in 2.1 does not.
>>
>> -Eiki
>>
>>
>>
>>>
>>> Edmund
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>>>
>>>
>>>
>> Best Regards
>>
>> Eirikur S. Hrafnsson, eiki@idega.is
>> Chief Software Engineer
>> Idega Software
>> http://www.idega.com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>
> This message contains information that may be privileged or
> confidential and is the property of the Capgemini Group. It is
> intended only for the person to whom it is addressed. If you are not
> the intended recipient, you are not authorized to read, print,
> retain, copy, disseminate, distribute, or use this message or any
> part thereof. If you receive this message in error, please notify the
> sender immediately and delete all copies of this message.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>
>
>
Best Regards
Eirikur S. Hrafnsson, eiki@idega.is
Chief Software Engineer
Idega Software
http://www.idega.com
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org
Re: Slide, Dasl and pdf
Posted by Bertrand Tignon <be...@capgemini.com>.
And when will the 2.2 version be released ?
----- Original Message -----
From: "Eirikur Hrafnsson" <ei...@idega.is>
To: "Slide Users Mailing List" <sl...@jakarta.apache.org>
Sent: Tuesday, April 12, 2005 6:28 PM
Subject: Re: Slide, Dasl and pdf
>
> On 12.4.2005, at 13:19, Edmund Urbani wrote:
>
> > Bertrand Tignon wrote:
> >> Thank u for replying Edmund.
> >> Well, I'm using Slide 2.1
> >> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but
> >> I
> >> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
> >> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
> >> About the wiki "DASL Configuration", I don't know how to get the
> >> lucene
> >> library needed (package org.apache.slide.index.*).
> >> thanx for your help
> >> Bertrand.
> > There is no 2.2 release, yet. The closest you get to 2.2 is the current
> > CVS HEAD.
> > That org.apache.slide.index package is in slide-stores-2.x.jar. It's
> > there
> > even in 2.1, even though it appearantly does not work.
> >
> > Maybe I should ask a different question on this list:
> > Does the LuceneIndexer that is currently in CVS HEAD work?
> It does, the version in 2.1 does not.
>
> -Eiki
>
>
>
> >
> > Edmund
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: slide-user-help@jakarta.apache.org
> >
> >
> >
> Best Regards
>
> Eirikur S. Hrafnsson, eiki@idega.is
> Chief Software Engineer
> Idega Software
> http://www.idega.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org
Re: Slide, Dasl and pdf
Posted by Eirikur Hrafnsson <ei...@idega.is>.
There is a document somewhere in the Slide head which contains settings
for the indexers (property and content) and the extractors
(pdf,office...)
it's called Extractor-Domain.xml. You could take a look at that. Here
are my working settings (from my Domain.xml):
.... (at the end of my <store> definition)
<contentindexer
classname="org.apache.slide.index.lucene.LuceneContentIndexer">
<!-- indexpath is the name of the folder that Lucene creates under (if
using tomcat) your tomcat/bin/ folder, unfortunately Slide does not
support storing the Lucene index in the Slide repository :( so you must
change this parameter if you have multiple contexts running so they
don't collide -->
<parameter name="indexpath">store/index/content</parameter>
<!-- asynchron makes the indexing happen in its own thread, not needed
really unless you need really rapid writes to the slide store but not
simultanious indexing -->
<parameter name="asynchron">true</parameter>
</contentindexer>
<propertiesindexer
classname="org.apache.slide.index.lucene.LucenePropertiesIndexer">
<parameter name="indexpath">store/index/metadata</parameter>
<parameter name="asynchron">true</parameter>
<!-- Here you define all your custom properties (if you have any) you
want to index that you add to a resource with propPatchMethod, we have
one extra property -->
<configuration name="indexed-properties">
<property name="ContentType" namespace="IW:">
<text/>
<is-defined/>
</property>
</configuration>
</propertiesindexer>
</store>
And then later in Domain.xml I add the extractors and set them to the
paths we want to index, in our case just everything under /files
....
<parameter name="versioncontrol-exclude"/>
<parameter name="checkout-fork">forbidden</parameter>
<parameter name="checkin-fork">forbidden</parameter>
<!-- Extractor configuration -->
<extractors>
<!--XML extractors-->
<extractor
classname="org.apache.slide.extractor.SimpleXmlExtractor" uri="/files">
<!-- BTW this extractor does NOT work if you have a namespace in your
xml document, I have these settings here because someday it will...-->
<configuration>
<instruction
namespace="http://xmlns.idega.com/block/article/xml"
property="headline" xpath="/article/headline/text()" />
<instruction
namespace="http://xmlns.idega.com/block/article/xml" property="teaser"
xpath="/article/teaser/text()" />
<instruction
namespace="http://xmlns.idega.com/block/article/xml" property="body"
xpath="/article/body/text()" />
<instruction
namespace="http://xmlns.idega.com/block/article/xml" property="author"
xpath="/article/author/text()" />
<instruction
namespace="http://xmlns.idega.com/block/article/xml" property="source"
xpath="/article/source/text()" />
<instruction
namespace="http://xmlns.idega.com/block/article/xml" property="comment"
xpath="/article/comment/text()" />
</configuration>
</extractor>
<extractor
classname="org.apache.slide.extractor.XmlContentExtractor"
uri="/files"/>
<!--XML extractors-->
<!--PDF extractors-->
<extractor classname="org.apache.slide.extractor.PDFExtractor"
uri="/files" />
<!--PDF extractors-->
<!--Text extractors-->
<extractor
classname="org.apache.slide.extractor.TextContentExtractor"
uri="/files" />
<!--Text extractors-->
<!--Office extractors-->
<extractor
classname="org.apache.slide.extractor.OfficeExtractor" uri="/files">
<configuration>
<instruction property="author"
id="SummaryInformation-0-4" />
<instruction property="application"
id="SummaryInformation-0-18" />
</configuration>
</extractor>
<extractor
classname="org.apache.slide.extractor.MSWordExtractor" uri="/files"/>
<extractor
classname="org.apache.slide.extractor.MSExcelExtractor" uri="/files"/>
<extractor
classname="org.apache.slide.extractor.MSPowerPointExtractor"
uri="/files"/>
<!--Office extractors-->
</extractors>
<!-- Event configuration -->
<events>
<event classname="org.apache.slide.webdav.event.WebdavEvent"
enable="true" />
<event classname="org.apache.slide.event.ContentEvent"
enable="true" />
...
And that's how we shave!
Best Regards
Eirikur S. Hrafnsson, eiki@idega.is
Chief Software Engineer
Idega Software
http://www.idega.com
On 13.4.2005, at 09:19, Edmund Urbani wrote:
> Eirikur Hrafnsson wrote:
>> On 12.4.2005, at 13:19, Edmund Urbani wrote:
>>> Bertrand Tignon wrote:
>>>
>>>> Thank u for replying Edmund.
>>>> Well, I'm using Slide 2.1
>>>> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to"
>>>> but I
>>>> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
>>>> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
>>>> About the wiki "DASL Configuration", I don't know how to get the
>>>> lucene
>>>> library needed (package org.apache.slide.index.*).
>>>> thanx for your help
>>>> Bertrand.
>>>
>>> There is no 2.2 release, yet. The closest you get to 2.2 is the
>>> current
>>> CVS HEAD.
>>> That org.apache.slide.index package is in slide-stores-2.x.jar. It's
>>> there
>>> even in 2.1, even though it appearantly does not work.
>>>
>>> Maybe I should ask a different question on this list:
>>> Does the LuceneIndexer that is currently in CVS HEAD work?
>> It does, the version in 2.1 does not.
>> -Eiki
> Thanks. That's good to hear. I was about to give up.
>
> Now I'd like to good back to the question I had earlier:
> Do I need to add anything to my Domain.xml other than the
> <contentindexer ..>
> element (as explained in the Wiki) to make the lucene indexer work?
>
> Edmund
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>
>
>
Re: Slide, Dasl and pdf
Posted by Edmund Urbani <em...@liland.org>.
Eirikur Hrafnsson wrote:
>
> On 12.4.2005, at 13:19, Edmund Urbani wrote:
>
>> Bertrand Tignon wrote:
>>
>>> Thank u for replying Edmund.
>>> Well, I'm using Slide 2.1
>>> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but I
>>> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
>>> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
>>> About the wiki "DASL Configuration", I don't know how to get the lucene
>>> library needed (package org.apache.slide.index.*).
>>> thanx for your help
>>> Bertrand.
>>
>> There is no 2.2 release, yet. The closest you get to 2.2 is the current
>> CVS HEAD.
>> That org.apache.slide.index package is in slide-stores-2.x.jar. It's
>> there
>> even in 2.1, even though it appearantly does not work.
>>
>> Maybe I should ask a different question on this list:
>> Does the LuceneIndexer that is currently in CVS HEAD work?
>
> It does, the version in 2.1 does not.
>
> -Eiki
>
Thanks. That's good to hear. I was about to give up.
Now I'd like to good back to the question I had earlier:
Do I need to add anything to my Domain.xml other than the <contentindexer ..>
element (as explained in the Wiki) to make the lucene indexer work?
Edmund
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org
Re: Slide, Dasl and pdf
Posted by Eirikur Hrafnsson <ei...@idega.is>.
On 12.4.2005, at 13:19, Edmund Urbani wrote:
> Bertrand Tignon wrote:
>> Thank u for replying Edmund.
>> Well, I'm using Slide 2.1
>> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but
>> I
>> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
>> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
>> About the wiki "DASL Configuration", I don't know how to get the
>> lucene
>> library needed (package org.apache.slide.index.*).
>> thanx for your help
>> Bertrand.
> There is no 2.2 release, yet. The closest you get to 2.2 is the current
> CVS HEAD.
> That org.apache.slide.index package is in slide-stores-2.x.jar. It's
> there
> even in 2.1, even though it appearantly does not work.
>
> Maybe I should ask a different question on this list:
> Does the LuceneIndexer that is currently in CVS HEAD work?
It does, the version in 2.1 does not.
-Eiki
>
> Edmund
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>
>
>
Best Regards
Eirikur S. Hrafnsson, eiki@idega.is
Chief Software Engineer
Idega Software
http://www.idega.com
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org
Re: Slide, Dasl and pdf
Posted by Edmund Urbani <em...@liland.org>.
Bertrand Tignon wrote:
> Thank u for replying Edmund.
> Well, I'm using Slide 2.1
> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but I
> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
>
> About the wiki "DASL Configuration", I don't know how to get the lucene
> library needed (package org.apache.slide.index.*).
>
> thanx for your help
>
> Bertrand.
>
There is no 2.2 release, yet. The closest you get to 2.2 is the current
CVS HEAD.
That org.apache.slide.index package is in slide-stores-2.x.jar. It's there
even in 2.1, even though it appearantly does not work.
Maybe I should ask a different question on this list:
Does the LuceneIndexer that is currently in CVS HEAD work?
Edmund
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org
Re: Slide, Dasl and pdf
Posted by Bertrand Tignon <be...@capgemini.com>.
Thank u for replying Edmund.
Well, I'm using Slide 2.1
I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but I
don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
"SLIDE_HEAD_AFTER_EVENTS" or something like that ?
About the wiki "DASL Configuration", I don't know how to get the lucene
library needed (package org.apache.slide.index.*).
thanx for your help
Bertrand.
----- Original Message -----
From: "Edmund Urbani" <em...@liland.org>
To: "Slide Users Mailing List" <sl...@jakarta.apache.org>
Sent: Tuesday, April 12, 2005 2:47 PM
Subject: Re: Slide, Dasl and pdf
> Bertrand Tignon wrote:
> > hello !
> > I'm working on a document library with Slide and Dasl.
> >
> > I'm trying to do plain-text search with a basic DASL search inside .doc
and .pdf files.
> > I've noticed that if I search a word with more than 2 characters it
doesn't return any results. For instance I have a PDF document about XML,
when I search the word "xm" it says "xm" was found in this PDF, but when I
search the word "xml" it doesn't work !
> > Do u understand that ?
> > Please help me.
> >
> > I've also noticed that DASL "like" search is not working with Slide. Is
it normal ? Is it gonna be fixed ?
> >
> > Thanx a lot for your help.
> >
> > Bertrand.
> >
> > This message contains information that may be privileged or confidential
and is the property of the Capgemini Group. It is intended only for the
person to whom it is addressed. If you are not the intended recipient, you
are not authorized to read, print, retain, copy, disseminate, distribute,
or use this message or any part thereof. If you receive this message in
error, please notify the sender immediately and delete all copies of this
message.
> >
>
> Are you using Slide 2.1 or the current CVS version?
> Content search should work better with Slide 2.2 and the lucene based
indexer.
> There's a howto in the Wiki at
http://wiki.apache.org/jakarta-slide/DaslConfiguration.
>
> However I am currently struggling myself to get this to work right. I'm
not sure
> whether that Wiki page really includes all you need.
> In my setup slide creates the store store/index/content dir with a 20-byte
> sized file called "segments" in it, but it does not ever react to changes
> in the repository and update the index. do I need to setup a listener
and/or extractors
> as well for this to work?
>
> Looks like this is not a big help for you, Bertrand. but maybe someone
will come along
> and help us both with our DASL problems.
>
> Edmund
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org
Re: Slide, Dasl and pdf
Posted by Edmund Urbani <em...@liland.org>.
Bertrand Tignon wrote:
> hello !
> I'm working on a document library with Slide and Dasl.
>
> I'm trying to do plain-text search with a basic DASL search inside .doc and .pdf files.
> I've noticed that if I search a word with more than 2 characters it doesn't return any results. For instance I have a PDF document about XML, when I search the word "xm" it says "xm" was found in this PDF, but when I search the word "xml" it doesn't work !
> Do u understand that ?
> Please help me.
>
> I've also noticed that DASL "like" search is not working with Slide. Is it normal ? Is it gonna be fixed ?
>
> Thanx a lot for your help.
>
> Bertrand.
>
> This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
>
Are you using Slide 2.1 or the current CVS version?
Content search should work better with Slide 2.2 and the lucene based indexer.
There's a howto in the Wiki at http://wiki.apache.org/jakarta-slide/DaslConfiguration.
However I am currently struggling myself to get this to work right. I'm not sure
whether that Wiki page really includes all you need.
In my setup slide creates the store store/index/content dir with a 20-byte
sized file called "segments" in it, but it does not ever react to changes
in the repository and update the index. do I need to setup a listener and/or extractors
as well for this to work?
Looks like this is not a big help for you, Bertrand. but maybe someone will come along
and help us both with our DASL problems.
Edmund
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org