You are viewing a plain text version of this content. The canonical link for it is here.
Posted to slide-user@jakarta.apache.org by Bertrand Tignon <be...@capgemini.com> on 2005/04/12 09:03:16 UTC

Slide, Dasl and pdf

hello !
I'm working on a document library with Slide and Dasl.

I'm trying to do plain-text search with a basic DASL search inside .doc and .pdf files.
I've noticed that if I search a word with more than 2 characters it doesn't return any results. For instance I have a PDF document about XML, when I search the word "xm" it says "xm" was found in this PDF, but when I search the word "xml" it doesn't work !
Do u understand that ?
Please help me.

I've also noticed that DASL "like" search is not working with Slide. Is it normal ? Is it gonna be fixed ?

Thanx a lot for your help.

Bertrand.

This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient,  you are not authorized to read, print, retain, copy, disseminate,  distribute, or use this message or any part thereof. If you receive this  message in error, please notify the sender immediately and delete all  copies of this message.

Re: Slide, Dasl and pdf

Posted by Eirikur Hrafnsson <ei...@idega.is>.
Good question...hope someone from the Slide team can answer that since 
we also are forced to use the 2.2pre1 in a production environment...
(Slide 2.2pre1 is a pretty stable version though so don't be afraid to 
use it )

On 13.4.2005, at 07:54, Bertrand Tignon wrote:

>
> And when will the 2.2 version be released ?
>
> ----- Original Message -----
> From: "Eirikur Hrafnsson" <ei...@idega.is>
> To: "Slide Users Mailing List" <sl...@jakarta.apache.org>
> Sent: Tuesday, April 12, 2005 6:28 PM
> Subject: Re: Slide, Dasl and pdf
>
>
>>
>> On 12.4.2005, at 13:19, Edmund Urbani wrote:
>>
>>> Bertrand Tignon wrote:
>>>> Thank u for replying Edmund.
>>>> Well, I'm using Slide 2.1
>>>> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" 
>>>> but
>>>> I
>>>> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
>>>> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
>>>> About the wiki "DASL Configuration", I don't know how to get the
>>>> lucene
>>>> library needed (package org.apache.slide.index.*).
>>>> thanx for your help
>>>> Bertrand.
>>> There is no 2.2 release, yet. The closest you get to 2.2 is the 
>>> current
>>> CVS HEAD.
>>> That org.apache.slide.index package is in slide-stores-2.x.jar. It's
>>> there
>>> even in 2.1, even though it appearantly does not work.
>>>
>>> Maybe I should ask a different question on this list:
>>> Does the LuceneIndexer that is currently in CVS HEAD work?
>> It does, the version in 2.1 does not.
>>
>> -Eiki
>>
>>
>>
>>>
>>>  Edmund
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>>>
>>>
>>>
>> Best Regards
>>
>> Eirikur S. Hrafnsson, eiki@idega.is
>> Chief Software Engineer
>> Idega Software
>> http://www.idega.com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>
> This message contains information that may be privileged or 
> confidential and is the property of the Capgemini Group. It is 
> intended only for the person to whom it is addressed. If you are not 
> the intended recipient,  you are not authorized to read, print, 
> retain, copy, disseminate,  distribute, or use this message or any 
> part thereof. If you receive this  message in error, please notify the 
> sender immediately and delete all  copies of this message.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>
>
>
Best Regards

Eirikur S. Hrafnsson, eiki@idega.is
Chief Software Engineer
Idega Software
http://www.idega.com


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org


Re: Slide, Dasl and pdf

Posted by Bertrand Tignon <be...@capgemini.com>.
And when will the 2.2 version be released ?

----- Original Message ----- 
From: "Eirikur Hrafnsson" <ei...@idega.is>
To: "Slide Users Mailing List" <sl...@jakarta.apache.org>
Sent: Tuesday, April 12, 2005 6:28 PM
Subject: Re: Slide, Dasl and pdf


> 
> On 12.4.2005, at 13:19, Edmund Urbani wrote:
> 
> > Bertrand Tignon wrote:
> >> Thank u for replying Edmund.
> >> Well, I'm using Slide 2.1
> >> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but 
> >> I
> >> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
> >> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
> >> About the wiki "DASL Configuration", I don't know how to get the 
> >> lucene
> >> library needed (package org.apache.slide.index.*).
> >> thanx for your help
> >> Bertrand.
> > There is no 2.2 release, yet. The closest you get to 2.2 is the current
> > CVS HEAD.
> > That org.apache.slide.index package is in slide-stores-2.x.jar. It's 
> > there
> > even in 2.1, even though it appearantly does not work.
> >
> > Maybe I should ask a different question on this list:
> > Does the LuceneIndexer that is currently in CVS HEAD work?
> It does, the version in 2.1 does not.
> 
> -Eiki
> 
> 
> 
> >
> >  Edmund
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: slide-user-help@jakarta.apache.org
> >
> >
> >
> Best Regards
> 
> Eirikur S. Hrafnsson, eiki@idega.is
> Chief Software Engineer
> Idega Software
> http://www.idega.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org

This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient,  you are not authorized to read, print, retain, copy, disseminate,  distribute, or use this message or any part thereof. If you receive this  message in error, please notify the sender immediately and delete all  copies of this message.


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org


Re: Slide, Dasl and pdf

Posted by Eirikur Hrafnsson <ei...@idega.is>.
There is a document somewhere in the Slide head which contains settings 
for the indexers (property and content) and the extractors 
(pdf,office...)
it's called Extractor-Domain.xml. You could take a look at that. Here 
are my working settings (from my Domain.xml):

.... (at the end of my <store> definition)
<contentindexer 
classname="org.apache.slide.index.lucene.LuceneContentIndexer">
<!-- indexpath is the name of the folder that Lucene creates under (if 
using tomcat) your tomcat/bin/ folder, unfortunately Slide does not 
support storing the Lucene index in the Slide repository :( so you must 
change this parameter if you have multiple contexts running so they 
don't collide -->
        <parameter name="indexpath">store/index/content</parameter>
<!-- asynchron makes the indexing happen in its own thread, not needed 
really unless you need really rapid writes to the slide store but not 
simultanious indexing -->
        <parameter name="asynchron">true</parameter>
   </contentindexer>
    <propertiesindexer 
classname="org.apache.slide.index.lucene.LucenePropertiesIndexer">
        <parameter name="indexpath">store/index/metadata</parameter>
        <parameter name="asynchron">true</parameter>

<!-- Here you define all your custom properties (if you have any) you 
want to index that you add to a resource with propPatchMethod, we have 
one extra property -->
        <configuration name="indexed-properties">
	       <property name="ContentType" namespace="IW:">
	        <text/>
	        <is-defined/>
	      </property>
        </configuration>
   </propertiesindexer>
  </store>

And then later in Domain.xml I add the extractors and set them to the 
paths we want to index, in our case just everything under /files
....
  <parameter name="versioncontrol-exclude"/>
     <parameter name="checkout-fork">forbidden</parameter>
     <parameter name="checkin-fork">forbidden</parameter>


     <!-- Extractor configuration -->
     <extractors>
     		<!--XML extractors-->
         <extractor 
classname="org.apache.slide.extractor.SimpleXmlExtractor" uri="/files">
<!-- BTW this extractor does NOT work if you have a namespace in your 
xml document, I have these settings here because someday it will...-->
             <configuration>
                 <instruction 
namespace="http://xmlns.idega.com/block/article/xml" 
property="headline" xpath="/article/headline/text()" />
                 <instruction 
namespace="http://xmlns.idega.com/block/article/xml" property="teaser" 
xpath="/article/teaser/text()" />
                 <instruction 
namespace="http://xmlns.idega.com/block/article/xml" property="body" 
xpath="/article/body/text()" />
                 <instruction 
namespace="http://xmlns.idega.com/block/article/xml" property="author" 
xpath="/article/author/text()" />
                 <instruction 
namespace="http://xmlns.idega.com/block/article/xml" property="source" 
xpath="/article/source/text()" />
                 <instruction 
namespace="http://xmlns.idega.com/block/article/xml" property="comment" 
xpath="/article/comment/text()" />
             </configuration>
         </extractor>
         <extractor 
classname="org.apache.slide.extractor.XmlContentExtractor" 
uri="/files"/>
         <!--XML extractors-->

         <!--PDF extractors-->
         <extractor classname="org.apache.slide.extractor.PDFExtractor" 
uri="/files" />
         <!--PDF extractors-->

         <!--Text extractors-->
         <extractor 
classname="org.apache.slide.extractor.TextContentExtractor" 
uri="/files" />
         <!--Text extractors-->

         <!--Office extractors-->
         <extractor 
classname="org.apache.slide.extractor.OfficeExtractor" uri="/files">
             <configuration>
                 <instruction property="author" 
id="SummaryInformation-0-4" />
                 <instruction property="application" 
id="SummaryInformation-0-18" />
             </configuration>
         </extractor>
         <extractor 
classname="org.apache.slide.extractor.MSWordExtractor" uri="/files"/>
         <extractor 
classname="org.apache.slide.extractor.MSExcelExtractor" uri="/files"/>
         <extractor 
classname="org.apache.slide.extractor.MSPowerPointExtractor" 
uri="/files"/>
         <!--Office extractors-->

     </extractors>

     <!-- Event configuration -->
     <events>
         <event classname="org.apache.slide.webdav.event.WebdavEvent" 
enable="true" />
         <event classname="org.apache.slide.event.ContentEvent" 
enable="true" />
...

And that's how we shave!

Best Regards

Eirikur S. Hrafnsson, eiki@idega.is
Chief Software Engineer
Idega Software
http://www.idega.com



On 13.4.2005, at 09:19, Edmund Urbani wrote:

> Eirikur Hrafnsson wrote:
>> On 12.4.2005, at 13:19, Edmund Urbani wrote:
>>> Bertrand Tignon wrote:
>>>
>>>> Thank u for replying Edmund.
>>>> Well, I'm using Slide 2.1
>>>> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" 
>>>> but I
>>>> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
>>>> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
>>>> About the wiki "DASL Configuration", I don't know how to get the 
>>>> lucene
>>>> library needed (package org.apache.slide.index.*).
>>>> thanx for your help
>>>> Bertrand.
>>>
>>> There is no 2.2 release, yet. The closest you get to 2.2 is the 
>>> current
>>> CVS HEAD.
>>> That org.apache.slide.index package is in slide-stores-2.x.jar. It's 
>>> there
>>> even in 2.1, even though it appearantly does not work.
>>>
>>> Maybe I should ask a different question on this list:
>>> Does the LuceneIndexer that is currently in CVS HEAD work?
>> It does, the version in 2.1 does not.
>> -Eiki
> Thanks. That's good to hear. I was about to give up.
>
> Now I'd like to good back to the question I had earlier:
> Do I need to add anything to my Domain.xml other than the 
> <contentindexer ..>
> element (as explained in the Wiki) to make the lucene indexer work?
>
>  Edmund
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>
>
>


Re: Slide, Dasl and pdf

Posted by Edmund Urbani <em...@liland.org>.
Eirikur Hrafnsson wrote:
> 
> On 12.4.2005, at 13:19, Edmund Urbani wrote:
> 
>> Bertrand Tignon wrote:
>>
>>> Thank u for replying Edmund.
>>> Well, I'm using Slide 2.1
>>> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but I
>>> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
>>> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
>>> About the wiki "DASL Configuration", I don't know how to get the lucene
>>> library needed (package org.apache.slide.index.*).
>>> thanx for your help
>>> Bertrand.
>>
>> There is no 2.2 release, yet. The closest you get to 2.2 is the current
>> CVS HEAD.
>> That org.apache.slide.index package is in slide-stores-2.x.jar. It's 
>> there
>> even in 2.1, even though it appearantly does not work.
>>
>> Maybe I should ask a different question on this list:
>> Does the LuceneIndexer that is currently in CVS HEAD work?
> 
> It does, the version in 2.1 does not.
> 
> -Eiki
> 
Thanks. That's good to hear. I was about to give up.

Now I'd like to good back to the question I had earlier:
Do I need to add anything to my Domain.xml other than the <contentindexer ..>
element (as explained in the Wiki) to make the lucene indexer work?

  Edmund

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org


Re: Slide, Dasl and pdf

Posted by Eirikur Hrafnsson <ei...@idega.is>.
On 12.4.2005, at 13:19, Edmund Urbani wrote:

> Bertrand Tignon wrote:
>> Thank u for replying Edmund.
>> Well, I'm using Slide 2.1
>> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but 
>> I
>> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
>> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
>> About the wiki "DASL Configuration", I don't know how to get the 
>> lucene
>> library needed (package org.apache.slide.index.*).
>> thanx for your help
>> Bertrand.
> There is no 2.2 release, yet. The closest you get to 2.2 is the current
> CVS HEAD.
> That org.apache.slide.index package is in slide-stores-2.x.jar. It's 
> there
> even in 2.1, even though it appearantly does not work.
>
> Maybe I should ask a different question on this list:
> Does the LuceneIndexer that is currently in CVS HEAD work?
It does, the version in 2.1 does not.

-Eiki



>
>  Edmund
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>
>
>
Best Regards

Eirikur S. Hrafnsson, eiki@idega.is
Chief Software Engineer
Idega Software
http://www.idega.com


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org


Re: Slide, Dasl and pdf

Posted by Edmund Urbani <em...@liland.org>.
Bertrand Tignon wrote:
> Thank u for replying Edmund.
> Well, I'm using Slide 2.1
> I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but I
> don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
> "SLIDE_HEAD_AFTER_EVENTS" or something like that ?
> 
> About the wiki "DASL Configuration", I don't know how to get the lucene
> library needed (package org.apache.slide.index.*).
> 
> thanx for your help
> 
> Bertrand.
> 
There is no 2.2 release, yet. The closest you get to 2.2 is the current
CVS HEAD.
That org.apache.slide.index package is in slide-stores-2.x.jar. It's there
even in 2.1, even though it appearantly does not work.

Maybe I should ask a different question on this list:
Does the LuceneIndexer that is currently in CVS HEAD work?

  Edmund

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org


Re: Slide, Dasl and pdf

Posted by Bertrand Tignon <be...@capgemini.com>.
Thank u for replying Edmund.
Well, I'm using Slide 2.1
I didn't manage to get the Slide 2.2 via cvs. I read the "how-to" but I
don't see the 2.2 version, is it called "Slide_HEAD_PRE_MERGE", or
"SLIDE_HEAD_AFTER_EVENTS" or something like that ?

About the wiki "DASL Configuration", I don't know how to get the lucene
library needed (package org.apache.slide.index.*).

thanx for your help

Bertrand.


----- Original Message ----- 
From: "Edmund Urbani" <em...@liland.org>
To: "Slide Users Mailing List" <sl...@jakarta.apache.org>
Sent: Tuesday, April 12, 2005 2:47 PM
Subject: Re: Slide, Dasl and pdf


> Bertrand Tignon wrote:
> > hello !
> > I'm working on a document library with Slide and Dasl.
> >
> > I'm trying to do plain-text search with a basic DASL search inside .doc
and .pdf files.
> > I've noticed that if I search a word with more than 2 characters it
doesn't return any results. For instance I have a PDF document about XML,
when I search the word "xm" it says "xm" was found in this PDF, but when I
search the word "xml" it doesn't work !
> > Do u understand that ?
> > Please help me.
> >
> > I've also noticed that DASL "like" search is not working with Slide. Is
it normal ? Is it gonna be fixed ?
> >
> > Thanx a lot for your help.
> >
> > Bertrand.
> >
> > This message contains information that may be privileged or confidential
and is the property of the Capgemini Group. It is intended only for the
person to whom it is addressed. If you are not the intended recipient,  you
are not authorized to read, print, retain, copy, disseminate,  distribute,
or use this message or any part thereof. If you receive this  message in
error, please notify the sender immediately and delete all  copies of this
message.
> >
>
> Are you using Slide 2.1 or the current CVS version?
> Content search should work better with Slide 2.2 and the lucene based
indexer.
> There's a howto in the Wiki at
http://wiki.apache.org/jakarta-slide/DaslConfiguration.
>
> However I am currently struggling myself to get this to work right. I'm
not sure
> whether that Wiki page really includes all you need.
> In my setup slide creates the store store/index/content dir with a 20-byte
> sized file called "segments" in it, but it does not ever react to changes
> in the repository and update the index. do I need to setup a listener
and/or extractors
> as well for this to work?
>
> Looks like this is not a big help for you, Bertrand. but maybe someone
will come along
> and help us both with our DASL problems.
>
>   Edmund
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org


This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient,  you are not authorized to read, print, retain, copy, disseminate,  distribute, or use this message or any part thereof. If you receive this  message in error, please notify the sender immediately and delete all  copies of this message.


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org


Re: Slide, Dasl and pdf

Posted by Edmund Urbani <em...@liland.org>.
Bertrand Tignon wrote:
> hello !
> I'm working on a document library with Slide and Dasl.
> 
> I'm trying to do plain-text search with a basic DASL search inside .doc and .pdf files.
> I've noticed that if I search a word with more than 2 characters it doesn't return any results. For instance I have a PDF document about XML, when I search the word "xm" it says "xm" was found in this PDF, but when I search the word "xml" it doesn't work !
> Do u understand that ?
> Please help me.
> 
> I've also noticed that DASL "like" search is not working with Slide. Is it normal ? Is it gonna be fixed ?
> 
> Thanx a lot for your help.
> 
> Bertrand.
> 
> This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient,  you are not authorized to read, print, retain, copy, disseminate,  distribute, or use this message or any part thereof. If you receive this  message in error, please notify the sender immediately and delete all  copies of this message.
> 

Are you using Slide 2.1 or the current CVS version?
Content search should work better with Slide 2.2 and the lucene based indexer.
There's a howto in the Wiki at http://wiki.apache.org/jakarta-slide/DaslConfiguration.

However I am currently struggling myself to get this to work right. I'm not sure
whether that Wiki page really includes all you need.
In my setup slide creates the store store/index/content dir with a 20-byte
sized file called "segments" in it, but it does not ever react to changes
in the repository and update the index. do I need to setup a listener and/or extractors
as well for this to work?

Looks like this is not a big help for you, Bertrand. but maybe someone will come along
and help us both with our DASL problems.

  Edmund

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org