You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Patrick Wider <pa...@yahoo.fr> on 2007/10/23 12:18:57 UTC

Re : Re : Binary Content Search Problem...

Hi,
I really don't think file 3 replaces the previous ones. I really create on "top" node (called "Homepage"), where I attached 3 different Nodes using Homepage.addNode(...) (typed as: wider:file > 'nt:file', 'mix:referenceable' - maybe there is something missing in my NodeType definition???)...  I also attached 3 different nt:resource nodes. It goes like this:

   File fileTXT = new File("C:/JackRabbit/testresources/JackRabbittest.txt");
   File fileDOC = new File("C:/JackRabbit/testresources/JackRabbittest.doc");

   Node file1 = homepage.addNode("MyStringName", "wider:file");
   Node res1 = file1.addNode("jcr:content", "nt:resource");
   res1.setProperty("jcr:mimeType", mimetype);
   res1.setProperty("jcr:encoding", "");
   res1.setProperty("jcr:lastModified", cal);   
   res1.setProperty("jcr:data", "My String with MyKeyWord Content toto");
   session.save();

   Node file2 = homepage.addNode(fileTXT.getName(), "wider:file");
   Node res2 = file2.addNode("jcr:content", "nt:resource");
   res2.setProperty("jcr:mimeType", mimetype);
   res2.setProperty("jcr:encoding", "");
   res2.setProperty("jcr:lastModified", cal);
   InputStream inputTXT = new FileInputStream(fileTXT);
   res2.setProperty("jcr:data", inputTXT);
   session.save();

   Node file3 = homepage.addNode(fileDOC.getName(), "wider:file");
   Node res3 = file3.addNode("jcr:content", "nt:resource");
   res3.setProperty("jcr:mimeType", mimetype);
   res3.setProperty("jcr:encoding", "");
   res3.setProperty("jcr:lastModified", cal);
   InputStream inputDOC = new FileInputStream(fileDOC);
   res3.setProperty("jcr:data", inputDOC);
   session.save();


Yes, my query returns one hit: the doc file... even though MyKeyWord appears in the 3 contents

I had no return because of the missing jars. Now this problem is resolved and the Word Document is indexed! 
But the simple text file is not... weird, isn't it?

BR, Patrick

----- Message d'origine ----
De : Ard Schrijvers <a....@hippo.nl>
À : users@jackrabbit.apache.org; Patrick Wider <pa...@yahoo.fr>
Envoyé le : Mardi, 23 Octobre 2007, 11h55mn 29s
Objet : RE: Re : Binary Content Search Problem...


Hello Patrick,

didn't file 3 replace file 2 and file 1 perhaps? You did a session.save() after each different file? 

Do I understand correctly that you now at least get a hit for  

/jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]

where you did not have this one before?

Ard

> 
> Hi Ard,
> 
> Thanx for your answer.... Especially the part concerning the 
> logs... So I could realize that they were disabled... Shame 
> on me !;-) Anyway... the logs showed me that some jars were 
> missing in the classpath.
> After correction, I re-created my repository again with one 
> Node where I attached 3 files (the means, the creation of a 
> nt:file node with a nt:resource node for each attached file). 
> My files are:
> 1. I set up the jcr:data property with a String, as you asked 
> me to do... I put text/plain as mimetype (since the field is 
> mandatory) 2. jcr:data is set up with a stream on a simple 
> text file (mime type: text/plain) 3. jcr:data is set up with 
> a stream on a Word Document file (mimetype: application/msword)
> 
> I created this nodes and here are extracts of the logs the I 
> got related to indexing. (note that there is no error log in 
> the whole log file, only debug) file 1: 
> DEBUG - persisting change log {#addedStates=15, 
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 
> 172ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent: 
> indexing finished in 31 ms.
> 
> file 2:
> DEBUG - persisting change log {#addedStates=11, 
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 
> 79ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent: 
> indexing finished in 0 ms.
> DEBUG - got EventStateCollection
> 
> file 3:
> DEBUG - persisting change log {#addedStates=11, 
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 
> 125ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, application/msword, ) DEBUG - 
> onEvent: indexing finished in 78 ms.
> DEBUG - got EventStateCollection
> 
> 
> And checking the state of the index with Luke, I could figure 
> out that file 3 (Word) was tokenized... but the content of 
> file 1 and 2 don't appear anywhere, even though the 
> respective properties and nodes do appear!!!
> Consquently, when I run the following XPath query:
> /jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]
> 
> The only result is the Word Document...
> 
> What happened with the 2 other files?
> Maybe the mimetype is wrong (text/plain) ?
> Or did I forget to define something ?
> Maybe I did something wrong in my filter definition, which is:
>    <param name="textFilterClasses" 
>    value="org.apache.jackrabbit.extractor.PlainTextExtractor,
>      org.apache.jackrabbit.extractor.MsWordTextExtractor,
>      org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>      org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>      org.apache.jackrabbit.extractor.PdfTextExtractor,
>      org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>      org.apache.jackrabbit.extractor.RTFTextExtractor,
>      org.apache.jackrabbit.extractor.HTMLTextExtractor,
>      org.apache.jackrabbit.extractor.XMLTextExtractor"/>
> 
> 
> I thought that 
> org.apache.jackrabbit.extractor.PlainTextExtractor could 
> handle simple text files... 
> As you can see, it is getting better, but I still need a 
> little help ;-) so if you haven any idea, don't hesitate
> 
> Thank you in advance,
> BR
> Patrick
> 
> 
> 
> ----- Message d'origine ----
> De : Ard Schrijvers <a....@hippo.nl> À : 
> users@jackrabbit.apache.org; Patrick Wider 
> <pa...@yahoo.fr> Envoyé le : Lundi, 22 Octobre 2007, 
> 14h59mn 53s Objet : RE: Binary Content Search Problem...
> 
> Hello Patrick,
> 
> 
> > Patrick Wider wrote:
> > 
> > Of course the files contain somehow 'myKeyWord'... the text file 
> > contains it for sure, but in the Document, 'myKeyWord'
> > is wrapped by bold and italic styles. But I don't think the styles 
> > cause any problems... on the other hand, I have no idea how the 
> > extractors works ;-) it's just a guess....
> 
> Just for pinpointing the problem, what happens if:
> 
> 1) you search for a word that is not with bold or italic styles?
> 2) if you replace inputstr with "a string to test myKeyWord", 
> and then do the search again
> 
> You might want to turn on the logging for the indexing and 
> extractors, perhaps they reveal some problems. Furthermore 
> you might want to take a look at the latest created index 
> folder after adding a binary doc with luke [1] and see if the 
> binary data is present as tokens in the index
> 
> Regards Ard
> 
> [1] http://www.getopt.org/luke/
> 
> >
> 
> 
>      
> ______________________________________________________________
> _______________
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails 
> vers Yahoo! Mail 
>


      _____________________________________________________________________________ 
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail

Re: Re : Re : Binary Content Search Problem...

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 10/23/07, Patrick Wider <pa...@yahoo.fr> wrote:
>    Node res1 = file1.addNode("jcr:content", "nt:resource");
>    res1.setProperty("jcr:mimeType", mimetype);
>    res1.setProperty("jcr:encoding", "");

Gotcha. The current PlainTextExtractor class will use the jcr:encoding
value as-is if the property is available, and fail silently (I guess
we should at least add some logging...) if the encoding is not
supported by the JVM.

You should either not set the jcr:encoding property at all or set it
to a valid encoding name.

BR,

Jukka Zitting