You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by thomasg <th...@hotmail.com> on 2006/04/26 16:53:39 UTC

Restricting xpath query to document text

I have extended nt:resource so I can add various properties to the new node
type which is named axxia:resource.

I am running queries such as:

//element(*, axxia:resource)[jcr:contains(@axxia:title, 'Jackbunny')

to search for words in the properties. Its all working well! The only
problem I have at present is when I want to search ONLY in the contents of
the added document. Then I am doing the following query:

//element(*, axxia:resource)[jcr:contains(., 'Jackbunny')

which is ok and returns a hit when the document contains the word. The
problem is it also returns a hit if the word is not in the document but is
in any of my properties.

How can I modify the query to return hits when the word is in the document
body and not if it is just in one of the properties?
Oh, plus another question while I'm at it. Is there any limit (absolute or
performance) to the number of clauses one can add to the [] (square bracket)
part of the query? Mine potentially could get very large.


p.s. Just tested that a predicate like this works but hope there is a
shorter method:
[jcr:contains(., 'XXYYZZ') and jcr:contains(@axxia:subject, '-XXYYZZ') and
... etc, all my properties I want to exclude...]


Any suggestions would be cool, cheers Thomas

--
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-t1512215.html#a4102828
Sent from the Jackrabbit - Dev forum at Nabble.com.

Re: Restricting xpath query to document text

Posted by Marcel Reutegger <ma...@gmx.net>.

thomasg wrote:
> Thanks for your reply. To clarify the situation at little. I was expecting to
> run a query such as:
> 
> //element(*, axxia:resource)[jcr:contains(@jcr:data, 'classes')]
> 
> to only search the contents of a document. This does not currently return an
> expected hit.

that's because the node indexer currently populates the node scope index 
(the one you can query with '.') with the text found in the jcr:data 
property.

> Will resolving the issue JCR-415 refered to make such a search
> possible?

yes, at least it will make it possible to replace the default 
implementation with an indexer that also provides an index for the 
jcr:data property.

regards
  marcel

Re: Restricting xpath query to document text

Posted by thomasg <th...@hotmail.com>.

Thanks for your reply. To clarify the situation at little. I was expecting to
run a query such as:

//element(*, axxia:resource)[jcr:contains(@jcr:data, 'classes')]

to only search the contents of a document. This does not currently return an
expected hit. Will resolving the issue JCR-415 refered to make such a search
possible?

Meanwhile I've realised that my idea of running query similar to:

//element(*, axxia:resource)[jcr:contains(., 'classes') and
jcr:contains(@axxia:subject, '-classes) and ... etc, all my properties I
want to exclude...] 

doesn't work logically as won't return a hit if the word is in the document
text and a property. I guess my only option is to use a different node
structure with the content and properties seperated and run 2
queries rather than 1.

Cheers, Thomas 


--
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-t1512215.html#a4118705
Sent from the Jackrabbit - Dev forum at Nabble.com.

Re: Restricting xpath query to document text

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 5/17/06, thomasg <th...@hotmail.com> wrote:
> One slight worry, have you visited www.textmining.org lately?
> Doesn't seem too healthy!

The site has been hacked since December. :-( Would it make sense to
consider alternatives? Some ideas that come to my mind:

a) Contact the Jakarta POI community for their suggestions.

b) Implement a generic text filter that pipes the binary stream
through an external application like catdoc and reads the output as
plain text to be indexed.

c) Implement a text filter that uses an OpenOffice "server" through
the UNO API to manipulate Word and other types of documents.

BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting, and Java development

Re: Restricting xpath query to document text

Posted by thomasg <th...@hotmail.com>.

Cheers,

No there were no exceptions, just no text coming back. This is no longer a
problem now that I'm removing duplicate whitespace. The only problem I now
have is the one related to my other post. If a Word doc contains any index
formatting fields (like { XE "index entry" }) then large sections of the
text can go missing. You're right this must either be the way text mining or
poi apis are used so I'll look at the forums. One slight worry, have you
visited www.textmining.org lately? Doesn't seem too healthy!
--
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-t1512215.html#a4426594
Sent from the Jackrabbit - Dev forum at Nabble.com.

Re: Restricting xpath query to document text

Posted by Marcel Reutegger <ma...@gmx.net>.

thomasg wrote:
> When you said it got a bit nasty were you refering to the spooling from
> Reader to String (method readerToString)?

yes, simply because text filters are built to provide a reader and not a 
string.

> It seems that you need to remove
> nasty parts of the returned string such as "\r\n" or the searches don't
> work. Doing this I have success with smaller documents but larger ones are
> not all working yet. Maybe there are more mallicious characters to remove?

I don't know. When you say some larger documents don't work, do you see an 
exception? or any other indication why it's 'not working'?

regards
  marcel

Re: Restricting xpath query to document text

Posted by thomasg <th...@hotmail.com>.

Hi,

I have implemented your suggestions with some success. The only change I
made to addBinaryValue() was to add the following code to the end of the
method, the original code remains:

                Reader fullTextReader = (Reader)
fields.get(FieldNames.FULLTEXT);
                if (fullTextReader != null)
                {
                	try
                	{                   
                	    String text = readerToString(fullTextReader);
                	    addStringValue(doc, fieldName, text);
                	}
                	catch (IOException e)
                	{
                		//TODO Logging etc
                		e.printStackTrace();
                	}
                }

When you said it got a bit nasty were you refering to the spooling from
Reader to String (method readerToString)? It seems that you need to remove
nasty parts of the returned string such as "\r\n" or the searches don't
work. Doing this I have success with smaller documents but larger ones are
not all working yet. Maybe there are more mallicious characters to remove?

Its cool that I am now getting results when searching with jcr:like:
jcr:like(@jcr:data, '%are not intended to be%')

Cheers, Thomas
--
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-t1512215.html#a4390389
Sent from the Jackrabbit - Dev forum at Nabble.com.

Re: Restricting xpath query to document text

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi Thomas,

can you please post the source of your extended NodeIndexer class? Thanks.

regards
  marcel

thomasg wrote:
> I've recently returned to finish a piece of work that was put on hold back in
> May. I've been trying to make xpath searches that are restricted to the
> content of indexed documents. I have followed the instructions in this
> thread for extending NodeIndexer. I now have 
> success running a query such as this:
> 
> //element(*, axxia:resource)[(jcr:contains(@jcr:data, 'house'))]
> 
> This query does not now return a hit if the word is in any property as this
> query would:
> //element(*, axxia:resource)[(jcr:contains(., 'house'))]
> 
> 
> 
>>>From the info posted in this thread:
> 
> "This way the jcr:data property is fulltext indexed twice, once in the 
> node scoped fulltext index (the one you can use with 
> jcr:contains(.,'whatever') ) and again in the field fulltext index, 
> which allows you to use jcr:contains(@jcr:data,'whatever'). In 
> addition the value is put a third time to the index as a whole 
> (untokenized) which allows you to use jcr:like(@jcr:date, '%foo%')"
> 
> (Guessing and hoping that @jcr:date above is typo of @jcr:data?)
> 
> I then tried wildcard queries. Wildcard searches on most properties work,
> like this:
> 
> //element(*, axxia:resource)[(jcr:like(@axxia:keywords, 'w_a_e_l'))]
> 
> I expected running this query would do a wildcard search in document content
> only:
> //element(*, axxia:resource)[(jcr:like(@jcr:data, 'comp%'))]
> 
> This query does not return an expected hit. Maybe there is an additional
> step required to enable use of @jcr:data property
> with a jcr:like constraint?
> 
> Any advice greatly appreciated.
> 
> Thanks, Thomas
> 
> 
>

Re: Restricting xpath query to document text

Posted by thomasg <th...@hotmail.com>.

I've recently returned to finish a piece of work that was put on hold back in
May. I've been trying to make xpath searches that are restricted to the
content of indexed documents. I have followed the instructions in this
thread for extending NodeIndexer. I now have 
success running a query such as this:

//element(*, axxia:resource)[(jcr:contains(@jcr:data, 'house'))]

This query does not now return a hit if the word is in any property as this
query would:
//element(*, axxia:resource)[(jcr:contains(., 'house'))]



>From the info posted in this thread:

"This way the jcr:data property is fulltext indexed twice, once in the 
node scoped fulltext index (the one you can use with 
jcr:contains(.,'whatever') ) and again in the field fulltext index, 
which allows you to use jcr:contains(@jcr:data,'whatever'). In 
addition the value is put a third time to the index as a whole 
(untokenized) which allows you to use jcr:like(@jcr:date, '%foo%')"

(Guessing and hoping that @jcr:date above is typo of @jcr:data?)

I then tried wildcard queries. Wildcard searches on most properties work,
like this:

//element(*, axxia:resource)[(jcr:like(@axxia:keywords, 'w_a_e_l'))]

I expected running this query would do a wildcard search in document content
only:
//element(*, axxia:resource)[(jcr:like(@jcr:data, 'comp%'))]

This query does not return an expected hit. Maybe there is an additional
step required to enable use of @jcr:data property
with a jcr:like constraint?

Any advice greatly appreciated.

Thanks, Thomas



-- 
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-tf1512215.html#a7380944
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.

Re: Restricting xpath query to document text

Posted by Marcel Reutegger <ma...@gmx.net>.

thomasg wrote:
>> node scoped fulltext index (the one you can use with
>> jcr:contains(.,'whatever') ) and again in the field fulltext index,
>> which allows you to use jcr:contains(@jcr:data,'whatever'). 
> 
> 1) Last time I checked searches such as jcr:contains(@jcr:data,'whatever')
> don't return expected hits,
> I believe issue 415 may help resolve this?

yes.

>> In addition the value is put a third time to the index as a whole
>> (untokenized) which allows you to use jcr:like(@jcr:date, '%foo%')
> 
> 2) Will implementing your suggestions enable document body searches with
> syntax:
> jcr:like(@jcr:data, '%cel Reute%')?

yes.

> 3) Is the solution of duplicating the document text as a string property to
> enable jcr:like
> likely to be significantly less performant than your suggestion?

Not if you disable text filtering in jackrabbit and move this functionality 
completely into your application. That is your application extracts the 
text and adds it to a custom property.

document indexing will be a little slower than native jackrabbit indexing, 
because jackrabbit just tokenizes a document and does not store the text 
representation as a whole.

regards
  marcel

Re: Restricting xpath query to document text

Posted by thomasg <th...@hotmail.com>.

Thanks Marcel, thats very helpful. 

Your suggested approach seems pretty doable on first inspection, there 
are are couple of issues I'd like to clarify:

>node scoped fulltext index (the one you can use with
>jcr:contains(.,'whatever') ) and again in the field fulltext index,
>which allows you to use jcr:contains(@jcr:data,'whatever'). 

1) Last time I checked searches such as jcr:contains(@jcr:data,'whatever')
don't return expected hits,
I believe issue 415 may help resolve this?

>In addition the value is put a third time to the index as a whole
>(untokenized) which allows you to use jcr:like(@jcr:date, '%foo%')

2) Will implementing your suggestions enable document body searches with
syntax:
jcr:like(@jcr:data, '%cel Reute%')?

3) Is the solution of duplicating the document text as a string property to
enable jcr:like
likely to be significantly less performant than your suggestion?

Thanks again, Thomas



--
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-t1512215.html#a4355580
Sent from the Jackrabbit - Dev forum at Nabble.com.

Re: Restricting xpath query to document text

Posted by Marcel Reutegger <ma...@gmx.net>.

thomasg wrote:
> Yep I've just found the part in the jcr spec that says jcr:like only
> supported for string properties. Do you have any idea how easy / difficult
> it would be for us to change this behaviour for our application? It looks
> like we need to enable wildcard search in the document text. I guess we
> could duplicate the doc text in a string property but would probably be
> better for us to modify the jcr:like behaviour.

you can try the following:

1) write your own query handler that extends from 
o.a.j.c.query.lucene.SearchIndex and overwrite the method createDocument()
2) write your own NodeIndexer extending from 
o.a.j.c.query.lucene.NodeIndexer which is then used in the above 
mentioned method.
3) from here on it gets a bit nasty!! in your custom NodeIndexer 
overwrite the method addBinaryValue(). you probably have to copy the 
method and adapt it. after obtaining the  map of fields returned by 
the text filter look for the Reader that was returned with the key 
FieldNames.FULLTEXT. you then have to spool the reader into a string 
value and call addStringValue().

This way the jcr:data property is fulltext indexed twice, once in the 
node scoped fulltext index (the one you can use with 
jcr:contains(.,'whatever') ) and again in the field fulltext index, 
which allows you to use jcr:contains(@jcr:data,'whatever'). In 
addition the value is put a third time to the index as a whole 
(untokenized) which allows you to use jcr:like(@jcr:date, '%foo%')

this adds some overhead to the indexing process of nt:resource nodes, 
so be sure you really need it and are willing to pay the price...

regards
  marcel

Re: Restricting xpath query to document text

Posted by thomasg <th...@hotmail.com>.

Thanks Marcel, 

Yep I've just found the part in the jcr spec that says jcr:like only
supported for string properties. Do you have any idea how easy / difficult
it would be for us to change this behaviour for our application? It looks
like we need to enable wildcard search in the document text. I guess we
could duplicate the doc text in a string property but would probably be
better for us to modify the jcr:like behaviour.

Thanks for your help, Thomas
--
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-t1512215.html#a4338607
Sent from the Jackrabbit - Dev forum at Nabble.com.

Re: Restricting xpath query to document text

Posted by thomasg <th...@hotmail.com>.

Thanks Marcel,

I just thought I would simplify things by indexing the text in lower case
and then turning all search terms to lower case. I've removed this now.

I've played around a little using a very simple word document. Putting
breakpoints in my NodeIndexer confirm it goes into the new code. Still get
no hits for the following queries: 

//element(*, axxia:resource)[(jcr:like(@jcr:data, '%fact%'))]

//element(*, axxia:resource)[(jcr:like(@jcr:data, 'Comp%'))]

The document starts with the word 'Comprehensively' and contains
'manufactured'.

This wildcard search on a property DOES return an expected hit:

//element(*, axxia:resource)[(jcr:like(@axxia:keywords, '%usa%'))]

jcr:contains queries work on the jcr:data property, it just seems that for
some reason jcr:like doesn't look at the jcr:data index. 

Any further ideas greatly appreciated,
Thomas





Marcel Reutegger-3 wrote:
> 
> The class looks ok, except that you shouldn't lower-case the text
> retrieved from 
> the resource, or is there a specific reason why this is done?
> 
> Jackrabbit 1.2 will support the functions fn:lower-case() and
> fn:upper-case(), 
> so there is no need to lower-case the text when it is indexed.
> 
> The query you mentioned will return nt:resource nodes with content that
> starts 
> with 'comp':
> 
> //element(*, axxia:resource)[(jcr:like(@jcr:data, 'comp%'))]
> 
> Are you sure that the first word in the document starts with 'comp'?
> 
> regards
>   marcel
> 
> thomasg wrote:
>> Sure, theres probably an obvious error / omission. This is the code:
>> 
>> package com.axxia.dms.indexing.jackrabbit;
>> 
>> import java.io.IOException;
>> import java.io.Reader;
>> import java.util.Collections;
>> import java.util.Iterator;
>> import java.util.List;
>> import java.util.Map;
>> 
>> import javax.jcr.RepositoryException;
>> 
>> import org.apache.jackrabbit.core.PropertyId;
>> import org.apache.jackrabbit.core.query.TextFilter;
>> import org.apache.jackrabbit.core.query.lucene.FieldNames;
>> import org.apache.jackrabbit.core.query.lucene.NamespaceMappings;
>> import org.apache.jackrabbit.core.query.lucene.NodeIndexer;
>> import org.apache.jackrabbit.core.state.ItemStateException;
>> import org.apache.jackrabbit.core.state.ItemStateManager;
>> import org.apache.jackrabbit.core.state.NodeState;
>> import org.apache.jackrabbit.core.state.PropertyState;
>> import org.apache.jackrabbit.name.QName;
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> 
>> import com.axxia.dms.indexing.util.IndexingUtil;
>> 
>> public class AxxiaJackrabbitNodeIndexer extends NodeIndexer
>> {
>>     /**
>>      * Creates a new node indexer.
>>      *
>>      * @param node          the node state to index.
>>      * @param stateProvider the persistent item state manager to retrieve
>> properties.
>>      * @param mappings      internal namespace mappings.
>>      * @param textFilters   List of {@link
>> org.apache.jackrabbit.core.query.TextFilter}s.
>>      */
>>     protected AxxiaJackrabbitNodeIndexer(NodeState node,
>>                           ItemStateManager stateProvider,
>>                           NamespaceMappings mappings,
>>                           List textFilters) {
>>         super(node, stateProvider, mappings, textFilters);
>>     }
>> 
>>     
>>     
>>     /**
>>      * Creates a lucene Document from a node.
>>      *
>>      * @param node          the node state to index.
>>      * @param stateProvider the state provider to retrieve property
>> values.
>>      * @param mappings      internal namespace mappings.
>>      * @param textFilters   list of text filters to use for indexing
>> binary
>>      *                      properties.
>>      * @return the lucene Document.
>>      * @throws RepositoryException if an error occurs while reading
>> property
>>      *                             values from the
>> <code>ItemStateProvider</code>.
>>      */
>>     public static Document createDocument(NodeState node,
>>                                           ItemStateManager stateProvider,
>>                                           NamespaceMappings mappings,
>>                                           List textFilters)
>>             throws RepositoryException {
>>     	AxxiaJackrabbitNodeIndexer indexer = new
>> AxxiaJackrabbitNodeIndexer(node, stateProvider, mappings, textFilters);
>>         return indexer.createDoc();
>>     }    
>>     
>>     /**
>>      * Adds the binary value to the document as the named field.
>>      * <p/>
>>      * This implementation checks if this {@link #node} is of type
>> nt:resource
>>      * and if that is the case, tries to extract text from the data atom
>> using
>>      * the {@link #textFilters}.
>>      *
>>      * @param doc           The document to which to add the field
>>      * @param fieldName     The name of the field to add
>>      * @param internalValue The value for the field to add to the
>> document.
>>      */
>>     protected void addBinaryValue(Document doc, String fieldName, Object
>> internalValue) {
>>     	
>>         // 'check' if node is of type nt:resource
>>         try {
>>             String jcrData = mappings.getPrefix(QName.NS_JCR_URI) +
>> ":data";
>>             if (!jcrData.equals(fieldName)) {
>>                 // don't know how to index
>>                 return;
>>             }
>>             //NB node variabel is of type NodeState
>>             if (node.hasPropertyName(QName.JCR_MIMETYPE)) {
>>                 PropertyState dataProp = (PropertyState)
>> stateProvider.getItemState(
>>                         new PropertyId(node.getNodeId(),
>> QName.JCR_DATA));
>>                 PropertyState mimeTypeProp =
>>                         (PropertyState) stateProvider.getItemState(
>>                                 new PropertyId(node.getNodeId(),
>> QName.JCR_MIMETYPE));
>> 
>>                 // jcr:encoding is not mandatory
>>                 String encoding = null;
>>                 if (node.hasPropertyName(QName.JCR_ENCODING)) {
>>                     PropertyState encodingProp =
>>                             (PropertyState) stateProvider.getItemState(
>>                                     new PropertyId(node.getNodeId(),
>> QName.JCR_ENCODING));
>>                     encoding =
>> encodingProp.getValues()[0].internalValue().toString();
>>                 }
>> 
>>                 
>>                 
>> 
>>                 String mimeType =
>> mimeTypeProp.getValues()[0].internalValue().toString();
>>                 Map fields = Collections.EMPTY_MAP;
>>                 for (Iterator it = textFilters.iterator(); it.hasNext();)
>> {
>>                     TextFilter filter = (TextFilter) it.next();
>>                     // use the first filter that can handle the mimeType
>>                     if (filter.canFilter(mimeType)) {
>>                         fields = filter.doFilter(dataProp, encoding);
>>                         break;
>>                     }
>>                 }
>> 
>>                 for (Iterator it = fields.keySet().iterator();
>> it.hasNext();) {
>>                     String field = (String) it.next();
>>                     Reader r = (Reader) fields.get(field);
>>                     doc.add(Field.Text(field, r));
>>                 }
>>                 
>>             	//After obtaining the  map of fields returned by
>>             	//the text filter look for the Reader that was returned with
>> the key
>>             	//FieldNames.FULLTEXT. you then have to spool the reader
>> into a
>> string
>>             	//value and call addStringValue(). 
>>                 Reader fullTextReader = (Reader)
>> fields.get(FieldNames.FULLTEXT);
>>                 if (fullTextReader != null)
>>                 {
>>                 	try
>>                 	{                    
>>                 	    String text = readerToString(fullTextReader);
>>                 	    addStringValue(doc, fieldName, text.toLowerCase());
>>                 	}
>>                 	catch (IOException e)
>>                 	{
>>                 		//TODO Logging etc
>>                 		e.printStackTrace();
>>                 	}
>>                 }
>>             }
>>         } catch (ItemStateException e) {
>>         	//TODO
>>             //log.warn("Exception while indexing binary property: " +
>> e.toString());
>>             //log.debug("Dump: ", e);
>>         } catch (RepositoryException e) {
>>         	//TODO
>>             //log.warn("Exception while indexing binary property: " +
>> e.toString());
>>             //log.debug("Dump: ", e);
>>         }
>>     }
>>     
>>    /**
>>     * Spools a reader object to string representation.
>>     * @param reader The reader to convert to a string.
>>     * @return String representation of the reader
>>     * @throws IOException
>>     */
>>     private String readerToString(Reader reader) throws IOException 
>>     {
>> 	    int charValue = 0;
>> 	    StringBuilder sb = new StringBuilder(2024);
>> 	    
>> 	    while ((charValue = reader.read()) != -1) {
>> 	    	sb.append((char)charValue);
>> 	    }
>> 	    String result = sb.toString();
>> 	    return result;
>>     }      
>>     
>> }
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-tf1512215.html#a7469661
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.

Re: Restricting xpath query to document text

Posted by Marcel Reutegger <ma...@day.com>.

The class looks ok, except that you shouldn't lower-case the text retrieved from 
the resource, or is there a specific reason why this is done?

Jackrabbit 1.2 will support the functions fn:lower-case() and fn:upper-case(), 
so there is no need to lower-case the text when it is indexed.

The query you mentioned will return nt:resource nodes with content that starts 
with 'comp':

//element(*, axxia:resource)[(jcr:like(@jcr:data, 'comp%'))]

Are you sure that the first word in the document starts with 'comp'?

regards
  marcel

thomasg wrote:
> Sure, theres probably an obvious error / omission. This is the code:
> 
> package com.axxia.dms.indexing.jackrabbit;
> 
> import java.io.IOException;
> import java.io.Reader;
> import java.util.Collections;
> import java.util.Iterator;
> import java.util.List;
> import java.util.Map;
> 
> import javax.jcr.RepositoryException;
> 
> import org.apache.jackrabbit.core.PropertyId;
> import org.apache.jackrabbit.core.query.TextFilter;
> import org.apache.jackrabbit.core.query.lucene.FieldNames;
> import org.apache.jackrabbit.core.query.lucene.NamespaceMappings;
> import org.apache.jackrabbit.core.query.lucene.NodeIndexer;
> import org.apache.jackrabbit.core.state.ItemStateException;
> import org.apache.jackrabbit.core.state.ItemStateManager;
> import org.apache.jackrabbit.core.state.NodeState;
> import org.apache.jackrabbit.core.state.PropertyState;
> import org.apache.jackrabbit.name.QName;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> 
> import com.axxia.dms.indexing.util.IndexingUtil;
> 
> public class AxxiaJackrabbitNodeIndexer extends NodeIndexer
> {
>     /**
>      * Creates a new node indexer.
>      *
>      * @param node          the node state to index.
>      * @param stateProvider the persistent item state manager to retrieve
> properties.
>      * @param mappings      internal namespace mappings.
>      * @param textFilters   List of {@link
> org.apache.jackrabbit.core.query.TextFilter}s.
>      */
>     protected AxxiaJackrabbitNodeIndexer(NodeState node,
>                           ItemStateManager stateProvider,
>                           NamespaceMappings mappings,
>                           List textFilters) {
>         super(node, stateProvider, mappings, textFilters);
>     }
> 
>     
>     
>     /**
>      * Creates a lucene Document from a node.
>      *
>      * @param node          the node state to index.
>      * @param stateProvider the state provider to retrieve property values.
>      * @param mappings      internal namespace mappings.
>      * @param textFilters   list of text filters to use for indexing binary
>      *                      properties.
>      * @return the lucene Document.
>      * @throws RepositoryException if an error occurs while reading property
>      *                             values from the
> <code>ItemStateProvider</code>.
>      */
>     public static Document createDocument(NodeState node,
>                                           ItemStateManager stateProvider,
>                                           NamespaceMappings mappings,
>                                           List textFilters)
>             throws RepositoryException {
>     	AxxiaJackrabbitNodeIndexer indexer = new
> AxxiaJackrabbitNodeIndexer(node, stateProvider, mappings, textFilters);
>         return indexer.createDoc();
>     }    
>     
>     /**
>      * Adds the binary value to the document as the named field.
>      * <p/>
>      * This implementation checks if this {@link #node} is of type
> nt:resource
>      * and if that is the case, tries to extract text from the data atom
> using
>      * the {@link #textFilters}.
>      *
>      * @param doc           The document to which to add the field
>      * @param fieldName     The name of the field to add
>      * @param internalValue The value for the field to add to the document.
>      */
>     protected void addBinaryValue(Document doc, String fieldName, Object
> internalValue) {
>     	
>         // 'check' if node is of type nt:resource
>         try {
>             String jcrData = mappings.getPrefix(QName.NS_JCR_URI) + ":data";
>             if (!jcrData.equals(fieldName)) {
>                 // don't know how to index
>                 return;
>             }
>             //NB node variabel is of type NodeState
>             if (node.hasPropertyName(QName.JCR_MIMETYPE)) {
>                 PropertyState dataProp = (PropertyState)
> stateProvider.getItemState(
>                         new PropertyId(node.getNodeId(), QName.JCR_DATA));
>                 PropertyState mimeTypeProp =
>                         (PropertyState) stateProvider.getItemState(
>                                 new PropertyId(node.getNodeId(),
> QName.JCR_MIMETYPE));
> 
>                 // jcr:encoding is not mandatory
>                 String encoding = null;
>                 if (node.hasPropertyName(QName.JCR_ENCODING)) {
>                     PropertyState encodingProp =
>                             (PropertyState) stateProvider.getItemState(
>                                     new PropertyId(node.getNodeId(),
> QName.JCR_ENCODING));
>                     encoding =
> encodingProp.getValues()[0].internalValue().toString();
>                 }
> 
>                 
>                 
> 
>                 String mimeType =
> mimeTypeProp.getValues()[0].internalValue().toString();
>                 Map fields = Collections.EMPTY_MAP;
>                 for (Iterator it = textFilters.iterator(); it.hasNext();) {
>                     TextFilter filter = (TextFilter) it.next();
>                     // use the first filter that can handle the mimeType
>                     if (filter.canFilter(mimeType)) {
>                         fields = filter.doFilter(dataProp, encoding);
>                         break;
>                     }
>                 }
> 
>                 for (Iterator it = fields.keySet().iterator();
> it.hasNext();) {
>                     String field = (String) it.next();
>                     Reader r = (Reader) fields.get(field);
>                     doc.add(Field.Text(field, r));
>                 }
>                 
>             	//After obtaining the  map of fields returned by
>             	//the text filter look for the Reader that was returned with
> the key
>             	//FieldNames.FULLTEXT. you then have to spool the reader into a
> string
>             	//value and call addStringValue(). 
>                 Reader fullTextReader = (Reader)
> fields.get(FieldNames.FULLTEXT);
>                 if (fullTextReader != null)
>                 {
>                 	try
>                 	{                    
>                 	    String text = readerToString(fullTextReader);
>                 	    addStringValue(doc, fieldName, text.toLowerCase());
>                 	}
>                 	catch (IOException e)
>                 	{
>                 		//TODO Logging etc
>                 		e.printStackTrace();
>                 	}
>                 }
>             }
>         } catch (ItemStateException e) {
>         	//TODO
>             //log.warn("Exception while indexing binary property: " +
> e.toString());
>             //log.debug("Dump: ", e);
>         } catch (RepositoryException e) {
>         	//TODO
>             //log.warn("Exception while indexing binary property: " +
> e.toString());
>             //log.debug("Dump: ", e);
>         }
>     }
>     
>    /**
>     * Spools a reader object to string representation.
>     * @param reader The reader to convert to a string.
>     * @return String representation of the reader
>     * @throws IOException
>     */
>     private String readerToString(Reader reader) throws IOException 
>     {
> 	    int charValue = 0;
> 	    StringBuilder sb = new StringBuilder(2024);
> 	    
> 	    while ((charValue = reader.read()) != -1) {
> 	    	sb.append((char)charValue);
> 	    }
> 	    String result = sb.toString();
> 	    return result;
>     }      
>     
> }

Re: Restricting xpath query to document text

Posted by thomasg <th...@hotmail.com>.

Sure, theres probably an obvious error / omission. This is the code:

package com.axxia.dms.indexing.jackrabbit;

import java.io.IOException;
import java.io.Reader;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import javax.jcr.RepositoryException;

import org.apache.jackrabbit.core.PropertyId;
import org.apache.jackrabbit.core.query.TextFilter;
import org.apache.jackrabbit.core.query.lucene.FieldNames;
import org.apache.jackrabbit.core.query.lucene.NamespaceMappings;
import org.apache.jackrabbit.core.query.lucene.NodeIndexer;
import org.apache.jackrabbit.core.state.ItemStateException;
import org.apache.jackrabbit.core.state.ItemStateManager;
import org.apache.jackrabbit.core.state.NodeState;
import org.apache.jackrabbit.core.state.PropertyState;
import org.apache.jackrabbit.name.QName;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import com.axxia.dms.indexing.util.IndexingUtil;

public class AxxiaJackrabbitNodeIndexer extends NodeIndexer
{
    /**
     * Creates a new node indexer.
     *
     * @param node          the node state to index.
     * @param stateProvider the persistent item state manager to retrieve
properties.
     * @param mappings      internal namespace mappings.
     * @param textFilters   List of {@link
org.apache.jackrabbit.core.query.TextFilter}s.
     */
    protected AxxiaJackrabbitNodeIndexer(NodeState node,
                          ItemStateManager stateProvider,
                          NamespaceMappings mappings,
                          List textFilters) {
        super(node, stateProvider, mappings, textFilters);
    }

    
    
    /**
     * Creates a lucene Document from a node.
     *
     * @param node          the node state to index.
     * @param stateProvider the state provider to retrieve property values.
     * @param mappings      internal namespace mappings.
     * @param textFilters   list of text filters to use for indexing binary
     *                      properties.
     * @return the lucene Document.
     * @throws RepositoryException if an error occurs while reading property
     *                             values from the
<code>ItemStateProvider</code>.
     */
    public static Document createDocument(NodeState node,
                                          ItemStateManager stateProvider,
                                          NamespaceMappings mappings,
                                          List textFilters)
            throws RepositoryException {
    	AxxiaJackrabbitNodeIndexer indexer = new
AxxiaJackrabbitNodeIndexer(node, stateProvider, mappings, textFilters);
        return indexer.createDoc();
    }    
    
    /**
     * Adds the binary value to the document as the named field.
     * <p/>
     * This implementation checks if this {@link #node} is of type
nt:resource
     * and if that is the case, tries to extract text from the data atom
using
     * the {@link #textFilters}.
     *
     * @param doc           The document to which to add the field
     * @param fieldName     The name of the field to add
     * @param internalValue The value for the field to add to the document.
     */
    protected void addBinaryValue(Document doc, String fieldName, Object
internalValue) {
    	
        // 'check' if node is of type nt:resource
        try {
            String jcrData = mappings.getPrefix(QName.NS_JCR_URI) + ":data";
            if (!jcrData.equals(fieldName)) {
                // don't know how to index
                return;
            }
            //NB node variabel is of type NodeState
            if (node.hasPropertyName(QName.JCR_MIMETYPE)) {
                PropertyState dataProp = (PropertyState)
stateProvider.getItemState(
                        new PropertyId(node.getNodeId(), QName.JCR_DATA));
                PropertyState mimeTypeProp =
                        (PropertyState) stateProvider.getItemState(
                                new PropertyId(node.getNodeId(),
QName.JCR_MIMETYPE));

                // jcr:encoding is not mandatory
                String encoding = null;
                if (node.hasPropertyName(QName.JCR_ENCODING)) {
                    PropertyState encodingProp =
                            (PropertyState) stateProvider.getItemState(
                                    new PropertyId(node.getNodeId(),
QName.JCR_ENCODING));
                    encoding =
encodingProp.getValues()[0].internalValue().toString();
                }

                
                

                String mimeType =
mimeTypeProp.getValues()[0].internalValue().toString();
                Map fields = Collections.EMPTY_MAP;
                for (Iterator it = textFilters.iterator(); it.hasNext();) {
                    TextFilter filter = (TextFilter) it.next();
                    // use the first filter that can handle the mimeType
                    if (filter.canFilter(mimeType)) {
                        fields = filter.doFilter(dataProp, encoding);
                        break;
                    }
                }

                for (Iterator it = fields.keySet().iterator();
it.hasNext();) {
                    String field = (String) it.next();
                    Reader r = (Reader) fields.get(field);
                    doc.add(Field.Text(field, r));
                }
                
            	//After obtaining the  map of fields returned by
            	//the text filter look for the Reader that was returned with
the key
            	//FieldNames.FULLTEXT. you then have to spool the reader into a
string
            	//value and call addStringValue(). 
                Reader fullTextReader = (Reader)
fields.get(FieldNames.FULLTEXT);
                if (fullTextReader != null)
                {
                	try
                	{                    
                	    String text = readerToString(fullTextReader);
                	    addStringValue(doc, fieldName, text.toLowerCase());
                	}
                	catch (IOException e)
                	{
                		//TODO Logging etc
                		e.printStackTrace();
                	}
                }
            }
        } catch (ItemStateException e) {
        	//TODO
            //log.warn("Exception while indexing binary property: " +
e.toString());
            //log.debug("Dump: ", e);
        } catch (RepositoryException e) {
        	//TODO
            //log.warn("Exception while indexing binary property: " +
e.toString());
            //log.debug("Dump: ", e);
        }
    }
    
   /**
    * Spools a reader object to string representation.
    * @param reader The reader to convert to a string.
    * @return String representation of the reader
    * @throws IOException
    */
    private String readerToString(Reader reader) throws IOException 
    {
	    int charValue = 0;
	    StringBuilder sb = new StringBuilder(2024);
	    
	    while ((charValue = reader.read()) != -1) {
	    	sb.append((char)charValue);
	    }
	    String result = sb.toString();
	    return result;
    }      
    
}
-- 
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-tf1512215.html#a7440129
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.

Re: Restricting xpath query to document text

Posted by Marcel Reutegger <ma...@gmx.net>.

thomasg wrote:
> Hi Marcel,
> 
> I've just been looking at a some more typical queries that our system will
> need to run and am finding a (related) but more serious problem using
> jcr:like. 
> A query such as this works fine:
> String queryString = "//element(*, axxia:resource)[jcr:like(@axxia:title,
> '%inking%Java%')]"; 
> 
> But how can you specify that you want the document text to be like
> '%inking%Java%'? Is it currently the case that jcr:like can only search
> properties and not the document text?

Yes, in JSR-170 the jcr:like() function is specified to only work on String 
properties. And I think this makes sense, because a like operation on a potentially 
large text which is not tokenized may be quite expensive. you should then rather use 
jcr:contains which also works on fulltext indexed binary properties.

> Using jcr:contains the situation is ok since jcr:contains(., 'word')
> searches the document text, even though you can't specify 'ONLY the document
> text and not other properties'. 

I think this is a reasonable extension that we should consider in jackrabbit.

it is tracked with this issue:
http://issues.apache.org/jira/browse/JCR-415

regards
  marcel

Re: Restricting xpath query to document text

Posted by thomasg <th...@hotmail.com>.

Hi Marcel,

I've just been looking at a some more typical queries that our system will
need to run and am finding a (related) but more serious problem using
jcr:like. 
A query such as this works fine:
String queryString = "//element(*, axxia:resource)[jcr:like(@axxia:title,
'%inking%Java%')]"; 

But how can you specify that you want the document text to be like
'%inking%Java%'? Is it currently the case that jcr:like can only search
properties and not the document text?

Using jcr:contains the situation is ok since jcr:contains(., 'word')
searches the document text, even though you can't specify 'ONLY the document
text and not other properties'. 

Any advice would be appreciated.
Thanks, Thomas
--
View this message in context: http://www.nabble.com/Restricting-xpath-query-to-document-text-t1512215.html#a4320864
Sent from the Jackrabbit - Dev forum at Nabble.com.

Re: Restricting xpath query to document text

Posted by Marcel Reutegger <ma...@gmx.net>.

thomasg wrote:
> How can I modify the query to return hits when the word is in the document
> body and not if it is just in one of the properties?

with the current implementation the only way to achieve this is to 
prohibit the term in another clause for the excluded properties.

currently the text representation of a jcr:data property is indexed as 
part of the node scope fulltext index. but in some cases (such as yours) 
it is also desirable to have the fulltext index also on the jcr:data 
property. this however adds some overhead to the node indexing. I think 
in the end this is something that should be configurable because it is 
not always needed.

I've created an jira issue that describes this enhancement:
http://issues.apache.org/jira/browse/JCR-415

> Oh, plus another question while I'm at it. Is there any limit (absolute or
> performance) to the number of clauses one can add to the [] (square bracket)
> part of the query? Mine potentially could get very large.

lucene has a max clauses limit of 1024. though, this can be set to a 
higher value. jackrabbit does not yet allow to configure this value.

wrt performance, the more clauses you have the longer the query will 
take to execute. in the end this completely depends on lucene. you might 
be able to find performance information on the lucene website.

regards
  marcel