You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Orlando Palis <or...@gmail.com> on 2013/06/07 02:58:25 UTC

jackrabbit 2.6.0 Full Text Search

Hi Folks,

I'm new to jackrabbit and I'm trying out full-text search using jackrabbit
2.6.0. (with tika 1.3) . I have a custom node type that allows me to store
some custom properties and multiple html files (stored as binary) .  I have
the following configurations:

*workspace.xml:*

<?xml version="1.0" encoding="UTF-8"?>
<Workspace name="default">
        <!--
            virtual file system of the workspace:
            class: FQN of class implementing the FileSystem interface
        -->
        <FileSystem
class="org.apache.jackrabbit.core.fs.db.OracleFileSystem">
            <param name="dataSourceName" value="ds1"/>
            <param name="schemaObjectPrefix" value="fs_${wsp.name}_"/>
        </FileSystem>
        <!--
            persistence manager of the workspace:
            class: FQN of class implementing the PersistenceManager
interface
        -->
        <PersistenceManager
class="org.apache.jackrabbit.core.persistence.pool.OraclePersistenceManager">
            <param name="dataSourceName" value="ds1"/>
            <param name="schemaObjectPrefix" value="pm_${wsp.name}_"/>
        </PersistenceManager>
        <!--
            Search index and the file system it uses.
            class: FQN of class implementing the QueryHandler interface
        -->
        <SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index"/>
            <param name="analyzer"
value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <param name="queryClass"
value="org.apache.jackrabbit.core.query.QueryImpl"/>
            <param name="excerptProviderClass"
value="org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt"/>
            <param name="supportHighlighting" value="true"/>
            <param name="tikaConfigPath"
value="${wsp.home}/tika-config.xml"/>
        </SearchIndex>
</Workspace>


*tika-config.xml:*

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"
magic="false"/>
    <parsers>
           <parser name="parse-html"
class="org.apache.tika.parser.html.HtmlParser">
               <mime>text/html</mime>
               <mime>application/xhtml+xml</mime>
               <mime>application/x-asp</mime>
           </parser>
    </parsers>
</properties>

*JCR-SQL2 queries tested:*

1) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This')

2) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This*')

3)
SELECT file.*, resource.* FROM [nt:file] AS file
INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
WHERE resource.[jcr:mimeType] = 'text/html'
AND CONTAINS(file.*, 'This')

4)
SELECT file.*, resource.* FROM [nt:file] AS file
INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
WHERE resource.[jcr:mimeType] = 'text/html'
AND CONTAINS(file.*, 'This*')

*Result:*
Nothing seems to work.  If I remove the CONTAINS() clause from the queries,
I am able to get rows from all the queries above and for query #3 & #4 I
can see that the field resource.[jcr:data] has the text ("This") I am
searching for when I dump the result to the log file.  I've also tried
deleting the index folder so that the repository will be re-indexed but I
am still not able to do full-text search successfully.

What am I missing?  In addition, is there any documentation on how to
configure tika (tika-config.xml)?


Thanks and Regards,
Orlando

Re: jackrabbit 2.6.0 Full Text Search

Posted by Orlando Palis <or...@gmail.com>.
*The following JCR-SQL2 queries don't work either:*

5) SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*,
'This')

6) SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*,
'This*')

7) SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*,
'*This*')



On Thu, Jun 6, 2013 at 5:58 PM, Orlando Palis <or...@gmail.com>wrote:

> Hi Folks,
>
> I'm new to jackrabbit and I'm trying out full-text search using jackrabbit
> 2.6.0. (with tika 1.3) . I have a custom node type that allows me to store
> some custom properties and multiple html files (stored as binary) .  I have
> the following configurations:
>
> *workspace.xml:*
>
> <?xml version="1.0" encoding="UTF-8"?>
> <Workspace name="default">
>         <!--
>             virtual file system of the workspace:
>             class: FQN of class implementing the FileSystem interface
>         -->
>         <FileSystem
> class="org.apache.jackrabbit.core.fs.db.OracleFileSystem">
>             <param name="dataSourceName" value="ds1"/>
>             <param name="schemaObjectPrefix" value="fs_${wsp.name}_"/>
>         </FileSystem>
>         <!--
>             persistence manager of the workspace:
>             class: FQN of class implementing the PersistenceManager
> interface
>         -->
>         <PersistenceManager
> class="org.apache.jackrabbit.core.persistence.pool.OraclePersistenceManager">
>             <param name="dataSourceName" value="ds1"/>
>             <param name="schemaObjectPrefix" value="pm_${wsp.name}_"/>
>         </PersistenceManager>
>         <!--
>             Search index and the file system it uses.
>             class: FQN of class implementing the QueryHandler interface
>         -->
>         <SearchIndex
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>             <param name="path" value="${wsp.home}/index"/>
>             <param name="analyzer"
> value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
>             <param name="queryClass"
> value="org.apache.jackrabbit.core.query.QueryImpl"/>
>             <param name="excerptProviderClass"
> value="org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt"/>
>             <param name="supportHighlighting" value="true"/>
>             <param name="tikaConfigPath"
> value="${wsp.home}/tika-config.xml"/>
>         </SearchIndex>
> </Workspace>
>
>
> *tika-config.xml:*
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <mimeTypeRepository
> resource="/org/apache/tika/mime/tika-mimetypes.xml" magic="false"/>
>     <parsers>
>            <parser name="parse-html"
> class="org.apache.tika.parser.html.HtmlParser">
>                <mime>text/html</mime>
>                <mime>application/xhtml+xml</mime>
>                <mime>application/x-asp</mime>
>            </parser>
>     </parsers>
> </properties>
>
> *JCR-SQL2 queries tested:*
>
> 1) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This')
>
> 2) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This*')
>
> 3)
> SELECT file.*, resource.* FROM [nt:file] AS file
> INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
> WHERE resource.[jcr:mimeType] = 'text/html'
> AND CONTAINS(file.*, 'This')
>
> 4)
> SELECT file.*, resource.* FROM [nt:file] AS file
> INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
> WHERE resource.[jcr:mimeType] = 'text/html'
> AND CONTAINS(file.*, 'This*')
>
> *Result:*
> Nothing seems to work.  If I remove the CONTAINS() clause from the
> queries, I am able to get rows from all the queries above and for query #3
> & #4 I can see that the field resource.[jcr:data] has the text ("This") I
> am searching for when I dump the result to the log file.  I've also tried
> deleting the index folder so that the repository will be re-indexed but I
> am still not able to do full-text search successfully.
>
> What am I missing?  In addition, is there any documentation on how to
> configure tika (tika-config.xml)?
>
>
> Thanks and Regards,
> Orlando
>

Re: jackrabbit 2.6.0 Full Text Search

Posted by hsp <pi...@ibest.com.br>.
I changed the workspace.xml in search section with tikaConfigPath element,
and put a tika-config.xml with the mimes I want to extract and index:

<properties>
	<detectors>
		<detector class="org.apache.tika.detect.DefaultDetector" />
	</detectors>
	<parsers>
		<parser class="org.apache.tika.parser.DefaultParser" />
		<parser class="org.apache.tika.parser.EmptyParser">
			<mime>application/pdf</mime>
		</parser>
	</parsers>
</properties>

But, in debug mode, the text extracted was blank and the search did not got
the pdf item.
Again, is there a way to test if is my fault or not?

Regards
Helio.



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4659003.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: jackrabbit 2.6.0 Full Text Search

Posted by hsp <pi...@ibest.com.br>.
I see that in org.apache.jackrabbit.core.query.lucene.NodeIndexer
    /**
     * Returns <code>true</code> if the provided type is among the types
     * supported by the Tika parser we are using.
     *
     * @param type  the type to check.
     * @return whether the type is supported by the Tika parser we are
using.
     */
    protected boolean isSupportedMediaType(final String type) {
        if (supportedMediaTypes == null) {
            supportedMediaTypes = parser.getSupportedTypes(null);
        }
        return supportedMediaTypes.contains(MediaType.parse(type));
    }

The supportedMediaTypes will be load with:
application/x-tar,
application/x-bzip,
application/x-bzip2,
image/x-icon,
image/vnd.wap.wbmp,
image/vnd.adobe.photoshop,
application/x-cpio,
image/x-xcf,
application/zip,
image/x-ms-bmp,
image/jpeg,
image/png,
application/x-gtar,
application/x-archive,
image/gif,
application/x-gzip

This way the mimetypes I have (txt, office, pdf) will be never extracted...

But, where is the configuration for this? Because the default
tika-config.xml is:
<properties>

  <detectors>

    <detector class="org.apache.tika.detect.DefaultDetector"/>

  </detectors>

  <parsers>

    <parser class="org.apache.tika.parser.DefaultParser"/>

    <parser class="org.apache.tika.parser.EmptyParser">
      
      <mime>application/x-archive</mime>
      <mime>application/x-bzip</mime>
      <mime>application/x-bzip2</mime>
      <mime>application/x-cpio</mime>
      <mime>application/x-gtar</mime>
      <mime>application/x-gzip</mime>
      <mime>application/x-tar</mime>
      <mime>application/zip</mime>
      
      <mime>image/bmp</mime>
      <mime>image/gif</mime>
      <mime>image/jpeg</mime>
      <mime>image/png</mime>
      <mime>image/vnd.wap.wbmp</mime>
      <mime>image/x-icon</mime>
      <mime>image/x-psd</mime>
      <mime>image/x-xcf</mime>
    </parser>

  </parsers>

</properties>

I am feeling almost there 
Bit lacking this in documentation...

Best Regards
Helio



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4659000.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: jackrabbit 2.6.0 Full Text Search

Posted by hsp <pi...@ibest.com.br>.
Nor 2.6.X neighter 2.4.X are indexing "pdf" for full text search over
nt:resource/jcr:data. Searches with contains() over nt:file/my:property are
ok.
What to do, is there some trick??? I did already a consistencyCheck/fix,
reindex, and nothing has effect.
Thanks in advanced.
Helio.



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658994.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: jackrabbit 2.6.0 Full Text Search

Posted by hsp <pi...@ibest.com.br>.
I am with jack 2.6.2 searching with xpath, and the pdf files are not full
text searchable, but with the old jack version they are normal returned. I
did recreate the indexes, but no effect.
Is the default tika-config.xml enough, or may I have to rewrite it with
other parsers and declared its path in repository.xml?
Regards
Helio.



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658971.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: jackrabbit 2.6.0 Full Text Search

Posted by hsp <pi...@ibest.com.br>.
The dependency 
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.3</version>
        </dependency>
must be included in the application to the extraction works.
Regards
Helio.



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4659006.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: jackrabbit 2.6.0 Full Text Search

Posted by "orlando.p" <or...@gmail.com>.
Ard wrote
> Did you check other words than 'this'? 'This' is a stopword and gets
> filtered out while fulltext indexing / querying. You are using the
> StandardAnalyzer which uses org.apache.lucene.analysis.StopAnalyzer
> which contains a default ENGLISH_STOP_WORDS array containing 'this'
> 
> Regards Ard

Yes.  In fact, I added the sentence "The quick brown fox jumps over the lazy
dog." and searched for each word just now to re-test.  I can find them using
2.4.3 but not 2.6.0.

regards,
Orlando



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658892.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: jackrabbit 2.6.0 Full Text Search

Posted by Ard Schrijvers <a....@onehippo.com>.
On Mon, Jun 17, 2013 at 9:11 AM, orlando.p <or...@gmail.com> wrote:
> I've tested the same query in 2.6.2 and it didn't work as well.  I rolled
> back to 2.4.3 and everything worked fine.  There seem to be full text search
> issue in versions 2.6.0 and 2.6.2.

Did you check other words than 'this'? 'This' is a stopword and gets
filtered out while fulltext indexing / querying. You are using the
StandardAnalyzer which uses org.apache.lucene.analysis.StopAnalyzer
which contains a default ENGLISH_STOP_WORDS array containing 'this'

Regards Ard

>
>
>
> --
> View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658884.html
> Sent from the Jackrabbit - Users mailing list archive at Nabble.com.



--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: jackrabbit 2.6.0 Full Text Search

Posted by "orlando.p" <or...@gmail.com>.
I've tested the same query in 2.6.2 and it didn't work as well.  I rolled
back to 2.4.3 and everything worked fine.  There seem to be full text search
issue in versions 2.6.0 and 2.6.2.



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658884.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: jackrabbit 2.6.0 Full Text Search

Posted by "orlando.p" <or...@gmail.com>.
Hi Torsten,
Yes, I have seen the link and have tested the following queries as well but
none worked.

SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*, 'This')
SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*, 'This*')
SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*, '*This*')

My other JCR-SQL2 queries work just fine until I add the CONTAINS clause. I
don't see errors in the log file either.

When I dump the result of the query without the CONTAINS clause, I am able
to see the text being searched.  Below is an example of the dumped data with
some lines omitted to reduce size.

[rt.jcr:configuration] ---> []
[rt.jcr:activity] ---> []
[rt.jcr:uuid] ---> [5183ee77-3aa9-497f-928a-4e61b2cac14c]
<<lines omitted>>
[rt.jcr:baseVersion] ---> [ee23fbe3-ec5e-4b29-bed1-24fd5a0a78c1]
[rt.jcr:isCheckedOut] ---> [true]
[rt.jcr:primaryType] ---> [rt:RuleTariff]
[rt.rt:EffectiveStartDate] ---> [20140101000000.000]
<<lines omitted>>
[rt.jcr:versionHistory] ---> [ef428650-639e-4153-9c56-26c82b5fe26b]
<<lines omitted>>
[file.jcr:createdBy] ---> [admin]
[file.jcr:created] ---> [2013-06-12T13:36:36.746-07:00]
[file.jcr:primaryType] ---> [nt:file]
[resource.jcr:lastModifiedBy] ---> [admin]
[resource.jcr:uuid] ---> [3cc13d9e-5f52-49b3-a85d-e626f691305d]
[resource.jcr:data] ---> [<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Strict//EN&quot;
    &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd&quot;>
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title></title>
<<lines omitted>>
</head>
<body>
<<lines omitted>>

<p class="c4">This is the body.</p>
<<lines omitted>>
</body>
</html>
]
[resource.jcr:encoding] ---> [UTF-8]
[resource.jcr:mimeType] ---> [text/html]
[resource.jcr:lastModified] ---> [2013-06-12T13:36:36.756-07:00]

regards,
Orlando



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658867.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: jackrabbit 2.6.0 Full Text Search

Posted by Torsten Stolpmann <st...@verit.de>.
Hi Orlando,

Did you read the answer from Alexander Klimetschek here: 
http://mail-archives.apache.org/mod_mbox/jackrabbit-users/201207.mbox/%3C9ADA9A33-5AA7-49DB-B32A-AF41610E335D@adobe.com%3E 
?

On 27.06.2012, at 17:19, Furst, Carl wrote:

 > > So here's the sql I use:
 > >
 > > select * from [nt:resource] where  contains([jcr:data], 'include');
 >
 > The full text index for binary properties is by default aggregated on 
 > the node itself, not
 > on the jcr:data property. You address that with "*" and you need a 
selector (s in this case):
 >
 > select * from [nt:resource] as s where contains(s.*, 'include')
 >
 > (In the former sql1 you could simply to CONTAINS(., 'include') to > 
adress the node itself).
 >
 > See my recent mail (about xpath, but same index is used): 
http://markmail.org/message/oc6uootrpxepso4d

 > Cheers,
 > Alex

Hope this helps,

Torsten


On 07.06.2013 02:58, Orlando Palis wrote:
> Hi Folks,
>
> I'm new to jackrabbit and I'm trying out full-text search using jackrabbit
> 2.6.0. (with tika 1.3) . I have a custom node type that allows me to store
> some custom properties and multiple html files (stored as binary) .  I have
> the following configurations:
>
> *workspace.xml:*
>
> <?xml version="1.0" encoding="UTF-8"?>
> <Workspace name="default">
>          <!--
>              virtual file system of the workspace:
>              class: FQN of class implementing the FileSystem interface
>          -->
>          <FileSystem
> class="org.apache.jackrabbit.core.fs.db.OracleFileSystem">
>              <param name="dataSourceName" value="ds1"/>
>              <param name="schemaObjectPrefix" value="fs_${wsp.name}_"/>
>          </FileSystem>
>          <!--
>              persistence manager of the workspace:
>              class: FQN of class implementing the PersistenceManager
> interface
>          -->
>          <PersistenceManager
> class="org.apache.jackrabbit.core.persistence.pool.OraclePersistenceManager">
>              <param name="dataSourceName" value="ds1"/>
>              <param name="schemaObjectPrefix" value="pm_${wsp.name}_"/>
>          </PersistenceManager>
>          <!--
>              Search index and the file system it uses.
>              class: FQN of class implementing the QueryHandler interface
>          -->
>          <SearchIndex
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>              <param name="path" value="${wsp.home}/index"/>
>              <param name="analyzer"
> value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
>              <param name="queryClass"
> value="org.apache.jackrabbit.core.query.QueryImpl"/>
>              <param name="excerptProviderClass"
> value="org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt"/>
>              <param name="supportHighlighting" value="true"/>
>              <param name="tikaConfigPath"
> value="${wsp.home}/tika-config.xml"/>
>          </SearchIndex>
> </Workspace>
>
>
> *tika-config.xml:*
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>      <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"
> magic="false"/>
>      <parsers>
>             <parser name="parse-html"
> class="org.apache.tika.parser.html.HtmlParser">
>                 <mime>text/html</mime>
>                 <mime>application/xhtml+xml</mime>
>                 <mime>application/x-asp</mime>
>             </parser>
>      </parsers>
> </properties>
>
> *JCR-SQL2 queries tested:*
>
> 1) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This')
>
> 2) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This*')
>
> 3)
> SELECT file.*, resource.* FROM [nt:file] AS file
> INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
> WHERE resource.[jcr:mimeType] = 'text/html'
> AND CONTAINS(file.*, 'This')
>
> 4)
> SELECT file.*, resource.* FROM [nt:file] AS file
> INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
> WHERE resource.[jcr:mimeType] = 'text/html'
> AND CONTAINS(file.*, 'This*')
>
> *Result:*
> Nothing seems to work.  If I remove the CONTAINS() clause from the queries,
> I am able to get rows from all the queries above and for query #3 & #4 I
> can see that the field resource.[jcr:data] has the text ("This") I am
> searching for when I dump the result to the log file.  I've also tried
> deleting the index folder so that the repository will be re-indexed but I
> am still not able to do full-text search successfully.
>
> What am I missing?  In addition, is there any documentation on how to
> configure tika (tika-config.xml)?
>
>
> Thanks and Regards,
> Orlando
>


-- 
Torsten Stolpmann
Geschäftsführender Gesellschafter

verit Informationssysteme GmbH
Europaallee 10
67657 Kaiserslautern

E-Mail: stolpmann@verit.de
Telefon: +49 631 520 840 00
Fax: +49 631 520 840 01
Web: http://www.verit.de/

Registergericht: Amtsgericht Kaiserslautern
Registernummer: HRB 3751
Geschäftsleitung: Claudia Könnecke, Torsten Stolpmann