You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Orlando Palis <or...@gmail.com> on 2013/06/07 02:58:25 UTC
jackrabbit 2.6.0 Full Text Search
Hi Folks,
I'm new to jackrabbit and I'm trying out full-text search using jackrabbit
2.6.0. (with tika 1.3) . I have a custom node type that allows me to store
some custom properties and multiple html files (stored as binary) . I have
the following configurations:
*workspace.xml:*
<?xml version="1.0" encoding="UTF-8"?>
<Workspace name="default">
<!--
virtual file system of the workspace:
class: FQN of class implementing the FileSystem interface
-->
<FileSystem
class="org.apache.jackrabbit.core.fs.db.OracleFileSystem">
<param name="dataSourceName" value="ds1"/>
<param name="schemaObjectPrefix" value="fs_${wsp.name}_"/>
</FileSystem>
<!--
persistence manager of the workspace:
class: FQN of class implementing the PersistenceManager
interface
-->
<PersistenceManager
class="org.apache.jackrabbit.core.persistence.pool.OraclePersistenceManager">
<param name="dataSourceName" value="ds1"/>
<param name="schemaObjectPrefix" value="pm_${wsp.name}_"/>
</PersistenceManager>
<!--
Search index and the file system it uses.
class: FQN of class implementing the QueryHandler interface
-->
<SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index"/>
<param name="analyzer"
value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
<param name="queryClass"
value="org.apache.jackrabbit.core.query.QueryImpl"/>
<param name="excerptProviderClass"
value="org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt"/>
<param name="supportHighlighting" value="true"/>
<param name="tikaConfigPath"
value="${wsp.home}/tika-config.xml"/>
</SearchIndex>
</Workspace>
*tika-config.xml:*
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"
magic="false"/>
<parsers>
<parser name="parse-html"
class="org.apache.tika.parser.html.HtmlParser">
<mime>text/html</mime>
<mime>application/xhtml+xml</mime>
<mime>application/x-asp</mime>
</parser>
</parsers>
</properties>
*JCR-SQL2 queries tested:*
1) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This')
2) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This*')
3)
SELECT file.*, resource.* FROM [nt:file] AS file
INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
WHERE resource.[jcr:mimeType] = 'text/html'
AND CONTAINS(file.*, 'This')
4)
SELECT file.*, resource.* FROM [nt:file] AS file
INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
WHERE resource.[jcr:mimeType] = 'text/html'
AND CONTAINS(file.*, 'This*')
*Result:*
Nothing seems to work. If I remove the CONTAINS() clause from the queries,
I am able to get rows from all the queries above and for query #3 & #4 I
can see that the field resource.[jcr:data] has the text ("This") I am
searching for when I dump the result to the log file. I've also tried
deleting the index folder so that the repository will be re-indexed but I
am still not able to do full-text search successfully.
What am I missing? In addition, is there any documentation on how to
configure tika (tika-config.xml)?
Thanks and Regards,
Orlando
Re: jackrabbit 2.6.0 Full Text Search
Posted by Orlando Palis <or...@gmail.com>.
*The following JCR-SQL2 queries don't work either:*
5) SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*,
'This')
6) SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*,
'This*')
7) SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*,
'*This*')
On Thu, Jun 6, 2013 at 5:58 PM, Orlando Palis <or...@gmail.com>wrote:
> Hi Folks,
>
> I'm new to jackrabbit and I'm trying out full-text search using jackrabbit
> 2.6.0. (with tika 1.3) . I have a custom node type that allows me to store
> some custom properties and multiple html files (stored as binary) . I have
> the following configurations:
>
> *workspace.xml:*
>
> <?xml version="1.0" encoding="UTF-8"?>
> <Workspace name="default">
> <!--
> virtual file system of the workspace:
> class: FQN of class implementing the FileSystem interface
> -->
> <FileSystem
> class="org.apache.jackrabbit.core.fs.db.OracleFileSystem">
> <param name="dataSourceName" value="ds1"/>
> <param name="schemaObjectPrefix" value="fs_${wsp.name}_"/>
> </FileSystem>
> <!--
> persistence manager of the workspace:
> class: FQN of class implementing the PersistenceManager
> interface
> -->
> <PersistenceManager
> class="org.apache.jackrabbit.core.persistence.pool.OraclePersistenceManager">
> <param name="dataSourceName" value="ds1"/>
> <param name="schemaObjectPrefix" value="pm_${wsp.name}_"/>
> </PersistenceManager>
> <!--
> Search index and the file system it uses.
> class: FQN of class implementing the QueryHandler interface
> -->
> <SearchIndex
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> <param name="path" value="${wsp.home}/index"/>
> <param name="analyzer"
> value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> <param name="queryClass"
> value="org.apache.jackrabbit.core.query.QueryImpl"/>
> <param name="excerptProviderClass"
> value="org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt"/>
> <param name="supportHighlighting" value="true"/>
> <param name="tikaConfigPath"
> value="${wsp.home}/tika-config.xml"/>
> </SearchIndex>
> </Workspace>
>
>
> *tika-config.xml:*
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
> <mimeTypeRepository
> resource="/org/apache/tika/mime/tika-mimetypes.xml" magic="false"/>
> <parsers>
> <parser name="parse-html"
> class="org.apache.tika.parser.html.HtmlParser">
> <mime>text/html</mime>
> <mime>application/xhtml+xml</mime>
> <mime>application/x-asp</mime>
> </parser>
> </parsers>
> </properties>
>
> *JCR-SQL2 queries tested:*
>
> 1) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This')
>
> 2) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This*')
>
> 3)
> SELECT file.*, resource.* FROM [nt:file] AS file
> INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
> WHERE resource.[jcr:mimeType] = 'text/html'
> AND CONTAINS(file.*, 'This')
>
> 4)
> SELECT file.*, resource.* FROM [nt:file] AS file
> INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
> WHERE resource.[jcr:mimeType] = 'text/html'
> AND CONTAINS(file.*, 'This*')
>
> *Result:*
> Nothing seems to work. If I remove the CONTAINS() clause from the
> queries, I am able to get rows from all the queries above and for query #3
> & #4 I can see that the field resource.[jcr:data] has the text ("This") I
> am searching for when I dump the result to the log file. I've also tried
> deleting the index folder so that the repository will be re-indexed but I
> am still not able to do full-text search successfully.
>
> What am I missing? In addition, is there any documentation on how to
> configure tika (tika-config.xml)?
>
>
> Thanks and Regards,
> Orlando
>
Re: jackrabbit 2.6.0 Full Text Search
Posted by hsp <pi...@ibest.com.br>.
I changed the workspace.xml in search section with tikaConfigPath element,
and put a tika-config.xml with the mimes I want to extract and index:
<properties>
<detectors>
<detector class="org.apache.tika.detect.DefaultDetector" />
</detectors>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser" />
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/pdf</mime>
</parser>
</parsers>
</properties>
But, in debug mode, the text extracted was blank and the search did not got
the pdf item.
Again, is there a way to test if is my fault or not?
Regards
Helio.
--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4659003.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
Re: jackrabbit 2.6.0 Full Text Search
Posted by hsp <pi...@ibest.com.br>.
I see that in org.apache.jackrabbit.core.query.lucene.NodeIndexer
/**
* Returns <code>true</code> if the provided type is among the types
* supported by the Tika parser we are using.
*
* @param type the type to check.
* @return whether the type is supported by the Tika parser we are
using.
*/
protected boolean isSupportedMediaType(final String type) {
if (supportedMediaTypes == null) {
supportedMediaTypes = parser.getSupportedTypes(null);
}
return supportedMediaTypes.contains(MediaType.parse(type));
}
The supportedMediaTypes will be load with:
application/x-tar,
application/x-bzip,
application/x-bzip2,
image/x-icon,
image/vnd.wap.wbmp,
image/vnd.adobe.photoshop,
application/x-cpio,
image/x-xcf,
application/zip,
image/x-ms-bmp,
image/jpeg,
image/png,
application/x-gtar,
application/x-archive,
image/gif,
application/x-gzip
This way the mimetypes I have (txt, office, pdf) will be never extracted...
But, where is the configuration for this? Because the default
tika-config.xml is:
<properties>
<detectors>
<detector class="org.apache.tika.detect.DefaultDetector"/>
</detectors>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/x-archive</mime>
<mime>application/x-bzip</mime>
<mime>application/x-bzip2</mime>
<mime>application/x-cpio</mime>
<mime>application/x-gtar</mime>
<mime>application/x-gzip</mime>
<mime>application/x-tar</mime>
<mime>application/zip</mime>
<mime>image/bmp</mime>
<mime>image/gif</mime>
<mime>image/jpeg</mime>
<mime>image/png</mime>
<mime>image/vnd.wap.wbmp</mime>
<mime>image/x-icon</mime>
<mime>image/x-psd</mime>
<mime>image/x-xcf</mime>
</parser>
</parsers>
</properties>
I am feeling almost there
Bit lacking this in documentation...
Best Regards
Helio
--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4659000.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
Re: jackrabbit 2.6.0 Full Text Search
Posted by hsp <pi...@ibest.com.br>.
Nor 2.6.X neighter 2.4.X are indexing "pdf" for full text search over
nt:resource/jcr:data. Searches with contains() over nt:file/my:property are
ok.
What to do, is there some trick??? I did already a consistencyCheck/fix,
reindex, and nothing has effect.
Thanks in advanced.
Helio.
--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658994.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
Re: jackrabbit 2.6.0 Full Text Search
Posted by hsp <pi...@ibest.com.br>.
I am with jack 2.6.2 searching with xpath, and the pdf files are not full
text searchable, but with the old jack version they are normal returned. I
did recreate the indexes, but no effect.
Is the default tika-config.xml enough, or may I have to rewrite it with
other parsers and declared its path in repository.xml?
Regards
Helio.
--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658971.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
Re: jackrabbit 2.6.0 Full Text Search
Posted by hsp <pi...@ibest.com.br>.
The dependency
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.3</version>
</dependency>
must be included in the application to the extraction works.
Regards
Helio.
--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4659006.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
Re: jackrabbit 2.6.0 Full Text Search
Posted by "orlando.p" <or...@gmail.com>.
Ard wrote
> Did you check other words than 'this'? 'This' is a stopword and gets
> filtered out while fulltext indexing / querying. You are using the
> StandardAnalyzer which uses org.apache.lucene.analysis.StopAnalyzer
> which contains a default ENGLISH_STOP_WORDS array containing 'this'
>
> Regards Ard
Yes. In fact, I added the sentence "The quick brown fox jumps over the lazy
dog." and searched for each word just now to re-test. I can find them using
2.4.3 but not 2.6.0.
regards,
Orlando
--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658892.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
Re: jackrabbit 2.6.0 Full Text Search
Posted by Ard Schrijvers <a....@onehippo.com>.
On Mon, Jun 17, 2013 at 9:11 AM, orlando.p <or...@gmail.com> wrote:
> I've tested the same query in 2.6.2 and it didn't work as well. I rolled
> back to 2.4.3 and everything worked fine. There seem to be full text search
> issue in versions 2.6.0 and 2.6.2.
Did you check other words than 'this'? 'This' is a stopword and gets
filtered out while fulltext indexing / querying. You are using the
StandardAnalyzer which uses org.apache.lucene.analysis.StopAnalyzer
which contains a default ENGLISH_STOP_WORDS array containing 'this'
Regards Ard
>
>
>
> --
> View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658884.html
> Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142
US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com
Re: jackrabbit 2.6.0 Full Text Search
Posted by "orlando.p" <or...@gmail.com>.
I've tested the same query in 2.6.2 and it didn't work as well. I rolled
back to 2.4.3 and everything worked fine. There seem to be full text search
issue in versions 2.6.0 and 2.6.2.
--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658884.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
Re: jackrabbit 2.6.0 Full Text Search
Posted by "orlando.p" <or...@gmail.com>.
Hi Torsten,
Yes, I have seen the link and have tested the following queries as well but
none worked.
SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*, 'This')
SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*, 'This*')
SELECT * FROM [nt:resource] as resource WHERE CONTAINS(resource.*, '*This*')
My other JCR-SQL2 queries work just fine until I add the CONTAINS clause. I
don't see errors in the log file either.
When I dump the result of the query without the CONTAINS clause, I am able
to see the text being searched. Below is an example of the dumped data with
some lines omitted to reduce size.
[rt.jcr:configuration] ---> []
[rt.jcr:activity] ---> []
[rt.jcr:uuid] ---> [5183ee77-3aa9-497f-928a-4e61b2cac14c]
<<lines omitted>>
[rt.jcr:baseVersion] ---> [ee23fbe3-ec5e-4b29-bed1-24fd5a0a78c1]
[rt.jcr:isCheckedOut] ---> [true]
[rt.jcr:primaryType] ---> [rt:RuleTariff]
[rt.rt:EffectiveStartDate] ---> [20140101000000.000]
<<lines omitted>>
[rt.jcr:versionHistory] ---> [ef428650-639e-4153-9c56-26c82b5fe26b]
<<lines omitted>>
[file.jcr:createdBy] ---> [admin]
[file.jcr:created] ---> [2013-06-12T13:36:36.746-07:00]
[file.jcr:primaryType] ---> [nt:file]
[resource.jcr:lastModifiedBy] ---> [admin]
[resource.jcr:uuid] ---> [3cc13d9e-5f52-49b3-a85d-e626f691305d]
[resource.jcr:data] ---> [<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title></title>
<<lines omitted>>
</head>
<body>
<<lines omitted>>
<p class="c4">This is the body.</p>
<<lines omitted>>
</body>
</html>
]
[resource.jcr:encoding] ---> [UTF-8]
[resource.jcr:mimeType] ---> [text/html]
[resource.jcr:lastModified] ---> [2013-06-12T13:36:36.756-07:00]
regards,
Orlando
--
View this message in context: http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4658867.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
Re: jackrabbit 2.6.0 Full Text Search
Posted by Torsten Stolpmann <st...@verit.de>.
Hi Orlando,
Did you read the answer from Alexander Klimetschek here:
http://mail-archives.apache.org/mod_mbox/jackrabbit-users/201207.mbox/%3C9ADA9A33-5AA7-49DB-B32A-AF41610E335D@adobe.com%3E
?
On 27.06.2012, at 17:19, Furst, Carl wrote:
> > So here's the sql I use:
> >
> > select * from [nt:resource] where contains([jcr:data], 'include');
>
> The full text index for binary properties is by default aggregated on
> the node itself, not
> on the jcr:data property. You address that with "*" and you need a
selector (s in this case):
>
> select * from [nt:resource] as s where contains(s.*, 'include')
>
> (In the former sql1 you could simply to CONTAINS(., 'include') to >
adress the node itself).
>
> See my recent mail (about xpath, but same index is used):
http://markmail.org/message/oc6uootrpxepso4d
> Cheers,
> Alex
Hope this helps,
Torsten
On 07.06.2013 02:58, Orlando Palis wrote:
> Hi Folks,
>
> I'm new to jackrabbit and I'm trying out full-text search using jackrabbit
> 2.6.0. (with tika 1.3) . I have a custom node type that allows me to store
> some custom properties and multiple html files (stored as binary) . I have
> the following configurations:
>
> *workspace.xml:*
>
> <?xml version="1.0" encoding="UTF-8"?>
> <Workspace name="default">
> <!--
> virtual file system of the workspace:
> class: FQN of class implementing the FileSystem interface
> -->
> <FileSystem
> class="org.apache.jackrabbit.core.fs.db.OracleFileSystem">
> <param name="dataSourceName" value="ds1"/>
> <param name="schemaObjectPrefix" value="fs_${wsp.name}_"/>
> </FileSystem>
> <!--
> persistence manager of the workspace:
> class: FQN of class implementing the PersistenceManager
> interface
> -->
> <PersistenceManager
> class="org.apache.jackrabbit.core.persistence.pool.OraclePersistenceManager">
> <param name="dataSourceName" value="ds1"/>
> <param name="schemaObjectPrefix" value="pm_${wsp.name}_"/>
> </PersistenceManager>
> <!--
> Search index and the file system it uses.
> class: FQN of class implementing the QueryHandler interface
> -->
> <SearchIndex
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> <param name="path" value="${wsp.home}/index"/>
> <param name="analyzer"
> value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> <param name="queryClass"
> value="org.apache.jackrabbit.core.query.QueryImpl"/>
> <param name="excerptProviderClass"
> value="org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt"/>
> <param name="supportHighlighting" value="true"/>
> <param name="tikaConfigPath"
> value="${wsp.home}/tika-config.xml"/>
> </SearchIndex>
> </Workspace>
>
>
> *tika-config.xml:*
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
> <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"
> magic="false"/>
> <parsers>
> <parser name="parse-html"
> class="org.apache.tika.parser.html.HtmlParser">
> <mime>text/html</mime>
> <mime>application/xhtml+xml</mime>
> <mime>application/x-asp</mime>
> </parser>
> </parsers>
> </properties>
>
> *JCR-SQL2 queries tested:*
>
> 1) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This')
>
> 2) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This*')
>
> 3)
> SELECT file.*, resource.* FROM [nt:file] AS file
> INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
> WHERE resource.[jcr:mimeType] = 'text/html'
> AND CONTAINS(file.*, 'This')
>
> 4)
> SELECT file.*, resource.* FROM [nt:file] AS file
> INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
> WHERE resource.[jcr:mimeType] = 'text/html'
> AND CONTAINS(file.*, 'This*')
>
> *Result:*
> Nothing seems to work. If I remove the CONTAINS() clause from the queries,
> I am able to get rows from all the queries above and for query #3 & #4 I
> can see that the field resource.[jcr:data] has the text ("This") I am
> searching for when I dump the result to the log file. I've also tried
> deleting the index folder so that the repository will be re-indexed but I
> am still not able to do full-text search successfully.
>
> What am I missing? In addition, is there any documentation on how to
> configure tika (tika-config.xml)?
>
>
> Thanks and Regards,
> Orlando
>
--
Torsten Stolpmann
Geschäftsführender Gesellschafter
verit Informationssysteme GmbH
Europaallee 10
67657 Kaiserslautern
E-Mail: stolpmann@verit.de
Telefon: +49 631 520 840 00
Fax: +49 631 520 840 01
Web: http://www.verit.de/
Registergericht: Amtsgericht Kaiserslautern
Registernummer: HRB 3751
Geschäftsleitung: Claudia Könnecke, Torsten Stolpmann