You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by anjan <po...@gmail.com> on 2013/06/17 07:38:55 UTC

Full text indexing under OSGi environment (Sling) is not working

Hi, I am building the Apache Sling (by checking out the latest stable version
from Jenkins) successfully and deploying it in Tomcat.  Then I am connecting
to Sling using WebDAV and adding few documents (pdf, word, text file...etc).
But the full text indexing is not happening.  I can confirm this using the
Luke tool.  Only metadata(created by, mime type) is getting indexed.  As I
see it, the built Sling is using Jackrabbit 2.4.2 as the embedded
repository.  So I tried to reproduce the problem by downloading the
standalone Jackrabbit 2.4.2 jar, running it, connecting to it via WebDAV and
adding few documents.  Here the full text indexing is happening perfectly
fine (confirmed looking at the indexes using Luke).

When I use earlier version of Sling (Sling 6 war file), full text indexing
is happening fine.  In Sling 6 though, Apache Tika 0.6 is used (I believe
Jackrabbit internally uses Tika for metadata and text extraction). 
Secondly, the entire Tika is bundled as a single OSGI bundle (Core and
Parsers) in Sling 6.  But in the latest build of Sling Tika 1.0 is used and
Tika Core and Tikar Parsers are deployed as separate OSGI bundles.  I did
lot of debugging without much success.

I am not sure if this issue is related to Jackrabbit being deployed in OSGi
environment.  I already raised this issue in Sling mailing list, but wanted
to post here also to get experts' opinion.  Please advice.  'Search' is an
important feature and it's is not working.



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/Full-text-indexing-under-OSGi-environment-Sling-is-not-working-tp4658882.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Full text indexing under OSGi environment (Sling) is not working

Posted by anjan <po...@gmail.com>.
After I posted the above comment, I changed the Tika dependency from version
1.0 to 1.2 and rebuilt Sling.  After deploying the same in Tomcat, full text
indexing is working fine.  I tested with pdf, doc, docx, xlsx and all of
them are getting indexed.

I hope this version change will not have any impact on other areas.



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/Full-text-indexing-under-OSGi-environment-Sling-is-not-working-tp4658882p4658903.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Full text indexing under OSGi environment (Sling) is not working

Posted by anjan <po...@gmail.com>.
I forgot to mention that the built Sling is using 1.0 version of Apache Tika
core(org.apache.tika.core) and Apache Tika OSGi
bundle(org.apache.tika.bundle) bundles.



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/Full-text-indexing-under-OSGi-environment-Sling-is-not-working-tp4658882p4658902.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Full text indexing under OSGi environment (Sling) is not working

Posted by anjan <po...@gmail.com>.
When I add a document (say pdf file) to Sling via WebDAV,
org.apache.tika.parser.AutoDetectParser.parse method is called initially. 
Below is the snippet of code from the parse method.
********************************
// Automatically detect the MIME type of the document
MediaType type = detector.detect(tis, metadata);
********************************
In the above method, it is trying to get the content type by passing the
'metadata'.  I verified that the value of 'metadata' is correctly set to
'Content-Type=application/pdf'.  But detector.detect method returns
"application/octet-stream"

I found that 'detector' is an instance of
org.apache.tika.detect.CompositeDetector and the "detectors" is initialized
to an *empty* list, so "application/octet-stream" is returned as the default
value.

Since the returned type is always "application/octet-stream", it is not
calling any tika parsers, instead org.apache.tika.parser.EmptyParser is
invoked.



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/Full-text-indexing-under-OSGi-environment-Sling-is-not-working-tp4658882p4658901.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Full text indexing under OSGi environment (Sling) is not working

Posted by anjan <po...@gmail.com>.
Upon further debugging (using Eclipse Debug mode), parse methods of the below
classes (from Tika Parsers) are called when I add documents (txt, doc files
respectively) to Jackrabbit (war file deployed to Tomcat) via WebDAV.  
org.apache.tika.parser.txt.TXTParser
org.apache.tika.parser.microsoft.OfficeParser

But when I add the same documents to Sling via WebDAV, the above methods are
not called.  What could be the issue?




--
View this message in context: http://jackrabbit.510166.n4.nabble.com/Full-text-indexing-under-OSGi-environment-Sling-is-not-working-tp4658882p4658896.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.