You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Bertrand Delacretaz <bd...@apache.org> on 2014/05/21 17:28:48 UTC

My repository is not indexing PDFs, what am I missing?

Hi,

I'm upgrading the OakSlingRepositoryManager used for Sling tests to
Oak 1.0, and it's not indexing PDFs anymore - it used to with oak 0.8.

After uploading a text file to /tmp, the
/jcr:root/foo//*[jcr:contains(.,'some word')] query finds it, but the
same doesn't work with a PDF.

My repository setup is in the OakSlingRepositoryManager [1] - am I
missing something in there?

-Bertrand

[1] https://svn.apache.org/repos/asf/sling/trunk/bundles/jcr/oak-server/src/main/java/org/apache/sling/oak/server/OakSlingRepositoryManager.java

Re: My repository is not indexing PDFs, what am I missing?

Posted by Alex Parvulescu <al...@gmail.com>.

Hi Bertrand,

Don't you have to also add the tika dependencies (tika-core and
tika-parsers) to the pom xml?

best,
alex



On Wed, May 21, 2014 at 5:28 PM, Bertrand Delacretaz <bdelacretaz@apache.org
> wrote:

> Hi,
>
> I'm upgrading the OakSlingRepositoryManager used for Sling tests to
> Oak 1.0, and it's not indexing PDFs anymore - it used to with oak 0.8.
>
> After uploading a text file to /tmp, the
> /jcr:root/foo//*[jcr:contains(.,'some word')] query finds it, but the
> same doesn't work with a PDF.
>
> My repository setup is in the OakSlingRepositoryManager [1] - am I
> missing something in there?
>
> -Bertrand
>
> [1]
> https://svn.apache.org/repos/asf/sling/trunk/bundles/jcr/oak-server/src/main/java/org/apache/sling/oak/server/OakSlingRepositoryManager.java
>

Re: My repository is not indexing PDFs, what am I missing?

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi Chetan,

On Thu, May 22, 2014 at 6:52 AM, Chetan Mehrotra
<ch...@gmail.com> wrote:
> ...This might be due to OAK-1462. We had to disable the
> LuceneIndexProvider form getting registered as OSGi service...

Would that mean that the LuceneIndexEditor is still called, but the
result isn't used?

I'm asking because when adding a PDF, LuceneIndexEditor.addOrUpdate
does call context.getWriter().updateDocument with a Document that does
contain the PDF's full text in a field named :fulltext, so the text
extraction is working (thanks Alex for the tika-parsers hint).

But the query mentioned earlier in this thread still finds only .txt
documents, not .pdf.

Adding a .txt also causes LuceneIndexEditor.addOrUpdate to call
context.getWriter().updateDocument, but maybe the text is also indexed
in another way?

-Bertrand

Re: My repository is not indexing PDFs, what am I missing?

Posted by Chetan Mehrotra <ch...@gmail.com>.

Hi Bertrand,

This might be due to OAK-1462. We had to disable the
LuceneIndexProvider form getting registered as OSGi service due to
handle case where LuceneIndexProvider was getting registered twice (1
default and other for Aggregate case). Would try to resolve this soon
by next week and then it should work fine
Chetan Mehrotra

On Wed, May 21, 2014 at 8:58 PM, Bertrand Delacretaz
<bd...@apache.org> wrote:
> Hi,
>
> I'm upgrading the OakSlingRepositoryManager used for Sling tests to
> Oak 1.0, and it's not indexing PDFs anymore - it used to with oak 0.8.
>
> After uploading a text file to /tmp, the
> /jcr:root/foo//*[jcr:contains(.,'some word')] query finds it, but the
> same doesn't work with a PDF.
>
> My repository setup is in the OakSlingRepositoryManager [1] - am I
> missing something in there?
>
> -Bertrand
>
> [1] https://svn.apache.org/repos/asf/sling/trunk/bundles/jcr/oak-server/src/main/java/org/apache/sling/oak/server/OakSlingRepositoryManager.java