You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by "Rojas Buitrago, Sergio" <sr...@indra.es> on 2010/12/16 13:09:01 UTC

FullText Indexing

Hello.

I'm a newbie in Jackrabbit.

I'm trying to index some content of different types of documents (word, pdf, xml, ...).

I've configured the searchIndex in my workspace.xml in this way:

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index"/>
            <param name="supportHighlighting" value="true"/>
                                               <param name="textFilterClasses" value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
                                                                                                                                                                                               org.apache.jackrabbit.extractor.MsExcelTextExtractor,
                                                                                                                                                                                               org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
                                                                                                                                                                                               org.apache.jackrabbit.extractor.PdfTextExtractor,
                                                                                                                                                                                               org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
                                                                                                                                                                                               org.apache.jackrabbit.extractor.RTFTextExtractor,
                                                                                                                                                                                               org.apache.jackrabbit.extractor.HTMLTextExtractor,
                                                                                                                                                                                               org.apache.jackrabbit.extractor.XMLTextExtractor"/>
        </SearchIndex>


When I create a document in the repository, I add the content in this way:

contenido = nodo.addNode("jcr:content", "nt:resource");
                  contenido.setProperty("jcr:data", J_OperacionesSesion
                             .getValueFactory().createBinary(is));

                  MimetypesFileTypeMap mimetypes = new MimetypesFileTypeMap();
                  String mime = mimetypes.getContentType(nodo.getName());
                  contenido.setProperty("jcr:mimeType", "application/pdf");

Afer creating the document, this warning is thrown:

16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180)
org.apache.tika.exception.TikaException: Unable to extract PDF content
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
      at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
      at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
      at java.util.concurrent.FutureTask.run(FutureTask.java:123)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
      at java.lang.Thread.run(Thread.java:595)
Caused by: org.apache.pdfbox.exceptions.WrappedIOException: OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be instantiated
      at org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152)
      at org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129)
      at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
      ... 13 more
Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph
      at org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146)
      ... 16 more

Later, when I search for the document, filtering by content, in this way:

String consulta = "SELECT * FROM [arch:documento] AS documento WHERE CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from nt:file)

No documents were found.


Can you help me please??.


Thanks and regards.






________________________________
Este correo electr?nico y, en su caso, cualquier fichero anexo al mismo, contiene informaci?n de car?cter confidencial exclusivamente dirigida a su destinatario o destinatarios. Si no es vd. el destinatario indicado, queda notificado que la lectura, utilizaci?n, divulgaci?n y/o copia sin autorizaci?n est? prohibida en virtud de la legislaci?n vigente. En el caso de haber recibido este correo electr?nico por error, se ruega notificar inmediatamente esta circunstancia mediante reenv?o a la direcci?n electr?nica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.

This email and any file attached to it (when applicable) contain(s) confidential information that is exclusively addressed to its recipient(s). If you are not the indicated recipient, you are informed that reading, using, disseminating and/or copying it without authorisation is forbidden in accordance with the legislation in effect. If you have received this email by mistake, please immediately notify the sender of the situation by resending it to their email address.
Avoid printing this message if it is not absolutely necessary.

RE: FullText Indexing

Posted by "Rojas Buitrago, Sergio" <sr...@indra.es>.

It doesn't occurs only with this pdf. It occurs with other .pdf and .doc documents too.

In the case of the .doc documento for example, the next warning is thrown:

16.12.2010 15:24:10 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180)
java.lang.NoSuchMethodError: org.apache.poi.hwpf.extractor.WordExtractor.getFootnoteText()[Ljava/lang/String;
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:95)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
        at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)
16.12.2010 15:24:14 *INFO * IndexMerger: merged 37 documents in 265 ms into _l. (IndexMerger.java, line 533)

If a execute a search, it doesn't return any match.

Regards

-----Mensaje original-----
De: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
Enviado el: jueves, 16 de diciembre de 2010 14:09
Para: dev@jackrabbit.apache.org
Asunto: Re: FullText Indexing

ps pls use users@jackrabbit.apache.org for non dev issues

Regards Ard

On Thu, Dec 16, 2010 at 2:08 PM, Ard Schrijvers
<a....@onehippo.com> wrote:
> Hello,
>
> seems to me a pdfbox issue. What happens if you try a different pdf?
> If other pdf's just work, and a single one fails, you can better post
> the question to one of the pdfbox mailinglists:
> http://pdfbox.apache.org/mail-lists.html
>
> Regards Ard
>
> On Thu, Dec 16, 2010 at 1:09 PM, Rojas Buitrago, Sergio <sr...@indra.es> wrote:
>> Hello.
>>
>>
>>
>> I'm a newbie in Jackrabbit.
>>
>>
>>
>> I'm trying to index some content of different types of documents (word, pdf,
>> xml, ...).
>>
>>
>>
>> I've configured the searchIndex in my workspace.xml in this way:
>>
>>
>>
>> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>>
>>             <param name="path" value="${wsp.home}/index"/>
>>
>>             <param name="supportHighlighting" value="true"/>
>>
>>                                                <param
>> name="textFilterClasses"
>> value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.PdfTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.RTFTextExtractor,
>>
>>
>>                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>>
>>         </SearchIndex>
>>
>>
>>
>>
>>
>> When I create a document in the repository, I add the content in this way:
>>
>>
>>
>> contenido = nodo.addNode("jcr:content", "nt:resource");
>>
>>                   contenido.setProperty("jcr:data", J_OperacionesSesion
>>
>>                              .getValueFactory().createBinary(is));
>>
>>
>>
>>                   MimetypesFileTypeMap mimetypes = new
>> MimetypesFileTypeMap();
>>
>>                   String mime = mimetypes.getContentType(nodo.getName());
>>
>>                   contenido.setProperty("jcr:mimeType", "application/pdf");
>>
>>
>>
>> Afer creating the document, this warning is thrown:
>>
>>
>>
>> 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text
>> from a binary property (LazyTextExtractorField.java, line 180)
>>
>> org.apache.tika.exception.TikaException: Unable to extract PDF content
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
>>
>>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>>
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>>
>>       at
>> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
>>
>>       at
>> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>>
>>       at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
>>
>>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
>>
>>       at java.util.concurrent.FutureTask.run(FutureTask.java:123)
>>
>>       at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
>>
>>       at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
>>
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
>>
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
>>
>>       at java.lang.Thread.run(Thread.java:595)
>>
>> Caused by: org.apache.pdfbox.exceptions.WrappedIOException:
>> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
>> instantiated
>>
>>       at
>> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152)
>>
>>       at
>> org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129)
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>>
>>       ... 13 more
>>
>> Caused by: java.lang.ClassCastException:
>> org.pdfbox.util.operator.ShowTextGlyph
>>
>>       at
>> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146)
>>
>>       ... 16 more
>>
>>
>>
>> Later, when I search for the document, filtering by content, in this way:
>>
>>
>>
>> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
>> CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from nt:file)
>>
>>
>>
>> No documents were found.
>>
>>
>>
>>
>>
>> Can you help me please??.
>>
>>
>>
>>
>>
>> Thanks and regards.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ________________________________
>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>> contiene información de carácter confidencial exclusivamente dirigida a su
>> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
>> notificado que la lectura, utilización, divulgación y/o copia sin
>> autorización está prohibida en virtud de la legislación vigente. En el caso
>> de haber recibido este correo electrónico por error, se ruega notificar
>> inmediatamente esta circunstancia mediante reenvío a la dirección
>> electrónica del remitente.
>> Evite imprimir este mensaje si no es estrictamente necesario.
>>
>> This email and any file attached to it (when applicable) contain(s)
>> confidential information that is exclusively addressed to its recipient(s).
>> If you are not the indicated recipient, you are informed that reading,
>> using, disseminating and/or copying it without authorisation is forbidden in
>> accordance with the legislation in effect. If you have received this email
>> by mistake, please immediately notify the sender of the situation by
>> resending it to their email address.
>> Avoid printing this message if it is not absolutely necessary.
>>
>
>
>
> --
> Hippo
> Europe  *  Amsterdam  Oosteinde 11  *  1017 WT Amsterdam  *  +31 (0)20 522 4466
> USA  * San Francisco 755 Baywood Drive, Second Floor *  Petaluma, CA.
> 94954 *  +1 877 414 4776 (toll free)
> Canada    *   Montréal  5369 Boulevard St-Laurent #430 *  Montréal QC
> H2T 1S5  *  +1 (514) 316 8966
> www.onehippo.com  *  www.onehippo.org  *  info@onehippo.com
> ________________________________________________________________
> This e-mail may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this e-mail or the information it contains by other than an
> intended recipient is unauthorized. If you received this e-mail in
> error, please advise me (by return e-mail or otherwise) immediately.
>



--
Hippo
Europe  *  Amsterdam  Oosteinde 11  *  1017 WT Amsterdam  *  +31 (0)20 522 4466
USA  * San Francisco 755 Baywood Drive, Second Floor *  Petaluma, CA.
94954 *  +1 877 414 4776 (toll free)
Canada    *   Montréal  5369 Boulevard St-Laurent #430 *  Montréal QC
H2T 1S5  *  +1 (514) 316 8966
www.onehippo.com  *  www.onehippo.org  *  info@onehippo.com
________________________________________________________________
This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.

Este correo electrónico y, en su caso, cualquier fichero anexo al mismo, contiene información de carácter confidencial exclusivamente dirigida a su destinatario o destinatarios. Si no es vd. el destinatario indicado, queda notificado que la lectura, utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. En el caso de haber recibido este correo electrónico por error, se ruega notificar inmediatamente esta circunstancia mediante reenvío a la dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.

This email and any file attached to it (when applicable) contain(s) confidential information that is exclusively addressed to its recipient(s). If you are not the indicated recipient, you are informed that reading, using, disseminating and/or copying it without authorisation is forbidden in accordance with the legislation in effect. If you have received this email by mistake, please immediately notify the sender of the situation by resending it to their email address.
Avoid printing this message if it is not absolutely necessary.

RE: FullText Indexing

Posted by "Rojas Buitrago, Sergio" <sr...@indra.es>.

Oh, excuse me.

It has been a mistake.

Regards

Sergio Rojas Buitrago
Desarrollo Software
Gestión Documental

Ronda de Toledo s/n
13003. Ciudad Real
España
T +34 926 27 08 49
Ext: 237849


srojas@indra.es
www.indra.es




-----Mensaje original-----
De: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
Enviado el: jueves, 16 de diciembre de 2010 14:09
Para: dev@jackrabbit.apache.org
Asunto: Re: FullText Indexing

ps pls use users@jackrabbit.apache.org for non dev issues

Regards Ard

On Thu, Dec 16, 2010 at 2:08 PM, Ard Schrijvers
<a....@onehippo.com> wrote:
> Hello,
>
> seems to me a pdfbox issue. What happens if you try a different pdf?
> If other pdf's just work, and a single one fails, you can better post
> the question to one of the pdfbox mailinglists:
> http://pdfbox.apache.org/mail-lists.html
>
> Regards Ard
>
> On Thu, Dec 16, 2010 at 1:09 PM, Rojas Buitrago, Sergio <sr...@indra.es> wrote:
>> Hello.
>>
>>
>>
>> I'm a newbie in Jackrabbit.
>>
>>
>>
>> I'm trying to index some content of different types of documents (word, pdf,
>> xml, ...).
>>
>>
>>
>> I've configured the searchIndex in my workspace.xml in this way:
>>
>>
>>
>> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>>
>>             <param name="path" value="${wsp.home}/index"/>
>>
>>             <param name="supportHighlighting" value="true"/>
>>
>>                                                <param
>> name="textFilterClasses"
>> value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.PdfTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.RTFTextExtractor,
>>
>>
>>                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>>
>>         </SearchIndex>
>>
>>
>>
>>
>>
>> When I create a document in the repository, I add the content in this way:
>>
>>
>>
>> contenido = nodo.addNode("jcr:content", "nt:resource");
>>
>>                   contenido.setProperty("jcr:data", J_OperacionesSesion
>>
>>                              .getValueFactory().createBinary(is));
>>
>>
>>
>>                   MimetypesFileTypeMap mimetypes = new
>> MimetypesFileTypeMap();
>>
>>                   String mime = mimetypes.getContentType(nodo.getName());
>>
>>                   contenido.setProperty("jcr:mimeType", "application/pdf");
>>
>>
>>
>> Afer creating the document, this warning is thrown:
>>
>>
>>
>> 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text
>> from a binary property (LazyTextExtractorField.java, line 180)
>>
>> org.apache.tika.exception.TikaException: Unable to extract PDF content
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
>>
>>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>>
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>>
>>       at
>> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
>>
>>       at
>> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>>
>>       at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
>>
>>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
>>
>>       at java.util.concurrent.FutureTask.run(FutureTask.java:123)
>>
>>       at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
>>
>>       at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
>>
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
>>
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
>>
>>       at java.lang.Thread.run(Thread.java:595)
>>
>> Caused by: org.apache.pdfbox.exceptions.WrappedIOException:
>> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
>> instantiated
>>
>>       at
>> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152)
>>
>>       at
>> org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129)
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>>
>>       ... 13 more
>>
>> Caused by: java.lang.ClassCastException:
>> org.pdfbox.util.operator.ShowTextGlyph
>>
>>       at
>> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146)
>>
>>       ... 16 more
>>
>>
>>
>> Later, when I search for the document, filtering by content, in this way:
>>
>>
>>
>> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
>> CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from nt:file)
>>
>>
>>
>> No documents were found.
>>
>>
>>
>>
>>
>> Can you help me please??.
>>
>>
>>
>>
>>
>> Thanks and regards.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ________________________________
>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>> contiene información de carácter confidencial exclusivamente dirigida a su
>> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
>> notificado que la lectura, utilización, divulgación y/o copia sin
>> autorización está prohibida en virtud de la legislación vigente. En el caso
>> de haber recibido este correo electrónico por error, se ruega notificar
>> inmediatamente esta circunstancia mediante reenvío a la dirección
>> electrónica del remitente.
>> Evite imprimir este mensaje si no es estrictamente necesario.
>>
>> This email and any file attached to it (when applicable) contain(s)
>> confidential information that is exclusively addressed to its recipient(s).
>> If you are not the indicated recipient, you are informed that reading,
>> using, disseminating and/or copying it without authorisation is forbidden in
>> accordance with the legislation in effect. If you have received this email
>> by mistake, please immediately notify the sender of the situation by
>> resending it to their email address.
>> Avoid printing this message if it is not absolutely necessary.
>>
>
>
>
> --
> Hippo
> Europe  *  Amsterdam  Oosteinde 11  *  1017 WT Amsterdam  *  +31 (0)20 522 4466
> USA  * San Francisco 755 Baywood Drive, Second Floor *  Petaluma, CA.
> 94954 *  +1 877 414 4776 (toll free)
> Canada    *   Montréal  5369 Boulevard St-Laurent #430 *  Montréal QC
> H2T 1S5  *  +1 (514) 316 8966
> www.onehippo.com  *  www.onehippo.org  *  info@onehippo.com
> ________________________________________________________________
> This e-mail may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this e-mail or the information it contains by other than an
> intended recipient is unauthorized. If you received this e-mail in
> error, please advise me (by return e-mail or otherwise) immediately.
>



--
Hippo
Europe  *  Amsterdam  Oosteinde 11  *  1017 WT Amsterdam  *  +31 (0)20 522 4466
USA  * San Francisco 755 Baywood Drive, Second Floor *  Petaluma, CA.
94954 *  +1 877 414 4776 (toll free)
Canada    *   Montréal  5369 Boulevard St-Laurent #430 *  Montréal QC
H2T 1S5  *  +1 (514) 316 8966
www.onehippo.com  *  www.onehippo.org  *  info@onehippo.com
________________________________________________________________
This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.

Este correo electrónico y, en su caso, cualquier fichero anexo al mismo, contiene información de carácter confidencial exclusivamente dirigida a su destinatario o destinatarios. Si no es vd. el destinatario indicado, queda notificado que la lectura, utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. En el caso de haber recibido este correo electrónico por error, se ruega notificar inmediatamente esta circunstancia mediante reenvío a la dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.

This email and any file attached to it (when applicable) contain(s) confidential information that is exclusively addressed to its recipient(s). If you are not the indicated recipient, you are informed that reading, using, disseminating and/or copying it without authorisation is forbidden in accordance with the legislation in effect. If you have received this email by mistake, please immediately notify the sender of the situation by resending it to their email address.
Avoid printing this message if it is not absolutely necessary.

Re: FullText Indexing

Posted by Ard Schrijvers <a....@onehippo.com>.

ps pls use users@jackrabbit.apache.org for non dev issues

Regards Ard

On Thu, Dec 16, 2010 at 2:08 PM, Ard Schrijvers
<a....@onehippo.com> wrote:
> Hello,
>
> seems to me a pdfbox issue. What happens if you try a different pdf?
> If other pdf's just work, and a single one fails, you can better post
> the question to one of the pdfbox mailinglists:
> http://pdfbox.apache.org/mail-lists.html
>
> Regards Ard
>
> On Thu, Dec 16, 2010 at 1:09 PM, Rojas Buitrago, Sergio <sr...@indra.es> wrote:
>> Hello.
>>
>>
>>
>> I’m a newbie in Jackrabbit.
>>
>>
>>
>> I’m trying to index some content of different types of documents (word, pdf,
>> xml, …).
>>
>>
>>
>> I’ve configured the searchIndex in my workspace.xml in this way:
>>
>>
>>
>> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>>
>>             <param name="path" value="${wsp.home}/index"/>
>>
>>             <param name="supportHighlighting" value="true"/>
>>
>>                                                <param
>> name="textFilterClasses"
>> value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.PdfTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.RTFTextExtractor,
>>
>>
>>                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
>>
>>
>>    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>>
>>         </SearchIndex>
>>
>>
>>
>>
>>
>> When I create a document in the repository, I add the content in this way:
>>
>>
>>
>> contenido = nodo.addNode("jcr:content", "nt:resource");
>>
>>                   contenido.setProperty("jcr:data", J_OperacionesSesion
>>
>>                              .getValueFactory().createBinary(is));
>>
>>
>>
>>                   MimetypesFileTypeMap mimetypes = new
>> MimetypesFileTypeMap();
>>
>>                   String mime = mimetypes.getContentType(nodo.getName());
>>
>>                   contenido.setProperty("jcr:mimeType", "application/pdf");
>>
>>
>>
>> Afer creating the document, this warning is thrown:
>>
>>
>>
>> 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text
>> from a binary property (LazyTextExtractorField.java, line 180)
>>
>> org.apache.tika.exception.TikaException: Unable to extract PDF content
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
>>
>>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>>
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>>
>>       at
>> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
>>
>>       at
>> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>>
>>       at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
>>
>>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
>>
>>       at java.util.concurrent.FutureTask.run(FutureTask.java:123)
>>
>>       at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
>>
>>       at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
>>
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
>>
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
>>
>>       at java.lang.Thread.run(Thread.java:595)
>>
>> Caused by: org.apache.pdfbox.exceptions.WrappedIOException:
>> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
>> instantiated
>>
>>       at
>> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152)
>>
>>       at
>> org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129)
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
>>
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>>
>>       ... 13 more
>>
>> Caused by: java.lang.ClassCastException:
>> org.pdfbox.util.operator.ShowTextGlyph
>>
>>       at
>> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146)
>>
>>       ... 16 more
>>
>>
>>
>> Later, when I search for the document, filtering by content, in this way:
>>
>>
>>
>> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
>> CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from nt:file)
>>
>>
>>
>> No documents were found.
>>
>>
>>
>>
>>
>> Can you help me please??.
>>
>>
>>
>>
>>
>> Thanks and regards.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ________________________________
>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>> contiene información de carácter confidencial exclusivamente dirigida a su
>> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
>> notificado que la lectura, utilización, divulgación y/o copia sin
>> autorización está prohibida en virtud de la legislación vigente. En el caso
>> de haber recibido este correo electrónico por error, se ruega notificar
>> inmediatamente esta circunstancia mediante reenvío a la dirección
>> electrónica del remitente.
>> Evite imprimir este mensaje si no es estrictamente necesario.
>>
>> This email and any file attached to it (when applicable) contain(s)
>> confidential information that is exclusively addressed to its recipient(s).
>> If you are not the indicated recipient, you are informed that reading,
>> using, disseminating and/or copying it without authorisation is forbidden in
>> accordance with the legislation in effect. If you have received this email
>> by mistake, please immediately notify the sender of the situation by
>> resending it to their email address.
>> Avoid printing this message if it is not absolutely necessary.
>>
>
>
>
> --
> Hippo
> Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522 4466
> USA  • San Francisco 755 Baywood Drive, Second Floor •  Petaluma, CA.
> 94954 •  +1 877 414 4776 (toll free)
> Canada    •   Montréal  5369 Boulevard St-Laurent #430 •  Montréal QC
> H2T 1S5  •  +1 (514) 316 8966
> www.onehippo.com  •  www.onehippo.org  •  info@onehippo.com
> ________________________________________________________________
> This e-mail may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this e-mail or the information it contains by other than an
> intended recipient is unauthorized. If you received this e-mail in
> error, please advise me (by return e-mail or otherwise) immediately.
>



-- 
Hippo
Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522 4466
USA  • San Francisco 755 Baywood Drive, Second Floor •  Petaluma, CA.
94954 •  +1 877 414 4776 (toll free)
Canada    •   Montréal  5369 Boulevard St-Laurent #430 •  Montréal QC
H2T 1S5  •  +1 (514) 316 8966
www.onehippo.com  •  www.onehippo.org  •  info@onehippo.com
________________________________________________________________
This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.

Re: FullText Indexing

Posted by Ard Schrijvers <a....@onehippo.com>.

Hello,

seems to me a pdfbox issue. What happens if you try a different pdf?
If other pdf's just work, and a single one fails, you can better post
the question to one of the pdfbox mailinglists:
http://pdfbox.apache.org/mail-lists.html

Regards Ard

On Thu, Dec 16, 2010 at 1:09 PM, Rojas Buitrago, Sergio <sr...@indra.es> wrote:
> Hello.
>
>
>
> I’m a newbie in Jackrabbit.
>
>
>
> I’m trying to index some content of different types of documents (word, pdf,
> xml, …).
>
>
>
> I’ve configured the searchIndex in my workspace.xml in this way:
>
>
>
> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>
>             <param name="path" value="${wsp.home}/index"/>
>
>             <param name="supportHighlighting" value="true"/>
>
>                                                <param
> name="textFilterClasses"
> value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.PdfTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.RTFTextExtractor,
>
>
>                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>
>         </SearchIndex>
>
>
>
>
>
> When I create a document in the repository, I add the content in this way:
>
>
>
> contenido = nodo.addNode("jcr:content", "nt:resource");
>
>                   contenido.setProperty("jcr:data", J_OperacionesSesion
>
>                              .getValueFactory().createBinary(is));
>
>
>
>                   MimetypesFileTypeMap mimetypes = new
> MimetypesFileTypeMap();
>
>                   String mime = mimetypes.getContentType(nodo.getName());
>
>                   contenido.setProperty("jcr:mimeType", "application/pdf");
>
>
>
> Afer creating the document, this warning is thrown:
>
>
>
> 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text
> from a binary property (LazyTextExtractorField.java, line 180)
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
>
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
>
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>
>       at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>
>       at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>
>       at
> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
>
>       at
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>
>       at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
>
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
>
>       at java.util.concurrent.FutureTask.run(FutureTask.java:123)
>
>       at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
>
>       at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
>
>       at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
>
>       at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
>
>       at java.lang.Thread.run(Thread.java:595)
>
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException:
> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
> instantiated
>
>       at
> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152)
>
>       at
> org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129)
>
>       at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
>
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>
>       ... 13 more
>
> Caused by: java.lang.ClassCastException:
> org.pdfbox.util.operator.ShowTextGlyph
>
>       at
> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146)
>
>       ... 16 more
>
>
>
> Later, when I search for the document, filtering by content, in this way:
>
>
>
> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
> CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from nt:file)
>
>
>
> No documents were found.
>
>
>
>
>
> Can you help me please??.
>
>
>
>
>
> Thanks and regards.
>
>
>
>
>
>
>
>
>
>
>
> ________________________________
> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> contiene información de carácter confidencial exclusivamente dirigida a su
> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
> notificado que la lectura, utilización, divulgación y/o copia sin
> autorización está prohibida en virtud de la legislación vigente. En el caso
> de haber recibido este correo electrónico por error, se ruega notificar
> inmediatamente esta circunstancia mediante reenvío a la dirección
> electrónica del remitente.
> Evite imprimir este mensaje si no es estrictamente necesario.
>
> This email and any file attached to it (when applicable) contain(s)
> confidential information that is exclusively addressed to its recipient(s).
> If you are not the indicated recipient, you are informed that reading,
> using, disseminating and/or copying it without authorisation is forbidden in
> accordance with the legislation in effect. If you have received this email
> by mistake, please immediately notify the sender of the situation by
> resending it to their email address.
> Avoid printing this message if it is not absolutely necessary.
>



-- 
Hippo
Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522 4466
USA  • San Francisco 755 Baywood Drive, Second Floor •  Petaluma, CA.
94954 •  +1 877 414 4776 (toll free)
Canada    •   Montréal  5369 Boulevard St-Laurent #430 •  Montréal QC
H2T 1S5  •  +1 (514) 316 8966
www.onehippo.com  •  www.onehippo.org  •  info@onehippo.com
________________________________________________________________
This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.