You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by taha ben salah <ta...@gmail.com> on 2010/07/22 13:13:03 UTC

How to know if a document is well indexed or not

Hi,
I found that some documents failed to be indexed in lucene.
Particularly some Office 2003 documents failed to be parsed (office tika
parser)
You can find out the stacktrace at  the end of this submission.
I wonder if there is a way to catch that exception  (indexing is done in
astynchronous thread and error is thrown to log only).
It will be even better if we could know (using some public API) the indexing
status of documents (indexed/not yet/failded index).
Any suggestion is very welcome.
Thanks in advance.
Taha



org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@ced1ac
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:122)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
        at
org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
        at
org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:195)
        at
org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:636)
Caused by: org.apache.poi.hpsf.HPSFRuntimeException: Value type of property
ID 1 is not VT_I2 but 2048.
        at org.apache.poi.hpsf.Section.<init>(Section.java:262)
        at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:452)
        at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:247)
        at
org.apache.tika.parser.microsoft.OfficeParser.parseSummaryEntryIfExists(OfficeParser.java:148)
        at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:71)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)

Re: How to know if a document is well indexed or not

Posted by Paco Avila <mo...@gmail.com>.

I'm also interested in a solution. Probably a source code modification
is needed, so I have to dig in the source code the find a reasonable
solution. The main problem here is that the text extractor does not
know the source file name, to be used in a possible
text_extraction_error.log file :(

On Thu, Jul 22, 2010 at 1:13 PM, taha ben salah <ta...@gmail.com> wrote:
> Hi,
> I found that some documents failed to be indexed in lucene.
> Particularly some Office 2003 documents failed to be parsed (office tika
> parser)
> You can find out the stacktrace at  the end of this submission.
> I wonder if there is a way to catch that exception  (indexing is done in
> astynchronous thread and error is thrown to log only).
> It will be even better if we could know (using some public API) the indexing
> status of documents (indexed/not yet/failded index).
> Any suggestion is very welcome.
> Thanks in advance.
> Taha
>
>
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@ced1ac
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:122)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at
> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
>        at
> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:195)
>        at
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>        at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
>        at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>        at java.lang.Thread.run(Thread.java:636)
> Caused by: org.apache.poi.hpsf.HPSFRuntimeException: Value type of property
> ID 1 is not VT_I2 but 2048.
>        at org.apache.poi.hpsf.Section.<init>(Section.java:262)
>        at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:452)
>        at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:247)
>        at
> org.apache.tika.parser.microsoft.OfficeParser.parseSummaryEntryIfExists(OfficeParser.java:148)
>        at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:71)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>



-- 
OpenKM
http://www.openkm.com
http://www.guia-ubuntu.org