You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by LukashP <lu...@poczta.onet.pl> on 2009/11/15 15:54:57 UTC

Extracting content from document

Hi,
It's my first post here, so please, be tolerant of any mistakes :).
I'm importing into Jackrabbit repository a large group of word (*.doc) files
(batch operation). I've setup Jackrabbit in a way, that content is extracted
immediately along with importing (commiting transaction to be strict).
Most of them are fine, and also MsWordExtractor can successfully extract
text content (that allows me to use full text search later).
However, for some of them I have a problem : The content can't be extracted
of whatever reason. That's ok, some of them can be in wrong format or so,
but I would like to know about such problem immediately.
The problem is, that when MsWordExtractor is not able to extract content, is
only logs a warning about it (and i think that's all - log below, i've shown
only the significant logs). Is there any way I could know about failure of
extraction immediately, when importing ?

[15:27:50,699] [WARN ]
[http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed to
extract Word text content
java.lang.ArrayIndexOutOfBoundsException: 59730
	at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475)
...
org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
...
org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701)
...
org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
	at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288)

I would be thankful for any help.

Regards, 
Luke

-- 
View this message in context: http://n4.nabble.com/Extracting-content-from-document-tp621776p621776.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Extracting content from document

Posted by LukashP <lu...@poczta.onet.pl>.
Quick update :
- Any of this solutions proved to be impossible to implement.
As it occured, classes that call Extractors (upper in hierarchy) catch
everything - even Errors :/
And all I got was different warning.

Regards



LukashP wrote:
> 
> Is there a direct way - that actually is the question ;)
> 
> The problem with my own text extractors is that I would have to override
> every single one I use. That is not a problem technically, but I find that
> solution somewhat ugly ;). What is more one can read in javadoc here:
> http://jackrabbit.apache.org/api/1.4/org/apache/jackrabbit/extractor/MsWordTextExtractor.html
> that this method should only throw Exception on transient errors.
> 
> I thought about hacking
> org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor (maybe
> there would be only one class to change), but then I think I would be
> forced to do some hacking in Jackrabbit jar files.
> 
> So, all in all I think I will stick with overriding every extractor I use.
> 
> Thank you for your reply.
> 
> Regards, Luke
> 
> 
> Dave Brosius-2 wrote:
>> 
>> If there's no direct way...   :)
>> 
>> I suppose you could create your own text extractor that derived from 
>> MsWordTextExtractor, overrides extractText and delegate to super in a 
>> try/catch block.
>> 
>> Then specify this extractor in your repository.xml file.
>> 
>> LukashP wrote:
>>> Hi,
>>> It's my first post here, so please, be tolerant of any mistakes :).
>>> I'm importing into Jackrabbit repository a large group of word (*.doc)
>>> files
>>> (batch operation). I've setup Jackrabbit in a way, that content is
>>> extracted
>>> immediately along with importing (commiting transaction to be strict).
>>> Most of them are fine, and also MsWordExtractor can successfully extract
>>> text content (that allows me to use full text search later).
>>> However, for some of them I have a problem : The content can't be
>>> extracted
>>> of whatever reason. That's ok, some of them can be in wrong format or
>>> so,
>>> but I would like to know about such problem immediately.
>>> The problem is, that when MsWordExtractor is not able to extract
>>> content, is
>>> only logs a warning about it (and i think that's all - log below, i've
>>> shown
>>> only the significant logs). Is there any way I could know about failure
>>> of
>>> extraction immediately, when importing ?
>>>
>>> [15:27:50,699] [WARN ]
>>> [http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed
>>> to
>>> extract Word text content
>>> java.lang.ArrayIndexOutOfBoundsException: 59730
>>> 	at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475)
>>> ...
>>> org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
>>> ...
>>> org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701)
>>> ...
>>> org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
>>> 	at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288)
>>>
>>> I would be thankful for any help.
>>>
>>> Regards, 
>>> Luke
>>>
>>>   
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://n4.nabble.com/Extracting-content-from-document-tp621776p623316.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Extracting content from document

Posted by LukashP <lu...@poczta.onet.pl>.
Is there a direct way - that actually is the question ;)

The problem with my own text extractors is that I would have to override
every single one I use. That is not a problem technically, but I find that
solution somewhat ugly ;). What is more one can read in javadoc here:
http://jackrabbit.apache.org/api/1.4/org/apache/jackrabbit/extractor/MsWordTextExtractor.html
that this method should only throw Exception on transient errors.

I thought about hacking
org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor (maybe there
would be only one class to change), but then I think I would be forced to do
some hacking in Jackrabbit jar files.

So, all in all I think I will stick with overriding every extractor I use.

Thank you for your reply.

Regards, Luke


Dave Brosius-2 wrote:
> 
> If there's no direct way...   :)
> 
> I suppose you could create your own text extractor that derived from 
> MsWordTextExtractor, overrides extractText and delegate to super in a 
> try/catch block.
> 
> Then specify this extractor in your repository.xml file.
> 
> LukashP wrote:
>> Hi,
>> It's my first post here, so please, be tolerant of any mistakes :).
>> I'm importing into Jackrabbit repository a large group of word (*.doc)
>> files
>> (batch operation). I've setup Jackrabbit in a way, that content is
>> extracted
>> immediately along with importing (commiting transaction to be strict).
>> Most of them are fine, and also MsWordExtractor can successfully extract
>> text content (that allows me to use full text search later).
>> However, for some of them I have a problem : The content can't be
>> extracted
>> of whatever reason. That's ok, some of them can be in wrong format or so,
>> but I would like to know about such problem immediately.
>> The problem is, that when MsWordExtractor is not able to extract content,
>> is
>> only logs a warning about it (and i think that's all - log below, i've
>> shown
>> only the significant logs). Is there any way I could know about failure
>> of
>> extraction immediately, when importing ?
>>
>> [15:27:50,699] [WARN ]
>> [http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed
>> to
>> extract Word text content
>> java.lang.ArrayIndexOutOfBoundsException: 59730
>> 	at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475)
>> ...
>> org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
>> ...
>> org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701)
>> ...
>> org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
>> 	at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288)
>>
>> I would be thankful for any help.
>>
>> Regards, 
>> Luke
>>
>>   
> 
> 
> 

-- 
View this message in context: http://n4.nabble.com/Extracting-content-from-document-tp621776p621866.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Extracting content from document

Posted by Dave Brosius <db...@mebigfatguy.com>.
If there's no direct way...   :)

I suppose you could create your own text extractor that derived from 
MsWordTextExtractor, overrides extractText and delegate to super in a 
try/catch block.

Then specify this extractor in your repository.xml file.

LukashP wrote:
> Hi,
> It's my first post here, so please, be tolerant of any mistakes :).
> I'm importing into Jackrabbit repository a large group of word (*.doc) files
> (batch operation). I've setup Jackrabbit in a way, that content is extracted
> immediately along with importing (commiting transaction to be strict).
> Most of them are fine, and also MsWordExtractor can successfully extract
> text content (that allows me to use full text search later).
> However, for some of them I have a problem : The content can't be extracted
> of whatever reason. That's ok, some of them can be in wrong format or so,
> but I would like to know about such problem immediately.
> The problem is, that when MsWordExtractor is not able to extract content, is
> only logs a warning about it (and i think that's all - log below, i've shown
> only the significant logs). Is there any way I could know about failure of
> extraction immediately, when importing ?
>
> [15:27:50,699] [WARN ]
> [http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed to
> extract Word text content
> java.lang.ArrayIndexOutOfBoundsException: 59730
> 	at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475)
> ...
> org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
> ...
> org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701)
> ...
> org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
> 	at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288)
>
> I would be thankful for any help.
>
> Regards, 
> Luke
>
>