You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Erlend Garåsen <e....@usit.uio.no> on 2011/04/04 17:01:34 UTC

Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1

After I downloaded and replaced the following jars, I no longer have a 
character encoding problem:
pdfbox-1.5.0.jar
fontbox-1.5.0.jar
jempbox-1.5.0.jar

Erlend

On 31.03.11 14.35, Karl Wright wrote:
> It might be worth cross-posting this to the Tika user or dev list.
> Jukka Zitting is one of the principal Tika developers and he's also a
> committer for MCF, but I'm not sure he'll notice it go by otherwise.
>
> In case you're wondering how to update the MCF FAQ, it's in the Wiki
> so all you need to do is sign up and you'll be able to update it.
> https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ
>
> Karl
>
> On Thu, Mar 31, 2011 at 6:59 AM, Erlend Garåsen<e....@usit.uio.no>  wrote:
>>
>> Oh, there's more unfortunately. Some of the Tika dependencies need to be
>> further updated. I couldn't parse the date from PDF documents correctly. I'm
>> not quite sure which of the extracting libraries causing this problem
>> (probably pdfbox). Anyway, I can now extract contents from the following
>> document formats without any problems:
>> - HTML
>> - RTF
>> - DOC
>> - DOCX
>> - ODT
>> - XLSX
>> - XLS
>> - SXW
>> - PDF
>>
>> I'm using the following jars:
>> apache-solr-cell-1.4.2-dev.jar
>> geronimo-stax-api_1.0_spec-1.0.1.jar
>> poi-scratchpad-3.7.jar
>> asm-3.1.jar
>> icu4j-4_6.jar
>> rome-0.9.jar
>> bcmail-jdk15-1.45.jar
>> jempbox-1.3.1.jar
>> tagsoup-1.2.jar
>> bcprov-jdk15-1.45.jar
>> metadata-extractor-2.4.0-beta-1.jar
>> tika-core-0.8.jar
>> boilerpipe-1.1.0.jar
>> netcdf-4.2.jar
>> tika-parsers-0.8.jar
>> commons-compress-1.1.jar
>> pdfbox-1.3.1.jar
>> commons-logging-1.1.1.jar
>> poi-3.7.jar
>> xercesImpl-2.8.1.jar
>> dom4j-1.6.1.jar
>> poi-ooxml-3.7.jar
>> xml-apis-1.0.b2.jar
>> fontbox-1.3.1.jar
>> poi-ooxml-schemas-3.7.jar
>> xmlbeans-2.3.0.jar
>>
>> But I still have some problems with PDF documents[1]. I'm not sure whether
>> it is a pdfbox bug, but Norwegian characters like æ, ø and å cannot be
>> displayed correctly after Solr has indexed the document. The characters are
>> replaced by a question mark.
>>
>> [1] http://ridder.uio.no/dokument.pdf
>>
>> Erlend
>>
>> On 30.03.11 18.09, Karl Wright wrote:
>>>
>>> Certainly it makes sense to start with the FAQ, especially for places
>>> where you are tripping over known bugs.  We can always do a site page
>>> later.
>>>
>>> Thanks!
>>> Karl
>>>
>>> On Wed, Mar 30, 2011 at 12:07 PM, Erlend Garåsen
>>> <e....@usit.uio.no>    wrote:
>>>>
>>>> On 30.03.11 18.00, Karl Wright wrote:
>>>>>
>>>>> It would be great if this information went at least into the FAQ, and
>>>>> even better if we added a page to the site documentation.  I'm
>>>>> thinking maybe a whole page titled "Integrating with Solr", which
>>>>> would walk you through the process and the pitfalls.  What do you
>>>>> think?
>>>>
>>>> Yes, I think so.
>>>>
>>>> The next version of Solr will probably be released soon, and then it will
>>>> be
>>>> much easier to integrate Solr. Maybe it is sufficient to add the
>>>> information
>>>> into the FAQ since the problem mentioned only affects 1.4.1?
>>>>
>>>> Erlend
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1

Posted by Erlend Garåsen <e....@usit.uio.no>.

On 05.04.11 16.30, Karl Wright wrote:
> That's great news!  There was also some rumor that the extracting
> update request handler had been moved to contrib, and thus you had to
> do special stuff in order to get ManifoldCF to work with Solr.  I'm
> hoping that was just rumor, though - any comments?

The extracting request handler, or Solr Cell, is placed in the contrib 
folder in Solr 1.4.1 as well. Yes, you have to do a couple of things as 
long as you do not run the example (Jetty) version of Solr.

1. Configure a lib directory in <solr.home>
2. Make sure that this lib directory will be read (edit solrconfig.xml)
3. Copy all jars from contrib/extraction/lib/ to lib
4. Copy dist/apache-solr-cell-3.1.0.jar to lib (Solr Cell is not 
included in solr.war).

That's it!

I tested this on Resin 4.0.15 today and managed to parse a lot of 
different document formats without any problems.

Erlend
-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1

Posted by Karl Wright <da...@gmail.com>.

That's great news!  There was also some rumor that the extracting
update request handler had been moved to contrib, and thus you had to
do special stuff in order to get ManifoldCF to work with Solr.  I'm
hoping that was just rumor, though - any comments?

Karl

On Tue, Apr 5, 2011 at 8:59 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
> On 04.04.11 17.32, Erlend Garåsen wrote:
>
>> It will probably work out-of-the box.
>
> Yes, Solr 3.1 is the version we have waited for. It works almost
> out-of-the-box. The only thing which doesn't work properly is the encoding
> issue I mentioned about Norwegian characters for some PDF documents.
>
> I will open a Jira ticket (Solr) about this, because I think they should
> upgrade the version of PDFBox to 1.5.0.
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1

Posted by Erlend Garåsen <e....@usit.uio.no>.

On 04.04.11 17.32, Erlend Garåsen wrote:

> It will probably work out-of-the box.

Yes, Solr 3.1 is the version we have waited for. It works almost 
out-of-the-box. The only thing which doesn't work properly is the 
encoding issue I mentioned about Norwegian characters for some PDF 
documents.

I will open a Jira ticket (Solr) about this, because I think they should 
upgrade the version of PDFBox to 1.5.0.

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1

Posted by Erlend Garåsen <e....@usit.uio.no>.

Thanks for your information about the latest Solr release. I will start 
to look at this tomorrow.

It will probably work out-of-the box.

Erlend


On 04.04.11 17.26, Karl Wright wrote:
> Good to know that it can be made to work. ;-)
>
> We should probably look at Lucene/Solr 3.1, which was just released,
> and is the next Solr version after 1.4.1, and see whether anything
> special is needed there.
>
> Karl
>
>
> On Mon, Apr 4, 2011 at 11:01 AM, Erlend Garåsen<e....@usit.uio.no>  wrote:
>>
>> After I downloaded and replaced the following jars, I no longer have a
>> character encoding problem:
>> pdfbox-1.5.0.jar
>> fontbox-1.5.0.jar
>> jempbox-1.5.0.jar
>>
>> Erlend
>>
>> On 31.03.11 14.35, Karl Wright wrote:
>>>
>>> It might be worth cross-posting this to the Tika user or dev list.
>>> Jukka Zitting is one of the principal Tika developers and he's also a
>>> committer for MCF, but I'm not sure he'll notice it go by otherwise.
>>>
>>> In case you're wondering how to update the MCF FAQ, it's in the Wiki
>>> so all you need to do is sign up and you'll be able to update it.
>>> https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ
>>>
>>> Karl
>>>
>>> On Thu, Mar 31, 2011 at 6:59 AM, Erlend Garåsen<e....@usit.uio.no>
>>>   wrote:
>>>>
>>>> Oh, there's more unfortunately. Some of the Tika dependencies need to be
>>>> further updated. I couldn't parse the date from PDF documents correctly.
>>>> I'm
>>>> not quite sure which of the extracting libraries causing this problem
>>>> (probably pdfbox). Anyway, I can now extract contents from the following
>>>> document formats without any problems:
>>>> - HTML
>>>> - RTF
>>>> - DOC
>>>> - DOCX
>>>> - ODT
>>>> - XLSX
>>>> - XLS
>>>> - SXW
>>>> - PDF
>>>>
>>>> I'm using the following jars:
>>>> apache-solr-cell-1.4.2-dev.jar
>>>> geronimo-stax-api_1.0_spec-1.0.1.jar
>>>> poi-scratchpad-3.7.jar
>>>> asm-3.1.jar
>>>> icu4j-4_6.jar
>>>> rome-0.9.jar
>>>> bcmail-jdk15-1.45.jar
>>>> jempbox-1.3.1.jar
>>>> tagsoup-1.2.jar
>>>> bcprov-jdk15-1.45.jar
>>>> metadata-extractor-2.4.0-beta-1.jar
>>>> tika-core-0.8.jar
>>>> boilerpipe-1.1.0.jar
>>>> netcdf-4.2.jar
>>>> tika-parsers-0.8.jar
>>>> commons-compress-1.1.jar
>>>> pdfbox-1.3.1.jar
>>>> commons-logging-1.1.1.jar
>>>> poi-3.7.jar
>>>> xercesImpl-2.8.1.jar
>>>> dom4j-1.6.1.jar
>>>> poi-ooxml-3.7.jar
>>>> xml-apis-1.0.b2.jar
>>>> fontbox-1.3.1.jar
>>>> poi-ooxml-schemas-3.7.jar
>>>> xmlbeans-2.3.0.jar
>>>>
>>>> But I still have some problems with PDF documents[1]. I'm not sure
>>>> whether
>>>> it is a pdfbox bug, but Norwegian characters like æ, ø and å cannot be
>>>> displayed correctly after Solr has indexed the document. The characters
>>>> are
>>>> replaced by a question mark.
>>>>
>>>> [1] http://ridder.uio.no/dokument.pdf
>>>>
>>>> Erlend
>>>>
>>>> On 30.03.11 18.09, Karl Wright wrote:
>>>>>
>>>>> Certainly it makes sense to start with the FAQ, especially for places
>>>>> where you are tripping over known bugs.  We can always do a site page
>>>>> later.
>>>>>
>>>>> Thanks!
>>>>> Karl
>>>>>
>>>>> On Wed, Mar 30, 2011 at 12:07 PM, Erlend Garåsen
>>>>> <e....@usit.uio.no>      wrote:
>>>>>>
>>>>>> On 30.03.11 18.00, Karl Wright wrote:
>>>>>>>
>>>>>>> It would be great if this information went at least into the FAQ, and
>>>>>>> even better if we added a page to the site documentation.  I'm
>>>>>>> thinking maybe a whole page titled "Integrating with Solr", which
>>>>>>> would walk you through the process and the pitfalls.  What do you
>>>>>>> think?
>>>>>>
>>>>>> Yes, I think so.
>>>>>>
>>>>>> The next version of Solr will probably be released soon, and then it
>>>>>> will
>>>>>> be
>>>>>> much easier to integrate Solr. Maybe it is sufficient to add the
>>>>>> information
>>>>>> into the FAQ since the problem mentioned only affects 1.4.1?
>>>>>>
>>>>>> Erlend
>>>>>>
>>>>>> --
>>>>>> Erlend Garåsen
>>>>>> Center for Information Technology Services
>>>>>> University of Oslo
>>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>>> 31050
>>>>>>
>>>>
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1

Posted by Karl Wright <da...@gmail.com>.

Good to know that it can be made to work. ;-)

We should probably look at Lucene/Solr 3.1, which was just released,
and is the next Solr version after 1.4.1, and see whether anything
special is needed there.

Karl


On Mon, Apr 4, 2011 at 11:01 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>
> After I downloaded and replaced the following jars, I no longer have a
> character encoding problem:
> pdfbox-1.5.0.jar
> fontbox-1.5.0.jar
> jempbox-1.5.0.jar
>
> Erlend
>
> On 31.03.11 14.35, Karl Wright wrote:
>>
>> It might be worth cross-posting this to the Tika user or dev list.
>> Jukka Zitting is one of the principal Tika developers and he's also a
>> committer for MCF, but I'm not sure he'll notice it go by otherwise.
>>
>> In case you're wondering how to update the MCF FAQ, it's in the Wiki
>> so all you need to do is sign up and you'll be able to update it.
>> https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ
>>
>> Karl
>>
>> On Thu, Mar 31, 2011 at 6:59 AM, Erlend Garåsen<e....@usit.uio.no>
>>  wrote:
>>>
>>> Oh, there's more unfortunately. Some of the Tika dependencies need to be
>>> further updated. I couldn't parse the date from PDF documents correctly.
>>> I'm
>>> not quite sure which of the extracting libraries causing this problem
>>> (probably pdfbox). Anyway, I can now extract contents from the following
>>> document formats without any problems:
>>> - HTML
>>> - RTF
>>> - DOC
>>> - DOCX
>>> - ODT
>>> - XLSX
>>> - XLS
>>> - SXW
>>> - PDF
>>>
>>> I'm using the following jars:
>>> apache-solr-cell-1.4.2-dev.jar
>>> geronimo-stax-api_1.0_spec-1.0.1.jar
>>> poi-scratchpad-3.7.jar
>>> asm-3.1.jar
>>> icu4j-4_6.jar
>>> rome-0.9.jar
>>> bcmail-jdk15-1.45.jar
>>> jempbox-1.3.1.jar
>>> tagsoup-1.2.jar
>>> bcprov-jdk15-1.45.jar
>>> metadata-extractor-2.4.0-beta-1.jar
>>> tika-core-0.8.jar
>>> boilerpipe-1.1.0.jar
>>> netcdf-4.2.jar
>>> tika-parsers-0.8.jar
>>> commons-compress-1.1.jar
>>> pdfbox-1.3.1.jar
>>> commons-logging-1.1.1.jar
>>> poi-3.7.jar
>>> xercesImpl-2.8.1.jar
>>> dom4j-1.6.1.jar
>>> poi-ooxml-3.7.jar
>>> xml-apis-1.0.b2.jar
>>> fontbox-1.3.1.jar
>>> poi-ooxml-schemas-3.7.jar
>>> xmlbeans-2.3.0.jar
>>>
>>> But I still have some problems with PDF documents[1]. I'm not sure
>>> whether
>>> it is a pdfbox bug, but Norwegian characters like æ, ø and å cannot be
>>> displayed correctly after Solr has indexed the document. The characters
>>> are
>>> replaced by a question mark.
>>>
>>> [1] http://ridder.uio.no/dokument.pdf
>>>
>>> Erlend
>>>
>>> On 30.03.11 18.09, Karl Wright wrote:
>>>>
>>>> Certainly it makes sense to start with the FAQ, especially for places
>>>> where you are tripping over known bugs.  We can always do a site page
>>>> later.
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>>>> On Wed, Mar 30, 2011 at 12:07 PM, Erlend Garåsen
>>>> <e....@usit.uio.no>    wrote:
>>>>>
>>>>> On 30.03.11 18.00, Karl Wright wrote:
>>>>>>
>>>>>> It would be great if this information went at least into the FAQ, and
>>>>>> even better if we added a page to the site documentation.  I'm
>>>>>> thinking maybe a whole page titled "Integrating with Solr", which
>>>>>> would walk you through the process and the pitfalls.  What do you
>>>>>> think?
>>>>>
>>>>> Yes, I think so.
>>>>>
>>>>> The next version of Solr will probably be released soon, and then it
>>>>> will
>>>>> be
>>>>> much easier to integrate Solr. Maybe it is sufficient to add the
>>>>> information
>>>>> into the FAQ since the problem mentioned only affects 1.4.1?
>>>>>
>>>>> Erlend
>>>>>
>>>>> --
>>>>> Erlend Garåsen
>>>>> Center for Information Technology Services
>>>>> University of Oslo
>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>> 31050
>>>>>
>>>
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>