You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ashokc <as...@qualcomm.com> on 2009/07/29 00:14:58 UTC

Indexing TIKA extracted text. Are there some issues?

I am finding that the search results based on indexing Tika extracted text
are very different from results based on indexing the text extracted via
other means. This shows up for example with a chinese web site that I am
trying to index.

I created the documents (for posting to SOLR) in two ways. The source text
of the web pages are full of html entities like &#12345; and some english
characters mixed in.

(a) Simple text extraction from the page source by a Perl script. The
resulting content field looks like

<field name="content_china">Who We Are &#20844;&#21496;&#21382;&#21490;
&#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
&#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376; Innovation
&#21019; etc...     </field>

I posted these documents to a SOLR instance

(b) Used Tika (command line). The resulting content field looks like

<field name="content_china">Who We Are Ã¥ Â¬Ã¥ÂÂ¸Ã¥ÂŽÂ†Ã¥ÂÂ²
Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂÃ¥ÂŠÂŸÃ¦Â¡
ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã¥Â
etc... </field>

I posted these documents to a different instance

When I search the first instance for a string (that I copied & pasted from
the web site) I find a number of hits, including the page from which I
copied the string from. But when I do the same on the instance with Tika
extracted text - I get nothing.

Has anyone seen this? I believe it may have to do with encoding. In both
cases the posted documents were utf-8 compiant.

Thanks for your insights.

- ashok

-- 
View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing TIKA extracted text. Are there some issues?

Posted by ashokc <as...@qualcomm.com>.

Could very well be... I will rectify it and try again. Thanks

- ashok



Robert Muir wrote:
> 
> it appears there is an encoding problem, in the screenshot I can see
> the title is mangled, and if i open up the URL in IE or firefox, both
> browsers think it is iso-8859-1.
> 
> I think this is why (from w3c validator):
> 
> Character Encoding mismatch!
> 
> The character encoding specified in the HTTP header (iso-8859-1) is
> different from the value in the <meta> element (utf-8). I will use the
> value from the HTTP header (iso-8859-1) for this validation.
> 
> On Wed, Jul 29, 2009 at 6:02 PM, ashokc<as...@qualcomm.com> wrote:
>>
>> Sure.
>>
>> The java command I use with TIKA to extract text from a URL is:
>>
>> java -jar tika-0.3-standalone.jar -t $url
>>
>> I have also attached the screenshots of the web page, post documents
>> produced in the two different ways (Perl & Tika) for that web page, and
>> the
>> screenshots of the search result for a string contained in that web page.
>> The index in each case contains just this one URL. To keep everything
>> else
>> identical, I used the same instance for creating the index in each case.
>> First I posted the Tika document, checked for the results, emptied the
>> index, posted the Perl document, and checked the results.
>>
>> Debug query for Tika:
>>
>> <str name="parsedquery">
>> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡
>> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
>> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
>> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
>> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
>> </str>
>>
>> Debug query for Perl:
>>
>> <str name="parsedquery">
>> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡
>> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
>> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
>> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
>> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
>> </str>
>>
>> The screenshots
>> http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx
>>
>> Perl extracted doc
>> http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml
>>
>> Tika extracted doc
>> http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>> Hmm, looks very much like an encoding problem.  Can you post a sample
>>> showing it, along with the commands you invoked?
>>>
>>> Thanks,
>>> Grant
>>>
>>> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
>>>
>>>>
>>>> I am finding that the search results based on indexing Tika
>>>> extracted text
>>>> are very different from results based on indexing the text extracted
>>>> via
>>>> other means. This shows up for example with a chinese web site that
>>>> I am
>>>> trying to index.
>>>>
>>>> I created the documents (for posting to SOLR) in two ways. The
>>>> source text
>>>> of the web pages are full of html entities like &#12345; and some
>>>> english
>>>> characters mixed in.
>>>>
>>>> (a) Simple text extraction from the page source by a Perl script. The
>>>> resulting content field looks like
>>>>
>>>> <field name="content_china">Who We Are
>>>> &#20844;&#21496;&#21382;&#21490;
>>>> &#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
>>>> &#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376;
>>>> Innovation
>>>> &#21019; etc...     </field>
>>>>
>>>> I posted these documents to a SOLR instance
>>>>
>>>> (b) Used Tika (command line). The resulting content field looks like
>>>>
>>>> <field name="content_china">Who We Are Ã¥ Â¬Ã¥Â Â¸Ã
>>>> ¥ÂŽÂ†Ã¥Â Â²
>>>> Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂ Ã¥ÂŠÂŸÃ¦Â¡
>>>> ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ
>>>> Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã
>>>> ¥Â
>>>> etc... </field>
>>>>
>>>> I posted these documents to a different instance
>>>>
>>>> When I search the first instance for a string (that I copied &
>>>> pasted from
>>>> the web site) I find a number of hits, including the page from which I
>>>> copied the string from. But when I do the same on the instance with
>>>> Tika
>>>> extracted text - I get nothing.
>>>>
>>>> Has anyone seen this? I believe it may have to do with encoding. In
>>>> both
>>>> cases the posted documents were utf-8 compiant.
>>>>
>>>> Thanks for your insights.
>>>>
>>>> - ashok
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Robert Muir
> rcmuir@gmail.com
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24729595.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing TIKA extracted text. Are there some issues?

Posted by Robert Muir <rc...@gmail.com>.

it appears there is an encoding problem, in the screenshot I can see
the title is mangled, and if i open up the URL in IE or firefox, both
browsers think it is iso-8859-1.

I think this is why (from w3c validator):

Character Encoding mismatch!

The character encoding specified in the HTTP header (iso-8859-1) is
different from the value in the <meta> element (utf-8). I will use the
value from the HTTP header (iso-8859-1) for this validation.

On Wed, Jul 29, 2009 at 6:02 PM, ashokc<as...@qualcomm.com> wrote:
>
> Sure.
>
> The java command I use with TIKA to extract text from a URL is:
>
> java -jar tika-0.3-standalone.jar -t $url
>
> I have also attached the screenshots of the web page, post documents
> produced in the two different ways (Perl & Tika) for that web page, and the
> screenshots of the search result for a string contained in that web page.
> The index in each case contains just this one URL. To keep everything else
> identical, I used the same instance for creating the index in each case.
> First I posted the Tika document, checked for the results, emptied the
> index, posted the Perl document, and checked the results.
>
> Debug query for Tika:
>
> <str name="parsedquery">
> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
> </str>
>
> Debug query for Perl:
>
> <str name="parsedquery">
> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
> </str>
>
> The screenshots
> http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx
>
> Perl extracted doc
> http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml
>
> Tika extracted doc
> http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml
>
>
> Grant Ingersoll-6 wrote:
>>
>> Hmm, looks very much like an encoding problem.  Can you post a sample
>> showing it, along with the commands you invoked?
>>
>> Thanks,
>> Grant
>>
>> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
>>
>>>
>>> I am finding that the search results based on indexing Tika
>>> extracted text
>>> are very different from results based on indexing the text extracted
>>> via
>>> other means. This shows up for example with a chinese web site that
>>> I am
>>> trying to index.
>>>
>>> I created the documents (for posting to SOLR) in two ways. The
>>> source text
>>> of the web pages are full of html entities like &#12345; and some
>>> english
>>> characters mixed in.
>>>
>>> (a) Simple text extraction from the page source by a Perl script. The
>>> resulting content field looks like
>>>
>>> <field name="content_china">Who We Are
>>> &#20844;&#21496;&#21382;&#21490;
>>> &#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
>>> &#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376;
>>> Innovation
>>> &#21019; etc...     </field>
>>>
>>> I posted these documents to a SOLR instance
>>>
>>> (b) Used Tika (command line). The resulting content field looks like
>>>
>>> <field name="content_china">Who We Are Ã¥ Â¬Ã¥Â Â¸Ã
>>> ¥ÂŽÂ†Ã¥Â Â²
>>> Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂ Ã¥ÂŠÂŸÃ¦Â¡
>>> ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ
>>> Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã
>>> ¥Â
>>> etc... </field>
>>>
>>> I posted these documents to a different instance
>>>
>>> When I search the first instance for a string (that I copied &
>>> pasted from
>>> the web site) I find a number of hits, including the page from which I
>>> copied the string from. But when I do the same on the instance with
>>> Tika
>>> extracted text - I get nothing.
>>>
>>> Has anyone seen this? I believe it may have to do with encoding. In
>>> both
>>> cases the posted documents were utf-8 compiant.
>>>
>>> Thanks for your insights.
>>>
>>> - ashok
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Indexing TIKA extracted text. Are there some issues?

Posted by ashokc <as...@qualcomm.com>.

Sure.

The java command I use with TIKA to extract text from a URL is:

java -jar tika-0.3-standalone.jar -t $url

I have also attached the screenshots of the web page, post documents
produced in the two different ways (Perl & Tika) for that web page, and the
screenshots of the search result for a string contained in that web page.
The index in each case contains just this one URL. To keep everything else
identical, I used the same instance for creating the index in each case.
First I posted the Tika document, checked for the results, emptied the
index, posted the Perl document, and checked the results.

Debug query for Tika:

<str name="parsedquery">
+DisjunctionMaxQuery((urltext:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0
| title:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0 |
content_china:"é«é éå¬ å¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é
éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å åå®¹ å®¹è½")~0.01) ()
</str>

Debug query for Perl:

<str name="parsedquery">
+DisjunctionMaxQuery((urltext:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0
| title:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0 |
content_china:"é«é éå¬ å¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é
éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å åå®¹ å®¹è½")~0.01) ()
</str>

The screenshots
http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx 

Perl extracted doc
http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml 

Tika extracted doc
http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml 


Grant Ingersoll-6 wrote:
> 
> Hmm, looks very much like an encoding problem.  Can you post a sample  
> showing it, along with the commands you invoked?
> 
> Thanks,
> Grant
> 
> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
> 
>>
>> I am finding that the search results based on indexing Tika  
>> extracted text
>> are very different from results based on indexing the text extracted  
>> via
>> other means. This shows up for example with a chinese web site that  
>> I am
>> trying to index.
>>
>> I created the documents (for posting to SOLR) in two ways. The  
>> source text
>> of the web pages are full of html entities like &#12345; and some  
>> english
>> characters mixed in.
>>
>> (a) Simple text extraction from the page source by a Perl script. The
>> resulting content field looks like
>>
>> <field name="content_china">Who We Are  
>> &#20844;&#21496;&#21382;&#21490;
>> &#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
>> &#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376;  
>> Innovation
>> &#21019; etc...     </field>
>>
>> I posted these documents to a SOLR instance
>>
>> (b) Used Tika (command line). The resulting content field looks like
>>
>> <field name="content_china">Who We Are Ã¥ Â¬Ã¥ÂÂ¸Ã 
>> ¥ÂŽÂ†Ã¥ÂÂ²
>> Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂÃ¥ÂŠÂŸÃ¦Â¡
>> ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ  
>> Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã 
>> ¥Â
>> etc... </field>
>>
>> I posted these documents to a different instance
>>
>> When I search the first instance for a string (that I copied &  
>> pasted from
>> the web site) I find a number of hits, including the page from which I
>> copied the string from. But when I do the same on the instance with  
>> Tika
>> extracted text - I get nothing.
>>
>> Has anyone seen this? I believe it may have to do with encoding. In  
>> both
>> cases the posted documents were utf-8 compiant.
>>
>> Thanks for your insights.
>>
>> - ashok
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing TIKA extracted text. Are there some issues?

Posted by Grant Ingersoll <gs...@apache.org>.

Hmm, looks very much like an encoding problem.  Can you post a sample  
showing it, along with the commands you invoked?

Thanks,
Grant

On Jul 28, 2009, at 6:14 PM, ashokc wrote:

>
> I am finding that the search results based on indexing Tika  
> extracted text
> are very different from results based on indexing the text extracted  
> via
> other means. This shows up for example with a chinese web site that  
> I am
> trying to index.
>
> I created the documents (for posting to SOLR) in two ways. The  
> source text
> of the web pages are full of html entities like &#12345; and some  
> english
> characters mixed in.
>
> (a) Simple text extraction from the page source by a Perl script. The
> resulting content field looks like
>
> <field name="content_china">Who We Are  
> &#20844;&#21496;&#21382;&#21490;
> &#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
> &#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376;  
> Innovation
> &#21019; etc...     </field>
>
> I posted these documents to a SOLR instance
>
> (b) Used Tika (command line). The resulting content field looks like
>
> <field name="content_china">Who We Are Ã¥ Â¬Ã¥ÂÂ¸Ã 
> ¥ÂŽÂ†Ã¥ÂÂ²
> Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂÃ¥ÂŠÂŸÃ¦Â¡
> ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ  
> Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã 
> ¥Â
> etc... </field>
>
> I posted these documents to a different instance
>
> When I search the first instance for a string (that I copied &  
> pasted from
> the web site) I find a number of hits, including the page from which I
> copied the string from. But when I do the same on the instance with  
> Tika
> extracted text - I get nothing.
>
> Has anyone seen this? I believe it may have to do with encoding. In  
> both
> cases the posted documents were utf-8 compiant.
>
> Thanks for your insights.
>
> - ashok
>
> -- 
> View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search