You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Charles Wardell <ch...@bcsolution.com> on 2011/03/28 18:52:55 UTC

problems indexing web content

Hi Everyone,

I setup a server and began to index my data. I have two questions I am hoping someone can help me with. Many of my files seem to index without any problems. Others, I get a host of different errors. I am indexing primarily web based content and have identified my text field as follows:
 
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
                <charfilter class="solr.HTMLStripCharFilterFactory"/>	
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldtype>


q1) Errors while indexing.

* SimplePostTool: WARNING: Unexpected response from Solr: '<result status="0"></result>' does not contain '<int name="status">0</int>'

* SEVERE: Error processing "legacy" update command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character ' ' (code 32) in content after '<' (malformed start element?). at [row,col {unknown-source}]: [1591,90] at com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:648)

* Although I can't find the actual error, I recall solr giving me an error when it came across a string &What - The error was something like expecting semicolon after "What"


q2) If my file has 1000 documents and I submit it with post.jar, if it comes across any of the above errors, will it break the processing of the whole file, or just the document with the error?


Thanks in advance. 
Your help is very much appreciated.

Charlie

Re: problems indexing web content

Posted by Markus Jelsma <ma...@openindex.io>.

The analyzer order doesn't really matter, char filters are regardless of 
position in the analyzer always executed first.  Multiple filters of the same 
type, however, are affected by order. Also, your error is not caused by a 
faulty analyzer, there is something wrong in your XML.

Anyway, according to your error, check row 1591 column 90 of your XML input, 
there seems to be a loose space somewhere.

> Jan,
> 
> thank you for such a quick reply. I have a feed coming in that I convert to
> an <add><doc></doc><doc></doc> Here is the type for text including index
> and query with the changes suggested.
> 
> 
>         <fieldtype name="text" class="solr.TextField"
> positionIncrementGap="100"> <analyzer type="index">
>                 <charfilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/> <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>             <analyzer type="query">
>                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter
> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/> <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>         </fieldtype>
> 
> 
> Here is the snippit of the file I generate.
> 
> ?xml version="1.0" encoding="UTF-8"?>
> <add>
> <doc>
> <field
> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field>
> <field name="title">E X I T</field>
> <field name="authorName">uswautis (Hasanah Uswa)</field>
> <field name="authorEmail"></field>
> <field name="authorLinkMimeType"></field>
> <field name="authorLink">http://twitter.com/uswautis</field>
> <field name="lang">U</field>
> <field name="publishDate">2011-03-27T13:21:52Z</field>
> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> <field name="source"></field>
> <field
> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</fie
> ld> <field name="feedContentMimeType">text/html</field>
> <field name="feedContentEncoding"></field>
> <field name="feedContent">null</field>
> <field name="inboundLinks">0</field>
> <field name="publisherType">MICROBLOG</field>
> <field name="postTitle">E X I T</field>
> <field name="postBodyMimeType">text/html</field>
> <field name="postBodyEncoding">zlib</field>
> <field name="postBody">mime_type: "text/html"
> data: ""
> </field>
> <field name="tags">[]</field>
> </doc>
> 
> <doc>
> <field
> name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080<
> /field> <field name="title">I want the sweater i saw in mango sooooo
> bad.</field> <field name="authorName">imsuperangelica (angelica
> marie)</field>
> <field name="authorEmail"></field>
> <field name="authorLinkMimeType"></field>
> <field name="authorLink">http://twitter.com/imsuperangelica</field>
> <field name="lang">en</field>
> <field name="publishDate">2011-03-27T13:21:52Z</field>
> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> <field name="source"></field>
> <field
> name="feedURL">http://twitter.com/imsuperangelica/statuses/519973640508620
> 80</field> <field name="feedContentMimeType">text/html</field>
> <field name="feedContentEncoding"></field>
> <field name="feedContent">null</field>
> <field name="inboundLinks">0</field>
> <field name="publisherType">MICROBLOG</field>
> <field name="postTitle">I want the sweater i saw in mango sooooo
> bad.</field> <field name="postBodyMimeType">text/html</field>
> <field name="postBodyEncoding">zlib</field>
> <field name="postBody">mime_type: "text/html"
> data: ""
> </field>
> <field name="tags">[]</field>
> </doc>
> 
> </add>
> 
> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:
> > Hi,
> > 
> > I assume you try to post HTML files from post.jar, and use
> > HTMLStripCharFilter to sanitize the HTML.
> > 
> > But you refer to "my file" as if you have multiple docs in one file? XML
> > or HTML? Multiple files? To what UpdateRequestHandler are you posting?
> > /update/xml or /update/extract ? For us to understand what you're trying
> > to achieve, please describe your project in more detail.
> > 
> > 
> > To give some concrete feedback too: First off, your analyzer for "text"
> > is wrong. All charFilter's need to be before the tokenizer. You also
> > lack an analyzer with type="query". If I were you I'd try the simplest
> > case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter
> > and Stemmer - just do the most basic stuff you can and go from there.
> > 
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> > 
> > On 28. mars 2011, at 18.52, Charles Wardell wrote:
> >> Hi Everyone,
> >> 
> >> I setup a server and began to index my data. I have two questions I am
> >> hoping someone can help me with. Many of my files seem to index without
> >> any problems. Others, I get a host of different errors. I am indexing
> >> primarily web based content and have identified my text field as
> >> follows:
> >> 
> >> <fieldtype name="text" class="solr.TextField"
> >> positionIncrementGap="100">
> >> 
> >>           <analyzer type="index">
> >>           
> >>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>               <charFilter class="solr.MappingCharFilterFactory"
> >>               mapping="mapping.txt"/> <charfilter
> >>               class="solr.HTMLStripCharFilterFactory"/> <filter
> >>               class="solr.StopFilterFactory" ignoreCase="true"
> >>               words="stopwords.txt"/> <filter
> >>               class="solr.WordDelimiterFilterFactory"
> >>               generateWordParts="1" generateNumberParts="1"
> >>               catenateWords="1" catenateNumbers="1" catenateAll="0"/>
> >>               <filter class="solr.LowerCaseFilterFactory"/>
> >>               <filter class="solr.EnglishPorterFilterFactory"
> >>               protected="protwords.txt"/> <filter
> >>               class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>           
> >>           </analyzer>
> >>       
> >>       </fieldtype>
> >> 
> >> q1) Errors while indexing.
> >> 
> >> * SimplePostTool: WARNING: Unexpected response from Solr: '<result
> >> status="0"></result>' does not contain '<int name="status">0</int>'
> >> 
> >> * SEVERE: Error processing "legacy" update
> >> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
> >> character ' ' (code 32) in content after '<' (malformed start
> >> element?). at [row,col {unknown-source}]: [1591,90] at
> >> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:64
> >> 8)
> >> 
> >> * Although I can't find the actual error, I recall solr giving me an
> >> error when it came across a string &What - The error was something like
> >> expecting semicolon after "What"
> >> 
> >> 
> >> q2) If my file has 1000 documents and I submit it with post.jar, if it
> >> comes across any of the above errors, will it break the processing of
> >> the whole file, or just the document with the error?
> >> 
> >> 
> >> Thanks in advance.
> >> Your help is very much appreciated.
> >> 
> >> Charlie

Re: problems indexing web content

Posted by Markus Jelsma <ma...@openindex.io>.

> I have about 1000 documents per xml file. I am not really doing anything
> with the data other than putting the xml tags around it. So essentially
> the data is okay with the exception of a few documents that are causing
> the errors.
> 
> Let's say document # 47 in the xml file has a problem, is the whole file
> skipped when using post.jar? I will add the CDATA to my xml generator.

I am not sure actually, i never tried, but i think it's thrown away.

> 
> Sometimes the data will come in as a string of pretty funky looking
> characters. I am assuming this is UTF-8. Is there any specialized data
> type I need to declare for this data?

Well, all data needs to be UTF-8 encoded. Anyway, wrong encoded text data is 
just indexed as is and won't throw an error. Except for entities of course.

> 
> One other thing I noticed is that sometimes I may get data in binary
> compreseed format. Like an image or something. Obviously I am not looking
> to index it, but is there a data type this can be stored as in Solr so I
> can retrieve and render easily?

Yes, use the binary field type [1]. You have to base64 encode the data.

[1]: http://lucene.apache.org/solr/api/org/apache/solr/schema/BinaryField.html

> 
> On Mar 28, 2011, at 1:38 PM, Markus Jelsma wrote:
> > Also, don't forget to encode entities or wrap them in CDATA.
> > 
> >> Jan,
> >> 
> >> thank you for such a quick reply. I have a feed coming in that I convert
> >> to an <add><doc></doc><doc></doc> Here is the type for text including
> >> index and query with the changes suggested.
> >> 
> >>        <fieldtype name="text" class="solr.TextField"
> >> 
> >> positionIncrementGap="100"> <analyzer type="index">
> >> 
> >>                <charfilter class="solr.HTMLStripCharFilterFactory"/>
> >>                <filter class="solr.LowerCaseFilterFactory"/>
> >>                <filter class="solr.EnglishPorterFilterFactory"
> >> 
> >> protected="protwords.txt"/> <filter
> >> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> >> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
> >> 
> >>            <analyzer type="query">
> >>            
> >>                <filter class="solr.SynonymFilterFactory"
> >> 
> >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter
> >> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> >> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> >> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/>
> >> 
> >>                <filter class="solr.EnglishPorterFilterFactory"
> >> 
> >> protected="protwords.txt"/> <filter
> >> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> >> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
> >> 
> >>        </fieldtype>
> >> 
> >> Here is the snippit of the file I generate.
> >> 
> >> ?xml version="1.0" encoding="UTF-8"?>
> >> <add>
> >> <doc>
> >> <field
> >> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</fiel
> >> d> <field name="title">E X I T</field>
> >> <field name="authorName">uswautis (Hasanah Uswa)</field>
> >> <field name="authorEmail"></field>
> >> <field name="authorLinkMimeType"></field>
> >> <field name="authorLink">http://twitter.com/uswautis</field>
> >> <field name="lang">U</field>
> >> <field name="publishDate">2011-03-27T13:21:52Z</field>
> >> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> >> <field name="source"></field>
> >> <field
> >> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</f
> >> ie ld> <field name="feedContentMimeType">text/html</field>
> >> <field name="feedContentEncoding"></field>
> >> <field name="feedContent">null</field>
> >> <field name="inboundLinks">0</field>
> >> <field name="publisherType">MICROBLOG</field>
> >> <field name="postTitle">E X I T</field>
> >> <field name="postBodyMimeType">text/html</field>
> >> <field name="postBodyEncoding">zlib</field>
> >> <field name="postBody">mime_type: "text/html"
> >> data: ""
> >> </field>
> >> <field name="tags">[]</field>
> >> </doc>
> >> 
> >> <doc>
> >> <field
> >> name="guid">http://twitter.com/imsuperangelica/statuses/5199736405086208
> >> 0< /field> <field name="title">I want the sweater i saw in mango sooooo
> >> bad.</field> <field name="authorName">imsuperangelica (angelica
> >> marie)</field>
> >> <field name="authorEmail"></field>
> >> <field name="authorLinkMimeType"></field>
> >> <field name="authorLink">http://twitter.com/imsuperangelica</field>
> >> <field name="lang">en</field>
> >> <field name="publishDate">2011-03-27T13:21:52Z</field>
> >> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> >> <field name="source"></field>
> >> <field
> >> name="feedURL">http://twitter.com/imsuperangelica/statuses/5199736405086
> >> 20 80</field> <field name="feedContentMimeType">text/html</field>
> >> <field name="feedContentEncoding"></field>
> >> <field name="feedContent">null</field>
> >> <field name="inboundLinks">0</field>
> >> <field name="publisherType">MICROBLOG</field>
> >> <field name="postTitle">I want the sweater i saw in mango sooooo
> >> bad.</field> <field name="postBodyMimeType">text/html</field>
> >> <field name="postBodyEncoding">zlib</field>
> >> <field name="postBody">mime_type: "text/html"
> >> data: ""
> >> </field>
> >> <field name="tags">[]</field>
> >> </doc>
> >> 
> >> </add>
> >> 
> >> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:
> >>> Hi,
> >>> 
> >>> I assume you try to post HTML files from post.jar, and use
> >>> HTMLStripCharFilter to sanitize the HTML.
> >>> 
> >>> But you refer to "my file" as if you have multiple docs in one file?
> >>> XML or HTML? Multiple files? To what UpdateRequestHandler are you
> >>> posting? /update/xml or /update/extract ? For us to understand what
> >>> you're trying to achieve, please describe your project in more detail.
> >>> 
> >>> 
> >>> To give some concrete feedback too: First off, your analyzer for "text"
> >>> is wrong. All charFilter's need to be before the tokenizer. You also
> >>> lack an analyzer with type="query". If I were you I'd try the simplest
> >>> case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter
> >>> and Stemmer - just do the most basic stuff you can and go from there.
> >>> 
> >>> --
> >>> Jan Høydahl, search solution architect
> >>> Cominvent AS - www.cominvent.com
> >>> 
> >>> On 28. mars 2011, at 18.52, Charles Wardell wrote:
> >>>> Hi Everyone,
> >>>> 
> >>>> I setup a server and began to index my data. I have two questions I am
> >>>> hoping someone can help me with. Many of my files seem to index
> >>>> without any problems. Others, I get a host of different errors. I am
> >>>> indexing primarily web based content and have identified my text
> >>>> field as follows:
> >>>> 
> >>>> <fieldtype name="text" class="solr.TextField"
> >>>> positionIncrementGap="100">
> >>>> 
> >>>>          <analyzer type="index">
> >>>>          
> >>>>              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>              <charFilter class="solr.MappingCharFilterFactory"
> >>>>              mapping="mapping.txt"/> <charfilter
> >>>>              class="solr.HTMLStripCharFilterFactory"/> <filter
> >>>>              class="solr.StopFilterFactory" ignoreCase="true"
> >>>>              words="stopwords.txt"/> <filter
> >>>>              class="solr.WordDelimiterFilterFactory"
> >>>>              generateWordParts="1" generateNumberParts="1"
> >>>>              catenateWords="1" catenateNumbers="1" catenateAll="0"/>
> >>>>              <filter class="solr.LowerCaseFilterFactory"/>
> >>>>              <filter class="solr.EnglishPorterFilterFactory"
> >>>>              protected="protwords.txt"/> <filter
> >>>>              class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>>>          
> >>>>          </analyzer>
> >>>>      
> >>>>      </fieldtype>
> >>>> 
> >>>> q1) Errors while indexing.
> >>>> 
> >>>> * SimplePostTool: WARNING: Unexpected response from Solr: '<result
> >>>> status="0"></result>' does not contain '<int name="status">0</int>'
> >>>> 
> >>>> * SEVERE: Error processing "legacy" update
> >>>> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
> >>>> character ' ' (code 32) in content after '<' (malformed start
> >>>> element?). at [row,col {unknown-source}]: [1591,90] at
> >>>> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:6
> >>>> 4 8)
> >>>> 
> >>>> * Although I can't find the actual error, I recall solr giving me an
> >>>> error when it came across a string &What - The error was something
> >>>> like expecting semicolon after "What"
> >>>> 
> >>>> 
> >>>> q2) If my file has 1000 documents and I submit it with post.jar, if it
> >>>> comes across any of the above errors, will it break the processing of
> >>>> the whole file, or just the document with the error?
> >>>> 
> >>>> 
> >>>> Thanks in advance.
> >>>> Your help is very much appreciated.
> >>>> 
> >>>> Charlie

Re: problems indexing web content

Posted by Charles Wardell <ch...@bcsolution.com>.

I have about 1000 documents per xml file. I am not really doing anything with the data other than putting the xml tags around it.
So essentially the data is okay with the exception of a few documents that are causing the errors.

Let's say document # 47 in the xml file has a problem, is the whole file skipped when using post.jar?
I will add the CDATA to my xml generator.

Sometimes the data will come in as a string of pretty funky looking characters. I am assuming this is UTF-8. Is there any specialized data type I need to declare for this data?

One other thing I noticed is that sometimes I may get data in binary compreseed format. Like an image or something. Obviously I am not looking to index it, but is there a data type this can be stored as in Solr so I can retrieve and render easily?


On Mar 28, 2011, at 1:38 PM, Markus Jelsma wrote:

> Also, don't forget to encode entities or wrap them in CDATA.
> 
>> Jan,
>> 
>> thank you for such a quick reply. I have a feed coming in that I convert to
>> an <add><doc></doc><doc></doc> Here is the type for text including index
>> and query with the changes suggested.
>> 
>> 
>>        <fieldtype name="text" class="solr.TextField"
>> positionIncrementGap="100"> <analyzer type="index">
>>                <charfilter class="solr.HTMLStripCharFilterFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/> <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>>            <analyzer type="query">
>>                <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter
>> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/> <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>>        </fieldtype>
>> 
>> 
>> Here is the snippit of the file I generate.
>> 
>> ?xml version="1.0" encoding="UTF-8"?>
>> <add>
>> <doc>
>> <field
>> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field>
>> <field name="title">E X I T</field>
>> <field name="authorName">uswautis (Hasanah Uswa)</field>
>> <field name="authorEmail"></field>
>> <field name="authorLinkMimeType"></field>
>> <field name="authorLink">http://twitter.com/uswautis</field>
>> <field name="lang">U</field>
>> <field name="publishDate">2011-03-27T13:21:52Z</field>
>> <field name="aquiDate">2011-03-27T13:22:13Z</field>
>> <field name="source"></field>
>> <field
>> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</fie
>> ld> <field name="feedContentMimeType">text/html</field>
>> <field name="feedContentEncoding"></field>
>> <field name="feedContent">null</field>
>> <field name="inboundLinks">0</field>
>> <field name="publisherType">MICROBLOG</field>
>> <field name="postTitle">E X I T</field>
>> <field name="postBodyMimeType">text/html</field>
>> <field name="postBodyEncoding">zlib</field>
>> <field name="postBody">mime_type: "text/html"
>> data: ""
>> </field>
>> <field name="tags">[]</field>
>> </doc>
>> 
>> <doc>
>> <field
>> name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080<
>> /field> <field name="title">I want the sweater i saw in mango sooooo
>> bad.</field> <field name="authorName">imsuperangelica (angelica
>> marie)</field>
>> <field name="authorEmail"></field>
>> <field name="authorLinkMimeType"></field>
>> <field name="authorLink">http://twitter.com/imsuperangelica</field>
>> <field name="lang">en</field>
>> <field name="publishDate">2011-03-27T13:21:52Z</field>
>> <field name="aquiDate">2011-03-27T13:22:13Z</field>
>> <field name="source"></field>
>> <field
>> name="feedURL">http://twitter.com/imsuperangelica/statuses/519973640508620
>> 80</field> <field name="feedContentMimeType">text/html</field>
>> <field name="feedContentEncoding"></field>
>> <field name="feedContent">null</field>
>> <field name="inboundLinks">0</field>
>> <field name="publisherType">MICROBLOG</field>
>> <field name="postTitle">I want the sweater i saw in mango sooooo
>> bad.</field> <field name="postBodyMimeType">text/html</field>
>> <field name="postBodyEncoding">zlib</field>
>> <field name="postBody">mime_type: "text/html"
>> data: ""
>> </field>
>> <field name="tags">[]</field>
>> </doc>
>> 
>> </add>
>> 
>> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:
>>> Hi,
>>> 
>>> I assume you try to post HTML files from post.jar, and use
>>> HTMLStripCharFilter to sanitize the HTML.
>>> 
>>> But you refer to "my file" as if you have multiple docs in one file? XML
>>> or HTML? Multiple files? To what UpdateRequestHandler are you posting?
>>> /update/xml or /update/extract ? For us to understand what you're trying
>>> to achieve, please describe your project in more detail.
>>> 
>>> 
>>> To give some concrete feedback too: First off, your analyzer for "text"
>>> is wrong. All charFilter's need to be before the tokenizer. You also
>>> lack an analyzer with type="query". If I were you I'd try the simplest
>>> case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter
>>> and Stemmer - just do the most basic stuff you can and go from there.
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
>>> On 28. mars 2011, at 18.52, Charles Wardell wrote:
>>>> Hi Everyone,
>>>> 
>>>> I setup a server and began to index my data. I have two questions I am
>>>> hoping someone can help me with. Many of my files seem to index without
>>>> any problems. Others, I get a host of different errors. I am indexing
>>>> primarily web based content and have identified my text field as
>>>> follows:
>>>> 
>>>> <fieldtype name="text" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>> 
>>>>          <analyzer type="index">
>>>> 
>>>>              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>              <charFilter class="solr.MappingCharFilterFactory"
>>>>              mapping="mapping.txt"/> <charfilter
>>>>              class="solr.HTMLStripCharFilterFactory"/> <filter
>>>>              class="solr.StopFilterFactory" ignoreCase="true"
>>>>              words="stopwords.txt"/> <filter
>>>>              class="solr.WordDelimiterFilterFactory"
>>>>              generateWordParts="1" generateNumberParts="1"
>>>>              catenateWords="1" catenateNumbers="1" catenateAll="0"/>
>>>>              <filter class="solr.LowerCaseFilterFactory"/>
>>>>              <filter class="solr.EnglishPorterFilterFactory"
>>>>              protected="protwords.txt"/> <filter
>>>>              class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>> 
>>>>          </analyzer>
>>>> 
>>>>      </fieldtype>
>>>> 
>>>> q1) Errors while indexing.
>>>> 
>>>> * SimplePostTool: WARNING: Unexpected response from Solr: '<result
>>>> status="0"></result>' does not contain '<int name="status">0</int>'
>>>> 
>>>> * SEVERE: Error processing "legacy" update
>>>> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
>>>> character ' ' (code 32) in content after '<' (malformed start
>>>> element?). at [row,col {unknown-source}]: [1591,90] at
>>>> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:64
>>>> 8)
>>>> 
>>>> * Although I can't find the actual error, I recall solr giving me an
>>>> error when it came across a string &What - The error was something like
>>>> expecting semicolon after "What"
>>>> 
>>>> 
>>>> q2) If my file has 1000 documents and I submit it with post.jar, if it
>>>> comes across any of the above errors, will it break the processing of
>>>> the whole file, or just the document with the error?
>>>> 
>>>> 
>>>> Thanks in advance.
>>>> Your help is very much appreciated.
>>>> 
>>>> Charlie

Re: problems indexing web content

Posted by Markus Jelsma <ma...@openindex.io>.

Also, don't forget to encode entities or wrap them in CDATA.

> Jan,
> 
> thank you for such a quick reply. I have a feed coming in that I convert to
> an <add><doc></doc><doc></doc> Here is the type for text including index
> and query with the changes suggested.
> 
> 
>         <fieldtype name="text" class="solr.TextField"
> positionIncrementGap="100"> <analyzer type="index">
>                 <charfilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/> <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>             <analyzer type="query">
>                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter
> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/> <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>         </fieldtype>
> 
> 
> Here is the snippit of the file I generate.
> 
> ?xml version="1.0" encoding="UTF-8"?>
> <add>
> <doc>
> <field
> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field>
> <field name="title">E X I T</field>
> <field name="authorName">uswautis (Hasanah Uswa)</field>
> <field name="authorEmail"></field>
> <field name="authorLinkMimeType"></field>
> <field name="authorLink">http://twitter.com/uswautis</field>
> <field name="lang">U</field>
> <field name="publishDate">2011-03-27T13:21:52Z</field>
> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> <field name="source"></field>
> <field
> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</fie
> ld> <field name="feedContentMimeType">text/html</field>
> <field name="feedContentEncoding"></field>
> <field name="feedContent">null</field>
> <field name="inboundLinks">0</field>
> <field name="publisherType">MICROBLOG</field>
> <field name="postTitle">E X I T</field>
> <field name="postBodyMimeType">text/html</field>
> <field name="postBodyEncoding">zlib</field>
> <field name="postBody">mime_type: "text/html"
> data: ""
> </field>
> <field name="tags">[]</field>
> </doc>
> 
> <doc>
> <field
> name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080<
> /field> <field name="title">I want the sweater i saw in mango sooooo
> bad.</field> <field name="authorName">imsuperangelica (angelica
> marie)</field>
> <field name="authorEmail"></field>
> <field name="authorLinkMimeType"></field>
> <field name="authorLink">http://twitter.com/imsuperangelica</field>
> <field name="lang">en</field>
> <field name="publishDate">2011-03-27T13:21:52Z</field>
> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> <field name="source"></field>
> <field
> name="feedURL">http://twitter.com/imsuperangelica/statuses/519973640508620
> 80</field> <field name="feedContentMimeType">text/html</field>
> <field name="feedContentEncoding"></field>
> <field name="feedContent">null</field>
> <field name="inboundLinks">0</field>
> <field name="publisherType">MICROBLOG</field>
> <field name="postTitle">I want the sweater i saw in mango sooooo
> bad.</field> <field name="postBodyMimeType">text/html</field>
> <field name="postBodyEncoding">zlib</field>
> <field name="postBody">mime_type: "text/html"
> data: ""
> </field>
> <field name="tags">[]</field>
> </doc>
> 
> </add>
> 
> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:
> > Hi,
> > 
> > I assume you try to post HTML files from post.jar, and use
> > HTMLStripCharFilter to sanitize the HTML.
> > 
> > But you refer to "my file" as if you have multiple docs in one file? XML
> > or HTML? Multiple files? To what UpdateRequestHandler are you posting?
> > /update/xml or /update/extract ? For us to understand what you're trying
> > to achieve, please describe your project in more detail.
> > 
> > 
> > To give some concrete feedback too: First off, your analyzer for "text"
> > is wrong. All charFilter's need to be before the tokenizer. You also
> > lack an analyzer with type="query". If I were you I'd try the simplest
> > case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter
> > and Stemmer - just do the most basic stuff you can and go from there.
> > 
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> > 
> > On 28. mars 2011, at 18.52, Charles Wardell wrote:
> >> Hi Everyone,
> >> 
> >> I setup a server and began to index my data. I have two questions I am
> >> hoping someone can help me with. Many of my files seem to index without
> >> any problems. Others, I get a host of different errors. I am indexing
> >> primarily web based content and have identified my text field as
> >> follows:
> >> 
> >> <fieldtype name="text" class="solr.TextField"
> >> positionIncrementGap="100">
> >> 
> >>           <analyzer type="index">
> >>           
> >>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>               <charFilter class="solr.MappingCharFilterFactory"
> >>               mapping="mapping.txt"/> <charfilter
> >>               class="solr.HTMLStripCharFilterFactory"/> <filter
> >>               class="solr.StopFilterFactory" ignoreCase="true"
> >>               words="stopwords.txt"/> <filter
> >>               class="solr.WordDelimiterFilterFactory"
> >>               generateWordParts="1" generateNumberParts="1"
> >>               catenateWords="1" catenateNumbers="1" catenateAll="0"/>
> >>               <filter class="solr.LowerCaseFilterFactory"/>
> >>               <filter class="solr.EnglishPorterFilterFactory"
> >>               protected="protwords.txt"/> <filter
> >>               class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>           
> >>           </analyzer>
> >>       
> >>       </fieldtype>
> >> 
> >> q1) Errors while indexing.
> >> 
> >> * SimplePostTool: WARNING: Unexpected response from Solr: '<result
> >> status="0"></result>' does not contain '<int name="status">0</int>'
> >> 
> >> * SEVERE: Error processing "legacy" update
> >> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
> >> character ' ' (code 32) in content after '<' (malformed start
> >> element?). at [row,col {unknown-source}]: [1591,90] at
> >> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:64
> >> 8)
> >> 
> >> * Although I can't find the actual error, I recall solr giving me an
> >> error when it came across a string &What - The error was something like
> >> expecting semicolon after "What"
> >> 
> >> 
> >> q2) If my file has 1000 documents and I submit it with post.jar, if it
> >> comes across any of the above errors, will it break the processing of
> >> the whole file, or just the document with the error?
> >> 
> >> 
> >> Thanks in advance.
> >> Your help is very much appreciated.
> >> 
> >> Charlie

Re: problems indexing web content

Posted by Charles Wardell <ch...@bcsolution.com>.

Jan,

thank you for such a quick reply. I have a feed coming in that I convert to an <add><doc></doc><doc></doc>
Here is the type for text including index and query with the changes suggested.


        <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <charfilter class="solr.HTMLStripCharFilterFactory"/>	
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            </analyzer>
            <analyzer type="query">
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            </analyzer>
        </fieldtype>


Here is the snippit of the file I generate.

?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
<field name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field>
<field name="title">E X I T</field>
<field name="authorName">uswautis (Hasanah Uswa)</field>
<field name="authorEmail"></field>
<field name="authorLinkMimeType"></field>
<field name="authorLink">http://twitter.com/uswautis</field>
<field name="lang">U</field>
<field name="publishDate">2011-03-27T13:21:52Z</field>
<field name="aquiDate">2011-03-27T13:22:13Z</field>
<field name="source"></field>
<field name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</field>
<field name="feedContentMimeType">text/html</field>
<field name="feedContentEncoding"></field>
<field name="feedContent">null</field>
<field name="inboundLinks">0</field>
<field name="publisherType">MICROBLOG</field>
<field name="postTitle">E X I T</field>
<field name="postBodyMimeType">text/html</field>
<field name="postBodyEncoding">zlib</field>
<field name="postBody">mime_type: "text/html"
data: ""
</field>
<field name="tags">[]</field>
</doc>

<doc>
<field name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080</field>
<field name="title">I want the sweater i saw in mango sooooo bad.</field>
<field name="authorName">imsuperangelica (angelica marie)</field>
<field name="authorEmail"></field>
<field name="authorLinkMimeType"></field>
<field name="authorLink">http://twitter.com/imsuperangelica</field>
<field name="lang">en</field>
<field name="publishDate">2011-03-27T13:21:52Z</field>
<field name="aquiDate">2011-03-27T13:22:13Z</field>
<field name="source"></field>
<field name="feedURL">http://twitter.com/imsuperangelica/statuses/51997364050862080</field>
<field name="feedContentMimeType">text/html</field>
<field name="feedContentEncoding"></field>
<field name="feedContent">null</field>
<field name="inboundLinks">0</field>
<field name="publisherType">MICROBLOG</field>
<field name="postTitle">I want the sweater i saw in mango sooooo bad.</field>
<field name="postBodyMimeType">text/html</field>
<field name="postBodyEncoding">zlib</field>
<field name="postBody">mime_type: "text/html"
data: ""
</field>
<field name="tags">[]</field>
</doc>

</add>








On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:

> Hi,
> 
> I assume you try to post HTML files from post.jar, and use HTMLStripCharFilter to sanitize the HTML.
> 
> But you refer to "my file" as if you have multiple docs in one file? XML or HTML? Multiple files?
> To what UpdateRequestHandler are you posting? /update/xml or /update/extract ?
> For us to understand what you're trying to achieve, please describe your project in more detail.
> 
> 
> To give some concrete feedback too: First off, your analyzer for "text" is wrong. All charFilter's need to be before the tokenizer. You also lack an analyzer with type="query". If I were you I'd try the simplest case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter and Stemmer - just do the most basic stuff you can and go from there.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> On 28. mars 2011, at 18.52, Charles Wardell wrote:
> 
>> Hi Everyone,
>> 
>> I setup a server and began to index my data. I have two questions I am hoping someone can help me with. Many of my files seem to index without any problems. Others, I get a host of different errors. I am indexing primarily web based content and have identified my text field as follows:
>> 
>> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>>           <analyzer type="index">
>>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>               <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>>               <charfilter class="solr.HTMLStripCharFilterFactory"/>	
>>               <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>>               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
>>               <filter class="solr.LowerCaseFilterFactory"/>
>>               <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
>>               <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>           </analyzer>
>>       </fieldtype>
>> 
>> 
>> q1) Errors while indexing.
>> 
>> * SimplePostTool: WARNING: Unexpected response from Solr: '<result status="0"></result>' does not contain '<int name="status">0</int>'
>> 
>> * SEVERE: Error processing "legacy" update command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character ' ' (code 32) in content after '<' (malformed start element?). at [row,col {unknown-source}]: [1591,90] at com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:648)
>> 
>> * Although I can't find the actual error, I recall solr giving me an error when it came across a string &What - The error was something like expecting semicolon after "What"
>> 
>> 
>> q2) If my file has 1000 documents and I submit it with post.jar, if it comes across any of the above errors, will it break the processing of the whole file, or just the document with the error?
>> 
>> 
>> Thanks in advance. 
>> Your help is very much appreciated.
>> 
>> Charlie
>> 
>

Re: problems indexing web content

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

I assume you try to post HTML files from post.jar, and use HTMLStripCharFilter to sanitize the HTML.

But you refer to "my file" as if you have multiple docs in one file? XML or HTML? Multiple files?
To what UpdateRequestHandler are you posting? /update/xml or /update/extract ?
For us to understand what you're trying to achieve, please describe your project in more detail.


To give some concrete feedback too: First off, your analyzer for "text" is wrong. All charFilter's need to be before the tokenizer. You also lack an analyzer with type="query". If I were you I'd try the simplest case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter and Stemmer - just do the most basic stuff you can and go from there.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 28. mars 2011, at 18.52, Charles Wardell wrote:

> Hi Everyone,
> 
> I setup a server and began to index my data. I have two questions I am hoping someone can help me with. Many of my files seem to index without any problems. Others, I get a host of different errors. I am indexing primarily web based content and have identified my text field as follows:
> 
> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>                <charfilter class="solr.HTMLStripCharFilterFactory"/>	
>                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
>                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>            </analyzer>
>        </fieldtype>
> 
> 
> q1) Errors while indexing.
> 
> * SimplePostTool: WARNING: Unexpected response from Solr: '<result status="0"></result>' does not contain '<int name="status">0</int>'
> 
> * SEVERE: Error processing "legacy" update command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character ' ' (code 32) in content after '<' (malformed start element?). at [row,col {unknown-source}]: [1591,90] at com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:648)
> 
> * Although I can't find the actual error, I recall solr giving me an error when it came across a string &What - The error was something like expecting semicolon after "What"
> 
> 
> q2) If my file has 1000 documents and I submit it with post.jar, if it comes across any of the above errors, will it break the processing of the whole file, or just the document with the error?
> 
> 
> Thanks in advance. 
> Your help is very much appreciated.
> 
> Charlie
>