You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Paden <ru...@gmail.com> on 2015/06/18 16:39:12 UTC

Error when submitting PDF to Solr w/text fields using SolrJ

Hello, 

I'm using Solr to pull information from a Database and a file system
simultaneously. The database houses the file path of the file in the file
system. It pulls all of those just fine. In fact, it combines the metadata
from the database and the metadata from the file system great. The problem
occurs when I try to index the text. The error does not occur at the point
when it tries to add the field "text" to the document. The error occurs when
I try to submit that document to Solr. It gives me this error, 


org.apache.solr.common.SolrException: Exception writing document id
/some/filepath to the index; possible analysis error. 


This is how the field is defined in schema:

<field name="text" type="string" indexed="true" stored="false"
required="false" multiValued="true" /> 

and this is the code I use to add it to the document:

File file = new File(filepath); 

ContentHandler textHandler = new BodyContentHandler(); 

Metadata metadata = new Metadata();

ParseContext context = new ParseContext();

Input Stream = new FileInputStream(file); 

try{

 autoParser.parse(input, textHandler, metadata, context); 

} catch (Exception e) { 

  //prints out error message

 continue;

} 

if(textHandler != null){

  doc.addField("text",textHandler.toString()); 

} 

try{
 
    server.add(doc); 

} catch (Exception ex){ 

 //logmessage

 continue; 

} 

I think it has something to do with how the field is defined in schema but I
don't know. All the files that get error messages are PDF's if that helps.
There are .doc s in the file system but they don't error out. 






--
View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Erick Erickson <er...@gmail.com>.

This may be another forehead-slapper (man, you don't know how often
I've injured myself that way).

Did you commit at the end of the SolrJ indexing to Testcore2? DIH automatically
commits at the end of the run, and depending on how your SolrJ program
is written
it may not have. Or just set autoCommit (with openSearcher=true) in
your solrconfig
file. Or set autoSoftCommit there. In either case, wait until the
interval has expired
after your indexing has run.

Or, for that matter, you can insure you've committed by using curl or
just entering
something like
..../Testcore2/update?commit=true
in a url.

And another one that'll make you cringe is if your SolrJ program looks like:

while (more docs) {
   create a solr doc and add it to my list
   if (list > 100) {
      send list to Solr
      clear list
  }
}
end of program.

As the program exits, there'll still be docs in the list that haven'
been sent to Solr.

Alessandro's question hints at things like this, the question is
whether the doc is
all the docs got sent to Solr or not. Second question is whether
they're analyzed
differently in the two cores. Third question....

Best,
Erick



On Fri, Jun 19, 2015 at 8:32 AM, Alessandro Benedetti
<be...@gmail.com> wrote:
> So, the first I can say is if that is true : "it almost killed Solr with
> 280 files" you are doing something wrong for sure.
> At least if you are not trying to index 4k full movies xD
>
> Joking apart :
> 1) You should carefully design your analyser.
> 2) You should store your fields initially to verify you index what you were
> supposed to ( in number and in content)
> Assuming you are a beginner storing the fields will make easier for you to
> check, as they will pop out of the results.
>
> is at least the number of docs indexed correct ?
>
>
> 2015-06-19 15:34 GMT+01:00 Paden <ru...@gmail.com>:
>
>> Yeah, actually changing the field to "text_en" or "text_en_splitting"
>> actually made it so my indexer indexed all my files. The only problem is, I
>> don't think it's doing it well.
>>
>> I have two Cores that I'm working with. Both of them have indexed the same
>> set of files. The first core, which I will refer to as Testcore, I used a
>> DIH configuration that indexed the files with their metadata. (It indexed
>> everything fine but it almost killed Solr with 280 files I would hate to
>> see
>> what would happen with say, 10,000 files.). When I query Testcore on some
>> random common word like "a" it returns like 279 files. A good margin I can
>> accept that.
>>
>> The second core, which I will refer to as Testcore2, I used my own indexer
>> that I created and use SolrJ as the client. It indexes everything. However,
>> when I query on the same word "a" it only returns 208 of the 281 files.
>> Which is weird cause I'm using the exact same Querying handler for both. So
>> I don't think a comprehensive indexed text is being sent to Solr.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212933.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Erick Erickson <er...@gmail.com>.

You really, really, really want to get friendly with the
admin/analysis page for questions like:

bq: You're probably right though. I probably have to create a better analyzer

really ;).

It shows you exactly what each link in your analysis chain does to the
input. Perhaps 75% or
the questions about "why am I getting the results I'm seeing" are
answered there IMO.

Best,
Erick

On Fri, Jun 19, 2015 at 9:38 AM, Paden <ru...@gmail.com> wrote:
> Yes the number of indexed documents is correct. But the queries I perform
> fall short of what they should be. You're probably right though. I probably
> have to create a better analyzer.
>
> And I'm not really worried about the other fields. I've already check to see
> if it's storing them correctly and it is. I'm mostly worried about the text
> fields and how they're being indexed by Solr when submitted.
>
> BTW: Because of your comment, I went back and checked my core that used the
> DIH configuration. I increased the RAM on the Linux virtual machine I'm
> using and it worked like a dream. Thanks! You might have just helped me
> finish this project.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212967.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Paden <ru...@gmail.com>.

Yes the number of indexed documents is correct. But the queries I perform
fall short of what they should be. You're probably right though. I probably
have to create a better analyzer. 

And I'm not really worried about the other fields. I've already check to see
if it's storing them correctly and it is. I'm mostly worried about the text
fields and how they're being indexed by Solr when submitted. 

BTW: Because of your comment, I went back and checked my core that used the
DIH configuration. I increased the RAM on the Linux virtual machine I'm
using and it worked like a dream. Thanks! You might have just helped me
finish this project.



--
View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212967.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Alessandro Benedetti <be...@gmail.com>.

So, the first I can say is if that is true : "it almost killed Solr with
280 files" you are doing something wrong for sure.
At least if you are not trying to index 4k full movies xD

Joking apart :
1) You should carefully design your analyser.
2) You should store your fields initially to verify you index what you were
supposed to ( in number and in content)
Assuming you are a beginner storing the fields will make easier for you to
check, as they will pop out of the results.

is at least the number of docs indexed correct ?


2015-06-19 15:34 GMT+01:00 Paden <ru...@gmail.com>:

> Yeah, actually changing the field to "text_en" or "text_en_splitting"
> actually made it so my indexer indexed all my files. The only problem is, I
> don't think it's doing it well.
>
> I have two Cores that I'm working with. Both of them have indexed the same
> set of files. The first core, which I will refer to as Testcore, I used a
> DIH configuration that indexed the files with their metadata. (It indexed
> everything fine but it almost killed Solr with 280 files I would hate to
> see
> what would happen with say, 10,000 files.). When I query Testcore on some
> random common word like "a" it returns like 279 files. A good margin I can
> accept that.
>
> The second core, which I will refer to as Testcore2, I used my own indexer
> that I created and use SolrJ as the client. It indexes everything. However,
> when I query on the same word "a" it only returns 208 of the 281 files.
> Which is weird cause I'm using the exact same Querying handler for both. So
> I don't think a comprehensive indexed text is being sent to Solr.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212933.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Paden <ru...@gmail.com>.

Yeah, actually changing the field to "text_en" or "text_en_splitting"
actually made it so my indexer indexed all my files. The only problem is, I
don't think it's doing it well. 

I have two Cores that I'm working with. Both of them have indexed the same
set of files. The first core, which I will refer to as Testcore, I used a
DIH configuration that indexed the files with their metadata. (It indexed
everything fine but it almost killed Solr with 280 files I would hate to see
what would happen with say, 10,000 files.). When I query Testcore on some
random common word like "a" it returns like 279 files. A good margin I can
accept that. 

The second core, which I will refer to as Testcore2, I used my own indexer
that I created and use SolrJ as the client. It indexes everything. However,
when I query on the same word "a" it only returns 208 of the 281 files.
Which is weird cause I'm using the exact same Querying handler for both. So
I don't think a comprehensive indexed text is being sent to Solr. 





--
View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212933.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Alessandro Benedetti <be...@gmail.com>.

We would like more information, but the first thing I notice is that hardly
would make any sense to use a "string" type for a file content.

Can you give more details about the exception ?
Have you debugged a little bit ?
How does the solr input document look before it is sent to Solr ?

Furthermore please give us all the stack trace. THe message you post is
almost useless without all the details ...

2015-06-18 15:39 GMT+01:00 Paden <ru...@gmail.com>:

> Hello,
>
> I'm using Solr to pull information from a Database and a file system
> simultaneously. The database houses the file path of the file in the file
> system. It pulls all of those just fine. In fact, it combines the metadata
> from the database and the metadata from the file system great. The problem
> occurs when I try to index the text. The error does not occur at the point
> when it tries to add the field "text" to the document. The error occurs
> when
> I try to submit that document to Solr. It gives me this error,
>
>
> org.apache.solr.common.SolrException: Exception writing document id
> /some/filepath to the index; possible analysis error.
>
>
> This is how the field is defined in schema:
>
> <field name="text" type="string" indexed="true" stored="false"
> required="false" multiValued="true" />
>
> and this is the code I use to add it to the document:
>
> File file = new File(filepath);
>
> ContentHandler textHandler = new BodyContentHandler();
>
> Metadata metadata = new Metadata();
>
> ParseContext context = new ParseContext();
>
> Input Stream = new FileInputStream(file);
>
> try{
>
>  autoParser.parse(input, textHandler, metadata, context);
>
> } catch (Exception e) {
>
>   //prints out error message
>
>  continue;
>
> }
>
> if(textHandler != null){
>
>   doc.addField("text",textHandler.toString());
>
> }
>
> try{
>
>     server.add(doc);
>
> } catch (Exception ex){
>
>  //logmessage
>
>  continue;
>
> }
>
> I think it has something to do with how the field is defined in schema but
> I
> don't know. All the files that get error messages are PDF's if that helps.
> There are .doc s in the file system but they don't error out.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Paden <ru...@gmail.com>.

USING Solr 5.1.0

This is the schema file

<?xml version="1.0" encoding="UTF-8" ?>


<schema name="example" version="1.5">
  
   <field name="_version_" type="long" indexed="true" stored="true"/>
 
   <field name="_root_" type="string" indexed="true" stored="false"/>

   <field name="id" type="string" indexed="true" stored="true"
required="false" multiValued="false" />
   <field name="filepath" type="string" indexed="true" stored ="true"
required="false" multiValued="false" />  
   <field name="title" type="string" indexed="true" stored ="true"
required="false" multiValued="false" />  
   <field name="author" type="string" indexed="true" stored ="true"
required="false" multiValued="false" />  
   <field name="text" type="string" indexed="true" stored ="false"
required="false" multiValued="true" />  
   <field name="key" type="string" indexed="true" stored ="false"
required="false" multiValued="false" /> 

 
   
   <dynamicField name="*_name"  type="text_general"   multiValued="false"
indexed="true"  stored="true" />

   <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
   <dynamicField name="*_is" type="int"    indexed="true"  stored="true" 
multiValued="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true" />
   <dynamicField name="*_ss" type="string"  indexed="true"  stored="true"
multiValued="true"/>
   <dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
   <dynamicField name="*_ls" type="long"   indexed="true"  stored="true" 
multiValued="true"/>
   <dynamicField name="*_t"  type="text_general"    indexed="true" 
stored="true"/>
   <dynamicField name="*_txt" type="text_general"   indexed="true" 
stored="true" multiValued="true"/>
   <dynamicField name="*_en"  type="text_en"    indexed="true" 
stored="true" multiValued="true"/>
   <dynamicField name="*_b"  type="boolean" indexed="true" stored="true"/>
   <dynamicField name="*_bs" type="boolean" indexed="true" stored="true" 
multiValued="true"/>
   <dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
   <dynamicField name="*_fs" type="float"  indexed="true"  stored="true" 
multiValued="true"/>
   <dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
   <dynamicField name="*_ds" type="double" indexed="true"  stored="true" 
multiValued="true"/>


   <dynamicField name="*_coordinate"  type="tdouble" indexed="true" 
stored="false" />

   <dynamicField name="*_dt"  type="date"    indexed="true"  stored="true"/>
   <dynamicField name="*_dts" type="date"    indexed="true"  stored="true"
multiValued="true"/>
   <dynamicField name="*_p"  type="location" indexed="true" stored="true"/>

 
   <dynamicField name="*_ti" type="tint"    indexed="true"  stored="true"/>
   <dynamicField name="*_tl" type="tlong"   indexed="true"  stored="true"/>
   <dynamicField name="*_tf" type="tfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_td" type="tdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_tdt" type="tdate"  indexed="true"  stored="true"/>

   <dynamicField name="*_c"   type="currency" indexed="true" 
stored="true"/>

   <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
   <dynamicField name="attr_*" type="text_general" indexed="true"
stored="true" multiValued="true"/>

   <dynamicField name="random_*" type="random" />



 <uniqueKey>filepath</uniqueKey>


    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />


    <fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true"/>

    <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0"
positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0"
positionIncrementGap="0"/>

    <fieldType name="tint" class="solr.TrieIntField" precisionStep="8"
positionIncrementGap="0"/>
    <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8"
positionIncrementGap="0"/>
    <fieldType name="tlong" class="solr.TrieLongField" precisionStep="8"
positionIncrementGap="0"/>
    <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8"
positionIncrementGap="0"/>

 
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0"
positionIncrementGap="0"/>

  
    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
positionIncrementGap="0"/>


    <fieldType name="binary" class="solr.BinaryField"/>

 
    <fieldType name="random" class="solr.RandomSortField" indexed="true" />


    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
    
 
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
    
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
    
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

  
    <fieldType name="text_en_splitting" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

   
    <fieldType name="text_en_splitting_tight" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

   
    <fieldType name="text_general_rev" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory"
withOriginal="true"
           maxPosAsterisk="3" maxPosQuestion="2"
maxFractionAsterisk="0.33"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

   
    <fieldType name="alphaOnlySort" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
      <analyzer>
      
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      
        <filter class="solr.LowerCaseFilterFactory" />
      
        <filter class="solr.TrimFilterFactory" />
       
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        />
      </analyzer>
    </fieldType>


    <fieldType name="lowercase" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>

  
    <fieldType name="ignored" stored="false" indexed="false"
multiValued="true" class="solr.StrField" />

   
    <fieldType name="point" class="solr.PointType" dimension="2"
subFieldSuffix="_d"/>

  
    <fieldType name="location" class="solr.LatLonType"
subFieldSuffix="_coordinate"/>

   
    <fieldType name="location_rpt"
class="solr.SpatialRecursivePrefixTreeFieldType"
        geo="true" distErrPct="0.025" maxDistErr="0.001"
distanceUnits="kilometers" />

   
    <fieldType name="bbox" class="solr.BBoxField"
               geo="true" distanceUnits="kilometers"
numberType="_bbox_coord" />
    <fieldType name="_bbox_coord" class="solr.TrieDoubleField"
precisionStep="8" docValues="true" stored="false"/>

  
    <fieldType name="currency" class="solr.CurrencyField" precisionStep="8"
defaultCurrency="USD" currencyConfig="currency.xml" />

</schema>

ENTIRE STACK TRACE

/home/paden/Documents/LWP_Files/BIGDATA/5974412.pdf
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://localhost:8983/solr/Testcore3: Exception writing
document id /home/paden/Documents/LWP_Files/BIGDATA/5974412.pdf to the
index; possible analysis error.
    at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:556)
    at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:233)
    at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:225)
    at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:174)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:139)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:153)
    at TikaSqlIndexer.Index(TikaSqlIndexer.java:238)
    at TikaSqlIndexer.main(TikaSqlIndexer.java:85)



--
View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212736.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Alessandro Benedetti <be...@gmail.com>.

Silly thing … Maybe the immense token was generating because trying to set
"string" as field type for your text ?
Can be ?
Can you wipe out the index, set a proper type for your text, and index
again ?
No worries about the not full stack trace,
We learn and do wrong things everyday :)
Errare humanum est

Cheers

2015-06-19 14:31 GMT+01:00 Paden <ru...@gmail.com>:

> Yeah I'm just gonna say hands down this was a totally bad question. My
> fault,
> mea culpa. I'm pretty new to working in an IDE environment and using a
> stack
> trace (I just finished my first year of CS at University and now I'm
> interning). I'm actually kind of embarrassed by how long it took me to
> realize I wasn't looking at the entire stack trace. Idiot moment of the
> week
> for sure. Thanks for the patience guys but when I looked at the entire
> stack
> trace it gave me this.
>
> Caused by: java.lang.IllegalArgumentException: Document contains at least
> one immense term in field="text" (whose UTF8 encoding is longer than the
> max
> length 32766), all of which were skipped.  Please correct the analyzer to
> not produce such terms.  The prefix of the first immense term is: '[84,
> 104,
> 101, 32, 73, 78, 76, 32, 105, 115, 32, 97, 32, 85, 46, 83, 46, 32, 68, 101,
> 112, 97, 114, 116, 109, 101, 110, 116, 32, 111]...', original message:
> bytes
> can be at most 32766 in length; got 44360
>         at
>
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
>         at
>
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
>         at
>
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
>         at
>
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
>         at
>
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
>         at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1350)
>         at
>
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
>         at
>
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
>         ... 40 more
> Caused by:
> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes
> can be at most 32766 in length; got 44360
>         at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
>         at
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
>         at
>
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657)
>         ... 47 more
>
>
> And it took me all of two seconds to realize what had gone wrong. Now I'm
> just trying to figure out how to index the text content without truncating
> all the info or filtering it out entirely, thereby messing up my searching
> capabilities.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212919.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Paden <ru...@gmail.com>.

Yeah I'm just gonna say hands down this was a totally bad question. My fault,
mea culpa. I'm pretty new to working in an IDE environment and using a stack
trace (I just finished my first year of CS at University and now I'm
interning). I'm actually kind of embarrassed by how long it took me to
realize I wasn't looking at the entire stack trace. Idiot moment of the week
for sure. Thanks for the patience guys but when I looked at the entire stack
trace it gave me this. 

Caused by: java.lang.IllegalArgumentException: Document contains at least
one immense term in field="text" (whose UTF8 encoding is longer than the max
length 32766), all of which were skipped.  Please correct the analyzer to
not produce such terms.  The prefix of the first immense term is: '[84, 104,
101, 32, 73, 78, 76, 32, 105, 115, 32, 97, 32, 85, 46, 83, 46, 32, 68, 101,
112, 97, 114, 116, 109, 101, 110, 116, 32, 111]...', original message: bytes
can be at most 32766 in length; got 44360
	at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
	at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
	at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
	at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
	at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
	at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1350)
	at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
	at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
	... 40 more
Caused by:
org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes
can be at most 32766 in length; got 44360
	at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
	at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
	at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657)
	... 47 more


And it took me all of two seconds to realize what had gone wrong. Now I'm
just trying to figure out how to index the text content without truncating
all the info or filtering it out entirely, thereby messing up my searching
capabilities. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212919.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Alessandro Benedetti <be...@gmail.com>.

I definitely agree with Erick, the stack trace you posted is not complete
again.
This is an example of the same problem you got with a complete, meaningful
stack trace :
"
Stacktrace you provided :

org.apache.solr.common.SolrException: Exception writing document id 12345
> to the index; possible analysis error.
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:870)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1024)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:693)
> …
> ------> Important stack trace follows !!
> Caused by: java.lang.IllegalArgumentException: input AttributeSource must
> not be null
> at org.apache.lucene.util.AttributeSource.<init>(AttributeSource.java:94)
> at org.apache.lucene.analysis.TokenStream.<init>(TokenStream.java:106)
> at org.apache.lucene.analysis.TokenFilter.<init>(TokenFilter.java:33)
> at
> org.apache.lucene.analysis.util.FilteringTokenFilter.<init>(FilteringTokenFilter.java:70)
> at org.apache.lucene.analysis.core.StopFilter.<init>(StopFilter.java:60)
> at
> org.apache.lucene.analysis.core.StopFilterFactory.create(StopFilterFactory.java:127)
> at
> org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:67)
> at
> org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:102)
> at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
> at org.apache.lucene.document.Field.tokenStream(Field.java:554)
> at
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:597)
> at
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
> at
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:222)
> at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
> at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1507)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:240)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
> ... 35 more
> ",


If you give us all the stack trace, I am pretty sure we can help .


Cheers


2015-06-19 5:31 GMT+01:00 Erick Erickson <er...@gmail.com>:

> The stack trace is what gets returned to the client, right? It's often
> much more informative to see the Solr log output, the error message
> is often much more helpful there. By the time the exception bubbles
> up through the various layers vital information is sometimes not returned
> to the client in the error message.
>
> One precaution I would take since you've changed the schema is to
> _completely_ remove the index.
> 1> shut down Solr
> 2> rm -rf coreX/data
> 3> restart Solr.
> 4> try it again.
>
> Lucene doesn't really care at all whether a field gets indexed one way in
> one document and another way in the next document and occasionally
> having fields indexed different ways (string and text) in different
> documents
> at the same time confuses things.
>
> Best,
> Erick
>
> On Thu, Jun 18, 2015 at 10:31 AM, Paden <ru...@gmail.com> wrote:
> > Just rolling out a little bit more information as it is coming. I
> changed the
> > field type in the schema to text_general and that didn't change a thing.
> >
> > Another thing is that it's consistently submitting/not submitting the
> same
> > documents. I will run over it one time and it won't index a set of
> > documents. When I clear the index and run the program again it
> > submits/doesn't submit the same documents.
> >
> > And it will index certain PDF's it just won't index others. Which is
> weird
> > because I printed the strings that are submitted to Solr and the ones
> that
> > get submitted are really similar to the ones that aren't submitted.
> >
> > I can't post the actual strings for sensitivity reasons.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212757.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Erick Erickson <er...@gmail.com>.

The stack trace is what gets returned to the client, right? It's often
much more informative to see the Solr log output, the error message
is often much more helpful there. By the time the exception bubbles
up through the various layers vital information is sometimes not returned
to the client in the error message.

One precaution I would take since you've changed the schema is to
_completely_ remove the index.
1> shut down Solr
2> rm -rf coreX/data
3> restart Solr.
4> try it again.

Lucene doesn't really care at all whether a field gets indexed one way in
one document and another way in the next document and occasionally
having fields indexed different ways (string and text) in different documents
at the same time confuses things.

Best,
Erick

On Thu, Jun 18, 2015 at 10:31 AM, Paden <ru...@gmail.com> wrote:
> Just rolling out a little bit more information as it is coming. I changed the
> field type in the schema to text_general and that didn't change a thing.
>
> Another thing is that it's consistently submitting/not submitting the same
> documents. I will run over it one time and it won't index a set of
> documents. When I clear the index and run the program again it
> submits/doesn't submit the same documents.
>
> And it will index certain PDF's it just won't index others. Which is weird
> because I printed the strings that are submitted to Solr and the ones that
> get submitted are really similar to the ones that aren't submitted.
>
> I can't post the actual strings for sensitivity reasons.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212757.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error when submitting PDF to Solr w/text fields using SolrJ

Posted by Paden <ru...@gmail.com>.

Just rolling out a little bit more information as it is coming. I changed the
field type in the schema to text_general and that didn't change a thing. 

Another thing is that it's consistently submitting/not submitting the same
documents. I will run over it one time and it won't index a set of
documents. When I clear the index and run the program again it
submits/doesn't submit the same documents. 

And it will index certain PDF's it just won't index others. Which is weird
because I printed the strings that are submitted to Solr and the ones that
get submitted are really similar to the ones that aren't submitted. 

I can't post the actual strings for sensitivity reasons. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Error-when-submitting-PDF-to-Solr-w-text-fields-using-SolrJ-tp4212704p4212757.html
Sent from the Solr - User mailing list archive at Nabble.com.