You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chantal Ackermann <c....@it-agenten.com> on 2012/07/20 16:44:01 UTC

NumberFormatException while indexing TextField with LengthFilter and then copying to tfloat

Hi all,

I'm trying to index float values that are not required, input is an XML file. I have problems avoiding the NFE.
I'm using SOLR 3.6.



Index input:
- XML using DataImportHandler with XPathProcessor

Data:
Optional, Float, CDATA like: <estimated_hours>2.0</estimated_hours> or <estimated_hours/>

Original Problem:
Empty values would cause a NumberFormatException when being loaded directly into a "tfloat" type field.

Processing chain (to avoid NFE):
via XPath loaded into a field of type text with a trim and length filter, then via copyField directive into the tfloat type field

data-config.xml:
<field column="s_estimated_hours" xpath="/issues/issue/estimated_hours" />

schema.xml:
<types>...
		<fieldtype name="text_not_empty" class="solr.TextField">
			<analyzer>
				<tokenizer class="solr.KeywordTokenizerFactory" />
				<filter class="solr.TrimFilterFactory" />
				<filter class="solr.LengthFilterFactory" min="1" max="20" />
			</analyzer>
		</fieldtype>
</types>

<fields>...
		<field name="estimated_hours" type="tfloat" indexed="true" stored="true" required="false" />
		<field name="s_estimated_hours" type="text_not_empty" indexed="false" stored="false" />
</fields>

	<copyField source="s_estimated_hours" dest="estimated_hours" />

Problem:
Well, yet another NFE. But this time reported on the text field "s_estimated_hours":

WARNUNG: Error creating document : SolrInputDocument[{id=id(1.0)={2930}, s_estimated_hours=s_estimated_hours(1.0)={}}]
org.apache.solr.common.SolrException: ERROR: [doc=2930] Error adding field 's_estimated_hours'=''
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:333)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
	at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
	at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:66)
	at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:723)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.NumberFormatException: empty String
	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:992)
	at java.lang.Float.parseFloat(Float.java:422)
	at org.apache.solr.schema.TrieField.createField(TrieField.java:410)
	at org.apache.solr.schema.FieldType.createFields(FieldType.java:289)
	at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:107)
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:312)
	... 11 more


It is like it would copy the empty value - which must not make it through the LengthFilter of "s_estimated_hours" - to the tfloat field "estimated_hours" anyway. How can I avoid this? Or is there any other way to make the indexer ignore the empty values when creating the tfloat fields? If it could at least create the document and enter the other values… (onError="continue" is not helping as this is only a Warning (I've tried))


BTW: I did try with the XPath that should only select those nodes with text: /issues/issue/estimated_hours[text()]
The result was that no values would make it into the tfloat fields while all documents would be indexed without warnings or errors. (I discarded this option thinking that the xpath was not correctly evaluated.)


Thank you for any suggestions!
Chantal

Re: NumberFormatException while indexing TextField with LengthFilter and then copying to tfloat

Posted by Chantal Ackermann <c....@it-agenten.com>.
Here are the working solutions for:


3.6.1 (or lower probably)
****************************

via ScriptTransformer in data-config.xml:

		function prepareData(row) {
			var cols = new java.util.ArrayList();
			cols.add("spent_hours");
			cols.add("estimated_hours");
			cols.add("story_points");
			cols.add("pos");
			for (var i=0; i<cols.size(); i++) {
				var no = row.get(cols.get(i));
				if (no != null && no.trim().length() == 0) {
					row.remove(cols.get(i));
				}
			}
			return row;
		}

In the XPathEntityProcessor, add the ScriptTransformer:
 transformer="script:prepareData,…"

XPATHs:

			<field column="spent_hours"     xpath="/issues/issue/spent_hours" />
			<field column="estimated_hours" xpath="/issues/issue/estimated_hours" />
			<field column="story_points"    xpath="/issues/issue/story_points" />
			<field column="pos"             xpath="/issues/issue/position" />

All of these fields are of type tfloat, required="false". They will only get a value if it is not empty or null.



4.0-ALPHA
**************

No ScriptTransformer required, XPATH as above, same field type, required="false".

In the dataimporthandler configuration section in solrconfig.xml specify:

	<updateRequestProcessorChain name="emptyFieldChain">
		<processor class="solr.RemoveBlankFieldUpdateProcessorFactory" />
		<processor class="solr.LogUpdateProcessorFactory" />
		<processor class="solr.RunUpdateProcessorFactory" />
	</updateRequestProcessorChain>

	<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
		<lst name="defaults">
			<str name="update.chain">emptyFieldChain</str>
			<str name="config">data-config.xml</str>
			<str name="clean">true</str>
			<str name="commit">true</str>
			<str name="optimize">true</str>
		</lst>
	</requestHandler>


Re: NumberFormatException while indexing TextField with LengthFilter and then copying to tfloat

Posted by Chantal Ackermann <c....@it-agenten.com>.
Hi Hoss,

thank you for the quick response and the explanations!

> My suggestion would be to modify the XPath expression you are using to 
> pull data out of your original XML files and ignore  "<estimated_hours/>"
> 

I don't think this is possible. That would include text() in the XPath which is not handled by the XPathRecordReader. I've checked in the code, as well, and the JavaDoc does not list this possibility. I've tried those patterns:

/issues/issue/estimated_hours[text()]
/issues/issue/estimated_hours/text()

No value at all will be added for that field for any of the documents (including those that do have a value in the XML).

> Alternatively: there are some new UpdateProcessors available in 4.0 that 
> let you easily prune field values based on various criteria (update 
> porcessors happen well before copyField)...
> 
> http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/update/processor/RemoveBlankFieldUpdateProcessorFactory.html

Thanks for pointing me to it. I've switched to 4.0.0-ALPHA (hoping, the ALPHA doesn't show itself too often ;-) ).

For anyone interested, my DataImportHandler Setup in solrconfig.xml now reads:

	<updateRequestProcessorChain name="emptyFieldChain">
		<processor class="solr.RemoveBlankFieldUpdateProcessorFactory" />
	</updateRequestProcessorChain>

	<requestHandler name="/dataimport"
		class="org.apache.solr.handler.dataimport.DataImportHandler">
		<lst name="defaults">
			<str name="update.chain">emptyFieldChain</str>
			<str name="config">data-config.xml</str>
			<str name="clean">true</str>
			<str name="commit">true</str>
			<str name="optimize">true</str>
		</lst>
	</requestHandler>

Works as expected!

And kudos to those working on the admin frontend, as well! The new admin is indeed slick!



> But i can certainly understand the confusion, i've opened SOLR-3657 to try 
> and improve on this.  Ideally the error message should make it clear that 
> the "value" from "source" field was copied to "dest" field which then 
> encountered "error"
> 

Thank you! Good Exception messages are certainly helpful!

Chantal


Re: NumberFormatException while indexing TextField with LengthFilter and then copying to tfloat

Posted by Chris Hostetter <ho...@fucit.org>.
: Processing chain (to avoid NFE): via XPath loaded into a field of type 
: text with a trim and length filter, then via copyField directive into 
: the tfloat type field

The root of the problem you are seeing is that copyField directives are 
applied to the *raw* field values -- the analyzer used on your "source" 
field won't have any effect on the values given to your "dest" field.

My suggestion would be to modify the XPath expression you are using to 
pull data out of your original XML files and ignore  "<estimated_hours/>"

Alternatively: there are some new UpdateProcessors available in 4.0 that 
let you easily prune field values based on various criteria (update 
porcessors happen well before copyField)...

http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/update/processor/RemoveBlankFieldUpdateProcessorFactory.html


: Problem:
: Well, yet another NFE. But this time reported on the text field "s_estimated_hours":

I believe this is intentional, but i can understand how it might be 
confusing.

I think the point here is that since the field submitted by the client was 
named "s_estimated_hours" that's the field used in the error reported back 
to the client when something goes wrong with the copyField -- if the error 
message refered to "estimated_hours" the client may not have any idea 
why/where that field came from.

But i can certainly understand the confusion, i've opened SOLR-3657 to try 
and improve on this.  Ideally the error message should make it clear that 
the "value" from "source" field was copied to "dest" field which then 
encountered "error"

: 
: WARNUNG: Error creating document : SolrInputDocument[{id=id(1.0)={2930}, s_estimated_hours=s_estimated_hours(1.0)={}}]
: org.apache.solr.common.SolrException: ERROR: [doc=2930] Error adding field 's_estimated_hours'=''
: 	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:333)
	...
: Caused by: java.lang.NumberFormatException: empty String
: 	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:992)
: 	at java.lang.Float.parseFloat(Float.java:422)
: 	at org.apache.solr.schema.TrieField.createField(TrieField.java:410)
: 	at org.apache.solr.schema.FieldType.createFields(FieldType.java:289)
: 	at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:107)
: 	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:312)
: 	... 11 more


-Hoss