You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Konrad Lötzsch <ko...@antibodies-online.com> on 2012/12/21 13:25:15 UTC

Can DataImportHandler ignore Missing Tags in XML?

Hi,

we are trying to import medline into a solr core. everthing works fine 
except the problem, that in the xml files from medline, sometimes 
certain tags are missing. If we define them in the data-config.xml file 
for our core, the dataimporthandler throws an exception for every tag, 
that is missing:

SCHWERWIEGEND: Exception while solr commit.
java.lang.IllegalArgumentException: no such field ChemicalNameOfSubstance
  at 
org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49)
  at 
org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52)
  at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94)
  at 
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335)
  at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
  at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
  at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
  at 
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82)
  at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480)
  at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
  at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554)
  at 
org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547)
  at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683)
  at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663)
  at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414)
  at 
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82)
  at 
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
  at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919)
  at 
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
  at 
org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107)
  at 
org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304)
  at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256)
  at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
  at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399)
  at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380)

Can we tell the DataImportHandler that it should write a default value 
if the tag is missing?

Here is our data-config.xml (skipped most of the lines that work for 
simplicity)

|<dataConfig>
<dataSource  name="medline"  type="FileDataSource"  encoding="UTF-8"  />
     <document  name="MedlineCitations">
     	<entity  name="file"  processor="FileListEntityProcessor"  baseDir="/home/"  fileName=".*xml"  recursive="true"  rootEntity="false"  dataSource="null">
	<entity  name="MedlineCitation"  
	processor="XPathEntityProcessor"
	stream="true"
	forEach="/MedlineCitationSet/MedlineCitation"
	url="${file.fileAbsolutePath}"
	                >
	
	<field  column="PMID"						xpath="/MedlineCitationSet/MedlineCitation/PMID"  />
	
	<field  column="CreationYear"					xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Year"  />
	<field  column="CreationMonth"					xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Month"  />
	<field  column="CreationDay"						xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Day"  />
	
	<!-- These cause DataImportHandler exceptions!
	            <field column="RevisionYear"					xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Year" />
	            <field column="RevisionMonth"					xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Month" />
	            <field column="RevisionDay"						xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Day" />
	            -->
	</entity>
         </entity>
     </document>
</dataConfig>|




With kind regards,
Konrad Lötzsch.

-- 
*Konrad Loetzsch*
Dipl. Math

*antibodies-online GmbH*
Schloß-Rahe-Str. 15
DE-52072 Aachen

Tel.: +49(0)241 9367-2544
konrad.loetzsch@antibodies-online.com 
<ma...@antibodies-online.com>
www.antikoerper-online.de <http://www.antikoerper-online.de> | 
www.antibodies-online.com <http://www.antibodies-online.com>

Eingetragen beim Amtsgericht Aachen unter HRB 13919
Geschäftsführer: Dr. Tim Hiddemann, Dr. Andreas Kessell

RE: Can DataImportHandler ignore Missing Tags in XML?

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
It looks from your stack trace that your XML document has a value for "ChemicalNameOfSubstance" yet you do not have such a column defined in schema.xml.  Is this your problem?

The easiest way to get Solr to ignore extra fields that you do not wish to index or store is to add a "catch-all" dynamic field to your schema.xml:

<fields>
...all your fields go here...

<!--last line at the end of "fields" -->
<dynamicField name="*" type="string" indexed="false" stored="false" multiValued="true" />
</fields>

This tells it to allow any column name that isn't explicitly defined but to just ignore it.  This overrides Solr's default behavior in throwing an exception in such cases.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Konrad Lötzsch [mailto:konrad.loetzsch@antibodies-online.com] 
Sent: Friday, December 21, 2012 6:25 AM
To: solr-user@lucene.apache.org
Subject: Can DataImportHandler ignore Missing Tags in XML?

Hi,

we are trying to import medline into a solr core. everthing works fine 
except the problem, that in the xml files from medline, sometimes 
certain tags are missing. If we define them in the data-config.xml file 
for our core, the dataimporthandler throws an exception for every tag, 
that is missing:

SCHWERWIEGEND: Exception while solr commit.
java.lang.IllegalArgumentException: no such field ChemicalNameOfSubstance
  at 
org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49)
  at 
org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52)
  at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94)
  at 
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335)
  at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
  at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
  at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
  at 
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82)
  at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480)
  at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
  at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554)
  at 
org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547)
  at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683)
  at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663)
  at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414)
  at 
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82)
  at 
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
  at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919)
  at 
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
  at 
org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107)
  at 
org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304)
  at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256)
  at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
  at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399)
  at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380)

Can we tell the DataImportHandler that it should write a default value 
if the tag is missing?

Here is our data-config.xml (skipped most of the lines that work for 
simplicity)

|<dataConfig>
<dataSource  name="medline"  type="FileDataSource"  encoding="UTF-8"  />
     <document  name="MedlineCitations">
     	<entity  name="file"  processor="FileListEntityProcessor"  baseDir="/home/"  fileName=".*xml"  recursive="true"  rootEntity="false"  dataSource="null">
	<entity  name="MedlineCitation"  
	processor="XPathEntityProcessor"
	stream="true"
	forEach="/MedlineCitationSet/MedlineCitation"
	url="${file.fileAbsolutePath}"
	                >
	
	<field  column="PMID"						xpath="/MedlineCitationSet/MedlineCitation/PMID"  />
	
	<field  column="CreationYear"					xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Year"  />
	<field  column="CreationMonth"					xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Month"  />
	<field  column="CreationDay"						xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Day"  />
	
	<!-- These cause DataImportHandler exceptions!
	            <field column="RevisionYear"					xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Year" />
	            <field column="RevisionMonth"					xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Month" />
	            <field column="RevisionDay"						xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Day" />
	            -->
	</entity>
         </entity>
     </document>
</dataConfig>|




With kind regards,
Konrad Lötzsch.

-- 
*Konrad Loetzsch*
Dipl. Math

*antibodies-online GmbH*
Schloß-Rahe-Str. 15
DE-52072 Aachen

Tel.: +49(0)241 9367-2544
konrad.loetzsch@antibodies-online.com 
<ma...@antibodies-online.com>
www.antikoerper-online.de <http://www.antikoerper-online.de> | 
www.antibodies-online.com <http://www.antibodies-online.com>

Eingetragen beim Amtsgericht Aachen unter HRB 13919
Geschäftsführer: Dr. Tim Hiddemann, Dr. Andreas Kessell