You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Konrad Lötzsch <ko...@antibodies-online.com> on 2012/12/21 13:25:15 UTC
Can DataImportHandler ignore Missing Tags in XML?
Hi,
we are trying to import medline into a solr core. everthing works fine
except the problem, that in the xml files from medline, sometimes
certain tags are missing. If we define them in the data-config.xml file
for our core, the dataimporthandler throws an exception for every tag,
that is missing:
SCHWERWIEGEND: Exception while solr commit.
java.lang.IllegalArgumentException: no such field ChemicalNameOfSubstance
at
org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49)
at
org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335)
at
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82)
at
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480)
at
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
at
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554)
at
org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919)
at
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
at
org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107)
at
org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380)
Can we tell the DataImportHandler that it should write a default value
if the tag is missing?
Here is our data-config.xml (skipped most of the lines that work for
simplicity)
|<dataConfig>
<dataSource name="medline" type="FileDataSource" encoding="UTF-8" />
<document name="MedlineCitations">
<entity name="file" processor="FileListEntityProcessor" baseDir="/home/" fileName=".*xml" recursive="true" rootEntity="false" dataSource="null">
<entity name="MedlineCitation"
processor="XPathEntityProcessor"
stream="true"
forEach="/MedlineCitationSet/MedlineCitation"
url="${file.fileAbsolutePath}"
>
<field column="PMID" xpath="/MedlineCitationSet/MedlineCitation/PMID" />
<field column="CreationYear" xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Year" />
<field column="CreationMonth" xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Month" />
<field column="CreationDay" xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Day" />
<!-- These cause DataImportHandler exceptions!
<field column="RevisionYear" xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Year" />
<field column="RevisionMonth" xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Month" />
<field column="RevisionDay" xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Day" />
-->
</entity>
</entity>
</document>
</dataConfig>|
With kind regards,
Konrad Lötzsch.
--
*Konrad Loetzsch*
Dipl. Math
*antibodies-online GmbH*
Schloß-Rahe-Str. 15
DE-52072 Aachen
Tel.: +49(0)241 9367-2544
konrad.loetzsch@antibodies-online.com
<ma...@antibodies-online.com>
www.antikoerper-online.de <http://www.antikoerper-online.de> |
www.antibodies-online.com <http://www.antibodies-online.com>
Eingetragen beim Amtsgericht Aachen unter HRB 13919
Geschäftsführer: Dr. Tim Hiddemann, Dr. Andreas Kessell
RE: Can DataImportHandler ignore Missing Tags in XML?
Posted by "Dyer, James" <Ja...@ingramcontent.com>.
It looks from your stack trace that your XML document has a value for "ChemicalNameOfSubstance" yet you do not have such a column defined in schema.xml. Is this your problem?
The easiest way to get Solr to ignore extra fields that you do not wish to index or store is to add a "catch-all" dynamic field to your schema.xml:
<fields>
...all your fields go here...
<!--last line at the end of "fields" -->
<dynamicField name="*" type="string" indexed="false" stored="false" multiValued="true" />
</fields>
This tells it to allow any column name that isn't explicitly defined but to just ignore it. This overrides Solr's default behavior in throwing an exception in such cases.
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-----Original Message-----
From: Konrad Lötzsch [mailto:konrad.loetzsch@antibodies-online.com]
Sent: Friday, December 21, 2012 6:25 AM
To: solr-user@lucene.apache.org
Subject: Can DataImportHandler ignore Missing Tags in XML?
Hi,
we are trying to import medline into a solr core. everthing works fine
except the problem, that in the xml files from medline, sometimes
certain tags are missing. If we define them in the data-config.xml file
for our core, the dataimporthandler throws an exception for every tag,
that is missing:
SCHWERWIEGEND: Exception while solr commit.
java.lang.IllegalArgumentException: no such field ChemicalNameOfSubstance
at
org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49)
at
org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335)
at
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82)
at
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480)
at
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
at
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554)
at
org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919)
at
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
at
org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107)
at
org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380)
Can we tell the DataImportHandler that it should write a default value
if the tag is missing?
Here is our data-config.xml (skipped most of the lines that work for
simplicity)
|<dataConfig>
<dataSource name="medline" type="FileDataSource" encoding="UTF-8" />
<document name="MedlineCitations">
<entity name="file" processor="FileListEntityProcessor" baseDir="/home/" fileName=".*xml" recursive="true" rootEntity="false" dataSource="null">
<entity name="MedlineCitation"
processor="XPathEntityProcessor"
stream="true"
forEach="/MedlineCitationSet/MedlineCitation"
url="${file.fileAbsolutePath}"
>
<field column="PMID" xpath="/MedlineCitationSet/MedlineCitation/PMID" />
<field column="CreationYear" xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Year" />
<field column="CreationMonth" xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Month" />
<field column="CreationDay" xpath="/MedlineCitationSet/MedlineCitation/DateCreated/Day" />
<!-- These cause DataImportHandler exceptions!
<field column="RevisionYear" xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Year" />
<field column="RevisionMonth" xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Month" />
<field column="RevisionDay" xpath="/MedlineCitationSet/MedlineCitation/DateRevised/Day" />
-->
</entity>
</entity>
</document>
</dataConfig>|
With kind regards,
Konrad Lötzsch.
--
*Konrad Loetzsch*
Dipl. Math
*antibodies-online GmbH*
Schloß-Rahe-Str. 15
DE-52072 Aachen
Tel.: +49(0)241 9367-2544
konrad.loetzsch@antibodies-online.com
<ma...@antibodies-online.com>
www.antikoerper-online.de <http://www.antikoerper-online.de> |
www.antibodies-online.com <http://www.antibodies-online.com>
Eingetragen beim Amtsgericht Aachen unter HRB 13919
Geschäftsführer: Dr. Tim Hiddemann, Dr. Andreas Kessell