You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fergus McMenemie <fe...@twig.me.uk> on 2009/01/19 11:44:38 UTC

Cant get HTMLStripTransformer's stripHTML to work in DIH.

Hello all,

I have the following DIH data-config.xml file. Adding 
HTMLStripTransformer and the associated stripHTML on the 
para tag seems to have broke things. I am using a nightly 
build from 12-jan-2009

The /record/sect1/para contains HTML sub tags which need
to be discarded. Is my use of stripHTML correct?

<dataConfig>
 <dataSource name="myfilereader" type="FileDataSource"/>    
  <document>
     <entity name="jcurrent"
	processor="FileListEntityProcessor"
	fileName=".*xml"
	newerThan="'NOW-1000DAYS'"
	recursive="true"
	rootEntity="false"
	dataSource="null"
	baseDir="/Volumes/spare/ts/jxml/data/news/groups">

	<entity name="x"
	   dataSource="myfilereader"
	   processor="XPathEntityProcessor"
	   url="${jcurrent.fileAbsolutePath}"
	   stream="false"
	   forEach="/record"
	   transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">

	   <field column="fileAbsPath" template="${jcurrent.fileAbsolutePath}" />
	   <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" replaceWith="$1" sourceColName="fileAbsePath"/>
	   <field column="title"    xpath="/record/title" />
	   <field column="para"     xpath="/record/sect1/para" stripHTML="true" />
	   <field column="subject"  xpath="/record/metadata/subject[@qualifier='fullTitle']"   />
	   <field column="pubname"  xpath="/record/metadata/subject[@qualifier='publication']" />
	   <field column="pubdate"  xpath="/record/metadata/date[@qualifier='pubDate']" dateTimeFormat="yyyyMMdd"   />
	   </entity>
        </entity>
     </document>
  </dataConfig>

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

Posted by Fergus McMenemie <fe...@twig.me.uk>.
>Hi Fergus,
>
>It seems a field it is expecting is missing from the XML.

You mean there is some field in the document we are indexing
that is missing?

><field column="fileAbsPath" template="${jcurrent.fileAbsolutePath}" />
><field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" replaceWith="$1"
>sourceColName="*fileAbsePath*"/>
>
>I guess "fileAbsePath" is a typo? Can you check if that is the cause?
Well spotted. I had made a mess of sanitizing the config file I sent
to you. I will in future make sure the stuff I am messing with matches
what I send to the list. However there is no typo in the underlying file;
at least not on that line:-) 


>
>
>On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:
>
>> Shalin
>>
>> Downloaded nightly for 21jan and tried DIH again. Its better but
>> still broken. Dozens of embeded tags are stripped from documents
>> but it now fails every few documents for no reason I can see. Manually
>> removing embeded tags causes a given problem document to be indexed,
>> only to have a it fail on one of the next few documents. I think the
>> problem is still in stripHTML
>>
>> Here is the traceback.
>>
>> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
>> INFO: Server startup in 3377 ms
>> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter
>> readIndexerProperties
>> INFO: Read dataimport.properties
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
>> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import}
>> status=0 QTime=13
>> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> INFO: Starting Full Import
>> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2
>> deleteAll
>> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
>> INFO: SolrDeletionPolicy.onInit: commits:num=2
>>
>>  commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]
>>
>>  commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>> INFO: last commit = 1232539612131
>> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: jc document : null
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
>> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
>> Processing Document # 9
>>        at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>>         at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>>        ... 9 more
>> Caused by: java.util.NoSuchElementException
>>        at
>> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>>        ... 10 more
>> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> SEVERE: Full Import failed
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
>> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
>> Processing Document # 9
>>        at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>>         at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>>        ... 9 more
>> Caused by: java.util.NoSuchElementException
>>        at
>> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>>        ... 10 more
>> Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2
>> rollback
>> INFO: start rollback
>>
>>
>>
>> >Ah, it needs a null check for multi valued fields. I've committed a fix to
>> >trunk. The next nightly build should have it. You can checkout and build
>> >from the trunk if need this immediately.
>> >
>> >On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fe...@twig.me.uk>
>> wrote:
>> >
>> >> Hmmm,
>> >>
>> >> Just to clarify I retested the thing using the nightly as of today
>> >> 18-jan-2009. The problem is still there and this traceback is from
>> >> that nightly.
>> >>
>> >> >>This looks fine. Can you post the stack trace?
>> >> >>
>> >> >Yep, here is the juicy bit. Let me know if you need more.
>> >> >
>> >> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
>> >> >INFO: Server startup in 2390 ms
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
>> >> >INFO: [janesdocs] webapp=/solr path=/dataimport
>> >> params={command=full-import} status=0 QTime=12
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter
>> >> readIndexerProperties
>> >> >INFO: Read dataimport.properties
>> >> >Jan 19, 2009 11:14:06 AM
>> org.apache.solr.handler.dataimport.DataImporter
>> >> doFullImport
>> >> >INFO: Starting Full Import
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> >> deleteAll
>> >> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
>> >> >INFO: SolrDeletionPolicy.onInit: commits:num=2
>> >> >
>> >>
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
>> >> >
>> >>
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy
>> >> updateCommits
>> >> >INFO: last commit = 1232363283059
>> >> >Jan 19, 2009 11:14:06 AM
>> >> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
>> >> >WARNING: transformer threw error
>> >> >java.lang.NullPointerException
>> >> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder
>> >> buildDocument
>> >> >SEVERE: Exception while processing: janescurrent document : null
>> >> >org.apache.solr.handler.dataimport.DataImportHandlerException:
>> >> java.lang.NullPointerException
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >> >Caused by: java.lang.NullPointerException
>> >> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >> >       ... 9 more
>> >> >Jan 19, 2009 11:14:06 AM
>> org.apache.solr.handler.dataimport.DataImporter
>> >> doFullImport
>> >> >SEVERE: Full Import failed
>> >> >org.apache.solr.handler.dataimport.DataImportHandlerException:
>> >> java.lang.NullPointerException
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >> >Caused by: java.lang.NullPointerException
>> >> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >> >       ... 9 more
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> >> rollback
>> >> >INFO: start rollback
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> >> rollback
>> >> >INFO: end_rollback
>> >> >
>> >> >
>> >> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fe...@twig.me.uk>
>> >> wrote:
>> >> >>
>> >> >>> Hello all,
>> >> >>>
>> >> >>> I have the following DIH data-config.xml file. Adding
>> >> >>> HTMLStripTransformer and the associated stripHTML on the
>> >> >>> para tag seems to have broke things. I am using a nightly
>> >> >>> build from 12-jan-2009
>> >> >>>
>> >> >>> The /record/sect1/para contains HTML sub tags which need
>> >> >>> to be discarded. Is my use of stripHTML correct?
>> >> >>>
>> >> >>> <dataConfig>
>> >> >>>  <dataSource name="myfilereader" type="FileDataSource"/>
>> >> >>>  <document>
>> >> >>>     <entity name="jcurrent"
>> >> >>>        processor="FileListEntityProcessor"
>> >> >>>        fileName=".*xml"
>> >> >>>        newerThan="'NOW-1000DAYS'"
>> >> >>>        recursive="true"
>> >> >>>        rootEntity="false"
>> >> >>>        dataSource="null"
>> >> >>>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
>> >> >>>
>> >> >>>        <entity name="x"
>> >> >>>           dataSource="myfilereader"
>> >> >>>           processor="XPathEntityProcessor"
>> >> >>>           url="${jcurrent.fileAbsolutePath}"
>> >> >>>           stream="false"
>> >> >>>           forEach="/record"
>> >> >>>
>> >> >>>
>> >>
>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
>> >> >>>
>> >> >>>           <field column="fileAbsPath"
>> >> >>> template="${jcurrent.fileAbsolutePath}" />
>> >> >>>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
>> >> >>> replaceWith="$1" sourceColName="fileAbsePath"/>
>> >> >>>           <field column="title"    xpath="/record/title" />
>> >> >>>           <field column="para"     xpath="/record/sect1/para"
>> >> >>> stripHTML="true" />
>> >> >>>           <field column="subject"
>> >> >>>  xpath="/record/metadata/subject[@qualifier='fullTitle']"   />
>> >> >>>           <field column="pubname"
>> >> >>>  xpath="/record/metadata/subject[@qualifier='publication']" />
>> >> >>>           <field column="pubdate"
>> >> >>>  xpath="/record/metadata/date[@qualifier='pubDate']"
>> >> >>> dateTimeFormat="yyyyMMdd"   />
>> >> >>>           </entity>
>> >> >>>        </entity>
>> >> >>>     </document>
>> >> >>>  </dataConfig>
>> >> >>>
>> >> >>> --
>> >> >>>
-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Hi Fergus,

It seems a field it is expecting is missing from the XML.

<field column="fileAbsPath" template="${jcurrent.fileAbsolutePath}" />
<field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" replaceWith="$1"
sourceColName="*fileAbsePath*"/>

I guess "fileAbsePath" is a typo? Can you check if that is the cause?


On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:

> Shalin
>
> Downloaded nightly for 21jan and tried DIH again. Its better but
> still broken. Dozens of embeded tags are stripped from documents
> but it now fails every few documents for no reason I can see. Manually
> removing embeded tags causes a given problem document to be indexed,
> only to have a it fail on one of the next few documents. I think the
> problem is still in stripHTML
>
> Here is the traceback.
>
> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
> INFO: Server startup in 3377 ms
> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> INFO: Read dataimport.properties
> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import}
> status=0 QTime=13
> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> INFO: Starting Full Import
> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2
> deleteAll
> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
> INFO: SolrDeletionPolicy.onInit: commits:num=2
>
>  commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]
>
>  commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy
> updateCommits
> INFO: last commit = 1232539612131
> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: jc document : null
> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
> Processing Document # 9
>        at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>         at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>        ... 9 more
> Caused by: java.util.NoSuchElementException
>        at
> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>        ... 10 more
> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> SEVERE: Full Import failed
> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
> Processing Document # 9
>        at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>         at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>        ... 9 more
> Caused by: java.util.NoSuchElementException
>        at
> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>        ... 10 more
> Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2
> rollback
> INFO: start rollback
>
>
>
> >Ah, it needs a null check for multi valued fields. I've committed a fix to
> >trunk. The next nightly build should have it. You can checkout and build
> >from the trunk if need this immediately.
> >
> >On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fe...@twig.me.uk>
> wrote:
> >
> >> Hmmm,
> >>
> >> Just to clarify I retested the thing using the nightly as of today
> >> 18-jan-2009. The problem is still there and this traceback is from
> >> that nightly.
> >>
> >> >>This looks fine. Can you post the stack trace?
> >> >>
> >> >Yep, here is the juicy bit. Let me know if you need more.
> >> >
> >> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
> >> >INFO: Server startup in 2390 ms
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
> >> >INFO: [janesdocs] webapp=/solr path=/dataimport
> >> params={command=full-import} status=0 QTime=12
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter
> >> readIndexerProperties
> >> >INFO: Read dataimport.properties
> >> >Jan 19, 2009 11:14:06 AM
> org.apache.solr.handler.dataimport.DataImporter
> >> doFullImport
> >> >INFO: Starting Full Import
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
> >> deleteAll
> >> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
> >> >INFO: SolrDeletionPolicy.onInit: commits:num=2
> >> >
> >>
> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
> >> >
> >>
> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy
> >> updateCommits
> >> >INFO: last commit = 1232363283059
> >> >Jan 19, 2009 11:14:06 AM
> >> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
> >> >WARNING: transformer threw error
> >> >java.lang.NullPointerException
> >> >       at java.io.StringReader.<init>(StringReader.java:33)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder
> >> buildDocument
> >> >SEVERE: Exception while processing: janescurrent document : null
> >> >org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.NullPointerException
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> >> >Caused by: java.lang.NullPointerException
> >> >       at java.io.StringReader.<init>(StringReader.java:33)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
> >> >       ... 9 more
> >> >Jan 19, 2009 11:14:06 AM
> org.apache.solr.handler.dataimport.DataImporter
> >> doFullImport
> >> >SEVERE: Full Import failed
> >> >org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.NullPointerException
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> >> >Caused by: java.lang.NullPointerException
> >> >       at java.io.StringReader.<init>(StringReader.java:33)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
> >> >       ... 9 more
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
> >> rollback
> >> >INFO: start rollback
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
> >> rollback
> >> >INFO: end_rollback
> >> >
> >> >
> >> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fe...@twig.me.uk>
> >> wrote:
> >> >>
> >> >>> Hello all,
> >> >>>
> >> >>> I have the following DIH data-config.xml file. Adding
> >> >>> HTMLStripTransformer and the associated stripHTML on the
> >> >>> para tag seems to have broke things. I am using a nightly
> >> >>> build from 12-jan-2009
> >> >>>
> >> >>> The /record/sect1/para contains HTML sub tags which need
> >> >>> to be discarded. Is my use of stripHTML correct?
> >> >>>
> >> >>> <dataConfig>
> >> >>>  <dataSource name="myfilereader" type="FileDataSource"/>
> >> >>>  <document>
> >> >>>     <entity name="jcurrent"
> >> >>>        processor="FileListEntityProcessor"
> >> >>>        fileName=".*xml"
> >> >>>        newerThan="'NOW-1000DAYS'"
> >> >>>        recursive="true"
> >> >>>        rootEntity="false"
> >> >>>        dataSource="null"
> >> >>>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
> >> >>>
> >> >>>        <entity name="x"
> >> >>>           dataSource="myfilereader"
> >> >>>           processor="XPathEntityProcessor"
> >> >>>           url="${jcurrent.fileAbsolutePath}"
> >> >>>           stream="false"
> >> >>>           forEach="/record"
> >> >>>
> >> >>>
> >>
> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
> >> >>>
> >> >>>           <field column="fileAbsPath"
> >> >>> template="${jcurrent.fileAbsolutePath}" />
> >> >>>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
> >> >>> replaceWith="$1" sourceColName="fileAbsePath"/>
> >> >>>           <field column="title"    xpath="/record/title" />
> >> >>>           <field column="para"     xpath="/record/sect1/para"
> >> >>> stripHTML="true" />
> >> >>>           <field column="subject"
> >> >>>  xpath="/record/metadata/subject[@qualifier='fullTitle']"   />
> >> >>>           <field column="pubname"
> >> >>>  xpath="/record/metadata/subject[@qualifier='publication']" />
> >> >>>           <field column="pubdate"
> >> >>>  xpath="/record/metadata/date[@qualifier='pubDate']"
> >> >>> dateTimeFormat="yyyyMMdd"   />
> >> >>>           </entity>
> >> >>>        </entity>
> >> >>>     </document>
> >> >>>  </dataConfig>
> >> >>>
> >> >>> --
> >> >>>
> >--
> >Regards,
> >Shalin Shekhar Mangar.
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

Posted by Fergus McMenemie <fe...@twig.me.uk>.
Shalin

Downloaded nightly for 21jan and tried DIH again. Its better but
still broken. Dozens of embeded tags are stripped from documents
but it now fails every few documents for no reason I can see. Manually
removing embeded tags causes a given problem document to be indexed,
only to have a it fail on one of the next few documents. I think the
problem is still in stripHTML

Here is the traceback.

Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 3377 ms
Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=13 
Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
INFO: Starting Full Import
Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2
	commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]
	commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1232539612131
Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument
SEVERE: Exception while processing: jc document : null
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9
	at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
	at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
	... 9 more
Caused by: java.util.NoSuchElementException
	at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
	at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
	at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
	at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
	at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
	... 10 more
Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9
	at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
	at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
	... 9 more
Caused by: java.util.NoSuchElementException
	at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
	at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
	at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
	at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
	at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
	... 10 more
Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback



>Ah, it needs a null check for multi valued fields. I've committed a fix to
>trunk. The next nightly build should have it. You can checkout and build
>from the trunk if need this immediately.
>
>On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:
>
>> Hmmm,
>>
>> Just to clarify I retested the thing using the nightly as of today
>> 18-jan-2009. The problem is still there and this traceback is from
>> that nightly.
>>
>> >>This looks fine. Can you post the stack trace?
>> >>
>> >Yep, here is the juicy bit. Let me know if you need more.
>> >
>> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
>> >INFO: Server startup in 2390 ms
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
>> >INFO: [janesdocs] webapp=/solr path=/dataimport
>> params={command=full-import} status=0 QTime=12
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter
>> readIndexerProperties
>> >INFO: Read dataimport.properties
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> >INFO: Starting Full Import
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> deleteAll
>> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
>> >INFO: SolrDeletionPolicy.onInit: commits:num=2
>> >
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
>> >
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>> >INFO: last commit = 1232363283059
>> >Jan 19, 2009 11:14:06 AM
>> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
>> >WARNING: transformer threw error
>> >java.lang.NullPointerException
>> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> >SEVERE: Exception while processing: janescurrent document : null
>> >org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.NullPointerException
>> >       at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >Caused by: java.lang.NullPointerException
>> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >       ... 9 more
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> >SEVERE: Full Import failed
>> >org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.NullPointerException
>> >       at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >Caused by: java.lang.NullPointerException
>> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >       ... 9 more
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> rollback
>> >INFO: start rollback
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> rollback
>> >INFO: end_rollback
>> >
>> >
>> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fe...@twig.me.uk>
>> wrote:
>> >>
>> >>> Hello all,
>> >>>
>> >>> I have the following DIH data-config.xml file. Adding
>> >>> HTMLStripTransformer and the associated stripHTML on the
>> >>> para tag seems to have broke things. I am using a nightly
>> >>> build from 12-jan-2009
>> >>>
>> >>> The /record/sect1/para contains HTML sub tags which need
>> >>> to be discarded. Is my use of stripHTML correct?
>> >>>
>> >>> <dataConfig>
>> >>>  <dataSource name="myfilereader" type="FileDataSource"/>
>> >>>  <document>
>> >>>     <entity name="jcurrent"
>> >>>        processor="FileListEntityProcessor"
>> >>>        fileName=".*xml"
>> >>>        newerThan="'NOW-1000DAYS'"
>> >>>        recursive="true"
>> >>>        rootEntity="false"
>> >>>        dataSource="null"
>> >>>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
>> >>>
>> >>>        <entity name="x"
>> >>>           dataSource="myfilereader"
>> >>>           processor="XPathEntityProcessor"
>> >>>           url="${jcurrent.fileAbsolutePath}"
>> >>>           stream="false"
>> >>>           forEach="/record"
>> >>>
>> >>>
>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
>> >>>
>> >>>           <field column="fileAbsPath"
>> >>> template="${jcurrent.fileAbsolutePath}" />
>> >>>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
>> >>> replaceWith="$1" sourceColName="fileAbsePath"/>
>> >>>           <field column="title"    xpath="/record/title" />
>> >>>           <field column="para"     xpath="/record/sect1/para"
>> >>> stripHTML="true" />
>> >>>           <field column="subject"
>> >>>  xpath="/record/metadata/subject[@qualifier='fullTitle']"   />
>> >>>           <field column="pubname"
>> >>>  xpath="/record/metadata/subject[@qualifier='publication']" />
>> >>>           <field column="pubdate"
>> >>>  xpath="/record/metadata/date[@qualifier='pubDate']"
>> >>> dateTimeFormat="yyyyMMdd"   />
>> >>>           </entity>
>> >>>        </entity>
>> >>>     </document>
>> >>>  </dataConfig>
>> >>>
>> >>> --
>> >>>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Ah, it needs a null check for multi valued fields. I've committed a fix to
trunk. The next nightly build should have it. You can checkout and build
from the trunk if need this immediately.

On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:

> Hmmm,
>
> Just to clarify I retested the thing using the nightly as of today
> 18-jan-2009. The problem is still there and this traceback is from
> that nightly.
>
> >>This looks fine. Can you post the stack trace?
> >>
> >Yep, here is the juicy bit. Let me know if you need more.
> >
> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
> >INFO: Server startup in 2390 ms
> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
> >INFO: [janesdocs] webapp=/solr path=/dataimport
> params={command=full-import} status=0 QTime=12
> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> >INFO: Read dataimport.properties
> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> >INFO: Starting Full Import
> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
> deleteAll
> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
> >INFO: SolrDeletionPolicy.onInit: commits:num=2
> >
> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
> >
> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy
> updateCommits
> >INFO: last commit = 1232363283059
> >Jan 19, 2009 11:14:06 AM
> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
> >WARNING: transformer threw error
> >java.lang.NullPointerException
> >       at java.io.StringReader.<init>(StringReader.java:33)
> >       at
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
> >       at
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
> >       at
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
> >       at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
> >       at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
> >       at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
> >       at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
> >       at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> >SEVERE: Exception while processing: janescurrent document : null
> >org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.NullPointerException
> >       at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
> >       at
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
> >       at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
> >       at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
> >       at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
> >       at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
> >       at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> >Caused by: java.lang.NullPointerException
> >       at java.io.StringReader.<init>(StringReader.java:33)
> >       at
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
> >       at
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
> >       at
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
> >       ... 9 more
> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> >SEVERE: Full Import failed
> >org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.NullPointerException
> >       at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
> >       at
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
> >       at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
> >       at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
> >       at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
> >       at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
> >       at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
> >       at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> >Caused by: java.lang.NullPointerException
> >       at java.io.StringReader.<init>(StringReader.java:33)
> >       at
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
> >       at
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
> >       at
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
> >       ... 9 more
> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
> rollback
> >INFO: start rollback
> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
> rollback
> >INFO: end_rollback
> >
> >
> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fe...@twig.me.uk>
> wrote:
> >>
> >>> Hello all,
> >>>
> >>> I have the following DIH data-config.xml file. Adding
> >>> HTMLStripTransformer and the associated stripHTML on the
> >>> para tag seems to have broke things. I am using a nightly
> >>> build from 12-jan-2009
> >>>
> >>> The /record/sect1/para contains HTML sub tags which need
> >>> to be discarded. Is my use of stripHTML correct?
> >>>
> >>> <dataConfig>
> >>>  <dataSource name="myfilereader" type="FileDataSource"/>
> >>>  <document>
> >>>     <entity name="jcurrent"
> >>>        processor="FileListEntityProcessor"
> >>>        fileName=".*xml"
> >>>        newerThan="'NOW-1000DAYS'"
> >>>        recursive="true"
> >>>        rootEntity="false"
> >>>        dataSource="null"
> >>>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
> >>>
> >>>        <entity name="x"
> >>>           dataSource="myfilereader"
> >>>           processor="XPathEntityProcessor"
> >>>           url="${jcurrent.fileAbsolutePath}"
> >>>           stream="false"
> >>>           forEach="/record"
> >>>
> >>>
> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
> >>>
> >>>           <field column="fileAbsPath"
> >>> template="${jcurrent.fileAbsolutePath}" />
> >>>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
> >>> replaceWith="$1" sourceColName="fileAbsePath"/>
> >>>           <field column="title"    xpath="/record/title" />
> >>>           <field column="para"     xpath="/record/sect1/para"
> >>> stripHTML="true" />
> >>>           <field column="subject"
> >>>  xpath="/record/metadata/subject[@qualifier='fullTitle']"   />
> >>>           <field column="pubname"
> >>>  xpath="/record/metadata/subject[@qualifier='publication']" />
> >>>           <field column="pubdate"
> >>>  xpath="/record/metadata/date[@qualifier='pubDate']"
> >>> dateTimeFormat="yyyyMMdd"   />
> >>>           </entity>
> >>>        </entity>
> >>>     </document>
> >>>  </dataConfig>
> >>>
> >>> --
> >>>
> >>> ===============================================================
> >>> Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
> <Email%3Afergus@twig.me.uk <Em...@twig.me.uk>>
> >>> Techmore Ltd                   Phone:(UK) 07721 376021
> >>>
> >>> Unix/Mac/Intranets             Analyst Programmer
> >>> ===============================================================
> >>>
> >>
> >>
> >>
> >>--
> >>Regards,
> >>Shalin Shekhar Mangar.
> >
> >--
> >
> >===============================================================
> >Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
> >Techmore Ltd                   Phone:(UK) 07721 376021
> >
> >Unix/Mac/Intranets             Analyst Programmer
> >===============================================================
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

Posted by Fergus McMenemie <fe...@twig.me.uk>.
Hmmm,

Just to clarify I retested the thing using the nightly as of today
18-jan-2009. The problem is still there and this traceback is from
that nightly. 

>>This looks fine. Can you post the stack trace?
>>
>Yep, here is the juicy bit. Let me know if you need more.
>
>Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
>INFO: Server startup in 2390 ms
>Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
>INFO: [janesdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=12 
>Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
>INFO: Read dataimport.properties
>Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport
>INFO: Starting Full Import
>Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll
>INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
>Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
>INFO: SolrDeletionPolicy.onInit: commits:num=2
>	commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
>	commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
>Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
>INFO: last commit = 1232363283059
>Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
>WARNING: transformer threw error
>java.lang.NullPointerException
>	at java.io.StringReader.<init>(StringReader.java:33)
>	at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>	at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder buildDocument
>SEVERE: Exception while processing: janescurrent document : null
>org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
>	at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>Caused by: java.lang.NullPointerException
>	at java.io.StringReader.<init>(StringReader.java:33)
>	at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>	at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>	... 9 more
>Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport
>SEVERE: Full Import failed
>org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
>	at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>Caused by: java.lang.NullPointerException
>	at java.io.StringReader.<init>(StringReader.java:33)
>	at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>	at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>	... 9 more
>Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback
>INFO: start rollback
>Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback
>INFO: end_rollback
>
>
>>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:
>>
>>> Hello all,
>>>
>>> I have the following DIH data-config.xml file. Adding
>>> HTMLStripTransformer and the associated stripHTML on the
>>> para tag seems to have broke things. I am using a nightly
>>> build from 12-jan-2009
>>>
>>> The /record/sect1/para contains HTML sub tags which need
>>> to be discarded. Is my use of stripHTML correct?
>>>
>>> <dataConfig>
>>>  <dataSource name="myfilereader" type="FileDataSource"/>
>>>  <document>
>>>     <entity name="jcurrent"
>>>        processor="FileListEntityProcessor"
>>>        fileName=".*xml"
>>>        newerThan="'NOW-1000DAYS'"
>>>        recursive="true"
>>>        rootEntity="false"
>>>        dataSource="null"
>>>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
>>>
>>>        <entity name="x"
>>>           dataSource="myfilereader"
>>>           processor="XPathEntityProcessor"
>>>           url="${jcurrent.fileAbsolutePath}"
>>>           stream="false"
>>>           forEach="/record"
>>>
>>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
>>>
>>>           <field column="fileAbsPath"
>>> template="${jcurrent.fileAbsolutePath}" />
>>>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
>>> replaceWith="$1" sourceColName="fileAbsePath"/>
>>>           <field column="title"    xpath="/record/title" />
>>>           <field column="para"     xpath="/record/sect1/para"
>>> stripHTML="true" />
>>>           <field column="subject"
>>>  xpath="/record/metadata/subject[@qualifier='fullTitle']"   />
>>>           <field column="pubname"
>>>  xpath="/record/metadata/subject[@qualifier='publication']" />
>>>           <field column="pubdate"
>>>  xpath="/record/metadata/date[@qualifier='pubDate']"
>>> dateTimeFormat="yyyyMMdd"   />
>>>           </entity>
>>>        </entity>
>>>     </document>
>>>  </dataConfig>
>>>
>>> --
>>>
>>> ===============================================================
>>> Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
>>> Techmore Ltd                   Phone:(UK) 07721 376021
>>>
>>> Unix/Mac/Intranets             Analyst Programmer
>>> ===============================================================
>>>
>>
>>
>>
>>-- 
>>Regards,
>>Shalin Shekhar Mangar.
>
>-- 
>
>===============================================================
>Fergus McMenemie               Email:fergus@twig.me.uk
>Techmore Ltd                   Phone:(UK) 07721 376021
>
>Unix/Mac/Intranets             Analyst Programmer
>===============================================================

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

Posted by Fergus McMenemie <fe...@twig.me.uk>.
>This looks fine. Can you post the stack trace?
>
Yep, here is the juicy bit. Let me know if you need more.

Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
INFO: Server startup in 2390 ms
Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
INFO: [janesdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=12 
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport
INFO: Starting Full Import
Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2
	commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
	commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1232363283059
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
WARNING: transformer threw error
java.lang.NullPointerException
	at java.io.StringReader.<init>(StringReader.java:33)
	at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
	at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder buildDocument
SEVERE: Exception while processing: janescurrent document : null
org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
	at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.NullPointerException
	at java.io.StringReader.<init>(StringReader.java:33)
	at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
	at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
	... 9 more
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
	at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.NullPointerException
	at java.io.StringReader.<init>(StringReader.java:33)
	at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
	at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
	at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
	... 9 more
Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback


>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:
>
>> Hello all,
>>
>> I have the following DIH data-config.xml file. Adding
>> HTMLStripTransformer and the associated stripHTML on the
>> para tag seems to have broke things. I am using a nightly
>> build from 12-jan-2009
>>
>> The /record/sect1/para contains HTML sub tags which need
>> to be discarded. Is my use of stripHTML correct?
>>
>> <dataConfig>
>>  <dataSource name="myfilereader" type="FileDataSource"/>
>>  <document>
>>     <entity name="jcurrent"
>>        processor="FileListEntityProcessor"
>>        fileName=".*xml"
>>        newerThan="'NOW-1000DAYS'"
>>        recursive="true"
>>        rootEntity="false"
>>        dataSource="null"
>>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
>>
>>        <entity name="x"
>>           dataSource="myfilereader"
>>           processor="XPathEntityProcessor"
>>           url="${jcurrent.fileAbsolutePath}"
>>           stream="false"
>>           forEach="/record"
>>
>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
>>
>>           <field column="fileAbsPath"
>> template="${jcurrent.fileAbsolutePath}" />
>>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
>> replaceWith="$1" sourceColName="fileAbsePath"/>
>>           <field column="title"    xpath="/record/title" />
>>           <field column="para"     xpath="/record/sect1/para"
>> stripHTML="true" />
>>           <field column="subject"
>>  xpath="/record/metadata/subject[@qualifier='fullTitle']"   />
>>           <field column="pubname"
>>  xpath="/record/metadata/subject[@qualifier='publication']" />
>>           <field column="pubdate"
>>  xpath="/record/metadata/date[@qualifier='pubDate']"
>> dateTimeFormat="yyyyMMdd"   />
>>           </entity>
>>        </entity>
>>     </document>
>>  </dataConfig>
>>
>> --
>>
>> ===============================================================
>> Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
>> Techmore Ltd                   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets             Analyst Programmer
>> ===============================================================
>>
>
>
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
This looks fine. Can you post the stack trace?

On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:

> Hello all,
>
> I have the following DIH data-config.xml file. Adding
> HTMLStripTransformer and the associated stripHTML on the
> para tag seems to have broke things. I am using a nightly
> build from 12-jan-2009
>
> The /record/sect1/para contains HTML sub tags which need
> to be discarded. Is my use of stripHTML correct?
>
> <dataConfig>
>  <dataSource name="myfilereader" type="FileDataSource"/>
>  <document>
>     <entity name="jcurrent"
>        processor="FileListEntityProcessor"
>        fileName=".*xml"
>        newerThan="'NOW-1000DAYS'"
>        recursive="true"
>        rootEntity="false"
>        dataSource="null"
>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
>
>        <entity name="x"
>           dataSource="myfilereader"
>           processor="XPathEntityProcessor"
>           url="${jcurrent.fileAbsolutePath}"
>           stream="false"
>           forEach="/record"
>
> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
>
>           <field column="fileAbsPath"
> template="${jcurrent.fileAbsolutePath}" />
>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
> replaceWith="$1" sourceColName="fileAbsePath"/>
>           <field column="title"    xpath="/record/title" />
>           <field column="para"     xpath="/record/sect1/para"
> stripHTML="true" />
>           <field column="subject"
>  xpath="/record/metadata/subject[@qualifier='fullTitle']"   />
>           <field column="pubname"
>  xpath="/record/metadata/subject[@qualifier='publication']" />
>           <field column="pubdate"
>  xpath="/record/metadata/date[@qualifier='pubDate']"
> dateTimeFormat="yyyyMMdd"   />
>           </entity>
>        </entity>
>     </document>
>  </dataConfig>
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
Regards,
Shalin Shekhar Mangar.