You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Cohen <da...@gmail.com> on 2009/09/10 21:52:19 UTC
Trouble Indexing HTML Files
*HI there-**
*
*I'm trying to get the dataimporthandler working to recursively parse the
content of a root directory, which contain several other directories beneath
it... The indexing seems to encounter errors ith the doctype tag in my
source files.*
*
*
*i've provided my schema.xml with the appropriate fields, I've added the
dataimport requestHandler to the solrconfig.xml. Does anyone know what I am
doing wrong, or perhaps a better way to attempt this?*
*
*
* dataconfig.xml:*
<dataConfig>
<dataSource type="FileDataSource" />
<document>
<entity name="file"
processor="FileListEntityProcessor"
baseDir="exampledocs/dylan"
fileName=".*htm"
recursive="true"
rootEntity="false"
dataSource="null">
<entity name="song"
processor="XPathEntityProcessor"
forEach="/html"
transformer="HTMLStripTransformer"
url="${file.fileAbsolutePath}">
<field column="name" xpath="//h1[@class='songtitle']"/>
<field column="album" xpath="//a[@class='recordlink']"/>
<field column="body" xpath="//body" stripHTML="true" />
</entity>
</entity>
</document>
</dataConfig>
*Stack trace:*
ava.lang.RuntimeException: com.ctc.wstx.exc.WstxIOException: Server returned
HTTP response code: 503 for URL:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:226)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:180)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:163)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:285)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:311)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)
Caused by: com.ctc.wstx.exc.WstxIOException: Server returned HTTP response
code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
... 10 more
Caused by: java.io.IOException: Server returned HTTP response code: 503 for
URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1170)
at java.net.URL.openStream(URL.java:1007)
at com.ctc.wstx.util.URLUtil.optimizedStreamFromURL(URLUtil.java:113)
at
com.ctc.wstx.io.DefaultInputResolver.sourceFromURL(DefaultInputResolver.java:256)
at
com.ctc.wstx.io.DefaultInputResolver.resolveEntity(DefaultInputResolver.java:96)
at
com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:468)
at
com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358)
at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
... 13 more
*Sample .htm file:*
*<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Hazel</title>
<link rel="stylesheet" type="text/css" href="../css/general.css" />
</head>
<body>
<h1 class="songtitle">Hazel</h1>
<p>Words and music Bob Dylan<br />
Released on <a class="recordlink" href="index.htm">Planet Waves</a>
(1974)<br />
Tabbed by Eyolf Østrem</p>
<p>The song could equally well be played with C chords and a capo on the
4th fret. Such a version is appended at the end.</p>
<p>The intro is played rather freely (which is a nice way of saying that
they aren't exactly tight...) – and with both a bass and a guitar. The
tab below is just a suggestion of an approximation.</p>
<hr />
<pre class="tab">
E B A E/g# F#m E
|--------7-------|--------2-------|----------------|
|9-----9---9-----|4-----4---4-----|2---0-----------|
|9---9-------9---|4---4-------4---|2---1---2---1---|
|9---------------|4---------------|2---2---4---2---|
|7---------------|2---------------|0-------4---2---|
|----------------|----------------|----4---2---0---|
</pre>
<pre class="verse">
E G#
Hazel, dirty-blonde hair
A F#7
I wouldn't be ashamed to be seen with you anywhere.
E G# C#m E/b A
You got something I want plenty of
E B A G#m F#m E
Ooh, a little touch of your love.
Hazel, stardust in your eye
You're goin' somewhere and so am I.
I'd give you the sky high above
Ooh, for a little touch of your love.
</pre>
<pre class="bridge">
G# C#m
Oh no, I don't need any reminder
G# C#m
To know how much I really care
F#
But it's just making me blinder and blinder
B A G#m F#m
Because I'm up on a hill and still you're not there.
</pre>
<pre class="verse">
Hazel, you called and I came,
Now don't make me play this waiting game.
You've got something I want plenty of
Ooh, a little touch of your love.
</pre>
<hr />
<h2 class="songversion">Version with capo on 4th fret</h2>
<pre class="verse">
C E
Hazel, dirty-blonde hair
F D7
I wouldn't be ashamed to be seen with you anywhere.
C E Am /g F
You got something I want plenty of
C G F Em Dm C
Ooh, a little touch of your love.
...
</pre>
<pre class="bridge">
E Am
Oh no, I don't need any reminder
E Am
To know how much I really care
But it's just making me blinder and blinder
G F Em Dm
Because I'm up on a hill and still you're not there.
</pre>
</body></html>
*
Re: Trouble Indexing HTML Files
Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen <
daniel.michael.cohen@gmail.com> wrote:
> *HI there-**
> *
> *I'm trying to get the dataimporthandler working to recursively parse the
> content of a root directory, which contain several other directories
> beneath
> it... The indexing seems to encounter errors ith the doctype tag in my
> source files.*
> *
> *Stack trace:*
>
> ava.lang.RuntimeException: com.ctc.wstx.exc.WstxIOException: Server
> returned
> HTTP response code: 503 for URL:
> http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> at
>
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
> at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:226)
> at
>
In trunk DataImportHandler ignores DTD [1]. If you are using Solr 1.3, then
unfortunately there is no workaround except removing the dtd declarations
from the files before indexing through DIH.
[1] - See https://issues.apache.org/jira/browse/SOLR-964
--
Regards,
Shalin Shekhar Mangar.
Re: Trouble Indexing HTML Files
Posted by Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com>.
hey XpathEntityprocessor does not work with wildcard xpath like '//a[@class'
if you just wish to index htl use a PlaintextEntityProcessor with
HTMLStripTransformer
On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen
<da...@gmail.com> wrote:
> *HI there-**
> *
> *I'm trying to get the dataimporthandler working to recursively parse the
> content of a root directory, which contain several other directories beneath
> it... The indexing seems to encounter errors ith the doctype tag in my
> source files.*
> *
> *
> *i've provided my schema.xml with the appropriate fields, I've added the
> dataimport requestHandler to the solrconfig.xml. Does anyone know what I am
> doing wrong, or perhaps a better way to attempt this?*
> *
> *
> * dataconfig.xml:*
> <dataConfig>
> <dataSource type="FileDataSource" />
> <document>
> <entity name="file"
> processor="FileListEntityProcessor"
> baseDir="exampledocs/dylan"
> fileName=".*htm"
> recursive="true"
> rootEntity="false"
> dataSource="null">
> <entity name="song"
> processor="XPathEntityProcessor"
> forEach="/html"
> transformer="HTMLStripTransformer"
> url="${file.fileAbsolutePath}">
> <field column="name" xpath="//h1[@class='songtitle']"/>
> <field column="album" xpath="//a[@class='recordlink']"/>
> <field column="body" xpath="//body" stripHTML="true" />
> </entity>
> </entity>
> </document>
> </dataConfig>
>
>
> *Stack trace:*
>
> ava.lang.RuntimeException: com.ctc.wstx.exc.WstxIOException: Server returned
> HTTP response code: 503 for URL:
> http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
> at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:226)
> at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:180)
> at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:163)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:285)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:311)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
> at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)
> Caused by: com.ctc.wstx.exc.WstxIOException: Server returned HTTP response
> code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
> at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
> at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
> at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
> ... 10 more
> Caused by: java.io.IOException: Server returned HTTP response code: 503 for
> URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1170)
> at java.net.URL.openStream(URL.java:1007)
> at com.ctc.wstx.util.URLUtil.optimizedStreamFromURL(URLUtil.java:113)
> at
> com.ctc.wstx.io.DefaultInputResolver.sourceFromURL(DefaultInputResolver.java:256)
> at
> com.ctc.wstx.io.DefaultInputResolver.resolveEntity(DefaultInputResolver.java:96)
> at
> com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:468)
> at
> com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358)
> at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
> ... 13 more
>
>
> *Sample .htm file:*
>
> *<?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
>
> <head>
> <title>Hazel</title>
> <link rel="stylesheet" type="text/css" href="../css/general.css" />
> </head>
>
> <body>
>
> <h1 class="songtitle">Hazel</h1>
>
>
> <p>Words and music Bob Dylan<br />
> Released on <a class="recordlink" href="index.htm">Planet Waves</a>
> (1974)<br />
> Tabbed by Eyolf Østrem</p>
>
> <p>The song could equally well be played with C chords and a capo on the
> 4th fret. Such a version is appended at the end.</p>
>
> <p>The intro is played rather freely (which is a nice way of saying that
> they aren't exactly tight...) – and with both a bass and a guitar. The
> tab below is just a suggestion of an approximation.</p>
>
> <hr />
>
> <pre class="tab">
> E B A E/g# F#m E
> |--------7-------|--------2-------|----------------|
> |9-----9---9-----|4-----4---4-----|2---0-----------|
> |9---9-------9---|4---4-------4---|2---1---2---1---|
> |9---------------|4---------------|2---2---4---2---|
> |7---------------|2---------------|0-------4---2---|
> |----------------|----------------|----4---2---0---|
> </pre>
>
> <pre class="verse">
> E G#
> Hazel, dirty-blonde hair
> A F#7
> I wouldn't be ashamed to be seen with you anywhere.
> E G# C#m E/b A
> You got something I want plenty of
> E B A G#m F#m E
> Ooh, a little touch of your love.
>
> Hazel, stardust in your eye
> You're goin' somewhere and so am I.
> I'd give you the sky high above
> Ooh, for a little touch of your love.
> </pre>
>
> <pre class="bridge">
> G# C#m
> Oh no, I don't need any reminder
> G# C#m
> To know how much I really care
> F#
> But it's just making me blinder and blinder
> B A G#m F#m
> Because I'm up on a hill and still you're not there.
> </pre>
>
> <pre class="verse">
> Hazel, you called and I came,
> Now don't make me play this waiting game.
> You've got something I want plenty of
> Ooh, a little touch of your love.
> </pre>
>
> <hr />
>
> <h2 class="songversion">Version with capo on 4th fret</h2>
> <pre class="verse">
> C E
> Hazel, dirty-blonde hair
> F D7
> I wouldn't be ashamed to be seen with you anywhere.
> C E Am /g F
> You got something I want plenty of
> C G F Em Dm C
> Ooh, a little touch of your love.
>
> ...
> </pre>
>
> <pre class="bridge">
> E Am
> Oh no, I don't need any reminder
> E Am
> To know how much I really care
>
> But it's just making me blinder and blinder
> G F Em Dm
> Because I'm up on a hill and still you're not there.
> </pre>
> </body></html>
> *
>
--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com