You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Cohen <da...@gmail.com> on 2009/09/10 21:52:19 UTC

Trouble Indexing HTML Files

*HI there-**
*
*I'm trying to get the dataimporthandler working to recursively parse the
content of a root directory, which contain several other directories beneath
it... The indexing seems to encounter errors ith the doctype tag in my
source files.*
*
*
*i've provided my schema.xml with the appropriate fields,  I've added the
dataimport requestHandler to the  solrconfig.xml. Does anyone know what I am
doing wrong, or perhaps a better way to attempt this?*
*
*
* dataconfig.xml:*
<dataConfig>
<dataSource type="FileDataSource" />
    <document>
        <entity name="file"
 processor="FileListEntityProcessor"
baseDir="exampledocs/dylan"
 fileName=".*htm"
recursive="true"
rootEntity="false"
 dataSource="null">
 <entity name="song"
 processor="XPathEntityProcessor"
forEach="/html"
 transformer="HTMLStripTransformer"
url="${file.fileAbsolutePath}">
                 <field column="name" xpath="//h1[@class='songtitle']"/>
<field column="album" xpath="//a[@class='recordlink']"/>
 <field column="body" xpath="//body" stripHTML="true" />
             </entity>
        </entity>
    </document>
</dataConfig>


*Stack trace:*

ava.lang.RuntimeException: com.ctc.wstx.exc.WstxIOException: Server returned
HTTP response code: 503 for URL:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
 at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:226)
 at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:180)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:163)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:285)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:311)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
 at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)
Caused by: com.ctc.wstx.exc.WstxIOException: Server returned HTTP response
code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
 at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
 at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
 at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
... 10 more
Caused by: java.io.IOException: Server returned HTTP response code: 503 for
URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
 at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1170)
at java.net.URL.openStream(URL.java:1007)
 at com.ctc.wstx.util.URLUtil.optimizedStreamFromURL(URLUtil.java:113)
at
com.ctc.wstx.io.DefaultInputResolver.sourceFromURL(DefaultInputResolver.java:256)
 at
com.ctc.wstx.io.DefaultInputResolver.resolveEntity(DefaultInputResolver.java:96)
at
com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:468)
 at
com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358)
at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351)
 at
com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
 ... 13 more


*Sample .htm file:*

*<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">

<head>
<title>Hazel</title>
<link rel="stylesheet" type="text/css" href="../css/general.css" />
</head>

<body>

<h1 class="songtitle">Hazel</h1>


<p>Words and music Bob Dylan<br />
Released on <a class="recordlink" href="index.htm">Planet Waves</a>
(1974)<br />
Tabbed by Eyolf &Oslash;strem</p>

<p>The song could equally well be played with C chords and a capo on the
4th fret. Such a version is appended at the end.</p>

<p>The intro is played rather freely (which is a nice way of saying that
they aren't exactly tight...) &ndash; and with both a bass and a guitar. The
tab below is just a suggestion of an approximation.</p>

<hr />

<pre class="tab">
 E                B                A  E/g# F#m E
|--------7-------|--------2-------|----------------|
|9-----9---9-----|4-----4---4-----|2---0-----------|
|9---9-------9---|4---4-------4---|2---1---2---1---|
|9---------------|4---------------|2---2---4---2---|
|7---------------|2---------------|0-------4---2---|
|----------------|----------------|----4---2---0---|
</pre>

<pre class="verse">
E      G#
Hazel, dirty-blonde hair
A                           F#7
I wouldn't be ashamed to be seen with you anywhere.
E                   G#          C#m   E/b  A
You got something I want plenty of
E             B             A   G#m  F#m  E
Ooh, a little touch of your love.

Hazel, stardust in your eye
You're goin' somewhere and so am I.
I'd give you the sky high above
Ooh, for a little touch of your love.
</pre>

<pre class="bridge">
G#                        C#m
Oh no, I don't need any reminder
G#                        C#m
To know how much I really care
F#
But it's just making me blinder and blinder
            B       A        G#m              F#m
Because I'm up on a hill and still you're not there.
</pre>

<pre class="verse">
Hazel, you called and I came,
Now don't make me play this waiting game.
You've got something I want plenty of
Ooh, a little touch of your love.
</pre>

<hr />

<h2 class="songversion">Version with capo on 4th fret</h2>
<pre class="verse">
C      E
Hazel, dirty-blonde hair
F                           D7
I wouldn't be ashamed to be seen with you anywhere.
C                   E          Am   /g  F
You got something I want plenty of
C             G             F   Em  Dm  C
Ooh, a little touch of your love.

...
</pre>

<pre class="bridge">
E                         Am
Oh no, I don't need any reminder
E                         Am
To know how much I really care

But it's just making me blinder and blinder
            G       F        Em               Dm
Because I'm up on a hill and still you're not there.
</pre>
</body></html>
*

Re: Trouble Indexing HTML Files

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen <
daniel.michael.cohen@gmail.com> wrote:

> *HI there-**
> *
> *I'm trying to get the dataimporthandler working to recursively parse the
> content of a root directory, which contain several other directories
> beneath
> it... The indexing seems to encounter errors ith the doctype tag in my
> source files.*
> *
> *Stack trace:*
>
> ava.lang.RuntimeException: com.ctc.wstx.exc.WstxIOException: Server
> returned
> HTTP response code: 503 for URL:
> http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
>  at
>
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
> at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:226)
>  at
>

In trunk DataImportHandler ignores DTD [1]. If you are using Solr 1.3, then
unfortunately there is no workaround except removing the dtd declarations
from the files before indexing through DIH.

[1] - See https://issues.apache.org/jira/browse/SOLR-964

-- 
Regards,
Shalin Shekhar Mangar.

Re: Trouble Indexing HTML Files

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
hey XpathEntityprocessor does not work with wildcard xpath  like '//a[@class'

if you just wish to index htl use a PlaintextEntityProcessor with
HTMLStripTransformer

On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen
<da...@gmail.com> wrote:
> *HI there-**
> *
> *I'm trying to get the dataimporthandler working to recursively parse the
> content of a root directory, which contain several other directories beneath
> it... The indexing seems to encounter errors ith the doctype tag in my
> source files.*
> *
> *
> *i've provided my schema.xml with the appropriate fields,  I've added the
> dataimport requestHandler to the  solrconfig.xml. Does anyone know what I am
> doing wrong, or perhaps a better way to attempt this?*
> *
> *
> * dataconfig.xml:*
> <dataConfig>
> <dataSource type="FileDataSource" />
>    <document>
>        <entity name="file"
>  processor="FileListEntityProcessor"
> baseDir="exampledocs/dylan"
>  fileName=".*htm"
> recursive="true"
> rootEntity="false"
>  dataSource="null">
>  <entity name="song"
>  processor="XPathEntityProcessor"
> forEach="/html"
>  transformer="HTMLStripTransformer"
> url="${file.fileAbsolutePath}">
>                 <field column="name" xpath="//h1[@class='songtitle']"/>
> <field column="album" xpath="//a[@class='recordlink']"/>
>  <field column="body" xpath="//body" stripHTML="true" />
>             </entity>
>        </entity>
>    </document>
> </dataConfig>
>
>
> *Stack trace:*
>
> ava.lang.RuntimeException: com.ctc.wstx.exc.WstxIOException: Server returned
> HTTP response code: 503 for URL:
> http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
>  at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
> at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:226)
>  at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:180)
> at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:163)
>  at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:285)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:311)
>  at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136)
>  at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
>  at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)
> Caused by: com.ctc.wstx.exc.WstxIOException: Server returned HTTP response
> code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
>  at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
>  at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
> at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>  at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
> ... 10 more
> Caused by: java.io.IOException: Server returned HTTP response code: 503 for
> URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
>  at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1170)
> at java.net.URL.openStream(URL.java:1007)
>  at com.ctc.wstx.util.URLUtil.optimizedStreamFromURL(URLUtil.java:113)
> at
> com.ctc.wstx.io.DefaultInputResolver.sourceFromURL(DefaultInputResolver.java:256)
>  at
> com.ctc.wstx.io.DefaultInputResolver.resolveEntity(DefaultInputResolver.java:96)
> at
> com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:468)
>  at
> com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358)
> at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351)
>  at
> com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
>  ... 13 more
>
>
> *Sample .htm file:*
>
> *<?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
>
> <head>
> <title>Hazel</title>
> <link rel="stylesheet" type="text/css" href="../css/general.css" />
> </head>
>
> <body>
>
> <h1 class="songtitle">Hazel</h1>
>
>
> <p>Words and music Bob Dylan<br />
> Released on <a class="recordlink" href="index.htm">Planet Waves</a>
> (1974)<br />
> Tabbed by Eyolf &Oslash;strem</p>
>
> <p>The song could equally well be played with C chords and a capo on the
> 4th fret. Such a version is appended at the end.</p>
>
> <p>The intro is played rather freely (which is a nice way of saying that
> they aren't exactly tight...) &ndash; and with both a bass and a guitar. The
> tab below is just a suggestion of an approximation.</p>
>
> <hr />
>
> <pre class="tab">
>  E                B                A  E/g# F#m E
> |--------7-------|--------2-------|----------------|
> |9-----9---9-----|4-----4---4-----|2---0-----------|
> |9---9-------9---|4---4-------4---|2---1---2---1---|
> |9---------------|4---------------|2---2---4---2---|
> |7---------------|2---------------|0-------4---2---|
> |----------------|----------------|----4---2---0---|
> </pre>
>
> <pre class="verse">
> E      G#
> Hazel, dirty-blonde hair
> A                           F#7
> I wouldn't be ashamed to be seen with you anywhere.
> E                   G#          C#m   E/b  A
> You got something I want plenty of
> E             B             A   G#m  F#m  E
> Ooh, a little touch of your love.
>
> Hazel, stardust in your eye
> You're goin' somewhere and so am I.
> I'd give you the sky high above
> Ooh, for a little touch of your love.
> </pre>
>
> <pre class="bridge">
> G#                        C#m
> Oh no, I don't need any reminder
> G#                        C#m
> To know how much I really care
> F#
> But it's just making me blinder and blinder
>            B       A        G#m              F#m
> Because I'm up on a hill and still you're not there.
> </pre>
>
> <pre class="verse">
> Hazel, you called and I came,
> Now don't make me play this waiting game.
> You've got something I want plenty of
> Ooh, a little touch of your love.
> </pre>
>
> <hr />
>
> <h2 class="songversion">Version with capo on 4th fret</h2>
> <pre class="verse">
> C      E
> Hazel, dirty-blonde hair
> F                           D7
> I wouldn't be ashamed to be seen with you anywhere.
> C                   E          Am   /g  F
> You got something I want plenty of
> C             G             F   Em  Dm  C
> Ooh, a little touch of your love.
>
> ...
> </pre>
>
> <pre class="bridge">
> E                         Am
> Oh no, I don't need any reminder
> E                         Am
> To know how much I really care
>
> But it's just making me blinder and blinder
>            G       F        Em               Dm
> Because I'm up on a hill and still you're not there.
> </pre>
> </body></html>
> *
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com