You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil> on 2016/07/15 16:01:14 UTC

SimplePostTool error (UNCLASSIFIED)

CLASSIFICATION: UNCLASSIFIED

How do I correct this error when running the simple post tool against a website?
The tool successfully indexed for about 30 mins before throwing this error and terminating.

[Fatal Error] :642:15: XML document structures must start and end within the same entity.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 642; columnNumber: 15; XML document structures must start and end within the same entity.
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1219)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:601)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
        at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:548)
        at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:351)
        at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:182)
        at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:167)
Caused by: org.xml.sax.SAXParseException; lineNumber: 642; columnNumber: 15; XML document structures must start and end within the same entity.
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
        at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1028)
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1201)
        ... 9 more

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~



CLASSIFICATION: UNCLASSIFIED

Re: SimplePostTool error (UNCLASSIFIED)

Posted by Yonik Seeley <ys...@gmail.com>.
On Fri, Jul 15, 2016 at 12:29 PM, Erick Erickson
<er...@gmail.com> wrote:
> simplePostTool is just that, simple. It's intended to get you started.
> It is not a full-featured web crawler. As such, if you're encountering
> wonky web pages that are not well formed HTML there's no guarantee
> that it'll handle them gracefully.

HTML is not well formed XML though.  Hopefully we're not using an XML
parser to try and parse HTML?
The error message "XML document structures must start and end within
the same entity." is true for XML, but not for HTML.

-Yonik

RE: [Non-DoD Source] Re: SimplePostTool error (UNCLASSIFIED)

Posted by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil>.
CLASSIFICATION: UNCLASSIFIED

Thanks Yonik and Eric,

If I set -filetypes csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,rtf,htm,html,txt would this prevent indexing of xml files? 

Why does the simple post tool index .cfm files with this or default settings?

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Friday, July 15, 2016 12:30 PM
To: solr-user <so...@lucene.apache.org>
Subject: [Non-DoD Source] Re: SimplePostTool error (UNCLASSIFIED)

simplePostTool is just that, simple. It's intended to get you started.
It is not a full-featured web crawler. As such, if you're encountering wonky web pages that are not well formed HTML there's no guarantee that it'll handle them gracefully.

Crawling websites is a pain, so if you require something robust I'd investigate Nutch (which integrates with Solr/Lucene) or similar.

Best,
Erick

On Fri, Jul 15, 2016 at 9:01 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) <kr...@mail.mil> wrote:
> CLASSIFICATION: UNCLASSIFIED
>
> How do I correct this error when running the simple post tool against a website?
> The tool successfully indexed for about 30 mins before throwing this error and terminating.
>
> [Fatal Error] :642:15: XML document structures must start and end within the same entity.
> Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 642; columnNumber: 15; XML document structures must start and end within the same entity.
>         at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1219)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:601)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:548)
>         at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:351)
>         at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:182)
>         at 
> org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:167)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 642; columnNumber: 15; XML document structures must start and end within the same entity.
>         at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
>         at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>         at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1028)
>         at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1201)
>         ... 9 more
>
> Thanks,
> Kris
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor - Catapult Technology Inc.
> US Army Research Lab
> Aberdeen Proving Ground
> Application Management & Development Branch
> 410-278-7251
> kris.t.musshorn.ctr@mail.mil
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
> CLASSIFICATION: UNCLASSIFIED


CLASSIFICATION: UNCLASSIFIED

Re: SimplePostTool error (UNCLASSIFIED)

Posted by Erick Erickson <er...@gmail.com>.
simplePostTool is just that, simple. It's intended to get you started.
It is not a full-featured web crawler. As such, if you're encountering
wonky web pages that are not well formed HTML there's no guarantee
that it'll handle them gracefully.

Crawling websites is a pain, so if you require something robust
I'd investigate Nutch (which integrates with Solr/Lucene) or
similar.

Best,
Erick

On Fri, Jul 15, 2016 at 9:01 AM, Musshorn, Kris T CTR USARMY RDECOM
ARL (US) <kr...@mail.mil> wrote:
> CLASSIFICATION: UNCLASSIFIED
>
> How do I correct this error when running the simple post tool against a website?
> The tool successfully indexed for about 30 mins before throwing this error and terminating.
>
> [Fatal Error] :642:15: XML document structures must start and end within the same entity.
> Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 642; columnNumber: 15; XML document structures must start and end within the same entity.
>         at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1219)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:601)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:548)
>         at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:351)
>         at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:182)
>         at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:167)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 642; columnNumber: 15; XML document structures must start and end within the same entity.
>         at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
>         at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>         at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1028)
>         at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1201)
>         ... 9 more
>
> Thanks,
> Kris
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor - Catapult Technology Inc.
> US Army Research Lab
> Aberdeen Proving Ground
> Application Management & Development Branch
> 410-278-7251
> kris.t.musshorn.ctr@mail.mil
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
> CLASSIFICATION: UNCLASSIFIED