You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Lance Norskog <go...@gmail.com> on 2010/01/31 06:00:39 UTC

DataImportHandler problem - reading XML from a file

This DataImportHandler script does not find any documents in this HTML
file. The DIH definitely opens the file, but the either the
xpathprocessor gets no data or it does not recognize the xpaths
described. Any hints? (I'm using Solr 1.5-dev, sometime recent.)

Thanks!

Lance


xhtml-data-config.xml:

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="xhtml"
                        forEach="/html/head | /html/body"
                        processor="XPathEntityProcessor" pk="id"
                        transformer="TemplateTransformer"
                        url="/cygwin/tmp/ch05-tokenizers-filters-Solr1.4.html"
                        >
                <field column="head_s" xpath="/html/head"/>
                <field column="body_s" xpath="/html/body"/>
        </entity>
        </document>
</dataConfig>

Sample data file: "cygwin/tmp/ch05-tokenizers-filters-Solr1.4.html"

<?xml version="1.0" encoding="UTF-8" ?>
<html >
  <head >
    <meta content="en-US" name="DC.language" />
  </head>
  <body>
    <div id="header">
     <a href="ch05-tokenizers-filters-Solr1.4.html">First</a>
        <span class="nolink">Previous</span>
        <a href="ch05-tokenizers-filters-Solr1.41.html">Next</a>
        <a href="ch05-tokenizers-filters-Solr1.460.html">Last</a>
    </div>
    <div dir="ltr" id="content" style="background-color:transparent">
      <h1 id="toc0">
        <span class="SectionNumber">1</span>
        <a id="RefHeading36402771"></a>
        <a id="bkmRefHeading36402771"></a>
        Understanding Analyzers, Tokenizers, and Filters
      </h1>
    </div>
  </body>
</html>



-- 
Lance Norskog
goksron@gmail.com

Re: DataImportHandler problem - reading XML from a file

Posted by Lance Norskog <go...@gmail.com>.
I would like to create a text string from the complete node tree,
expressed in XML. So, /html/body would supply a string which starts:
'<div id="header">'. This this possible?

In general, I'm attempting to take the HTML body node, and index it as
a text string. Then, I can fetch that text body and highlight words.
The reason I want to only save the body part is that I can then pull
multiple body parts and string them together into a page. This is how
the www.lucidimagination.com/search does our Solr reference guide
book.

Anyway, /html/body/div/span should supply the text 'Previous' and does
not. I changed this to use a ContentStreamDataSource and post the
data, and then I get this. What does "Total Requests made to
DataSource">0" mean?

  <?xml version="1.0" encoding="UTF-8" ?>
- <response>
- <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">124</int>
  </lst>
- <lst name="initArgs">
- <lst name="defaults">
  <str name="config">xhtml-data-config.xml</str>
  </lst>
  </lst>
  <str name="command">full-import</str>
  <str name="status">idle</str>
  <str name="importResponse" />
- <lst name="statusMessages">
  <str name="Total Requests made to DataSource">0</str>
  <str name="Total Rows Fetched">0</str>
  <str name="Total Documents Skipped">0</str>
  <str name="Full Dump Started">2010-01-31 21:58:50</str>
  <str name="">Indexing completed. Added/Updated: 0 documents. Deleted
0 documents.</str>
  <str name="Committed">2010-01-31 21:58:50</str>
  <str name="Optimized">2010-01-31 21:58:50</str>
  <str name="Total Documents Processed">0</str>
  <str name="Time taken">0:0:0.124</str>
  </lst>
  <str name="WARNING">This response format is experimental. It is
likely to change in the future.</str>
  </response>

2010/1/31 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> It clear that the xpaths provided won't fetch anything. because there
> is no data in those paths. what do you really wish to be indexed ?
>
>
>
> On Sun, Jan 31, 2010 at 10:30 AM, Lance Norskog <go...@gmail.com> wrote:
>> This DataImportHandler script does not find any documents in this HTML
>> file. The DIH definitely opens the file, but the either the
>> xpathprocessor gets no data or it does not recognize the xpaths
>> described. Any hints? (I'm using Solr 1.5-dev, sometime recent.)
>>
>> Thanks!
>>
>> Lance
>>
>>
>> xhtml-data-config.xml:
>>
>> <dataConfig>
>>        <dataSource type="FileDataSource" encoding="UTF-8" />
>>        <document>
>>        <entity name="xhtml"
>>                        forEach="/html/head | /html/body"
>>                        processor="XPathEntityProcessor" pk="id"
>>                        transformer="TemplateTransformer"
>>                        url="/cygwin/tmp/ch05-tokenizers-filters-Solr1.4.html"
>>                        >
>>                <field column="head_s" xpath="/html/head"/>
>>                <field column="body_s" xpath="/html/body"/>
>>        </entity>
>>        </document>
>> </dataConfig>
>>
>> Sample data file: "cygwin/tmp/ch05-tokenizers-filters-Solr1.4.html"
>>
>> <?xml version="1.0" encoding="UTF-8" ?>
>> <html >
>>  <head >
>>    <meta content="en-US" name="DC.language" />
>>  </head>
>>  <body>
>>    <div id="header">
>>     <a href="ch05-tokenizers-filters-Solr1.4.html">First</a>
>>        <span class="nolink">Previous</span>
>>        <a href="ch05-tokenizers-filters-Solr1.41.html">Next</a>
>>        <a href="ch05-tokenizers-filters-Solr1.460.html">Last</a>
>>    </div>
>>    <div dir="ltr" id="content" style="background-color:transparent">
>>      <h1 id="toc0">
>>        <span class="SectionNumber">1</span>
>>        <a id="RefHeading36402771"></a>
>>        <a id="bkmRefHeading36402771"></a>
>>        Understanding Analyzers, Tokenizers, and Filters
>>      </h1>
>>    </div>
>>  </body>
>> </html>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Systems Architect| AOL | http://aol.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: DataImportHandler problem - reading XML from a file

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
It clear that the xpaths provided won't fetch anything. because there
is no data in those paths. what do you really wish to be indexed ?



On Sun, Jan 31, 2010 at 10:30 AM, Lance Norskog <go...@gmail.com> wrote:
> This DataImportHandler script does not find any documents in this HTML
> file. The DIH definitely opens the file, but the either the
> xpathprocessor gets no data or it does not recognize the xpaths
> described. Any hints? (I'm using Solr 1.5-dev, sometime recent.)
>
> Thanks!
>
> Lance
>
>
> xhtml-data-config.xml:
>
> <dataConfig>
>        <dataSource type="FileDataSource" encoding="UTF-8" />
>        <document>
>        <entity name="xhtml"
>                        forEach="/html/head | /html/body"
>                        processor="XPathEntityProcessor" pk="id"
>                        transformer="TemplateTransformer"
>                        url="/cygwin/tmp/ch05-tokenizers-filters-Solr1.4.html"
>                        >
>                <field column="head_s" xpath="/html/head"/>
>                <field column="body_s" xpath="/html/body"/>
>        </entity>
>        </document>
> </dataConfig>
>
> Sample data file: "cygwin/tmp/ch05-tokenizers-filters-Solr1.4.html"
>
> <?xml version="1.0" encoding="UTF-8" ?>
> <html >
>  <head >
>    <meta content="en-US" name="DC.language" />
>  </head>
>  <body>
>    <div id="header">
>     <a href="ch05-tokenizers-filters-Solr1.4.html">First</a>
>        <span class="nolink">Previous</span>
>        <a href="ch05-tokenizers-filters-Solr1.41.html">Next</a>
>        <a href="ch05-tokenizers-filters-Solr1.460.html">Last</a>
>    </div>
>    <div dir="ltr" id="content" style="background-color:transparent">
>      <h1 id="toc0">
>        <span class="SectionNumber">1</span>
>        <a id="RefHeading36402771"></a>
>        <a id="bkmRefHeading36402771"></a>
>        Understanding Analyzers, Tokenizers, and Filters
>      </h1>
>    </div>
>  </body>
> </html>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com