You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Rida Benjelloun (JIRA)" <ji...@apache.org> on 2006/01/24 17:47:09 UTC

[jira] Created: (NUTCH-185) XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.

XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. 
----------------------------------------------------------------------------------------------------------------------------

         Key: NUTCH-185
         URL: http://issues.apache.org/jira/browse/NUTCH-185
     Project: Nutch
        Type: New Feature
  Components: fetcher, indexer  
    Versions: 0.7.2-dev    
 Environment: OS Independent
    Reporter: Rida Benjelloun


XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. 

Informations :

1- Copy "xmlparser-conf.xml" to the nutch/conf dir

2- To index your custom XML file, you have to modify the "xmlparser-conf.xml". 
This parser uses namespaces and XPATH to parse XML content
The config file do the mapping between the XML noeds (using XPATH) and lucene field. 
Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 

3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. 
If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
Example : 
<xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
  <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
  <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
</xmlIndexerProperties>


4- It is possible to define a default namespace that will be applied when the parser 
didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. 
Example :
<xmlIndexerProperties type="filePerDocument" namespace="default">
  <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
</xmlIndexerProperties>


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Commented: (NUTCH-185) XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.

Posted by Rida Benjelloun <ri...@doculibre.com>.

Hi Philippe,

Thanks, for your comments. I have already add multi-values for a field  in
lucene. I will try it with nutch plugin.

Best regards.




On 1/26/06, Philippe EUGENE (JIRA) <ji...@apache.org> wrote:
>
>    [
> http://issues.apache.org/jira/browse/NUTCH-185?page=comments#action_12364087]
>
> Philippe EUGENE commented on NUTCH-185:
> ---------------------------------------
>
> Great Plugin. Thanks !
> I succesfull test this plugin on a 0.7.1 version of nutch.
> I have just a problem with somes structures like this :
> <authors>
> <author>author1</author>
> <author>author2</author>
> <author>author3</author>
> <authorr>
>
> In my Lucene Index i just see the author3 value for this field.
> I'm not sure that the problem is on the plugin.
> I don't know if it's possible to have multi-values for a field on nutch
> 0.7.1
>
> > XMLParser is configurable plugin. It use XPath and namespaces to do the
> mapping between the XML elements and Lucene fields.
> >
> ---------------------------------------------------------------------------------------------------------------------------
> >
> >          Key: NUTCH-185
> >          URL: http://issues.apache.org/jira/browse/NUTCH-185
> >      Project: Nutch
> >         Type: New Feature
> >   Components: fetcher, indexer
> >     Versions: 0.7.2-dev
> >  Environment: OS Independent
> >     Reporter: Rida Benjelloun
> >  Attachments: parse-xml.zip
> >
> > XMLParser is configurable plugin. It use XPath and namespaces to do the
> mapping between the XML elements and Lucene fields.
> > Informations :
> > 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> > 2- To index your custom XML file, you have to modify the "
> xmlparser-conf.xml".
> > This parser uses namespaces and XPATH to parse XML content
> > The config file do the mapping between the XML noeds (using XPATH) and
> lucene field.
> > Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="
> 1.4" />
> > 3- The xmlIndexerProperties encapsulate a set of fields associated to a
> namespace.
> > If the namespace is found in the xml document, the fields represented by
> the namespace will be indexed.
> > Example :
> > <xmlIndexerProperties type="filePerDocument" namespace="
> http://purl.org/dc/elements/1.1/">
> >   <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" />
> >   <field name="dccreator" xpath="//dc:creator" type="keyword" boost="
> 1.0" />
> > </xmlIndexerProperties>
> > 4- It is possible to define a default namespace that will be applied
> when the parser
> > didn't find any namespace in the document or when the namespace found in
> the xml document doesn't match with the namespace defined in the
> xmlIndexerProperties.
> > Example :
> > <xmlIndexerProperties type="filePerDocument" namespace="default">
> >   <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" />
> > </xmlIndexerProperties>
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>   http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>   http://www.atlassian.com/software/jira
>
>


--
----------------------------------------
Rida Benjelloun
Président directeur général
DocuLibre inc.
Téléphone : (418) 262-3222
Site Web : http://www.doculibre.com
Courriel : rida.benjelloun@doculibre.com
----------------------------------------

[jira] Commented: (NUTCH-185) XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.

Posted by "Philippe EUGENE (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-185?page=comments#action_12364087 ] 

Philippe EUGENE commented on NUTCH-185:
---------------------------------------

Great Plugin. Thanks !
I succesfull test this plugin on a 0.7.1 version of nutch.
I have just a problem with somes structures like this :
<authors>
 <author>author1</author>
 <author>author2</author>
 <author>author3</author>
<authorr>

In my Lucene Index i just see the author3 value for this field.
I'm not sure that the problem is on the plugin. 
I don't know if it's possible to have multi-values for a field on nutch 0.7.1

> XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.
> ---------------------------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-185
>          URL: http://issues.apache.org/jira/browse/NUTCH-185
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher, indexer
>     Versions: 0.7.2-dev
>  Environment: OS Independent
>     Reporter: Rida Benjelloun
>  Attachments: parse-xml.zip
>
> XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. 
> Informations :
> 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> 2- To index your custom XML file, you have to modify the "xmlparser-conf.xml". 
> This parser uses namespaces and XPATH to parse XML content
> The config file do the mapping between the XML noeds (using XPATH) and lucene field. 
> Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 
> 3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. 
> If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
> Example : 
> <xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
>   <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
>   <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
> </xmlIndexerProperties>
> 4- It is possible to define a default namespace that will be applied when the parser 
> didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. 
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace="default">
>   <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
> </xmlIndexerProperties>

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-185) XMLParser is configurable xml parser plugin.

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-185?page=comments#action_12371671 ] 

Chris A. Mattmann commented on NUTCH-185:
-----------------------------------------

I propose that either this issue be closed and the patch files moved to NUTCH-23, or that NUTCH-23 be closed, as the two are duplicate issues. Comments?

> XMLParser is configurable xml parser plugin.
> --------------------------------------------
>
>          Key: NUTCH-185
>          URL: http://issues.apache.org/jira/browse/NUTCH-185
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher, indexer
>     Versions: 0.7.2-dev
>  Environment: OS Independent
>     Reporter: Rida Benjelloun
>  Attachments: parse-xml.zip
>
> Xml parser  is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. 
> Informations :
> 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> 2- To index your custom XML file, you have to modify the "xmlparser-conf.xml". 
> This parser uses namespaces and XPATH to parse XML content
> The config file do the mapping between the XML noeds (using XPATH) and lucene field. 
> Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 
> 3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. 
> If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
> Example : 
> <xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
>   <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
>   <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
> </xmlIndexerProperties>
> 4- It is possible to define a default namespace that will be applied when the parser 
> didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. 
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace="default">
>   <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
> </xmlIndexerProperties>

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-185) XMLParser is configurable xml parser plugin.

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-185?page=all ]

Rida Benjelloun updated NUTCH-185:
----------------------------------

        Summary: XMLParser is configurable xml parser plugin.   (was: XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.)
    Description: 
Xml parser  is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. 

Informations :

1- Copy "xmlparser-conf.xml" to the nutch/conf dir

2- To index your custom XML file, you have to modify the "xmlparser-conf.xml". 
This parser uses namespaces and XPATH to parse XML content
The config file do the mapping between the XML noeds (using XPATH) and lucene field. 
Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 

3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. 
If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
Example : 
<xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
  <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
  <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
</xmlIndexerProperties>


4- It is possible to define a default namespace that will be applied when the parser 
didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. 
Example :
<xmlIndexerProperties type="filePerDocument" namespace="default">
  <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
</xmlIndexerProperties>


  was:
XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. 

Informations :

1- Copy "xmlparser-conf.xml" to the nutch/conf dir

2- To index your custom XML file, you have to modify the "xmlparser-conf.xml". 
This parser uses namespaces and XPATH to parse XML content
The config file do the mapping between the XML noeds (using XPATH) and lucene field. 
Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 

3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. 
If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
Example : 
<xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
  <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
  <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
</xmlIndexerProperties>


4- It is possible to define a default namespace that will be applied when the parser 
didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. 
Example :
<xmlIndexerProperties type="filePerDocument" namespace="default">
  <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
</xmlIndexerProperties>



> XMLParser is configurable xml parser plugin. 
> ---------------------------------------------
>
>          Key: NUTCH-185
>          URL: http://issues.apache.org/jira/browse/NUTCH-185
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher, indexer
>     Versions: 0.7.2-dev
>  Environment: OS Independent
>     Reporter: Rida Benjelloun
>  Attachments: parse-xml.zip
>
> Xml parser  is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. 
> Informations :
> 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> 2- To index your custom XML file, you have to modify the "xmlparser-conf.xml". 
> This parser uses namespaces and XPATH to parse XML content
> The config file do the mapping between the XML noeds (using XPATH) and lucene field. 
> Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 
> 3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. 
> If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
> Example : 
> <xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
>   <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
>   <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
> </xmlIndexerProperties>
> 4- It is possible to define a default namespace that will be applied when the parser 
> didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. 
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace="default">
>   <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
> </xmlIndexerProperties>

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-185) XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.

Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-185?page=all ]

Rida Benjelloun updated NUTCH-185:
----------------------------------

    Attachment: parse-xml.zip

Version 1.0

> XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.
> ---------------------------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-185
>          URL: http://issues.apache.org/jira/browse/NUTCH-185
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher, indexer
>     Versions: 0.7.2-dev
>  Environment: OS Independent
>     Reporter: Rida Benjelloun
>  Attachments: parse-xml.zip
>
> XMLParser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. 
> Informations :
> 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> 2- To index your custom XML file, you have to modify the "xmlparser-conf.xml". 
> This parser uses namespaces and XPATH to parse XML content
> The config file do the mapping between the XML noeds (using XPATH) and lucene field. 
> Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 
> 3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. 
> If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
> Example : 
> <xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
>   <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
>   <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
> </xmlIndexerProperties>
> 4- It is possible to define a default namespace that will be applied when the parser 
> didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. 
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace="default">
>   <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
> </xmlIndexerProperties>

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira