You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by payo <pa...@yahoo.com> on 2007/10/29 17:59:03 UTC

Re: XMLParser for Nutch

hello

I am trying install the Xml Parser but when the run ant in the step 7 and 8
showme this message

BUILD FAILED

C:\nutch-0.9\build.xml:61: Specify at least one source--a file or r
source collection.

why?




Rida Benjelloun wrote:
> 
> Hi,
> Here is the steps to install the Xml Parser plugin :
> 1- Copy parse-xml in the src/plugin directory
> 
> 2- Copy xmlparser-conf.xml in the conf directory
> 3- Add to nutch-site.xml (conf directory) the following property
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-http|urlfilter
> 
> -regex|parse-(text|xml|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
> 
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
> 
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
> 
> 4- Modify parse-plugins.xml (conf directory)
>     <mimeType name="text/xml">
>         <plugin id="parse-xml" />
>         <plugin id="parse-text" />
>         <plugin id="parse-html" />
>         <plugin id="parse-rss" />
>     </mimeType>
> 
> 5- Modify build.xml in the root directory add parse-xml
> 6 - Modify src\plugin build.xml add parse-xml
> 7 - Execute ant in src/plugin directory
> 8 - Execute  ant in the root directory
> 9 - Copy parse-xml directory located in nutch-0.8.1/build/plugins to
> nutch-0.8.1/plugins
> 
> Best regards
> 
> Rida Benjelloun
> 
> 
> 
> 
> On 11/7/06, Jim Wilson <wi...@gmail.com> wrote:
>>
>> I think you should stop sending *bump* emails.
>>
>> -- Jim
>>
>> On 11/7/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>> >
>> > *bump*
>> >
>> > Any thoughts, anyone?
>> >
>> > Thanks,
>> > Jayant
>> >
>> > On 11/6/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>> > > Hello,
>> > >
>> > > I have been working on it since then.. I have found one problem. It
>> > > seems the plugin parse-xml plugin is not loading.
>> > >
>> > > One thing I did was put the plugin in the parse-plugins.xml to enable
>> > > nutch-0.8.1 to detect that parse-xml is the plugin to be used for xml
>> > > content. This is not given in the instructions for the plugin though.
>> > >
>> > > Because of it I started to get the following error in hadoop.log:-
>> > >
>> > > 2006-11-06 15:12:33,156 WARN  parse.ParserFactory - ParserFactory:
>> > > Plugin: parse-xml mapped to contentType text/xml via
>> > > parse-plugins.xml, but not enabled via plugin.includes in
>> > > nutch-default.xml
>> > >
>> > > The issue is that I have the plugin enabled in the nutch-site.xml. I
>> > > also tried to enable the plugin in nutch-default.xml but I still get
>> > > the same error.
>> > >
>> > > Any thoughts/ pointers on how to make the plugin work?
>> > >
>> > > Thanks and Best Regards,
>> > > Jayant Gandhi
>> > >
>> > >
>> > > On 11/5/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>> > > > I am using the default xmlparser-conf.xml, just copied it into
>> > > > nutch/conf dir. To test it I used the xml file given in the sample
>> > > > directory xmltest.xml and is uploaded at
>> http://www.jkg.in/xmltest.xml
>> > > > .
>> > > >
>> > > > I do not get any errors while indexing or parsing. The crawl log is
>> > > > attached. I am able to get the xml file in the results when I
>> search
>> > > > for 'XPath' but when I click the explain link, it doesn't show me
>> the
>> > > > field dctitle in the index which it should.
>> > > >
>> > > > I just noticed that hadoop.log has some error for handling xml
>> files
>> > > > and I cannot see parse-xml loaded, but I have it enabled in my
>> > > > nutch-site.conf. I am new to nutch-0.8 and hadoop so I have no idea
>> > > > whether this is expected behaviour/ how to fix it.
>> > > >
>> > > > Thanks and Best Regards,
>> > > > Jayant
>> > > >
>> > > > On 11/5/06, Nutch Newbie <nu...@gmail.com> wrote:
>> > > > > Can you post your "xmlparser-conf.xml" from the nutch/conf dir ?
>> > > > > Also what kind of error message do you get when you index?
>> > > > > You can use Luke to see the index...
>> > > > >
>> > > > > Regards,
>> > > > >
>> > > > > On 11/4/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>> > > > > > Hello Everyone,
>> > > > > >
>> > > > > > I am just installed nutch-0.8.1 on my dev machine. I installed
>> a
>> > new
>> > > > > > plugin called XML Parser available at
>> > > > > > http://issues.apache.org/jira/browse/NUTCH-185
>> > > > > > The issue is that I am unable get it to work.
>> > > > > > I copied the parse-xml folder to src/plugin folder. I made the
>> > > > > > corresponding deploy/ clean entries in the build xml file.
>> > > > > >
>> > > > > > Also, I have editied the nutch conf to enable xml plugin.
>> > > > > > The plugin is still not working. After compiling using ant, I
>> > started
>> > > > > > indexing. After the indexing was finished and query done, I
>> > couldnt
>> > > > > > see the indexed fields on the explain page.
>> > > > > >
>> > > > > > Any inputs?
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Jayant
>> > > > > >
>> > > > >
>> > > >
>> > > > --
>> > > > www.jkg.in | http://www.jkg.in/contact-me/
>> > > > Jayant Kr. Gandhi
>> > >
>> > > --
>> > > www.jkg.in | http://www.jkg.in/contact-me/
>> > > Jayant Kr. Gandhi
>> > >
>> >
>> >
>> > --
>> > www.jkg.in | http://www.jkg.in/contact-me/
>> > Jayant Kr. Gandhi
>> > M.Tech. Computer Tech. Class of 2007,
>> > IIT Delhi
>> >
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/XMLParser-for-Nutch-tf2575183.html#a13471028
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: XMLParser for Nutch

Posted by payo <pa...@yahoo.com>.
I can install the parse-xml with nutch-0.8.1

when i run the crawl

./bin/nutch crawl urls -dir crawl -depth 3 -topN 50

showme this errors


Exception in thread "main" java.io.IOException: Input directory
C:/cygwin/home/n
utch-8/url in local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

which is the problem?



Sebastian Steinmetz wrote:
> 
> Hi,
> 
> compiling from the source package is somewhat tricky. You've got 2  
> options to solve this:
> 
> 1. export the missing config/*.template files from the SVN-Repository
> 2. edit build.xml:61 so that it doesn't want to copy these *.template  
> files.
> 
> hope that helps.
> 
> cya,
> 	Sebastian Steinmetz
> 
> 
> Am 29.10.2007 um 17:59 schrieb payo:
> 
>>
>> hello
>>
>> I am trying install the Xml Parser but when the run ant in the step  
>> 7 and 8
>> showme this message
>>
>> BUILD FAILED
>>
>> C:\nutch-0.9\build.xml:61: Specify at least one source--a file or r
>> source collection.
>>
>> why?
>>
>>
>>
>>
>> Rida Benjelloun wrote:
>>>
>>> Hi,
>>> Here is the steps to install the Xml Parser plugin :
>>> 1- Copy parse-xml in the src/plugin directory
>>>
>>> 2- Copy xmlparser-conf.xml in the conf directory
>>> 3- Add to nutch-site.xml (conf directory) the following property
>>> <property>
>>>   <name>plugin.includes</name>
>>>   <value>protocol-http|urlfilter
>>>
>>> -regex|parse-(text|xml|html|js)|index-basic|query-(basic|site|url)| 
>>> summary-basic|scoring-opic</value>
>>>
>>>   <description>Regular expression naming plugin directory names to
>>>   include.  Any plugin not matching this expression is excluded.
>>>   In any case you need at least include the nutch-extensionpoints  
>>> plugin.
>>> By
>>>
>>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>>   and basic indexing and search plugins.
>>>   </description>
>>> </property>
>>>
>>> 4- Modify parse-plugins.xml (conf directory)
>>>     <mimeType name="text/xml">
>>>         <plugin id="parse-xml" />
>>>         <plugin id="parse-text" />
>>>         <plugin id="parse-html" />
>>>         <plugin id="parse-rss" />
>>>     </mimeType>
>>>
>>> 5- Modify build.xml in the root directory add parse-xml
>>> 6 - Modify src\plugin build.xml add parse-xml
>>> 7 - Execute ant in src/plugin directory
>>> 8 - Execute  ant in the root directory
>>> 9 - Copy parse-xml directory located in nutch-0.8.1/build/plugins to
>>> nutch-0.8.1/plugins
>>>
>>> Best regards
>>>
>>> Rida Benjelloun
>>>
>>>
>>>
>>>
>>> On 11/7/06, Jim Wilson <wi...@gmail.com> wrote:
>>>>
>>>> I think you should stop sending *bump* emails.
>>>>
>>>> -- Jim
>>>>
>>>> On 11/7/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>>>>>
>>>>> *bump*
>>>>>
>>>>> Any thoughts, anyone?
>>>>>
>>>>> Thanks,
>>>>> Jayant
>>>>>
>>>>> On 11/6/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I have been working on it since then.. I have found one  
>>>>>> problem. It
>>>>>> seems the plugin parse-xml plugin is not loading.
>>>>>>
>>>>>> One thing I did was put the plugin in the parse-plugins.xml to  
>>>>>> enable
>>>>>> nutch-0.8.1 to detect that parse-xml is the plugin to be used  
>>>>>> for xml
>>>>>> content. This is not given in the instructions for the plugin  
>>>>>> though.
>>>>>>
>>>>>> Because of it I started to get the following error in hadoop.log:-
>>>>>>
>>>>>> 2006-11-06 15:12:33,156 WARN  parse.ParserFactory - ParserFactory:
>>>>>> Plugin: parse-xml mapped to contentType text/xml via
>>>>>> parse-plugins.xml, but not enabled via plugin.includes in
>>>>>> nutch-default.xml
>>>>>>
>>>>>> The issue is that I have the plugin enabled in the nutch- 
>>>>>> site.xml. I
>>>>>> also tried to enable the plugin in nutch-default.xml but I  
>>>>>> still get
>>>>>> the same error.
>>>>>>
>>>>>> Any thoughts/ pointers on how to make the plugin work?
>>>>>>
>>>>>> Thanks and Best Regards,
>>>>>> Jayant Gandhi
>>>>>>
>>>>>>
>>>>>> On 11/5/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>>>>>>> I am using the default xmlparser-conf.xml, just copied it into
>>>>>>> nutch/conf dir. To test it I used the xml file given in the  
>>>>>>> sample
>>>>>>> directory xmltest.xml and is uploaded at
>>>> http://www.jkg.in/xmltest.xml
>>>>>>> .
>>>>>>>
>>>>>>> I do not get any errors while indexing or parsing. The crawl  
>>>>>>> log is
>>>>>>> attached. I am able to get the xml file in the results when I
>>>> search
>>>>>>> for 'XPath' but when I click the explain link, it doesn't show me
>>>> the
>>>>>>> field dctitle in the index which it should.
>>>>>>>
>>>>>>> I just noticed that hadoop.log has some error for handling xml
>>>> files
>>>>>>> and I cannot see parse-xml loaded, but I have it enabled in my
>>>>>>> nutch-site.conf. I am new to nutch-0.8 and hadoop so I have no  
>>>>>>> idea
>>>>>>> whether this is expected behaviour/ how to fix it.
>>>>>>>
>>>>>>> Thanks and Best Regards,
>>>>>>> Jayant
>>>>>>>
>>>>>>> On 11/5/06, Nutch Newbie <nu...@gmail.com> wrote:
>>>>>>>> Can you post your "xmlparser-conf.xml" from the nutch/conf dir ?
>>>>>>>> Also what kind of error message do you get when you index?
>>>>>>>> You can use Luke to see the index...
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> On 11/4/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>>>>>>>>> Hello Everyone,
>>>>>>>>>
>>>>>>>>> I am just installed nutch-0.8.1 on my dev machine. I installed
>>>> a
>>>>> new
>>>>>>>>> plugin called XML Parser available at
>>>>>>>>> http://issues.apache.org/jira/browse/NUTCH-185
>>>>>>>>> The issue is that I am unable get it to work.
>>>>>>>>> I copied the parse-xml folder to src/plugin folder. I made the
>>>>>>>>> corresponding deploy/ clean entries in the build xml file.
>>>>>>>>>
>>>>>>>>> Also, I have editied the nutch conf to enable xml plugin.
>>>>>>>>> The plugin is still not working. After compiling using ant, I
>>>>> started
>>>>>>>>> indexing. After the indexing was finished and query done, I
>>>>> couldnt
>>>>>>>>> see the indexed fields on the explain page.
>>>>>>>>>
>>>>>>>>> Any inputs?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jayant
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> www.jkg.in | http://www.jkg.in/contact-me/
>>>>>>> Jayant Kr. Gandhi
>>>>>>
>>>>>> --
>>>>>> www.jkg.in | http://www.jkg.in/contact-me/
>>>>>> Jayant Kr. Gandhi
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> www.jkg.in | http://www.jkg.in/contact-me/
>>>>> Jayant Kr. Gandhi
>>>>> M.Tech. Computer Tech. Class of 2007,
>>>>> IIT Delhi
>>>>>
>>>>
>>>>
>>>
>>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/XMLParser-for- 
>> Nutch-tf2575183.html#a13471028
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/XMLParser-for-Nutch-tf2575183.html#a13630600
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: XMLParser for Nutch

Posted by Sebastian Steinmetz <s....@mederi-research.de>.
Hi,

compiling from the source package is somewhat tricky. You've got 2  
options to solve this:

1. export the missing config/*.template files from the SVN-Repository
2. edit build.xml:61 so that it doesn't want to copy these *.template  
files.

hope that helps.

cya,
	Sebastian Steinmetz


Am 29.10.2007 um 17:59 schrieb payo:

>
> hello
>
> I am trying install the Xml Parser but when the run ant in the step  
> 7 and 8
> showme this message
>
> BUILD FAILED
>
> C:\nutch-0.9\build.xml:61: Specify at least one source--a file or r
> source collection.
>
> why?
>
>
>
>
> Rida Benjelloun wrote:
>>
>> Hi,
>> Here is the steps to install the Xml Parser plugin :
>> 1- Copy parse-xml in the src/plugin directory
>>
>> 2- Copy xmlparser-conf.xml in the conf directory
>> 3- Add to nutch-site.xml (conf directory) the following property
>> <property>
>>   <name>plugin.includes</name>
>>   <value>protocol-http|urlfilter
>>
>> -regex|parse-(text|xml|html|js)|index-basic|query-(basic|site|url)| 
>> summary-basic|scoring-opic</value>
>>
>>   <description>Regular expression naming plugin directory names to
>>   include.  Any plugin not matching this expression is excluded.
>>   In any case you need at least include the nutch-extensionpoints  
>> plugin.
>> By
>>
>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>   and basic indexing and search plugins.
>>   </description>
>> </property>
>>
>> 4- Modify parse-plugins.xml (conf directory)
>>     <mimeType name="text/xml">
>>         <plugin id="parse-xml" />
>>         <plugin id="parse-text" />
>>         <plugin id="parse-html" />
>>         <plugin id="parse-rss" />
>>     </mimeType>
>>
>> 5- Modify build.xml in the root directory add parse-xml
>> 6 - Modify src\plugin build.xml add parse-xml
>> 7 - Execute ant in src/plugin directory
>> 8 - Execute  ant in the root directory
>> 9 - Copy parse-xml directory located in nutch-0.8.1/build/plugins to
>> nutch-0.8.1/plugins
>>
>> Best regards
>>
>> Rida Benjelloun
>>
>>
>>
>>
>> On 11/7/06, Jim Wilson <wi...@gmail.com> wrote:
>>>
>>> I think you should stop sending *bump* emails.
>>>
>>> -- Jim
>>>
>>> On 11/7/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>>>>
>>>> *bump*
>>>>
>>>> Any thoughts, anyone?
>>>>
>>>> Thanks,
>>>> Jayant
>>>>
>>>> On 11/6/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>>>>> Hello,
>>>>>
>>>>> I have been working on it since then.. I have found one  
>>>>> problem. It
>>>>> seems the plugin parse-xml plugin is not loading.
>>>>>
>>>>> One thing I did was put the plugin in the parse-plugins.xml to  
>>>>> enable
>>>>> nutch-0.8.1 to detect that parse-xml is the plugin to be used  
>>>>> for xml
>>>>> content. This is not given in the instructions for the plugin  
>>>>> though.
>>>>>
>>>>> Because of it I started to get the following error in hadoop.log:-
>>>>>
>>>>> 2006-11-06 15:12:33,156 WARN  parse.ParserFactory - ParserFactory:
>>>>> Plugin: parse-xml mapped to contentType text/xml via
>>>>> parse-plugins.xml, but not enabled via plugin.includes in
>>>>> nutch-default.xml
>>>>>
>>>>> The issue is that I have the plugin enabled in the nutch- 
>>>>> site.xml. I
>>>>> also tried to enable the plugin in nutch-default.xml but I  
>>>>> still get
>>>>> the same error.
>>>>>
>>>>> Any thoughts/ pointers on how to make the plugin work?
>>>>>
>>>>> Thanks and Best Regards,
>>>>> Jayant Gandhi
>>>>>
>>>>>
>>>>> On 11/5/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>>>>>> I am using the default xmlparser-conf.xml, just copied it into
>>>>>> nutch/conf dir. To test it I used the xml file given in the  
>>>>>> sample
>>>>>> directory xmltest.xml and is uploaded at
>>> http://www.jkg.in/xmltest.xml
>>>>>> .
>>>>>>
>>>>>> I do not get any errors while indexing or parsing. The crawl  
>>>>>> log is
>>>>>> attached. I am able to get the xml file in the results when I
>>> search
>>>>>> for 'XPath' but when I click the explain link, it doesn't show me
>>> the
>>>>>> field dctitle in the index which it should.
>>>>>>
>>>>>> I just noticed that hadoop.log has some error for handling xml
>>> files
>>>>>> and I cannot see parse-xml loaded, but I have it enabled in my
>>>>>> nutch-site.conf. I am new to nutch-0.8 and hadoop so I have no  
>>>>>> idea
>>>>>> whether this is expected behaviour/ how to fix it.
>>>>>>
>>>>>> Thanks and Best Regards,
>>>>>> Jayant
>>>>>>
>>>>>> On 11/5/06, Nutch Newbie <nu...@gmail.com> wrote:
>>>>>>> Can you post your "xmlparser-conf.xml" from the nutch/conf dir ?
>>>>>>> Also what kind of error message do you get when you index?
>>>>>>> You can use Luke to see the index...
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> On 11/4/06, Jayant Kumar Gandhi <ja...@gmail.com> wrote:
>>>>>>>> Hello Everyone,
>>>>>>>>
>>>>>>>> I am just installed nutch-0.8.1 on my dev machine. I installed
>>> a
>>>> new
>>>>>>>> plugin called XML Parser available at
>>>>>>>> http://issues.apache.org/jira/browse/NUTCH-185
>>>>>>>> The issue is that I am unable get it to work.
>>>>>>>> I copied the parse-xml folder to src/plugin folder. I made the
>>>>>>>> corresponding deploy/ clean entries in the build xml file.
>>>>>>>>
>>>>>>>> Also, I have editied the nutch conf to enable xml plugin.
>>>>>>>> The plugin is still not working. After compiling using ant, I
>>>> started
>>>>>>>> indexing. After the indexing was finished and query done, I
>>>> couldnt
>>>>>>>> see the indexed fields on the explain page.
>>>>>>>>
>>>>>>>> Any inputs?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jayant
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> www.jkg.in | http://www.jkg.in/contact-me/
>>>>>> Jayant Kr. Gandhi
>>>>>
>>>>> --
>>>>> www.jkg.in | http://www.jkg.in/contact-me/
>>>>> Jayant Kr. Gandhi
>>>>>
>>>>
>>>>
>>>> --
>>>> www.jkg.in | http://www.jkg.in/contact-me/
>>>> Jayant Kr. Gandhi
>>>> M.Tech. Computer Tech. Class of 2007,
>>>> IIT Delhi
>>>>
>>>
>>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/XMLParser-for- 
> Nutch-tf2575183.html#a13471028
> Sent from the Nutch - User mailing list archive at Nabble.com.
>