You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kemical <mi...@gmail.com> on 2013/01/14 11:23:48 UTC

Re: Using Nutch with Boilerpipe

Hi,

I've just installed nutch 1.6 and solr 3.6.2 and i'd like to know if it's
possible to use boilerpipe with it (or if i should install an older version
of nutch with the patch mentioned above).





--
View this message in context: http://lucene.472066.n3.nabble.com/Using-Nutch-with-Boilerpipe-tp3991587p4033101.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Using Nutch with Boilerpipe

Posted by Lewis John Mcgibbney <le...@gmail.com>.

> Nevertheless, i should commit it and more patches anyway.
>

+1
Lewis

> Cheers,
>
> -----Original message-----
>> From:kemical <mi...@gmail.com>
>> Sent: Mon 14-Jan-2013 22:11
>> To: user@nutch.apache.org
>> Subject: Re: Using Nutch with Boilerpipe
>>
>> Hi,
>>
>> I've just installed nutch 1.6 and solr 3.6.2 and i'd like to know if it's
>> possible to use boilerpipe with it (or if i should install an older
version
>> of nutch with the patch mentioned above).
>>
>>
>>
>>
>>
>> --
>> View this message in context:
http://lucene.472066.n3.nabble.com/Using-Nutch-with-Boilerpipe-tp3991587p4033101.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>

-- 
*Lewis*

Re: how to crawl image document only with nutch ?

Posted by Tejas Patil <te...@gmail.com>.

If you just want to crawl images and dont want any html pages, add
rules to regex-urlfilter.txt
such that it accepts only (jpg / gif / png / ico / bmp) and rejects rest.
Remove all the existing rules from the file and add this:

+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$

-.


Thanks,

Tejas Patil



On Fri, Jan 18, 2013 at 10:43 AM, Eyeris Rodriguez Rueda <er...@uci.cu>wrote:

> Hi all.
>
> Im tring to make a crawl for image documents only(jpg, gif,png,ico,bmp),
> but unafortunetly some html are included in my index to. I have used a
> sufix-urlfilter.txt plugin restricting .html,.php,.xml but there are some
> html page that not have extensions and this are being inserted in my solr
> index. Also i have restrict for all in regex-urlfilter.txt and permit this
> image only but nutch said that no have document to fetch, Im using nutch
> 1.4 and solr 3.6.
> Any body can help me or point me in correct way to make a crawl only for
> documents that i want.
> Thanks in advance.
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>

how to crawl image document only with nutch ?

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.

Hi all.

Im tring to make a crawl for image documents only(jpg, gif,png,ico,bmp), but unafortunetly some html are included in my index to. I have used a sufix-urlfilter.txt plugin restricting .html,.php,.xml but there are some html page that not have extensions and this are being inserted in my solr index. Also i have restrict for all in regex-urlfilter.txt and permit this image only but nutch said that no have document to fetch, Im using nutch 1.4 and solr 3.6.
Any body can help me or point me in correct way to make a crawl only for documents that i want.
Thanks in advance.

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Using Nutch with Boilerpipe

Posted by "J. Gobel" <jj...@gmail.com>.

Hi,

How can I use boilerpipe for nutch 2.1?

I have so far: (these instructions are for 1.6, i cannot find anything on
2.1)


4. delete the following lines from runtime/local/conf/parse-plugins.xml:
        <mimeType name="text/html">
                <plugin id="parse-tika" />
        </mimeType>

        <mimeType name="application/xhtml+xml">
                <plugin id="parse-tika" />
        </mimeType>

5. Add the following lines to runtime/local/conf/nutch-site.xml
  <property>

                <name>tika.boilerpipe</name>

                <value>true</value>

        </property>

I test with L: bin/nutch parsechecker -dumpText
http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html

But that doesnt give me the desired result.

Thanks in advance,

Jaap

On Wed, Jan 16, 2013 at 3:54 PM, kemical <mi...@gmail.com> wrote:

> Outlink extraction is not mandatory since the most important for me is the
> main content.
>
> Also is there some options for the plugin to extract html tags and not raw
> plain text without line returns (sometimes i've got tags but most of the
> time i've not), or at least some conversion in "\n" so the main content
> displayed could have some interest too?
>
> And when the url here : http://boilerpipe-web.appspot.com/ i've got them.
>
> (but i guess it could be because tika boilerpipe version is an older one)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-Nutch-with-Boilerpipe-tp3991587p4033868.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: Using Nutch with Boilerpipe

Posted by kemical <mi...@gmail.com>.

Outlink extraction is not mandatory since the most important for me is the
main content. 

Also is there some options for the plugin to extract html tags and not raw
plain text without line returns (sometimes i've got tags but most of the
time i've not), or at least some conversion in "\n" so the main content
displayed could have some interest too?

And when the url here : http://boilerpipe-web.appspot.com/ i've got them.

(but i guess it could be because tika boilerpipe version is an older one)



--
View this message in context: http://lucene.472066.n3.nabble.com/Using-Nutch-with-Boilerpipe-tp3991587p4033868.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Using Nutch with Boilerpipe

Posted by Markus Jelsma <ma...@openindex.io>.

Keep in mind that you don't have proper outlink extraction now. 
 
-----Original message-----
> From:kemical <mi...@gmail.com>
> Sent: Tue 15-Jan-2013 10:57
> To: user@nutch.apache.org
> Subject: RE: Using Nutch with Boilerpipe
> 
> Hi and thanks Markus!
> 
> indeed the patch works fine with 1.6 :) 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Using-Nutch-with-Boilerpipe-tp3991587p4033416.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: Using Nutch with Boilerpipe

Posted by kemical <mi...@gmail.com>.

Hi and thanks Markus!

indeed the patch works fine with 1.6 :) 



--
View this message in context: http://lucene.472066.n3.nabble.com/Using-Nutch-with-Boilerpipe-tp3991587p4033416.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Using Nutch with Boilerpipe

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

The patch should work fine with 1.6, i don't remember changes  to parse-tika that would result in errors when patching. If it does you can either work it out and submit a new patch or i'll take a look when i come to it.

Nevertheless, i should commit it and more patches anyway.

Cheers,
 
-----Original message-----
> From:kemical <mi...@gmail.com>
> Sent: Mon 14-Jan-2013 22:11
> To: user@nutch.apache.org
> Subject: Re: Using Nutch with Boilerpipe
> 
> Hi,
> 
> I've just installed nutch 1.6 and solr 3.6.2 and i'd like to know if it's
> possible to use boilerpipe with it (or if i should install an older version
> of nutch with the patch mentioned above).
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Using-Nutch-with-Boilerpipe-tp3991587p4033101.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>