You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2017/05/24 15:07:23 UTC

Re: [MASSMAIL]Problems with crawling images (pretty basic stuff)

Hello Filip.
I have used nutch also as a image crawler, now i am using nutch 1.12 too.
I think that seed is olny your initial page (s).
In my little experience you need to crawl all urls because it is the only way to discover links for images.
Normally you don´t have a list of images urls.
So you need to process pages in order to extract images links.
We have implemented an extra parser because i think that nutch is not able(or i don´t know) to make thumbnails for images.
I agree that you need one indexing filter for index only images. (you can use mimetype-filter)
with this configuration,deny all, and permit only document if mimetypes contain image.
-
image

Nutch is very good extracting links in pages.
Tell,me if you want to know more, maybe i can help you.
Best.






----- Mensaje original -----
De: "Filip Stysiak" <st...@gmail.com>
Para: user@nutch.apache.org
Enviados: Miércoles, 24 de Mayo 2017 6:58:29
Asunto: [MASSMAIL]Problems with crawling images (pretty basic stuff)

Hello everyone,

It's the first time I'm posting here, so i hope that I won't embarass
myself that much (I don't consider myself an advanced Nutch user yet).

I'm developing an application which uses Nutch as a complementary tool to
extract and index images on chosen websites. I wrote a plugin that indexes
the images in Solr (I'm using Nutch 1.12 and Solr 4.10.4 atm) and it was a
relatively easy task to do. However, now that I test the crawler on various
websites I realized that whole the plugin succesfully works on direct image
addresses (that is if I put them in seed.txt) the problems arise if the
site refers to the images via gallery or even non-standard HTML attributes
(for example, on a trusted website I test it on I didn't retrieve the
images which sat in <div> tags with the image URL stored like

 <div src="http://placeholder-blank-image-that-nutch-reached.jpg" data-src="
http://link to actual image.jpg">

As a matter of fact most of the images stored on any kind of website don't
seem to get crawled unless I inject their URLs directly (not only injecting
the main page that hosts them).

When it comes to the configuration I did the basic steps I could find
outside this mailing list, namely:
- I commented out the image extensions in regex-urlfilter
- I turned off the content limit sizes

I also turned off external links (at least in the development stage) but
I'm positive it's not an issue for the test site.

I feel like the issue I'm having shouldn't be anything out of ordinary
nutch extensions; what am I missing here? Am I supposed to somehow traverse
the html tree to extract all the interesting links myself? If so - which
extension point do I use? And since I want a separate document for each
image in my index, how do I generate more documents for indexing (for
example if I were to find the image links in a parsing plugin)?

Thanks in advance,
Filip
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre


Re: [MASSMAIL]Problems with crawling images (pretty basic stuff)

Posted by Filip Stysiak <st...@gmail.com>.
I do know that normally I don't posess the list of specific image urls.
Thing is that Nutch does not seem to extract the images properly. Like I
said, when testing it on different websites it's usually a hit-or-miss
whether the image urls get fetched or if they don't, and I'm interested in
consistency.

all that mimetype-filter does is it limits the indexing step to the urls
that link the images and only those documents end up in Solr.

2017-05-24 17:07 GMT+02:00 Eyeris Rodriguez Rueda <er...@uci.cu>:

> Hello Filip.
> I have used nutch also as a image crawler, now i am using nutch 1.12 too.
> I think that seed is olny your initial page (s).
> In my little experience you need to crawl all urls because it is the only
> way to discover links for images.
> Normally you don´t have a list of images urls.
> So you need to process pages in order to extract images links.
> We have implemented an extra parser because i think that nutch is not
> able(or i don´t know) to make thumbnails for images.
> I agree that you need one indexing filter for index only images. (you can
> use mimetype-filter)
> with this configuration,deny all, and permit only document if mimetypes
> contain image.
> -
> image
>
> Nutch is very good extracting links in pages.
> Tell,me if you want to know more, maybe i can help you.
> Best.
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Filip Stysiak" <st...@gmail.com>
> Para: user@nutch.apache.org
> Enviados: Miércoles, 24 de Mayo 2017 6:58:29
> Asunto: [MASSMAIL]Problems with crawling images (pretty basic stuff)
>
> Hello everyone,
>
> It's the first time I'm posting here, so i hope that I won't embarass
> myself that much (I don't consider myself an advanced Nutch user yet).
>
> I'm developing an application which uses Nutch as a complementary tool to
> extract and index images on chosen websites. I wrote a plugin that indexes
> the images in Solr (I'm using Nutch 1.12 and Solr 4.10.4 atm) and it was a
> relatively easy task to do. However, now that I test the crawler on various
> websites I realized that whole the plugin succesfully works on direct image
> addresses (that is if I put them in seed.txt) the problems arise if the
> site refers to the images via gallery or even non-standard HTML attributes
> (for example, on a trusted website I test it on I didn't retrieve the
> images which sat in <div> tags with the image URL stored like
>
>  <div src="http://placeholder-blank-image-that-nutch-reached.jpg"
> data-src="
> http://link to actual image.jpg">
>
> As a matter of fact most of the images stored on any kind of website don't
> seem to get crawled unless I inject their URLs directly (not only injecting
> the main page that hosts them).
>
> When it comes to the configuration I did the basic steps I could find
> outside this mailing list, namely:
> - I commented out the image extensions in regex-urlfilter
> - I turned off the content limit sizes
>
> I also turned off external links (at least in the development stage) but
> I'm positive it's not an issue for the test site.
>
> I feel like the issue I'm having shouldn't be anything out of ordinary
> nutch extensions; what am I missing here? Am I supposed to somehow traverse
> the html tree to extract all the interesting links myself? If so - which
> extension point do I use? And since I want a separate document for each
> image in my index, how do I generate more documents for indexing (for
> example if I were to find the image links in a parsing plugin)?
>
> Thanks in advance,
> Filip
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>