You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Filip Stysiak <st...@gmail.com> on 2017/05/24 10:58:29 UTC

Problems with crawling images (pretty basic stuff)

Hello everyone,

It's the first time I'm posting here, so i hope that I won't embarass
myself that much (I don't consider myself an advanced Nutch user yet).

I'm developing an application which uses Nutch as a complementary tool to
extract and index images on chosen websites. I wrote a plugin that indexes
the images in Solr (I'm using Nutch 1.12 and Solr 4.10.4 atm) and it was a
relatively easy task to do. However, now that I test the crawler on various
websites I realized that whole the plugin succesfully works on direct image
addresses (that is if I put them in seed.txt) the problems arise if the
site refers to the images via gallery or even non-standard HTML attributes
(for example, on a trusted website I test it on I didn't retrieve the
images which sat in <div> tags with the image URL stored like

 <div src="http://placeholder-blank-image-that-nutch-reached.jpg" data-src="
http://link to actual image.jpg">

As a matter of fact most of the images stored on any kind of website don't
seem to get crawled unless I inject their URLs directly (not only injecting
the main page that hosts them).

When it comes to the configuration I did the basic steps I could find
outside this mailing list, namely:
- I commented out the image extensions in regex-urlfilter
- I turned off the content limit sizes

I also turned off external links (at least in the development stage) but
I'm positive it's not an issue for the test site.

I feel like the issue I'm having shouldn't be anything out of ordinary
nutch extensions; what am I missing here? Am I supposed to somehow traverse
the html tree to extract all the interesting links myself? If so - which
extension point do I use? And since I want a separate document for each
image in my index, how do I generate more documents for indexing (for
example if I were to find the image links in a parsing plugin)?

Thanks in advance,
Filip

Re: Problems with crawling images (pretty basic stuff)

Posted by BlackIce <bl...@gmail.com>.
www.andrzejstysiak.com and andrzejstysiak.com
<http://www.andrzejstysiak.com/> are two different sites according to
internet logic, even if they "internally" reside on the same webserver.

On Wed, May 24, 2017 at 5:54 PM, Filip Stysiak <st...@gmail.com>
wrote:

> The image address is definitely an internal link. To be specific, I test
> nutch on the site
> www.andrzejstysiak.com (I use it because it's mostly a gallery).
> For example when you'd look at
> http://andrzejstysiak.com/category/all
> you'd see images that refer you to urls like
> http://andrzejstysiak.com/wp-content/uploads/2017/03/obraz_
> 74_78x60-1280x1682.jpg
> (so it should be an internal link)
>
> Thing is, when you look inside the site's source you see the urls are
> there. But Nutch doesn't fetch them (it DOES fetch the blank.gif
> placeholder image that you can also see in <div>'s that hold the images,
> like I wrote in the original message.
>
> What I'd like to do is to somehow ensure that Nutch DOES reach those images
> - or, in fact, any images that exist within the HTML source of the sites I
> inject. Therefore I'd like to know is there a way to take control over the
> search for the new links, especially if it's possible within the realm of
> plugins.
>
> 2017-05-24 17:08 GMT+02:00 BlackIce <bl...@gmail.com>:
>
> > Hi Filip,
> >
> >
> > You mentioned that you commented out "External Links" - what do the links
> > look like that point to the images? do they start with ":www.server.com"
> > or
> > something like "images.server.com"? With "External Links" turned off
> Nutch
> > should interpret those links as "external sites" and thus not following
> > them. Logic dictates that if the Images are not on the same site, ie: "
> > www.server.com" Nutch would not retrieve them unless they are in the
> seed
> > list because "images.server.com" and "www.server.com" are two different
> > sites even tho they are on the same domain.
> >
> > Generally speaking in Internet development a "Site" and a "Domain" are 2
> > different things, even protocols are separate, ie: http://www.server.com
> > is
> > different from https://www.server.com
> >
> > Just my 2 cents, hope its off use.
> >
> > Greetings
> >
> > Ralf
> >
> >
> > PS.: could I have a look at your image retrieval plug-in? Thank you!
> >
>

Re: Problems with crawling images (pretty basic stuff)

Posted by Filip Stysiak <st...@gmail.com>.
The image address is definitely an internal link. To be specific, I test
nutch on the site
www.andrzejstysiak.com (I use it because it's mostly a gallery).
For example when you'd look at
http://andrzejstysiak.com/category/all
you'd see images that refer you to urls like
http://andrzejstysiak.com/wp-content/uploads/2017/03/obraz_74_78x60-1280x1682.jpg
(so it should be an internal link)

Thing is, when you look inside the site's source you see the urls are
there. But Nutch doesn't fetch them (it DOES fetch the blank.gif
placeholder image that you can also see in <div>'s that hold the images,
like I wrote in the original message.

What I'd like to do is to somehow ensure that Nutch DOES reach those images
- or, in fact, any images that exist within the HTML source of the sites I
inject. Therefore I'd like to know is there a way to take control over the
search for the new links, especially if it's possible within the realm of
plugins.

2017-05-24 17:08 GMT+02:00 BlackIce <bl...@gmail.com>:

> Hi Filip,
>
>
> You mentioned that you commented out "External Links" - what do the links
> look like that point to the images? do they start with ":www.server.com"
> or
> something like "images.server.com"? With "External Links" turned off Nutch
> should interpret those links as "external sites" and thus not following
> them. Logic dictates that if the Images are not on the same site, ie: "
> www.server.com" Nutch would not retrieve them unless they are in the seed
> list because "images.server.com" and "www.server.com" are two different
> sites even tho they are on the same domain.
>
> Generally speaking in Internet development a "Site" and a "Domain" are 2
> different things, even protocols are separate, ie: http://www.server.com
> is
> different from https://www.server.com
>
> Just my 2 cents, hope its off use.
>
> Greetings
>
> Ralf
>
>
> PS.: could I have a look at your image retrieval plug-in? Thank you!
>

Re: Problems with crawling images (pretty basic stuff)

Posted by BlackIce <bl...@gmail.com>.
Hi Filip,


You mentioned that you commented out "External Links" - what do the links
look like that point to the images? do they start with ":www.server.com" or
something like "images.server.com"? With "External Links" turned off Nutch
should interpret those links as "external sites" and thus not following
them. Logic dictates that if the Images are not on the same site, ie: "
www.server.com" Nutch would not retrieve them unless they are in the seed
list because "images.server.com" and "www.server.com" are two different
sites even tho they are on the same domain.

Generally speaking in Internet development a "Site" and a "Domain" are 2
different things, even protocols are separate, ie: http://www.server.com is
different from https://www.server.com

Just my 2 cents, hope its off use.

Greetings

Ralf


PS.: could I have a look at your image retrieval plug-in? Thank you!

Re: [MASSMAIL]Problems with crawling images (pretty basic stuff)

Posted by Filip Stysiak <st...@gmail.com>.
I do know that normally I don't posess the list of specific image urls.
Thing is that Nutch does not seem to extract the images properly. Like I
said, when testing it on different websites it's usually a hit-or-miss
whether the image urls get fetched or if they don't, and I'm interested in
consistency.

all that mimetype-filter does is it limits the indexing step to the urls
that link the images and only those documents end up in Solr.

2017-05-24 17:07 GMT+02:00 Eyeris Rodriguez Rueda <er...@uci.cu>:

> Hello Filip.
> I have used nutch also as a image crawler, now i am using nutch 1.12 too.
> I think that seed is olny your initial page (s).
> In my little experience you need to crawl all urls because it is the only
> way to discover links for images.
> Normally you don´t have a list of images urls.
> So you need to process pages in order to extract images links.
> We have implemented an extra parser because i think that nutch is not
> able(or i don´t know) to make thumbnails for images.
> I agree that you need one indexing filter for index only images. (you can
> use mimetype-filter)
> with this configuration,deny all, and permit only document if mimetypes
> contain image.
> -
> image
>
> Nutch is very good extracting links in pages.
> Tell,me if you want to know more, maybe i can help you.
> Best.
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Filip Stysiak" <st...@gmail.com>
> Para: user@nutch.apache.org
> Enviados: Miércoles, 24 de Mayo 2017 6:58:29
> Asunto: [MASSMAIL]Problems with crawling images (pretty basic stuff)
>
> Hello everyone,
>
> It's the first time I'm posting here, so i hope that I won't embarass
> myself that much (I don't consider myself an advanced Nutch user yet).
>
> I'm developing an application which uses Nutch as a complementary tool to
> extract and index images on chosen websites. I wrote a plugin that indexes
> the images in Solr (I'm using Nutch 1.12 and Solr 4.10.4 atm) and it was a
> relatively easy task to do. However, now that I test the crawler on various
> websites I realized that whole the plugin succesfully works on direct image
> addresses (that is if I put them in seed.txt) the problems arise if the
> site refers to the images via gallery or even non-standard HTML attributes
> (for example, on a trusted website I test it on I didn't retrieve the
> images which sat in <div> tags with the image URL stored like
>
>  <div src="http://placeholder-blank-image-that-nutch-reached.jpg"
> data-src="
> http://link to actual image.jpg">
>
> As a matter of fact most of the images stored on any kind of website don't
> seem to get crawled unless I inject their URLs directly (not only injecting
> the main page that hosts them).
>
> When it comes to the configuration I did the basic steps I could find
> outside this mailing list, namely:
> - I commented out the image extensions in regex-urlfilter
> - I turned off the content limit sizes
>
> I also turned off external links (at least in the development stage) but
> I'm positive it's not an issue for the test site.
>
> I feel like the issue I'm having shouldn't be anything out of ordinary
> nutch extensions; what am I missing here? Am I supposed to somehow traverse
> the html tree to extract all the interesting links myself? If so - which
> extension point do I use? And since I want a separate document for each
> image in my index, how do I generate more documents for indexing (for
> example if I were to find the image links in a parsing plugin)?
>
> Thanks in advance,
> Filip
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>

Re: [MASSMAIL]Problems with crawling images (pretty basic stuff)

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.
Hello Filip.
I have used nutch also as a image crawler, now i am using nutch 1.12 too.
I think that seed is olny your initial page (s).
In my little experience you need to crawl all urls because it is the only way to discover links for images.
Normally you don´t have a list of images urls.
So you need to process pages in order to extract images links.
We have implemented an extra parser because i think that nutch is not able(or i don´t know) to make thumbnails for images.
I agree that you need one indexing filter for index only images. (you can use mimetype-filter)
with this configuration,deny all, and permit only document if mimetypes contain image.
-
image

Nutch is very good extracting links in pages.
Tell,me if you want to know more, maybe i can help you.
Best.






----- Mensaje original -----
De: "Filip Stysiak" <st...@gmail.com>
Para: user@nutch.apache.org
Enviados: Miércoles, 24 de Mayo 2017 6:58:29
Asunto: [MASSMAIL]Problems with crawling images (pretty basic stuff)

Hello everyone,

It's the first time I'm posting here, so i hope that I won't embarass
myself that much (I don't consider myself an advanced Nutch user yet).

I'm developing an application which uses Nutch as a complementary tool to
extract and index images on chosen websites. I wrote a plugin that indexes
the images in Solr (I'm using Nutch 1.12 and Solr 4.10.4 atm) and it was a
relatively easy task to do. However, now that I test the crawler on various
websites I realized that whole the plugin succesfully works on direct image
addresses (that is if I put them in seed.txt) the problems arise if the
site refers to the images via gallery or even non-standard HTML attributes
(for example, on a trusted website I test it on I didn't retrieve the
images which sat in <div> tags with the image URL stored like

 <div src="http://placeholder-blank-image-that-nutch-reached.jpg" data-src="
http://link to actual image.jpg">

As a matter of fact most of the images stored on any kind of website don't
seem to get crawled unless I inject their URLs directly (not only injecting
the main page that hosts them).

When it comes to the configuration I did the basic steps I could find
outside this mailing list, namely:
- I commented out the image extensions in regex-urlfilter
- I turned off the content limit sizes

I also turned off external links (at least in the development stage) but
I'm positive it's not an issue for the test site.

I feel like the issue I'm having shouldn't be anything out of ordinary
nutch extensions; what am I missing here? Am I supposed to somehow traverse
the html tree to extract all the interesting links myself? If so - which
extension point do I use? And since I want a separate document for each
image in my index, how do I generate more documents for indexing (for
example if I were to find the image links in a parsing plugin)?

Thanks in advance,
Filip
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre