You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ytthet <ye...@gmail.com> on 2013/02/23 02:52:18 UTC

Crawling URLs with query string while limiting only web pages

Hi Folks,

I have a question on crawling URLs with query string. I am crawling about
10,000 sites. Some of the site uses query string to serve the content while
some uses simple URLs. Example I have following cases

Case 1:

site1.com/article1
site1.com/article2

Case 2:
site2.com/?pid=123
site2.com/?pid=124

The only way to crawl and fetch webpages/articles in case 2 is to fetch URLs
with query string "?" . While for the case 1 I can set NOT to fetch "?" in
URL. Thus currently in my regex-urlfilter.txt , I commented the following
lines for my crawler to fetch URL with query string.

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=] 

The above setting cause the crawler to fetch all URLs including URLs with
query string thus pages such as download, login, comments, search query,
printer friendly pages, zoom in view and other not valuable pages are being
fetch. Practically, the crawler is going deep web. The undesirable cause of
this is as following:

1. Duplicate pages are being fetch, effecting the crawl DB to be bloated
- Printer friendly view, zoom in view
e.g. site1.com/article1
e.g. site1.com/article1/?view=printerfriendly
e.g. site1.com/article1/?zoom=large
e.g. site1.com/article1/?zoom=extralarge

2. Download pages are being fetch, effecting the segment to be too large
e.g. site1/com/getcontentID?id=1&format=pdf
e.g. site1/com/getcontentID?id=1&format=doc

3. Crawling take very long time (10 days for depth 5) since is it going deep
web.

My current solution to the problem is to add additional regex in the
regex-urlfilter.txt to prevent the crawler from fetching undesired pages.
Now I have another problems.
1. regex to exclude undesired URLs patter is not exhausted for there are
many site and many pattern. Thus crawler is still going deep web.
2. regex filters to exclude is getting too long so far 50 regex to exclude
the URLs pattern.

I hope I am not the only one with the similar problem and someone knows
smarter way to solve the problem. Has anybody have a solution or suggestion
on how to solve the problem? Some tips or direction would be very much
appreciated. 

Btw, I am using nutch 1.2 but I believe the crawler principle is pretty much
the same.

Warm Regards,

Ye





--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling URLs with query string while limiting only web pages

Posted by Ye T Thet <ye...@gmail.com>.
Feng Lu: Thanks for the tip. I will definitely try the approach. Appreciate
your help.

Tejas: I am using the groping approach, filtering out some keywords from
the fetched log. Good so far I observed 20% of the fetched list is filled
up with the not-so important URL. I hope optimized filter can do some good
for my crawler performance.

Thanks for your directions.

Cheers,

Ye



On Mon, Feb 25, 2013 at 3:31 AM, Tejas Patil <te...@gmail.com>wrote:

> @Ye, You need not look at each url. Random sampling will be better. It wont
> be accurate but practical thing to do. Even while going through logs,
> extract the urls, sort them so that all of those belonging to the same host
> lie in the same group.
>
> @feng lu: +1. Good trick to remove the bad urls using normalization. The
> main problem in front of OP would be still to come up with such rules by
> manually observing the logs.
>
> Thanks,
> Tejas Patil
>
>
> On Sun, Feb 24, 2013 at 7:16 AM, feng lu <am...@gmail.com> wrote:
>
> > Hi Ye
> >
> > Can you add this pattern to regex-normalize.xml configuration file for
> the
> > RegexUrlNormalize class.
> >
> > <!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
> > <regex>
> >
> >
> >
> <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid|view|zoom)=.*?)(\?|&amp;|#|$)</pattern>
> >   <substitution>$4</substitution>
> > </regex>
> >
> > it will removes session ids from urls such as view and zoom.
> >
> > e.g. site1.com/article1/?view=printerfriendly
> > e.g. site1.com/article1/?zoom=large
> > e.g. site1.com/article1/?zoom=extralarge
> >
> > to
> >
> > e.g. site1.com/article1
> >
> >
> >
> >
> >
> > On Sun, Feb 24, 2013 at 9:48 PM, Ye T Thet <ye...@gmail.com>
> wrote:
> >
> > > Tejas,
> > >
> > > Thanks for your pointers. They are really helpful. As of now my
> approach
> > is
> > > according to your direction 1, 2 and 3. Since my sites are around 10k
> in
> > > number, I hope it would be manageable for near future.
> > >
> > > I might need to apply as per your direction 4 and 5 in the future as
> > well.
> > > But I believe it might be out of my league to get it right though.
> > >
> > > Some extra information my approach, most of my target sites are using
> CMS
> > > and quite a number of them DOES NOT use pretty URL. I have been greping
> > the
> > > log and identify the pattern of redundant or non-important URL and
> adding
> > > regex rules to regex-urlfilter. 2 millions URL is quite hard to process
> > for
> > > one man though. Phew!
> > >
> > > I would share if I could fine an approach that could benefit us all.
> > >
> > > Regards,
> > >
> > > Ye
> > >
> > > On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil <
> tejas.patil.cs@gmail.com
> > > >wrote:
> > >
> > > > one correction in red below.
> > > >
> > > > On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <
> tejas.patil.cs@gmail.com
> > > > >wrote:
> > > >
> > > > > I think that what you have done till now is logical. Typically in
> > nutch
> > > > > crawls people dont want urls with query string but nowadays things
> > have
> > > > > changed. For instance, category #2 you pointed out may capture some
> > > vital
> > > > > pages. I once ran into the similar issue. Crawler cant be made
> > > > intelligent
> > > > > beyond a certain point and I had to go through crawl logs to check
> > what
> > > > all
> > > > > urls are being fetched and later redefine by regex rules.
> > > > >
> > > > > Some things that I had considered doing:
> > > > > 1. Start off with rules which are less restrictive and observe the
> > logs
> > > > > for which urls are visited. This will give you an idea about the
> bad
> > > urls
> > > > > and the good ones. As you already have crawled for 10 days, you are
> > > (just
> > > > > !!) left with studying the logs.
> > > > > 2. After #1 is done, launch crawls with accept rules for the good
> > urls
> > > > and
> > > > > put a "-." in the end to avoid the bad urls.
> > > > > 3. Having a huge list of regexes is bad thing because its comparing
> > > urls
> > > > > against regexes is a costly operation and done for every url. A url
> > > > getting
> > > > > a match early saves this time. So have patterns which capture a
> huge
> > > set
> > > > of
> > > > > urls at the top for the regex urlfilter file.
> > > > > 4. Sometimes you dont want the parser to extract urls from certain
> > > areas
> > > > > of the page as you know that its not going to yield anything good
> to
> > > you.
> > > > > Lets say that the "print" or "zoom" urls are coming from some
> > specific
> > > > tags
> > > > > of the html source. Its better not to parse those things and thus
> not
> > > > have
> > > > > those urls itself in the first place. The profit here is that now
> the
> > > > regex
> > > > > rules to be defined are reduced.
> > > > > 5. An improvement over *#4* is that if you know the nature of pages
> > > that
> > > > > are being crawled, you can tweak parsers to extract urls from
> > specific
> > > > tags
> > > > > only. This reduces noise and much cleaner fetch list.
> > > > >
> > > > > As far as I feel, this problem wont have an automated solution like
> > > > > modifying some config/setting. There is a decent amount of human
> > > > > intervention required to get things right. Knowing the nature of
> > pages
> > > > you
> > > > > plan to crawl is vital in making smart decisions.
> > > > >
> > > > > Thanks,
> > > > > Tejas Patil
> > > > >
> > > > >
> > > > > On Fri, Feb 22, 2013 at 5:52 PM, ytthet <ye...@gmail.com>
> > > wrote:
> > > > >
> > > > >> Hi Folks,
> > > > >>
> > > > >> I have a question on crawling URLs with query string. I am
> crawling
> > > > about
> > > > >> 10,000 sites. Some of the site uses query string to serve the
> > content
> > > > >> while
> > > > >> some uses simple URLs. Example I have following cases
> > > > >>
> > > > >> Case 1:
> > > > >>
> > > > >> site1.com/article1
> > > > >> site1.com/article2
> > > > >>
> > > > >> Case 2:
> > > > >> site2.com/?pid=123
> > > > >> site2.com/?pid=124
> > > > >>
> > > > >> The only way to crawl and fetch webpages/articles in case 2 is to
> > > fetch
> > > > >> URLs
> > > > >> with query string "?" . While for the case 1 I can set NOT to
> fetch
> > > "?"
> > > > in
> > > > >> URL. Thus currently in my regex-urlfilter.txt , I commented the
> > > > following
> > > > >> lines for my crawler to fetch URL with query string.
> > > > >>
> > > > >> # skip URLs containing certain characters as probable queries,
> etc.
> > > > >> #-[?*!@=]
> > > > >>
> > > > >> The above setting cause the crawler to fetch all URLs including
> URLs
> > > > with
> > > > >> query string thus pages such as download, login, comments, search
> > > query,
> > > > >> printer friendly pages, zoom in view and other not valuable pages
> > are
> > > > >> being
> > > > >> fetch. Practically, the crawler is going deep web. The undesirable
> > > cause
> > > > >> of
> > > > >> this is as following:
> > > > >>
> > > > >> 1. Duplicate pages are being fetch, effecting the crawl DB to be
> > > bloated
> > > > >> - Printer friendly view, zoom in view
> > > > >> e.g. site1.com/article1
> > > > >> e.g. site1.com/article1/?view=printerfriendly
> > > > >> e.g. site1.com/article1/?zoom=large
> > > > >> e.g. site1.com/article1/?zoom=extralarge
> > > > >>
> > > > >> 2. Download pages are being fetch, effecting the segment to be too
> > > large
> > > > >> e.g. site1/com/getcontentID?id=1&format=pdf
> > > > >> e.g. site1/com/getcontentID?id=1&format=doc
> > > > >>
> > > > >> 3. Crawling take very long time (10 days for depth 5) since is it
> > > going
> > > > >> deep
> > > > >> web.
> > > > >>
> > > > >> My current solution to the problem is to add additional regex in
> the
> > > > >> regex-urlfilter.txt to prevent the crawler from fetching undesired
> > > > pages.
> > > > >> Now I have another problems.
> > > > >> 1. regex to exclude undesired URLs patter is not exhausted for
> there
> > > are
> > > > >> many site and many pattern. Thus crawler is still going deep web.
> > > > >> 2. regex filters to exclude is getting too long so far 50 regex to
> > > > exclude
> > > > >> the URLs pattern.
> > > > >>
> > > > >> I hope I am not the only one with the similar problem and someone
> > > knows
> > > > >> smarter way to solve the problem. Has anybody have a solution or
> > > > >> suggestion
> > > > >> on how to solve the problem? Some tips or direction would be very
> > much
> > > > >> appreciated.
> > > > >>
> > > > >> Btw, I am using nutch 1.2 but I believe the crawler principle is
> > > pretty
> > > > >> much
> > > > >> the same.
> > > > >>
> > > > >> Warm Regards,
> > > > >>
> > > > >> Ye
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> View this message in context:
> > > > >>
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
> > > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>

Re: Crawling URLs with query string while limiting only web pages

Posted by Tejas Patil <te...@gmail.com>.
@Ye, You need not look at each url. Random sampling will be better. It wont
be accurate but practical thing to do. Even while going through logs,
extract the urls, sort them so that all of those belonging to the same host
lie in the same group.

@feng lu: +1. Good trick to remove the bad urls using normalization. The
main problem in front of OP would be still to come up with such rules by
manually observing the logs.

Thanks,
Tejas Patil


On Sun, Feb 24, 2013 at 7:16 AM, feng lu <am...@gmail.com> wrote:

> Hi Ye
>
> Can you add this pattern to regex-normalize.xml configuration file for the
> RegexUrlNormalize class.
>
> <!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
> <regex>
>
>
> <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid|view|zoom)=.*?)(\?|&amp;|#|$)</pattern>
>   <substitution>$4</substitution>
> </regex>
>
> it will removes session ids from urls such as view and zoom.
>
> e.g. site1.com/article1/?view=printerfriendly
> e.g. site1.com/article1/?zoom=large
> e.g. site1.com/article1/?zoom=extralarge
>
> to
>
> e.g. site1.com/article1
>
>
>
>
>
> On Sun, Feb 24, 2013 at 9:48 PM, Ye T Thet <ye...@gmail.com> wrote:
>
> > Tejas,
> >
> > Thanks for your pointers. They are really helpful. As of now my approach
> is
> > according to your direction 1, 2 and 3. Since my sites are around 10k in
> > number, I hope it would be manageable for near future.
> >
> > I might need to apply as per your direction 4 and 5 in the future as
> well.
> > But I believe it might be out of my league to get it right though.
> >
> > Some extra information my approach, most of my target sites are using CMS
> > and quite a number of them DOES NOT use pretty URL. I have been greping
> the
> > log and identify the pattern of redundant or non-important URL and adding
> > regex rules to regex-urlfilter. 2 millions URL is quite hard to process
> for
> > one man though. Phew!
> >
> > I would share if I could fine an approach that could benefit us all.
> >
> > Regards,
> >
> > Ye
> >
> > On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > >wrote:
> >
> > > one correction in red below.
> > >
> > > On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > > >wrote:
> > >
> > > > I think that what you have done till now is logical. Typically in
> nutch
> > > > crawls people dont want urls with query string but nowadays things
> have
> > > > changed. For instance, category #2 you pointed out may capture some
> > vital
> > > > pages. I once ran into the similar issue. Crawler cant be made
> > > intelligent
> > > > beyond a certain point and I had to go through crawl logs to check
> what
> > > all
> > > > urls are being fetched and later redefine by regex rules.
> > > >
> > > > Some things that I had considered doing:
> > > > 1. Start off with rules which are less restrictive and observe the
> logs
> > > > for which urls are visited. This will give you an idea about the bad
> > urls
> > > > and the good ones. As you already have crawled for 10 days, you are
> > (just
> > > > !!) left with studying the logs.
> > > > 2. After #1 is done, launch crawls with accept rules for the good
> urls
> > > and
> > > > put a "-." in the end to avoid the bad urls.
> > > > 3. Having a huge list of regexes is bad thing because its comparing
> > urls
> > > > against regexes is a costly operation and done for every url. A url
> > > getting
> > > > a match early saves this time. So have patterns which capture a huge
> > set
> > > of
> > > > urls at the top for the regex urlfilter file.
> > > > 4. Sometimes you dont want the parser to extract urls from certain
> > areas
> > > > of the page as you know that its not going to yield anything good to
> > you.
> > > > Lets say that the "print" or "zoom" urls are coming from some
> specific
> > > tags
> > > > of the html source. Its better not to parse those things and thus not
> > > have
> > > > those urls itself in the first place. The profit here is that now the
> > > regex
> > > > rules to be defined are reduced.
> > > > 5. An improvement over *#4* is that if you know the nature of pages
> > that
> > > > are being crawled, you can tweak parsers to extract urls from
> specific
> > > tags
> > > > only. This reduces noise and much cleaner fetch list.
> > > >
> > > > As far as I feel, this problem wont have an automated solution like
> > > > modifying some config/setting. There is a decent amount of human
> > > > intervention required to get things right. Knowing the nature of
> pages
> > > you
> > > > plan to crawl is vital in making smart decisions.
> > > >
> > > > Thanks,
> > > > Tejas Patil
> > > >
> > > >
> > > > On Fri, Feb 22, 2013 at 5:52 PM, ytthet <ye...@gmail.com>
> > wrote:
> > > >
> > > >> Hi Folks,
> > > >>
> > > >> I have a question on crawling URLs with query string. I am crawling
> > > about
> > > >> 10,000 sites. Some of the site uses query string to serve the
> content
> > > >> while
> > > >> some uses simple URLs. Example I have following cases
> > > >>
> > > >> Case 1:
> > > >>
> > > >> site1.com/article1
> > > >> site1.com/article2
> > > >>
> > > >> Case 2:
> > > >> site2.com/?pid=123
> > > >> site2.com/?pid=124
> > > >>
> > > >> The only way to crawl and fetch webpages/articles in case 2 is to
> > fetch
> > > >> URLs
> > > >> with query string "?" . While for the case 1 I can set NOT to fetch
> > "?"
> > > in
> > > >> URL. Thus currently in my regex-urlfilter.txt , I commented the
> > > following
> > > >> lines for my crawler to fetch URL with query string.
> > > >>
> > > >> # skip URLs containing certain characters as probable queries, etc.
> > > >> #-[?*!@=]
> > > >>
> > > >> The above setting cause the crawler to fetch all URLs including URLs
> > > with
> > > >> query string thus pages such as download, login, comments, search
> > query,
> > > >> printer friendly pages, zoom in view and other not valuable pages
> are
> > > >> being
> > > >> fetch. Practically, the crawler is going deep web. The undesirable
> > cause
> > > >> of
> > > >> this is as following:
> > > >>
> > > >> 1. Duplicate pages are being fetch, effecting the crawl DB to be
> > bloated
> > > >> - Printer friendly view, zoom in view
> > > >> e.g. site1.com/article1
> > > >> e.g. site1.com/article1/?view=printerfriendly
> > > >> e.g. site1.com/article1/?zoom=large
> > > >> e.g. site1.com/article1/?zoom=extralarge
> > > >>
> > > >> 2. Download pages are being fetch, effecting the segment to be too
> > large
> > > >> e.g. site1/com/getcontentID?id=1&format=pdf
> > > >> e.g. site1/com/getcontentID?id=1&format=doc
> > > >>
> > > >> 3. Crawling take very long time (10 days for depth 5) since is it
> > going
> > > >> deep
> > > >> web.
> > > >>
> > > >> My current solution to the problem is to add additional regex in the
> > > >> regex-urlfilter.txt to prevent the crawler from fetching undesired
> > > pages.
> > > >> Now I have another problems.
> > > >> 1. regex to exclude undesired URLs patter is not exhausted for there
> > are
> > > >> many site and many pattern. Thus crawler is still going deep web.
> > > >> 2. regex filters to exclude is getting too long so far 50 regex to
> > > exclude
> > > >> the URLs pattern.
> > > >>
> > > >> I hope I am not the only one with the similar problem and someone
> > knows
> > > >> smarter way to solve the problem. Has anybody have a solution or
> > > >> suggestion
> > > >> on how to solve the problem? Some tips or direction would be very
> much
> > > >> appreciated.
> > > >>
> > > >> Btw, I am using nutch 1.2 but I believe the crawler principle is
> > pretty
> > > >> much
> > > >> the same.
> > > >>
> > > >> Warm Regards,
> > > >>
> > > >> Ye
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> View this message in context:
> > > >>
> > >
> >
> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
> > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > >>
> > > >
> > > >
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Crawling URLs with query string while limiting only web pages

Posted by feng lu <am...@gmail.com>.
Hi Ye

Can you add this pattern to regex-normalize.xml configuration file for the
RegexUrlNormalize class.

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>

<pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid|view|zoom)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>

it will removes session ids from urls such as view and zoom.

e.g. site1.com/article1/?view=printerfriendly
e.g. site1.com/article1/?zoom=large
e.g. site1.com/article1/?zoom=extralarge

to

e.g. site1.com/article1





On Sun, Feb 24, 2013 at 9:48 PM, Ye T Thet <ye...@gmail.com> wrote:

> Tejas,
>
> Thanks for your pointers. They are really helpful. As of now my approach is
> according to your direction 1, 2 and 3. Since my sites are around 10k in
> number, I hope it would be manageable for near future.
>
> I might need to apply as per your direction 4 and 5 in the future as well.
> But I believe it might be out of my league to get it right though.
>
> Some extra information my approach, most of my target sites are using CMS
> and quite a number of them DOES NOT use pretty URL. I have been greping the
> log and identify the pattern of redundant or non-important URL and adding
> regex rules to regex-urlfilter. 2 millions URL is quite hard to process for
> one man though. Phew!
>
> I would share if I could fine an approach that could benefit us all.
>
> Regards,
>
> Ye
>
> On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > one correction in red below.
> >
> > On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > >wrote:
> >
> > > I think that what you have done till now is logical. Typically in nutch
> > > crawls people dont want urls with query string but nowadays things have
> > > changed. For instance, category #2 you pointed out may capture some
> vital
> > > pages. I once ran into the similar issue. Crawler cant be made
> > intelligent
> > > beyond a certain point and I had to go through crawl logs to check what
> > all
> > > urls are being fetched and later redefine by regex rules.
> > >
> > > Some things that I had considered doing:
> > > 1. Start off with rules which are less restrictive and observe the logs
> > > for which urls are visited. This will give you an idea about the bad
> urls
> > > and the good ones. As you already have crawled for 10 days, you are
> (just
> > > !!) left with studying the logs.
> > > 2. After #1 is done, launch crawls with accept rules for the good urls
> > and
> > > put a "-." in the end to avoid the bad urls.
> > > 3. Having a huge list of regexes is bad thing because its comparing
> urls
> > > against regexes is a costly operation and done for every url. A url
> > getting
> > > a match early saves this time. So have patterns which capture a huge
> set
> > of
> > > urls at the top for the regex urlfilter file.
> > > 4. Sometimes you dont want the parser to extract urls from certain
> areas
> > > of the page as you know that its not going to yield anything good to
> you.
> > > Lets say that the "print" or "zoom" urls are coming from some specific
> > tags
> > > of the html source. Its better not to parse those things and thus not
> > have
> > > those urls itself in the first place. The profit here is that now the
> > regex
> > > rules to be defined are reduced.
> > > 5. An improvement over *#4* is that if you know the nature of pages
> that
> > > are being crawled, you can tweak parsers to extract urls from specific
> > tags
> > > only. This reduces noise and much cleaner fetch list.
> > >
> > > As far as I feel, this problem wont have an automated solution like
> > > modifying some config/setting. There is a decent amount of human
> > > intervention required to get things right. Knowing the nature of pages
> > you
> > > plan to crawl is vital in making smart decisions.
> > >
> > > Thanks,
> > > Tejas Patil
> > >
> > >
> > > On Fri, Feb 22, 2013 at 5:52 PM, ytthet <ye...@gmail.com>
> wrote:
> > >
> > >> Hi Folks,
> > >>
> > >> I have a question on crawling URLs with query string. I am crawling
> > about
> > >> 10,000 sites. Some of the site uses query string to serve the content
> > >> while
> > >> some uses simple URLs. Example I have following cases
> > >>
> > >> Case 1:
> > >>
> > >> site1.com/article1
> > >> site1.com/article2
> > >>
> > >> Case 2:
> > >> site2.com/?pid=123
> > >> site2.com/?pid=124
> > >>
> > >> The only way to crawl and fetch webpages/articles in case 2 is to
> fetch
> > >> URLs
> > >> with query string "?" . While for the case 1 I can set NOT to fetch
> "?"
> > in
> > >> URL. Thus currently in my regex-urlfilter.txt , I commented the
> > following
> > >> lines for my crawler to fetch URL with query string.
> > >>
> > >> # skip URLs containing certain characters as probable queries, etc.
> > >> #-[?*!@=]
> > >>
> > >> The above setting cause the crawler to fetch all URLs including URLs
> > with
> > >> query string thus pages such as download, login, comments, search
> query,
> > >> printer friendly pages, zoom in view and other not valuable pages are
> > >> being
> > >> fetch. Practically, the crawler is going deep web. The undesirable
> cause
> > >> of
> > >> this is as following:
> > >>
> > >> 1. Duplicate pages are being fetch, effecting the crawl DB to be
> bloated
> > >> - Printer friendly view, zoom in view
> > >> e.g. site1.com/article1
> > >> e.g. site1.com/article1/?view=printerfriendly
> > >> e.g. site1.com/article1/?zoom=large
> > >> e.g. site1.com/article1/?zoom=extralarge
> > >>
> > >> 2. Download pages are being fetch, effecting the segment to be too
> large
> > >> e.g. site1/com/getcontentID?id=1&format=pdf
> > >> e.g. site1/com/getcontentID?id=1&format=doc
> > >>
> > >> 3. Crawling take very long time (10 days for depth 5) since is it
> going
> > >> deep
> > >> web.
> > >>
> > >> My current solution to the problem is to add additional regex in the
> > >> regex-urlfilter.txt to prevent the crawler from fetching undesired
> > pages.
> > >> Now I have another problems.
> > >> 1. regex to exclude undesired URLs patter is not exhausted for there
> are
> > >> many site and many pattern. Thus crawler is still going deep web.
> > >> 2. regex filters to exclude is getting too long so far 50 regex to
> > exclude
> > >> the URLs pattern.
> > >>
> > >> I hope I am not the only one with the similar problem and someone
> knows
> > >> smarter way to solve the problem. Has anybody have a solution or
> > >> suggestion
> > >> on how to solve the problem? Some tips or direction would be very much
> > >> appreciated.
> > >>
> > >> Btw, I am using nutch 1.2 but I believe the crawler principle is
> pretty
> > >> much
> > >> the same.
> > >>
> > >> Warm Regards,
> > >>
> > >> Ye
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context:
> > >>
> >
> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
> > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > >>
> > >
> > >
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Crawling URLs with query string while limiting only web pages

Posted by Ye T Thet <ye...@gmail.com>.
Tejas,

Thanks for your pointers. They are really helpful. As of now my approach is
according to your direction 1, 2 and 3. Since my sites are around 10k in
number, I hope it would be manageable for near future.

I might need to apply as per your direction 4 and 5 in the future as well.
But I believe it might be out of my league to get it right though.

Some extra information my approach, most of my target sites are using CMS
and quite a number of them DOES NOT use pretty URL. I have been greping the
log and identify the pattern of redundant or non-important URL and adding
regex rules to regex-urlfilter. 2 millions URL is quite hard to process for
one man though. Phew!

I would share if I could fine an approach that could benefit us all.

Regards,

Ye

On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil <te...@gmail.com>wrote:

> one correction in red below.
>
> On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > I think that what you have done till now is logical. Typically in nutch
> > crawls people dont want urls with query string but nowadays things have
> > changed. For instance, category #2 you pointed out may capture some vital
> > pages. I once ran into the similar issue. Crawler cant be made
> intelligent
> > beyond a certain point and I had to go through crawl logs to check what
> all
> > urls are being fetched and later redefine by regex rules.
> >
> > Some things that I had considered doing:
> > 1. Start off with rules which are less restrictive and observe the logs
> > for which urls are visited. This will give you an idea about the bad urls
> > and the good ones. As you already have crawled for 10 days, you are (just
> > !!) left with studying the logs.
> > 2. After #1 is done, launch crawls with accept rules for the good urls
> and
> > put a "-." in the end to avoid the bad urls.
> > 3. Having a huge list of regexes is bad thing because its comparing urls
> > against regexes is a costly operation and done for every url. A url
> getting
> > a match early saves this time. So have patterns which capture a huge set
> of
> > urls at the top for the regex urlfilter file.
> > 4. Sometimes you dont want the parser to extract urls from certain areas
> > of the page as you know that its not going to yield anything good to you.
> > Lets say that the "print" or "zoom" urls are coming from some specific
> tags
> > of the html source. Its better not to parse those things and thus not
> have
> > those urls itself in the first place. The profit here is that now the
> regex
> > rules to be defined are reduced.
> > 5. An improvement over *#4* is that if you know the nature of pages that
> > are being crawled, you can tweak parsers to extract urls from specific
> tags
> > only. This reduces noise and much cleaner fetch list.
> >
> > As far as I feel, this problem wont have an automated solution like
> > modifying some config/setting. There is a decent amount of human
> > intervention required to get things right. Knowing the nature of pages
> you
> > plan to crawl is vital in making smart decisions.
> >
> > Thanks,
> > Tejas Patil
> >
> >
> > On Fri, Feb 22, 2013 at 5:52 PM, ytthet <ye...@gmail.com> wrote:
> >
> >> Hi Folks,
> >>
> >> I have a question on crawling URLs with query string. I am crawling
> about
> >> 10,000 sites. Some of the site uses query string to serve the content
> >> while
> >> some uses simple URLs. Example I have following cases
> >>
> >> Case 1:
> >>
> >> site1.com/article1
> >> site1.com/article2
> >>
> >> Case 2:
> >> site2.com/?pid=123
> >> site2.com/?pid=124
> >>
> >> The only way to crawl and fetch webpages/articles in case 2 is to fetch
> >> URLs
> >> with query string "?" . While for the case 1 I can set NOT to fetch "?"
> in
> >> URL. Thus currently in my regex-urlfilter.txt , I commented the
> following
> >> lines for my crawler to fetch URL with query string.
> >>
> >> # skip URLs containing certain characters as probable queries, etc.
> >> #-[?*!@=]
> >>
> >> The above setting cause the crawler to fetch all URLs including URLs
> with
> >> query string thus pages such as download, login, comments, search query,
> >> printer friendly pages, zoom in view and other not valuable pages are
> >> being
> >> fetch. Practically, the crawler is going deep web. The undesirable cause
> >> of
> >> this is as following:
> >>
> >> 1. Duplicate pages are being fetch, effecting the crawl DB to be bloated
> >> - Printer friendly view, zoom in view
> >> e.g. site1.com/article1
> >> e.g. site1.com/article1/?view=printerfriendly
> >> e.g. site1.com/article1/?zoom=large
> >> e.g. site1.com/article1/?zoom=extralarge
> >>
> >> 2. Download pages are being fetch, effecting the segment to be too large
> >> e.g. site1/com/getcontentID?id=1&format=pdf
> >> e.g. site1/com/getcontentID?id=1&format=doc
> >>
> >> 3. Crawling take very long time (10 days for depth 5) since is it going
> >> deep
> >> web.
> >>
> >> My current solution to the problem is to add additional regex in the
> >> regex-urlfilter.txt to prevent the crawler from fetching undesired
> pages.
> >> Now I have another problems.
> >> 1. regex to exclude undesired URLs patter is not exhausted for there are
> >> many site and many pattern. Thus crawler is still going deep web.
> >> 2. regex filters to exclude is getting too long so far 50 regex to
> exclude
> >> the URLs pattern.
> >>
> >> I hope I am not the only one with the similar problem and someone knows
> >> smarter way to solve the problem. Has anybody have a solution or
> >> suggestion
> >> on how to solve the problem? Some tips or direction would be very much
> >> appreciated.
> >>
> >> Btw, I am using nutch 1.2 but I believe the crawler principle is pretty
> >> much
> >> the same.
> >>
> >> Warm Regards,
> >>
> >> Ye
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> >
>

Re: Crawling URLs with query string while limiting only web pages

Posted by Tejas Patil <te...@gmail.com>.
one correction in red below.

On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <te...@gmail.com>wrote:

> I think that what you have done till now is logical. Typically in nutch
> crawls people dont want urls with query string but nowadays things have
> changed. For instance, category #2 you pointed out may capture some vital
> pages. I once ran into the similar issue. Crawler cant be made intelligent
> beyond a certain point and I had to go through crawl logs to check what all
> urls are being fetched and later redefine by regex rules.
>
> Some things that I had considered doing:
> 1. Start off with rules which are less restrictive and observe the logs
> for which urls are visited. This will give you an idea about the bad urls
> and the good ones. As you already have crawled for 10 days, you are (just
> !!) left with studying the logs.
> 2. After #1 is done, launch crawls with accept rules for the good urls and
> put a "-." in the end to avoid the bad urls.
> 3. Having a huge list of regexes is bad thing because its comparing urls
> against regexes is a costly operation and done for every url. A url getting
> a match early saves this time. So have patterns which capture a huge set of
> urls at the top for the regex urlfilter file.
> 4. Sometimes you dont want the parser to extract urls from certain areas
> of the page as you know that its not going to yield anything good to you.
> Lets say that the "print" or "zoom" urls are coming from some specific tags
> of the html source. Its better not to parse those things and thus not have
> those urls itself in the first place. The profit here is that now the regex
> rules to be defined are reduced.
> 5. An improvement over *#4* is that if you know the nature of pages that
> are being crawled, you can tweak parsers to extract urls from specific tags
> only. This reduces noise and much cleaner fetch list.
>
> As far as I feel, this problem wont have an automated solution like
> modifying some config/setting. There is a decent amount of human
> intervention required to get things right. Knowing the nature of pages you
> plan to crawl is vital in making smart decisions.
>
> Thanks,
> Tejas Patil
>
>
> On Fri, Feb 22, 2013 at 5:52 PM, ytthet <ye...@gmail.com> wrote:
>
>> Hi Folks,
>>
>> I have a question on crawling URLs with query string. I am crawling about
>> 10,000 sites. Some of the site uses query string to serve the content
>> while
>> some uses simple URLs. Example I have following cases
>>
>> Case 1:
>>
>> site1.com/article1
>> site1.com/article2
>>
>> Case 2:
>> site2.com/?pid=123
>> site2.com/?pid=124
>>
>> The only way to crawl and fetch webpages/articles in case 2 is to fetch
>> URLs
>> with query string "?" . While for the case 1 I can set NOT to fetch "?" in
>> URL. Thus currently in my regex-urlfilter.txt , I commented the following
>> lines for my crawler to fetch URL with query string.
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> #-[?*!@=]
>>
>> The above setting cause the crawler to fetch all URLs including URLs with
>> query string thus pages such as download, login, comments, search query,
>> printer friendly pages, zoom in view and other not valuable pages are
>> being
>> fetch. Practically, the crawler is going deep web. The undesirable cause
>> of
>> this is as following:
>>
>> 1. Duplicate pages are being fetch, effecting the crawl DB to be bloated
>> - Printer friendly view, zoom in view
>> e.g. site1.com/article1
>> e.g. site1.com/article1/?view=printerfriendly
>> e.g. site1.com/article1/?zoom=large
>> e.g. site1.com/article1/?zoom=extralarge
>>
>> 2. Download pages are being fetch, effecting the segment to be too large
>> e.g. site1/com/getcontentID?id=1&format=pdf
>> e.g. site1/com/getcontentID?id=1&format=doc
>>
>> 3. Crawling take very long time (10 days for depth 5) since is it going
>> deep
>> web.
>>
>> My current solution to the problem is to add additional regex in the
>> regex-urlfilter.txt to prevent the crawler from fetching undesired pages.
>> Now I have another problems.
>> 1. regex to exclude undesired URLs patter is not exhausted for there are
>> many site and many pattern. Thus crawler is still going deep web.
>> 2. regex filters to exclude is getting too long so far 50 regex to exclude
>> the URLs pattern.
>>
>> I hope I am not the only one with the similar problem and someone knows
>> smarter way to solve the problem. Has anybody have a solution or
>> suggestion
>> on how to solve the problem? Some tips or direction would be very much
>> appreciated.
>>
>> Btw, I am using nutch 1.2 but I believe the crawler principle is pretty
>> much
>> the same.
>>
>> Warm Regards,
>>
>> Ye
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>

Re: Crawling URLs with query string while limiting only web pages

Posted by Tejas Patil <te...@gmail.com>.
I think that what you have done till now is logical. Typically in nutch
crawls people dont want urls with query string but nowadays things have
changed. For instance, category #2 you pointed out may capture some vital
pages. I once ran into the similar issue. Crawler cant be made intelligent
beyond a certain point and I had to go through crawl logs to check what all
urls are being fetched and later redefine by regex rules.

Some things that I had considered doing:
1. Start off with rules which are less restrictive and observe the logs for
which urls are visited. This will give you an idea about the bad urls and
the good ones. As you already have crawled for 10 days, you are (just !!)
left with studying the logs.
2. After #1 is done, launch crawls with accept rules for the good urls and
put a "-." in the end to avoid the bad urls.
3. Having a huge list of regexes is bad thing because its comparing urls
against regexes is a costly operation and done for every url. A url getting
a match early saves this time. So have patterns which capture a huge set of
urls at the top for the regex urlfilter file.
4. Sometimes you dont want the parser to extract urls from certain areas of
the page as you know that its not going to yield anything good to you. Lets
say that the "print" or "zoom" urls are coming from some specific tags of
the html source. Its better not to parse those things and thus not have
those urls itself in the first place. The profit here is that now the regex
rules to be defined are reduced.
5. An improvement over #5 is that if you know the nature of pages that are
being crawled, you can tweak parsers to extract urls from specific tags
only. This reduces noise and much cleaner fetch list.

As far as I feel, this problem wont have an automated solution like
modifying some config/setting. There is a decent amount of human
intervention required to get things right. Knowing the nature of pages you
plan to crawl is vital in making smart decisions.

Thanks,
Tejas Patil


On Fri, Feb 22, 2013 at 5:52 PM, ytthet <ye...@gmail.com> wrote:

> Hi Folks,
>
> I have a question on crawling URLs with query string. I am crawling about
> 10,000 sites. Some of the site uses query string to serve the content while
> some uses simple URLs. Example I have following cases
>
> Case 1:
>
> site1.com/article1
> site1.com/article2
>
> Case 2:
> site2.com/?pid=123
> site2.com/?pid=124
>
> The only way to crawl and fetch webpages/articles in case 2 is to fetch
> URLs
> with query string "?" . While for the case 1 I can set NOT to fetch "?" in
> URL. Thus currently in my regex-urlfilter.txt , I commented the following
> lines for my crawler to fetch URL with query string.
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> The above setting cause the crawler to fetch all URLs including URLs with
> query string thus pages such as download, login, comments, search query,
> printer friendly pages, zoom in view and other not valuable pages are being
> fetch. Practically, the crawler is going deep web. The undesirable cause of
> this is as following:
>
> 1. Duplicate pages are being fetch, effecting the crawl DB to be bloated
> - Printer friendly view, zoom in view
> e.g. site1.com/article1
> e.g. site1.com/article1/?view=printerfriendly
> e.g. site1.com/article1/?zoom=large
> e.g. site1.com/article1/?zoom=extralarge
>
> 2. Download pages are being fetch, effecting the segment to be too large
> e.g. site1/com/getcontentID?id=1&format=pdf
> e.g. site1/com/getcontentID?id=1&format=doc
>
> 3. Crawling take very long time (10 days for depth 5) since is it going
> deep
> web.
>
> My current solution to the problem is to add additional regex in the
> regex-urlfilter.txt to prevent the crawler from fetching undesired pages.
> Now I have another problems.
> 1. regex to exclude undesired URLs patter is not exhausted for there are
> many site and many pattern. Thus crawler is still going deep web.
> 2. regex filters to exclude is getting too long so far 50 regex to exclude
> the URLs pattern.
>
> I hope I am not the only one with the similar problem and someone knows
> smarter way to solve the problem. Has anybody have a solution or suggestion
> on how to solve the problem? Some tips or direction would be very much
> appreciated.
>
> Btw, I am using nutch 1.2 but I believe the crawler principle is pretty
> much
> the same.
>
> Warm Regards,
>
> Ye
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>