You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by glumet <ja...@gmail.com> on 2013/07/04 11:43:42 UTC

New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

Hello everybody,

I am trying to crawl a few websites from my seed.txt with Nutch 2.1 new
crawl script bin/crawl. The problem is that everytime I run my script, it
does not fetch or parse anything (no urls) with message "Skipping [/here is
concrete url/] different batch id ([/here is some batch id/])"

Here is some output from the log:

/Start old crawling linked TV:
InjectorJob: starting
InjectorJob: urlDir: /opt/ir/nutch/urls
InjectorJob: finished/

It looks like that the injection of urls was ok...

/Sun Jun 30 22:18:07 CEST 2013 : Iteration 1 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: topN: 50000
GeneratorJob: done
GeneratorJob: generated batch id: 1372623488-1201848586
InjectorJob: starting
InjectorJob: urlDir: /opt/ir/nutch/urls
InjectorJob: finished
Fetching : 
FetcherJob: starting
FetcherJob: batchId: 1372623487-26323
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1372624103280
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0/

.... it continues in iteration to FetcherThread48..

/Fetcher: throughput threshold: -1
-finishing thread FetcherThread49, activeThreads=0
-finishing thread FetcherThread36, activeThreads=0
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
FetcherJob: done
Parsing : 
ParserJob: starting
ParserJob: resuming:	false
ParserJob: forced reparse:	false
ParserJob: batchId:	1372623487-26323
Skipping
http://www.brugge.be/internet/en/musea/bruggemuseum/stadhuis/index.htm;
different batch id (1372590913-1016555835)
Skipping http://www.galloromeinsmuseum.be/; different batch id
(1372590913-1016555835)
Skipping http://www.museumdrguislain.be/; different batch id
(1372590913-1016555835)
Skipping http://www.muzee.be/; different batch id (1372590913-1016555835)
Skipping http://musea.sint-niklaas.be/; different batch id
(1372590913-1016555835)
Skipping http://www.the-athenaeum.org/; different batch id
(1372590913-1016555835)
Skipping http://the-athenaeum.org/; different batch id
(1372590913-1016555835)
Skipping http://viaf.org/; different batch id (1372590913-1016555835)/

... and skipping more urls from my seed ... yes, from the seed, because I
have in my seed.txt exactly this:

http://www.brugge.be/internet/en/musea/bruggemuseum/stadhuis/index.htm
http://www.galloromeinsmuseum.be/
http://www.museumdrguislain.be/

etc.

/ParserJob: success
CrawlDB update
DbUpdaterJob: starting
Limit reached, skipping further inlinks for de.ard.www:http/
Limit reached, skipping further inlinks for de.rbb-online.mediathek:http/
Limit reached, skipping further inlinks for de.rbb-online.www:http/
DbUpdaterJob: done/

Do you know where is the probleam, please? I have read here
http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Nutch_logging_shows_Skipping_http:.2F.2FmyurlForParsing.com.3B_different_batch_id_.28null.29
about the second inject:

"Null values are possible, too, think about these steps: inject -> generate
-> inject -> fetch. The second inject will leave entries in the db without
fetchmarks seen by the fetcher later. "

but it seems to be for different batch id (null) ant it is not my case...

JB



--
View this message in context: http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Re:Re: Re:Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

Posted by glumet <ja...@gmail.com>.

Maybe I have found a solution. The problem is in the integration. I am trying
to integrate Nutch 2.2.1 and HtmlUnit 2.12 because I am working on a video
and podcasts crawling... so I need a rendered source code for every single
page I want to crawl. And this is the pain... it fails (I think) because of
library conflicts (between httpclient-4.2.5.jar and htmlunit-2.12.jar)

Caused by: java.lang.RuntimeException: java.lang.NoSuchMethodException:
org.apache.http.conn.ssl.SSLSocketFactory.createDefaultSSLContext()
        at
com.gargoylesoftware.htmlunit.HtmlUnitSSLSocketFactory.createSSLContext(HtmlUnitSSLSocketFactory.java:119)
        at
com.gargoylesoftware.htmlunit.HtmlUnitSSLSocketFactory.<init>(HtmlUnitSSLSocketFactory.java:102)
        at
com.gargoylesoftware.htmlunit.HtmlUnitSSLSocketFactory.buildSSLSocketFactory(HtmlUnitSSLSocketFactory.java:77)
        at
com.gargoylesoftware.htmlunit.HttpWebConnection.configureHttpsScheme(HttpWebConnection.java:608)
        at
com.gargoylesoftware.htmlunit.HttpWebConnection.createHttpClient(HttpWebConnection.java:555)
        at
com.gargoylesoftware.htmlunit.HttpWebConnection.getHttpClient(HttpWebConnection.java:518)
        at
com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:155)
        at
com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1486)
        at
com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1403)
        at
com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:305)
        at
com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:374)
        at
com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:359)




--
View this message in context: http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441p4075805.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re:Re: Re:Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

Posted by RS <ti...@163.com>.

Oh,
    Please parse your regex-urlfilter.txt content, and show the error log info and the complete command you use.


Thanks
HeChuan




At 2013-07-05 14:46:02,glumet <ja...@gmail.com> wrote:
>Thanks for you reply. Unfortunately, I have to write that it did not help :(.
>
>
>
>--
>View this message in context: http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441p4075665.html
>Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Re:Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

Posted by glumet <ja...@gmail.com>.

Thanks for you reply. Unfortunately, I have to write that it did not help :(.



--
View this message in context: http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441p4075665.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re:Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

Posted by RS <ti...@163.com>.

Hi:
    You write the wrong rules in the conf/regex-urlfilter.txt file.
    You should chang it like this :
     +^http://www.eisbaeren.de/*
    then ,you will get log like this :
    fetching http://www.eisbaeren.de/club/partner/ (queue crawl delay=5000ms)


Thanks 
HeChuan




At 2013-07-05 03:32:36,glumet <ja...@gmail.com> wrote:
>Ok, as I have written, the problem was in an old version of nutch (2.1).
>After updating to 2.2.1 the message about different batch id disabled but I
>have a new problem now.
>
>Everytime I start the script bin/crawl it fetch only the urls from seed (no
>pages)
>
>fetching http://www.museumhetvalkhof.nl/ (queue crawl delay=5000ms)
>fetching http://www.eisbaeren.de/ (queue crawl delay=5000ms)
>fetching http://www.s-bahn-berlin.de/ (queue crawl delay=5000ms)
>
>...but I want to fetch and then parse also 
>
>fetching http://www.museumhetvalkhof.nl/something.html
>fetching http://www.eisbaeren.de/something/something.html
>
>etc...
>
>Where is the problem please?
>
>The urls in my seed are defined like:
>
>http://www.funkhauseuropa.de/
>http://www.swr.de/
>http://www.swrmediathek.de/
>
>And regex-urlfilter.txt:
>
>+^http://([a-z0-9]*\.)*funkhauseuropa.de/
>+^http://([a-z0-9]*\.)*swr.de/
>+^http://([a-z0-9]*\.)*swrmediathek.de/
>
>
>
>
>--
>View this message in context: http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441p4075577.html
>Sent from the Nutch - User mailing list archive at Nabble.com.

Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

Posted by glumet <ja...@gmail.com>.

Ok, as I have written, the problem was in an old version of nutch (2.1).
After updating to 2.2.1 the message about different batch id disabled but I
have a new problem now.

Everytime I start the script bin/crawl it fetch only the urls from seed (no
pages)

fetching http://www.museumhetvalkhof.nl/ (queue crawl delay=5000ms)
fetching http://www.eisbaeren.de/ (queue crawl delay=5000ms)
fetching http://www.s-bahn-berlin.de/ (queue crawl delay=5000ms)

...but I want to fetch and then parse also 

fetching http://www.museumhetvalkhof.nl/something.html
fetching http://www.eisbaeren.de/something/something.html

etc...

Where is the problem please?

The urls in my seed are defined like:

http://www.funkhauseuropa.de/
http://www.swr.de/
http://www.swrmediathek.de/

And regex-urlfilter.txt:

+^http://([a-z0-9]*\.)*funkhauseuropa.de/
+^http://([a-z0-9]*\.)*swr.de/
+^http://([a-z0-9]*\.)*swrmediathek.de/




--
View this message in context: http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441p4075577.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

Posted by glumet <ja...@gmail.com>.

I forgot to say that I am using Nutch in version 2.1 ... 



--
View this message in context: http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441p4075443.html
Sent from the Nutch - User mailing list archive at Nabble.com.