You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mohamed Parvez <pa...@gmail.com> on 2009/09/01 23:25:57 UTC

Nutch truncating URL to 318 Chars

I am trying to index a site that has, URL with length 325 chars and its
failing.


I started with 2 URLs in the urls/seed.txt file with both of length 325 and
only difference between both the URLs is the right side, last 3 chars

I ran the fallowing 2 commands

$ bin/nutch inject crawl/crawldb urls
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done

$ bin/nutch readdb crawl/crawldb -dump dump
CrawlDb dump: starting
CrawlDb db: crawl/crawldb
CrawlDb dump: done


I opened the part-00000 file in the dump folder and there, is only ONE url
and it has been truncated to 318 chars


How make Nutch consider URLs with length more than 318 chars

----
Thanks/Regards,
Parvez

Re: Nutch truncating URL to 318 Chars

Posted by Alexey Torochkov <al...@gmail.com>.
See regex-urlnormalizer.xmlFor default it truncate
(sid|phpsessid|sessionid)=.*? insensitive
-- 
Alexey Torochkov

Re: Nutch truncating URL to 318 Chars

Posted by Mohamed Parvez <pa...@gmail.com>.
It truncates "sld=386"

Looks like URL is not getting tructed but its removing the "sld=386" part of
all URLs.

I tried using string for filed url in the conf/schema.xml but still same
results.

I have tried using the http://business.verizon.net/  but when it reaches
these URLs later in the parsing, it only stores one, even though there are
many. As the truncated URLs are all same.

I am sure the webserver does not limit it. As i can see the full url in the
browser.

Contents of urls/seed.txt :
-------------------------------------
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=/pageflows/verizon/smb/portal/marketPlacePF/getProductDetails&MarketPlacePFController_1productsId=443
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fsmb%252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_1productsId=49


Contents of dump/part-00000 :
-------------------------------------------
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fsmb%252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_1product
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Sep 01 17:18:05 CDT 2009
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:

http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=/pageflows/verizon/smb/portal/marketPlacePF/getProductDetails&MarketPlacePFController_1product
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Sep 01 17:18:05 CDT 2009
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:



----
Thanks/Regards,
Parvez
GV : 786-693-2228


On Tue, Sep 1, 2009 at 5:16 PM, Fuad Efendi <fu...@efendi.ca> wrote:

> What it truncates, 'http://' or 'sId=386'? Or something inside URL?
>
>
> Just inject http://business.verizon.net/ ... nutch should find the rest...
>
> I believe Nutch doesn't have any limits with URL length, although some Web
> servers limited to 4000...
>
>
> >
>
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel
> =S<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel%0A=S>
> >
>
> MBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFControll
> er
> >
>
> _1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fs
> mb
> >
>
> %252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_
> 1p
> > roductsId=386
> >
> > Thanks/Regards,
> > Parvez
> >
> >
> >
> > On Tue, Sep 1, 2009 at 4:43 PM, Fuad Efendi <fu...@efendi.ca> wrote:
> >
> > > > I opened the part-00000 file in the dump folder and there, is only
> ONE
> > > url
> > > > and it has been truncated to 318 chars
> > > > How make Nutch consider URLs with length more than 318 chars
> > >
> > > Please provide original (before truncating) sample of such URL
> > > Thanks
> > >
> > >
> > >
> > >
> > >
>
>
>

RE: Nutch truncating URL to 318 Chars

Posted by Fuad Efendi <fu...@efendi.ca>.
What it truncates, 'http://' or 'sId=386'? Or something inside URL?


Just inject http://business.verizon.net/ ... nutch should find the rest...

I believe Nutch doesn't have any limits with URL length, although some Web
servers limited to 4000...


>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel
=S
>
MBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFControll
er
>
_1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fs
mb
>
%252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_
1p
> roductsId=386
> 
> Thanks/Regards,
> Parvez
> 
> 
> 
> On Tue, Sep 1, 2009 at 4:43 PM, Fuad Efendi <fu...@efendi.ca> wrote:
> 
> > > I opened the part-00000 file in the dump folder and there, is only ONE
> > url
> > > and it has been truncated to 318 chars
> > > How make Nutch consider URLs with length more than 318 chars
> >
> > Please provide original (before truncating) sample of such URL
> > Thanks
> >
> >
> >
> >
> >



Re: Nutch truncating URL to 318 Chars

Posted by Mohamed Parvez <pa...@gmail.com>.
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fsmb%252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_1productsId=386

Thanks/Regards,
Parvez



On Tue, Sep 1, 2009 at 4:43 PM, Fuad Efendi <fu...@efendi.ca> wrote:

> > I opened the part-00000 file in the dump folder and there, is only ONE
> url
> > and it has been truncated to 318 chars
> > How make Nutch consider URLs with length more than 318 chars
>
> Please provide original (before truncating) sample of such URL
> Thanks
>
>
>
>
>

RE: Nutch truncating URL to 318 Chars

Posted by Fuad Efendi <fu...@efendi.ca>.
> I opened the part-00000 file in the dump folder and there, is only ONE url
> and it has been truncated to 318 chars
> How make Nutch consider URLs with length more than 318 chars

Please provide original (before truncating) sample of such URL
Thanks