You are viewing a plain text version of this content. The canonical link for it is here.
Posted to agent@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/09/02 18:00:38 UTC
downloading zip/exe files
Hi,
I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages.
i've removed the suffixes from the regex-urlfilter &
automation-urlfilter(files identical):
regex-urlfilter.txt:
--------------------------------------------------------------------------------------------------------
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept anything else
+.
------------------------------------------------------------------------------------------------------------------
When trying to download EXE:
http://www.xtodvd.com/apodvdcopy.exe
the fetch fails:
found segment crawl/segments/20070902084928
Fetching now the urls..
Fetcher: starting
Fetcher: segment: crawl/segments/20070902084928
Fetcher: threads: 1000
fetching http://www.xtodvd.com/apodvdcopy.exe
Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/x-dosexec url=http://createdvd.net/apodvdcopy.exe
Fetcher: done
when trying to fetch Zip file, its works, but how can i tell him to save the
zip to a folder in a directory on the file system, do i need to write a
plugin?
thanks!
Eyal Edri
Re: downloading zip/exe files
Posted by eyal edri <ey...@gmail.com>.
hi,
I've been digging all day into the parse-zip plugin trying to figure out how
can i change / add someting to allow
it do download the zip file to a local folder on the filesystem.
i tried writing the "content.getContent()" byte array to afile... , but it
always gives me an error on parsing the file and it doesnt say anything
about the byte array saving (it also doent show any System.out.println(...)
i gave him... :( )
Anyone has an idea how it can be done?
thanks,
eyal
On 9/4/07, Sagar Naik <sa...@visvo.com> wrote:
>
> Hey eyal,
>
> There is no parser for "application/x-dosexec". You will have to write
> plugin for to parse exe files (have a look @ parse-zip plugin).
>
> Storing unzipped contents :
> Option 1:
> I think u can modify parse-zip plugin's ZipParser class to
> store the
> unzipped contents at some desired location
> Option 2:
> Or write a separate job to get parse-text contents and store
> @ some desired
> location
>
> - Sagar Naik
>
>
> eyal edri wrote:
> > Hi,
> >
> > I am trying to use the nutch fetcher for d/l EXE/ZIP files from web
> pages.
> > i've removed the suffixes from the regex-urlfilter &
> > automation-urlfilter(files identical):
> >
> >
> > regex-urlfilter.txt:
> >
> --------------------------------------------------------------------------------------------------------
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > -.*(/.+?)/.*?\1/.*?\1/
> >
> > # accept anything else
> > +.
> >
> >
> ------------------------------------------------------------------------------------------------------------------
> >
> > When trying to download EXE:
> > http://www.xtodvd.com/apodvdcopy.exe
> >
> > the fetch fails:
> > found segment crawl/segments/20070902084928
> > Fetching now the urls..
> > Fetcher: starting
> > Fetcher: segment: crawl/segments/20070902084928
> > Fetcher: threads: 1000
> > fetching http://www.xtodvd.com/apodvdcopy.exe
> > Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
> > org.apache.nutch.parse.ParseException : parser not found for
> > contentType=application/x-dosexec url=
> http://createdvd.net/apodvdcopy.exe
> > Fetcher: done
> >
> > when trying to fetch Zip file, its works, but how can i tell him to save
> the
> > zip to a folder in a directory on the file system, do i need to write a
> > plugin?
> >
> > thanks!
> >
> >
> >
>
>
> --
> This message has been scanned for viruses and
> dangerous content and is believed to be clean.
>
>
--
Eyal Edri
Re: downloading zip/exe files
Posted by Sagar Naik <sa...@visvo.com>.
Hey eyal,
There is no parser for "application/x-dosexec". You will have to write
plugin for to parse exe files (have a look @ parse-zip plugin).
Storing unzipped contents :
Option 1:
I think u can modify parse-zip plugin's ZipParser class to
store the
unzipped contents at some desired location
Option 2:
Or write a separate job to get parse-text contents and store
@ some desired
location
- Sagar Naik
eyal edri wrote:
> Hi,
>
> I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages.
> i've removed the suffixes from the regex-urlfilter &
> automation-urlfilter(files identical):
>
>
> regex-urlfilter.txt:
> --------------------------------------------------------------------------------------------------------
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept anything else
> +.
>
> ------------------------------------------------------------------------------------------------------------------
>
> When trying to download EXE:
> http://www.xtodvd.com/apodvdcopy.exe
>
> the fetch fails:
> found segment crawl/segments/20070902084928
> Fetching now the urls..
> Fetcher: starting
> Fetcher: segment: crawl/segments/20070902084928
> Fetcher: threads: 1000
> fetching http://www.xtodvd.com/apodvdcopy.exe
> Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
> org.apache.nutch.parse.ParseException : parser not found for
> contentType=application/x-dosexec url=http://createdvd.net/apodvdcopy.exe
> Fetcher: done
>
> when trying to fetch Zip file, its works, but how can i tell him to save the
> zip to a folder in a directory on the file system, do i need to write a
> plugin?
>
> thanks!
>
>
>
--
This message has been scanned for viruses and
dangerous content and is believed to be clean.
downloading zip/exe files
Posted by eyal edri <ey...@gmail.com>.
Hi,
I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages.
i've removed the suffixes from the regex-urlfilter &
automation-urlfilter(files identical):
regex-urlfilter.txt:
--------------------------------------------------------------------------------------------------------
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept anything else
+.
------------------------------------------------------------------------------------------------------------------
When trying to download EXE:
http://www.xtodvd.com/apodvdcopy.exe
the fetch fails:
found segment crawl/segments/20070902084928
Fetching now the urls..
Fetcher: starting
Fetcher: segment: crawl/segments/20070902084928
Fetcher: threads: 1000
fetching http://www.xtodvd.com/apodvdcopy.exe
Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
org.apache.nutch.parse.ParseException : parser not found for
contentType=application/x-dosexec url=http://createdvd.net/apodvdcopy.exe
Fetcher: done
when trying to fetch Zip file, its works, but how can i tell him to save the
zip to a folder in a directory on the file system, do i need to write a
plugin?
thanks!
--
Eyal Edri