You are viewing a plain text version of this content. The canonical link for it is here.

Posted to agent@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/09/02 18:00:38 UTC

downloading zip/exe files

Hi,

I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages.
i've removed the suffixes from the regex-urlfilter &
automation-urlfilter(files identical):


regex-urlfilter.txt:
--------------------------------------------------------------------------------------------------------
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.

------------------------------------------------------------------------------------------------------------------

When trying to download EXE:
http://www.xtodvd.com/apodvdcopy.exe

the fetch fails:
found segment crawl/segments/20070902084928
Fetching now the urls..
Fetcher: starting
Fetcher: segment: crawl/segments/20070902084928
Fetcher: threads: 1000
fetching http://www.xtodvd.com/apodvdcopy.exe
Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/x-dosexec url=http://createdvd.net/apodvdcopy.exe
Fetcher: done

when trying to fetch Zip file, its works, but how can i tell him to save the
zip to a folder in a directory on the file system, do i need to write a
plugin?

thanks!






Eyal Edri

Re: downloading zip/exe files

Posted by eyal edri <ey...@gmail.com>.

hi,

I've been digging all day into the parse-zip plugin trying to figure out how
can i change / add someting to allow
it do download the zip file to a local folder on the filesystem.

i tried writing the "content.getContent()" byte array to afile... , but it
always gives me an error on parsing the file and it doesnt say anything
about the byte array saving (it also doent show any System.out.println(...)
i gave him... :( )

Anyone has an idea how it can be done?

thanks,

eyal



On 9/4/07, Sagar Naik <sa...@visvo.com> wrote:
>
> Hey eyal,
>
> There is no parser for "application/x-dosexec". You will have to write
> plugin for to parse exe files (have a look @ parse-zip plugin).
>
> Storing unzipped contents :
>     Option 1:
>             I think u can modify parse-zip plugin's ZipParser class to
> store the
>             unzipped contents at some desired location
>     Option 2:
>            Or write a separate job to get parse-text contents and store
> @ some desired
>             location
>
> - Sagar Naik
>
>
> eyal edri wrote:
> > Hi,
> >
> > I am trying to use the nutch fetcher for d/l EXE/ZIP files from web
> pages.
> > i've removed the suffixes from the regex-urlfilter &
> > automation-urlfilter(files identical):
> >
> >
> > regex-urlfilter.txt:
> >
> --------------------------------------------------------------------------------------------------------
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > -.*(/.+?)/.*?\1/.*?\1/
> >
> > # accept anything else
> > +.
> >
> >
> ------------------------------------------------------------------------------------------------------------------
> >
> > When trying to download EXE:
> > http://www.xtodvd.com/apodvdcopy.exe
> >
> > the fetch fails:
> > found segment crawl/segments/20070902084928
> > Fetching now the urls..
> > Fetcher: starting
> > Fetcher: segment: crawl/segments/20070902084928
> > Fetcher: threads: 1000
> > fetching http://www.xtodvd.com/apodvdcopy.exe
> > Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
> > org.apache.nutch.parse.ParseException : parser not found for
> > contentType=application/x-dosexec url=
> http://createdvd.net/apodvdcopy.exe
> > Fetcher: done
> >
> > when trying to fetch Zip file, its works, but how can i tell him to save
> the
> > zip to a folder in a directory on the file system, do i need to write a
> > plugin?
> >
> > thanks!
> >
> >
> >
>
>
> --
> This message has been scanned for viruses and
> dangerous content and is believed to be clean.
>
>


-- 
Eyal Edri

Re: downloading zip/exe files

Posted by Sagar Naik <sa...@visvo.com>.

Hey eyal,

There is no parser for "application/x-dosexec". You will have to write
plugin for to parse exe files (have a look @ parse-zip plugin).

Storing unzipped contents :
    Option 1:
            I think u can modify parse-zip plugin's ZipParser class to 
store the
            unzipped contents at some desired location
    Option 2:
           Or write a separate job to get parse-text contents and store 
@ some desired
            location
       
 - Sagar Naik


eyal edri wrote:
> Hi,
>
> I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages.
> i've removed the suffixes from the regex-urlfilter &
> automation-urlfilter(files identical):
>
>
> regex-urlfilter.txt:
> --------------------------------------------------------------------------------------------------------
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept anything else
> +.
>
> ------------------------------------------------------------------------------------------------------------------
>
> When trying to download EXE:
> http://www.xtodvd.com/apodvdcopy.exe
>
> the fetch fails:
> found segment crawl/segments/20070902084928
> Fetching now the urls..
> Fetcher: starting
> Fetcher: segment: crawl/segments/20070902084928
> Fetcher: threads: 1000
> fetching http://www.xtodvd.com/apodvdcopy.exe
> Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
> org.apache.nutch.parse.ParseException : parser not found for
> contentType=application/x-dosexec url=http://createdvd.net/apodvdcopy.exe
> Fetcher: done
>
> when trying to fetch Zip file, its works, but how can i tell him to save the
> zip to a folder in a directory on the file system, do i need to write a
> plugin?
>
> thanks!
>
>
>   


-- 
This message has been scanned for viruses and
dangerous content and is believed to be clean.

downloading zip/exe files

Posted by eyal edri <ey...@gmail.com>.

Hi,

I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages.
i've removed the suffixes from the regex-urlfilter &
automation-urlfilter(files identical):


regex-urlfilter.txt:
--------------------------------------------------------------------------------------------------------
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.

------------------------------------------------------------------------------------------------------------------

When trying to download EXE:
http://www.xtodvd.com/apodvdcopy.exe

the fetch fails:
found segment crawl/segments/20070902084928
Fetching now the urls..
Fetcher: starting
Fetcher: segment: crawl/segments/20070902084928
Fetcher: threads: 1000
fetching http://www.xtodvd.com/apodvdcopy.exe
Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
org.apache.nutch.parse.ParseException : parser not found for
contentType=application/x-dosexec url=http://createdvd.net/apodvdcopy.exe
Fetcher: done

when trying to fetch Zip file, its works, but how can i tell him to save the
zip to a folder in a directory on the file system, do i need to write a
plugin?

thanks!


-- 
Eyal Edri