You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/10/09 11:30:24 UTC

Re: Downloading file types to file system

Can anyone help with this?
Is there another IO java class i can use for saving the byte array?

Eyal.

On 9/22/07, eyal edri <ey...@gmail.com> wrote:
>
> Am i catching this content byte array too late (in the code)?
>
> is there a previous data field that holds the page content before the
> content byte array?
>
> thanks,
>
>
> On 9/20/07, eyal edri <ey...@gmail.com> wrote:
> >
> > Hi,
> >
> > I've made some progress with downloading files (EXE/ZIP).
> > I'm not using yet the plugin system, just injected code to the "
> > fetcher.java" meantime to test it.
> > I've written the following code:  (after this line:   Content content =
> > output.getContent(); )
> >
> >
> >  //  - save the file to fs
> >      // define regrex to capture domainname & filename
> >     Pattern regex = Pattern.compile ("http://([^/]*).*/([^/]*)$");
> >     Matcher urlMatcher = regex.matcher(content.getUrl());
> >
> >     String domain = null;
> >     String fileLast = null;
> >    // get $1 &$2 backreference from regrex
> >     while ( urlMatcher.find() ) {
> >          domain = urlMatcher.group(1);
> >           fileLast = urlMatcher.group(2);
> >      }
> >      LOG.info ("filename " + fileLast);
> >      LOG.info ("domain " + domain);
> >      File downloadDir  = new File("/home/eyale/nutch/DOWNLOADS/" +
> > domain);
> >      // CHECK IF DIR EXITS
> >      if ( !downloadDir.exists() )
> >           downloadDir.mkdir();
> >       String filename = downloadDir + "/" + fileLast;
> >
> >        FileOutputStream out = new FileOutputStream (new File
> > (filename));
> >        ObjectOutputStream obj = new ObjectOutputStream (out);
> >
> >        // the content.getContent() returns a byte array
> >        obj.write (content.getContent());
> >        obj.close();
> >
> > after downloading this file, i've found out that it is slightly bigger
> > than the original file
> > (compare with file retrived from WGET).
> > why is that? does this byte array contain more information/data?
> > how can i get the real file data only?
> >
> > thanks,
> >
> >
> > On 9/11/07, Martin Kuen <martin.kuen@gmail.com > wrote:
> > >
> > > hi,
> > >
> > > I don't think that nutch can be configured to store each downloaded
> > > file as
> > > a file (one file downloaded - one file on your local disk).
> > > The "byte array called content" can be directly stored I think. I
> > > think
> > > that's worth giving it a try. The fetcher uses (binary) streams to
> > > handle
> > > the downloaded content, so I think it *should* be okay.
> > >
> > > Another approach (my two cents):
> > > 1. Run the fetcher with the -noParse option (most likely not even
> > > necessary)
> > > 2. check if the fetcher is advised to store the content (there is a
> > > property in nutch-default.xml)
> > > 3. create a dump with the "readseg" command and the "-dump" option
> > > 4. process the dump file and cut out what is necessary
> > >
> > > Just interested if that could work . . . however:
> > > I had a look at the class implementing the readseg command and found
> > > that
> > > the dump file is created with a "PrintWriter". This will create
> > > trouble I
> > > think. Maybe you can modify the SegmentReader (use an OutputStream).
> > > Regarding the fetcher - it's using a binary stream to store the
> > > content
> > > (FSDataOutputStream).
> > >
> > >
> > > Cheers,
> > >
> > > Martin
> > >
> > >
> > > On 9/11/07, eyal edri < eyal.edri@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I've asked this question before on a different mail list, with no
> > > real
> > > > response.
> > > > I hope someone saw the need for this actions and could help.
> > > >
> > > > I'm trying to config nutch to download certain file types (exe/zip)
> > > to the
> > > > file system while crawling.
> > > > I know nutch doesn't have a parse-exe plugin, so i'll focus on the
> > > ZIP
> > > > (once
> > > > i will understand the logic, i will write a parse-exe plugin).
> > > >
> > > > I want to know if nutch supports the downloading of files inherently
> > > > (using
> > > > only conf files) or if not, how can i alter the parse-zip plugin in
> > > order
> > > > to
> > > > download the file.
> > > > (i saw the parser gets a byte array called "content", can i save
> > > this to
> > > > the
> > > > fs ?).
> > > >
> > > > thanks,
> > > >
> > > >
> > > > --
> > > > Eyal Edri
> > > >
> > >
> >
> >
> >
> > --
> > Eyal Edri
>
>
>
>
> --
> Eyal Edri




-- 
Eyal Edri