You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/09/11 10:41:14 UTC

Downloading file types to file system

Hi,

I've asked this question before on a different mail list, with no real
response.
I hope someone saw the need for this actions and could help.

I'm trying to config nutch to download certain file types (exe/zip) to the
file system while crawling.
I know nutch doesn't have a parse-exe plugin, so i'll focus on the ZIP (once
i will understand the logic, i will write a parse-exe plugin).

I want to know if nutch supports the downloading of files inherently (using
only conf files) or if not, how can i alter the parse-zip plugin in order to
download the file.
(i saw the parser gets a byte array called "content", can i save this to the
fs ?).

thanks,


-- 
Eyal Edri

Re: Downloading file types to file system

Posted by eyal edri <ey...@gmail.com>.
Can anyone help with this?
Is there another IO java class i can use for saving the byte array?

Eyal.

On 9/22/07, eyal edri <ey...@gmail.com> wrote:
>
> Am i catching this content byte array too late (in the code)?
>
> is there a previous data field that holds the page content before the
> content byte array?
>
> thanks,
>
>
> On 9/20/07, eyal edri <ey...@gmail.com> wrote:
> >
> > Hi,
> >
> > I've made some progress with downloading files (EXE/ZIP).
> > I'm not using yet the plugin system, just injected code to the "
> > fetcher.java" meantime to test it.
> > I've written the following code:  (after this line:   Content content =
> > output.getContent(); )
> >
> >
> >  //  - save the file to fs
> >      // define regrex to capture domainname & filename
> >     Pattern regex = Pattern.compile ("http://([^/]*).*/([^/]*)$");
> >     Matcher urlMatcher = regex.matcher(content.getUrl());
> >
> >     String domain = null;
> >     String fileLast = null;
> >    // get $1 &$2 backreference from regrex
> >     while ( urlMatcher.find() ) {
> >          domain = urlMatcher.group(1);
> >           fileLast = urlMatcher.group(2);
> >      }
> >      LOG.info ("filename " + fileLast);
> >      LOG.info ("domain " + domain);
> >      File downloadDir  = new File("/home/eyale/nutch/DOWNLOADS/" +
> > domain);
> >      // CHECK IF DIR EXITS
> >      if ( !downloadDir.exists() )
> >           downloadDir.mkdir();
> >       String filename = downloadDir + "/" + fileLast;
> >
> >        FileOutputStream out = new FileOutputStream (new File
> > (filename));
> >        ObjectOutputStream obj = new ObjectOutputStream (out);
> >
> >        // the content.getContent() returns a byte array
> >        obj.write (content.getContent());
> >        obj.close();
> >
> > after downloading this file, i've found out that it is slightly bigger
> > than the original file
> > (compare with file retrived from WGET).
> > why is that? does this byte array contain more information/data?
> > how can i get the real file data only?
> >
> > thanks,
> >
> >
> > On 9/11/07, Martin Kuen <martin.kuen@gmail.com > wrote:
> > >
> > > hi,
> > >
> > > I don't think that nutch can be configured to store each downloaded
> > > file as
> > > a file (one file downloaded - one file on your local disk).
> > > The "byte array called content" can be directly stored I think. I
> > > think
> > > that's worth giving it a try. The fetcher uses (binary) streams to
> > > handle
> > > the downloaded content, so I think it *should* be okay.
> > >
> > > Another approach (my two cents):
> > > 1. Run the fetcher with the -noParse option (most likely not even
> > > necessary)
> > > 2. check if the fetcher is advised to store the content (there is a
> > > property in nutch-default.xml)
> > > 3. create a dump with the "readseg" command and the "-dump" option
> > > 4. process the dump file and cut out what is necessary
> > >
> > > Just interested if that could work . . . however:
> > > I had a look at the class implementing the readseg command and found
> > > that
> > > the dump file is created with a "PrintWriter". This will create
> > > trouble I
> > > think. Maybe you can modify the SegmentReader (use an OutputStream).
> > > Regarding the fetcher - it's using a binary stream to store the
> > > content
> > > (FSDataOutputStream).
> > >
> > >
> > > Cheers,
> > >
> > > Martin
> > >
> > >
> > > On 9/11/07, eyal edri < eyal.edri@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I've asked this question before on a different mail list, with no
> > > real
> > > > response.
> > > > I hope someone saw the need for this actions and could help.
> > > >
> > > > I'm trying to config nutch to download certain file types (exe/zip)
> > > to the
> > > > file system while crawling.
> > > > I know nutch doesn't have a parse-exe plugin, so i'll focus on the
> > > ZIP
> > > > (once
> > > > i will understand the logic, i will write a parse-exe plugin).
> > > >
> > > > I want to know if nutch supports the downloading of files inherently
> > > > (using
> > > > only conf files) or if not, how can i alter the parse-zip plugin in
> > > order
> > > > to
> > > > download the file.
> > > > (i saw the parser gets a byte array called "content", can i save
> > > this to
> > > > the
> > > > fs ?).
> > > >
> > > > thanks,
> > > >
> > > >
> > > > --
> > > > Eyal Edri
> > > >
> > >
> >
> >
> >
> > --
> > Eyal Edri
>
>
>
>
> --
> Eyal Edri




-- 
Eyal Edri

Re: Downloading file types to file system

Posted by eyal edri <ey...@gmail.com>.
Am i catching this content byte array too late (in the code)?

is there a previous data field that holds the page content before the
content byte array?

thanks,


On 9/20/07, eyal edri <ey...@gmail.com> wrote:
>
> Hi,
>
> I've made some progress with downloading files (EXE/ZIP).
> I'm not using yet the plugin system, just injected code to the "
> fetcher.java" meantime to test it.
> I've written the following code:  (after this line:   Content content =
> output.getContent(); )
>
>
>  //  - save the file to fs
>      // define regrex to capture domainname & filename
>     Pattern regex = Pattern.compile ("http://([^/]*).*/([^/]*)$");
>     Matcher urlMatcher = regex.matcher(content.getUrl());
>
>     String domain = null;
>     String fileLast = null;
>    // get $1 &$2 backreference from regrex
>     while ( urlMatcher.find() ) {
>          domain = urlMatcher.group(1);
>           fileLast = urlMatcher.group(2);
>      }
>      LOG.info ("filename " + fileLast);
>      LOG.info ("domain " + domain);
>      File downloadDir  = new File("/home/eyale/nutch/DOWNLOADS/" +
> domain);
>      // CHECK IF DIR EXITS
>      if ( !downloadDir.exists() )
>           downloadDir.mkdir();
>       String filename = downloadDir + "/" + fileLast;
>
>        FileOutputStream out = new FileOutputStream (new File (filename));
>        ObjectOutputStream obj = new ObjectOutputStream (out);
>
>        // the content.getContent() returns a byte array
>        obj.write (content.getContent());
>        obj.close();
>
> after downloading this file, i've found out that it is slightly bigger
> than the original file
> (compare with file retrived from WGET).
> why is that? does this byte array contain more information/data?
> how can i get the real file data only?
>
> thanks,
>
>
> On 9/11/07, Martin Kuen <ma...@gmail.com> wrote:
> >
> > hi,
> >
> > I don't think that nutch can be configured to store each downloaded file
> > as
> > a file (one file downloaded - one file on your local disk).
> > The "byte array called content" can be directly stored I think. I think
> > that's worth giving it a try. The fetcher uses (binary) streams to
> > handle
> > the downloaded content, so I think it *should* be okay.
> >
> > Another approach (my two cents):
> > 1. Run the fetcher with the -noParse option (most likely not even
> > necessary)
> > 2. check if the fetcher is advised to store the content (there is a
> > property in nutch-default.xml)
> > 3. create a dump with the "readseg" command and the "-dump" option
> > 4. process the dump file and cut out what is necessary
> >
> > Just interested if that could work . . . however:
> > I had a look at the class implementing the readseg command and found
> > that
> > the dump file is created with a "PrintWriter". This will create trouble
> > I
> > think. Maybe you can modify the SegmentReader (use an OutputStream).
> > Regarding the fetcher - it's using a binary stream to store the content
> > (FSDataOutputStream).
> >
> >
> > Cheers,
> >
> > Martin
> >
> >
> > On 9/11/07, eyal edri <ey...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I've asked this question before on a different mail list, with no real
> >
> > > response.
> > > I hope someone saw the need for this actions and could help.
> > >
> > > I'm trying to config nutch to download certain file types (exe/zip) to
> > the
> > > file system while crawling.
> > > I know nutch doesn't have a parse-exe plugin, so i'll focus on the ZIP
> >
> > > (once
> > > i will understand the logic, i will write a parse-exe plugin).
> > >
> > > I want to know if nutch supports the downloading of files inherently
> > > (using
> > > only conf files) or if not, how can i alter the parse-zip plugin in
> > order
> > > to
> > > download the file.
> > > (i saw the parser gets a byte array called "content", can i save this
> > to
> > > the
> > > fs ?).
> > >
> > > thanks,
> > >
> > >
> > > --
> > > Eyal Edri
> > >
> >
>
>
>
> --
> Eyal Edri




-- 
Eyal Edri

Re: Downloading file types to file system

Posted by eyal edri <ey...@gmail.com>.
Hi,

I've made some progress with downloading files (EXE/ZIP).
I'm not using yet the plugin system, just injected code to the "fetcher.java"
meantime to test it.
I've written the following code:  (after this line:   Content content =
output.getContent(); )


 //  - save the file to fs
    // define regrex to capture domainname & filename
    Pattern regex = Pattern.compile ("http://([^/]*).*/([^/]*)$");
    Matcher urlMatcher = regex.matcher(content.getUrl());

    String domain = null;
    String fileLast = null;
   // get $1 &$2 backreference from regrex
    while ( urlMatcher.find() ) {
         domain = urlMatcher.group(1);
          fileLast = urlMatcher.group(2);
     }
     LOG.info ("filename " + fileLast);
     LOG.info ("domain " + domain);
     File downloadDir  = new File("/home/eyale/nutch/DOWNLOADS/" + domain);
     // CHECK IF DIR EXITS
     if ( !downloadDir.exists() )
          downloadDir.mkdir();
      String filename = downloadDir + "/" + fileLast;

       FileOutputStream out = new FileOutputStream (new File (filename));
       ObjectOutputStream obj = new ObjectOutputStream (out);

       // the content.getContent() returns a byte array
       obj.write (content.getContent());
       obj.close();

after downloading this file, i've found out that it is slightly bigger than
the original file
(compare with file retrived from WGET).
why is that? does this byte array contain more information/data?
how can i get the real file data only?

thanks,


On 9/11/07, Martin Kuen <ma...@gmail.com> wrote:
>
> hi,
>
> I don't think that nutch can be configured to store each downloaded file
> as
> a file (one file downloaded - one file on your local disk).
> The "byte array called content" can be directly stored I think. I think
> that's worth giving it a try. The fetcher uses (binary) streams to handle
> the downloaded content, so I think it *should* be okay.
>
> Another approach (my two cents):
> 1. Run the fetcher with the -noParse option (most likely not even
> necessary)
> 2. check if the fetcher is advised to store the content (there is a
> property in nutch-default.xml)
> 3. create a dump with the "readseg" command and the "-dump" option
> 4. process the dump file and cut out what is necessary
>
> Just interested if that could work . . . however:
> I had a look at the class implementing the readseg command and found that
> the dump file is created with a "PrintWriter". This will create trouble I
> think. Maybe you can modify the SegmentReader (use an OutputStream).
> Regarding the fetcher - it's using a binary stream to store the content
> (FSDataOutputStream).
>
>
> Cheers,
>
> Martin
>
>
> On 9/11/07, eyal edri <ey...@gmail.com> wrote:
> >
> > Hi,
> >
> > I've asked this question before on a different mail list, with no real
> > response.
> > I hope someone saw the need for this actions and could help.
> >
> > I'm trying to config nutch to download certain file types (exe/zip) to
> the
> > file system while crawling.
> > I know nutch doesn't have a parse-exe plugin, so i'll focus on the ZIP
> > (once
> > i will understand the logic, i will write a parse-exe plugin).
> >
> > I want to know if nutch supports the downloading of files inherently
> > (using
> > only conf files) or if not, how can i alter the parse-zip plugin in
> order
> > to
> > download the file.
> > (i saw the parser gets a byte array called "content", can i save this to
> > the
> > fs ?).
> >
> > thanks,
> >
> >
> > --
> > Eyal Edri
> >
>



-- 
Eyal Edri

Re: Downloading file types to file system

Posted by Martin Kuen <ma...@gmail.com>.
hi,

I don't think that nutch can be configured to store each downloaded file as
a file (one file downloaded - one file on your local disk).
The "byte array called content" can be directly stored I think. I think
that's worth giving it a try. The fetcher uses (binary) streams to handle
the downloaded content, so I think it *should* be okay.

Another approach (my two cents):
 1. Run the fetcher with the -noParse option (most likely not even
necessary)
 2. check if the fetcher is advised to store the content (there is a
property in nutch-default.xml)
 3. create a dump with the "readseg" command and the "-dump" option
 4. process the dump file and cut out what is necessary

Just interested if that could work . . . however:
I had a look at the class implementing the readseg command and found that
the dump file is created with a "PrintWriter". This will create trouble I
think. Maybe you can modify the SegmentReader (use an OutputStream).
Regarding the fetcher - it's using a binary stream to store the content
(FSDataOutputStream).


Cheers,

Martin


On 9/11/07, eyal edri <ey...@gmail.com> wrote:
>
> Hi,
>
> I've asked this question before on a different mail list, with no real
> response.
> I hope someone saw the need for this actions and could help.
>
> I'm trying to config nutch to download certain file types (exe/zip) to the
> file system while crawling.
> I know nutch doesn't have a parse-exe plugin, so i'll focus on the ZIP
> (once
> i will understand the logic, i will write a parse-exe plugin).
>
> I want to know if nutch supports the downloading of files inherently
> (using
> only conf files) or if not, how can i alter the parse-zip plugin in order
> to
> download the file.
> (i saw the parser gets a byte array called "content", can i save this to
> the
> fs ?).
>
> thanks,
>
>
> --
> Eyal Edri
>