You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Matthias W." <Ma...@e-projecta.com> on 2009/01/13 13:17:50 UTC

nutch crawling with java (not shellscript)

Hi,
is there a tutorial or can anyone explain if and how I can run the nutch
crawler via java and not with the shellscript?
Furthermore I don't need to crawl, because I've got a list of URLs (PDF,
Word, Excel, ... Documents) which I have to index
-> In my case nutch only has to create the index from the urls list.

Till now I've got a shellscript which calls "bin/nutch crawl ..."

But if it is possible, I want to use java code instead of the "bin/nutch"
crawlscript.

Are there Java classes and methods to do this?

For better understanding, my association to start the crawl respectively the
index process:
    "java Crawl"
That I'm able to set options for crawling in the java code and not in a
shellscript.

Is this possible?

Thanks!
Matthias
-- 
View this message in context: http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling with java (not shellscript)

Posted by "Matthias W." <Ma...@e-projecta.com>.

Thanks, I'll look at it.

But it doesn't support Office07, yet !?



Julien Nioche-4 wrote:
> 
> Matthias,
> 
> Have a look at Apache Tika. It provides a simple and unified API over
> PDFBOX
> and POI etc... and Mimetype facilities.
> That should greatly simplify your code.
> 
> Julien
> 
> 2009/1/14 Matthias W. <Ma...@e-projecta.com>
> 
>>
>> Ok thanks!
>>
>> But I decided against using the nutch crawler.
>>
>> It will be the better way to build the index directly with Lucene,
>> because
>> I
>> do not need to crawl.
>> (I'm also searching with Lucene)
>>
>> Now I use the parsers PDFBox for PDF-Documents and the Apache POI for MS
>> Office Documents.
>>
>> There's little problem remaining: Mimetype checking.
>> I tried this:
>> String mimetype = new MimetypesFileTypeMap().getContentType( file );
>> but I always get the type application/octet-stream
>>
>> Does anybody know a good Mimetype class in java?
>>
>>
>>
>> Otis Gospodnetic-2 wrote:
>> >
>> > Hi Matthias,
>> >
>> > Several years ago when I did crawling/parsing/indexing of full-page
>> > content for Simpy.com I used Nutch in exactly that manner.
>> >
>> > For example (this is outdated code, but you'll get the idea):
>> >
>> >        System.out.println("Urls to fetch: " + _urls.size());
>> >
>> >         if (_urls.size() == 0)
>> >             return;
>> >
>> >         // clean up and prepare the FS
>> >         prepareFS();
>> >
>> >         // create the URL file
>> >         String urlFile = createURLFile();
>> >
>> >         // create the fetch list from the URL file
>> >         createFetchList(urlFile);
>> >
>> >         // start the fetcher
>> >         _segmentDir = getLastSegmentDirectory(_rootDir);
>> >         String[] params = new String[] {                           //
>> THIS
>> > IS WHAT YOU ARE AFTER
>> >             "-local",
>> >             _segmentDir
>> >         };
>> >         org.apache.nutch.fetcher.Fetcher.main(params);   // THIS IS
>> WHAT
>> > YOU ARE AFTER
>> >
>> >
>> > If you look at bin/nutch script, you will see it really just calls
>> Nutch's
>> > Java classes, so you just have to figure out what parameters those
>> classes
>> > take and then call them as above, or even more directly using ctor and
>> > methods other than main.
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Matthias W. <Ma...@e-projecta.com>
>> >> To: nutch-user@lucene.apache.org
>> >> Sent: Tuesday, January 13, 2009 7:17:50 AM
>> >> Subject: nutch crawling with java (not shellscript)
>> >>
>> >>
>> >> Hi,
>> >> is there a tutorial or can anyone explain if and how I can run the
>> nutch
>> >> crawler via java and not with the shellscript?
>> >> Furthermore I don't need to crawl, because I've got a list of URLs
>> (PDF,
>> >> Word, Excel, ... Documents) which I have to index
>> >> -> In my case nutch only has to create the index from the urls list.
>> >>
>> >> Till now I've got a shellscript which calls "bin/nutch crawl ..."
>> >>
>> >> But if it is possible, I want to use java code instead of the
>> "bin/nutch"
>> >> crawlscript.
>> >>
>> >> Are there Java classes and methods to do this?
>> >>
>> >> For better understanding, my association to start the crawl
>> respectively
>> >> the
>> >> index process:
>> >>     "java Crawl"
>> >> That I'm able to set options for crawling in the java code and not in
>> a
>> >> shellscript.
>> >>
>> >> Is this possible?
>> >>
>> >> Thanks!
>> >> Matthias
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21454646.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> DigitalPebble Ltd
> http://www.digitalpebble.com
> 
> 

-- 
View this message in context: http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21455529.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling with java (not shellscript)

Posted by Julien Nioche <li...@gmail.com>.

Matthias,

Have a look at Apache Tika. It provides a simple and unified API over PDFBOX
and POI etc... and Mimetype facilities.
That should greatly simplify your code.

Julien

2009/1/14 Matthias W. <Ma...@e-projecta.com>

>
> Ok thanks!
>
> But I decided against using the nutch crawler.
>
> It will be the better way to build the index directly with Lucene, because
> I
> do not need to crawl.
> (I'm also searching with Lucene)
>
> Now I use the parsers PDFBox for PDF-Documents and the Apache POI for MS
> Office Documents.
>
> There's little problem remaining: Mimetype checking.
> I tried this:
> String mimetype = new MimetypesFileTypeMap().getContentType( file );
> but I always get the type application/octet-stream
>
> Does anybody know a good Mimetype class in java?
>
>
>
> Otis Gospodnetic-2 wrote:
> >
> > Hi Matthias,
> >
> > Several years ago when I did crawling/parsing/indexing of full-page
> > content for Simpy.com I used Nutch in exactly that manner.
> >
> > For example (this is outdated code, but you'll get the idea):
> >
> >        System.out.println("Urls to fetch: " + _urls.size());
> >
> >         if (_urls.size() == 0)
> >             return;
> >
> >         // clean up and prepare the FS
> >         prepareFS();
> >
> >         // create the URL file
> >         String urlFile = createURLFile();
> >
> >         // create the fetch list from the URL file
> >         createFetchList(urlFile);
> >
> >         // start the fetcher
> >         _segmentDir = getLastSegmentDirectory(_rootDir);
> >         String[] params = new String[] {                           //
> THIS
> > IS WHAT YOU ARE AFTER
> >             "-local",
> >             _segmentDir
> >         };
> >         org.apache.nutch.fetcher.Fetcher.main(params);   // THIS IS WHAT
> > YOU ARE AFTER
> >
> >
> > If you look at bin/nutch script, you will see it really just calls
> Nutch's
> > Java classes, so you just have to figure out what parameters those
> classes
> > take and then call them as above, or even more directly using ctor and
> > methods other than main.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Matthias W. <Ma...@e-projecta.com>
> >> To: nutch-user@lucene.apache.org
> >> Sent: Tuesday, January 13, 2009 7:17:50 AM
> >> Subject: nutch crawling with java (not shellscript)
> >>
> >>
> >> Hi,
> >> is there a tutorial or can anyone explain if and how I can run the nutch
> >> crawler via java and not with the shellscript?
> >> Furthermore I don't need to crawl, because I've got a list of URLs (PDF,
> >> Word, Excel, ... Documents) which I have to index
> >> -> In my case nutch only has to create the index from the urls list.
> >>
> >> Till now I've got a shellscript which calls "bin/nutch crawl ..."
> >>
> >> But if it is possible, I want to use java code instead of the
> "bin/nutch"
> >> crawlscript.
> >>
> >> Are there Java classes and methods to do this?
> >>
> >> For better understanding, my association to start the crawl respectively
> >> the
> >> index process:
> >>     "java Crawl"
> >> That I'm able to set options for crawling in the java code and not in a
> >> shellscript.
> >>
> >> Is this possible?
> >>
> >> Thanks!
> >> Matthias
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21454646.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: nutch crawling with java (not shellscript)

Posted by "Matthias W." <Ma...@e-projecta.com>.

Ok thanks!

But I decided against using the nutch crawler.

It will be the better way to build the index directly with Lucene, because I
do not need to crawl.
(I'm also searching with Lucene)

Now I use the parsers PDFBox for PDF-Documents and the Apache POI for MS
Office Documents.

There's little problem remaining: Mimetype checking.
I tried this:
String mimetype = new MimetypesFileTypeMap().getContentType( file );
but I always get the type application/octet-stream

Does anybody know a good Mimetype class in java?



Otis Gospodnetic-2 wrote:
> 
> Hi Matthias,
> 
> Several years ago when I did crawling/parsing/indexing of full-page
> content for Simpy.com I used Nutch in exactly that manner.
> 
> For example (this is outdated code, but you'll get the idea):
> 
>        System.out.println("Urls to fetch: " + _urls.size());
> 
>         if (_urls.size() == 0)
>             return;
> 
>         // clean up and prepare the FS
>         prepareFS();
> 
>         // create the URL file
>         String urlFile = createURLFile();
> 
>         // create the fetch list from the URL file
>         createFetchList(urlFile);
> 
>         // start the fetcher
>         _segmentDir = getLastSegmentDirectory(_rootDir);
>         String[] params = new String[] {                           // THIS
> IS WHAT YOU ARE AFTER
>             "-local",
>             _segmentDir
>         };
>         org.apache.nutch.fetcher.Fetcher.main(params);   // THIS IS WHAT
> YOU ARE AFTER
> 
> 
> If you look at bin/nutch script, you will see it really just calls Nutch's
> Java classes, so you just have to figure out what parameters those classes
> take and then call them as above, or even more directly using ctor and
> methods other than main.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Matthias W. <Ma...@e-projecta.com>
>> To: nutch-user@lucene.apache.org
>> Sent: Tuesday, January 13, 2009 7:17:50 AM
>> Subject: nutch crawling with java (not shellscript)
>> 
>> 
>> Hi,
>> is there a tutorial or can anyone explain if and how I can run the nutch
>> crawler via java and not with the shellscript?
>> Furthermore I don't need to crawl, because I've got a list of URLs (PDF,
>> Word, Excel, ... Documents) which I have to index
>> -> In my case nutch only has to create the index from the urls list.
>> 
>> Till now I've got a shellscript which calls "bin/nutch crawl ..."
>> 
>> But if it is possible, I want to use java code instead of the "bin/nutch"
>> crawlscript.
>> 
>> Are there Java classes and methods to do this?
>> 
>> For better understanding, my association to start the crawl respectively
>> the
>> index process:
>>     "java Crawl"
>> That I'm able to set options for crawling in the java code and not in a
>> shellscript.
>> 
>> Is this possible?
>> 
>> Thanks!
>> Matthias
>> -- 
>> View this message in context: 
>> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21454646.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling with java (not shellscript)

Posted by Otis Gospodnetic <og...@yahoo.com>.

Hi Matthias,

Several years ago when I did crawling/parsing/indexing of full-page content for Simpy.com I used Nutch in exactly that manner.

For example (this is outdated code, but you'll get the idea):

       System.out.println("Urls to fetch: " + _urls.size());

        if (_urls.size() == 0)
            return;

        // clean up and prepare the FS
        prepareFS();

        // create the URL file
        String urlFile = createURLFile();

        // create the fetch list from the URL file
        createFetchList(urlFile);

        // start the fetcher
        _segmentDir = getLastSegmentDirectory(_rootDir);
        String[] params = new String[] {                           // THIS IS WHAT YOU ARE AFTER
            "-local",
            _segmentDir
        };
        org.apache.nutch.fetcher.Fetcher.main(params);   // THIS IS WHAT YOU ARE AFTER


If you look at bin/nutch script, you will see it really just calls Nutch's Java classes, so you just have to figure out what parameters those classes take and then call them as above, or even more directly using ctor and methods other than main.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Matthias W. <Ma...@e-projecta.com>
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, January 13, 2009 7:17:50 AM
> Subject: nutch crawling with java (not shellscript)
> 
> 
> Hi,
> is there a tutorial or can anyone explain if and how I can run the nutch
> crawler via java and not with the shellscript?
> Furthermore I don't need to crawl, because I've got a list of URLs (PDF,
> Word, Excel, ... Documents) which I have to index
> -> In my case nutch only has to create the index from the urls list.
> 
> Till now I've got a shellscript which calls "bin/nutch crawl ..."
> 
> But if it is possible, I want to use java code instead of the "bin/nutch"
> crawlscript.
> 
> Are there Java classes and methods to do this?
> 
> For better understanding, my association to start the crawl respectively the
> index process:
>     "java Crawl"
> That I'm able to set options for crawling in the java code and not in a
> shellscript.
> 
> Is this possible?
> 
> Thanks!
> Matthias
> -- 
> View this message in context: 
> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
> Sent from the Nutch - User mailing list archive at Nabble.com.