You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by alessio crisantemi <al...@gmail.com> on 2012/03/10 17:42:52 UTC

Re: nutch crawling file system SOLVED

I'm partially solved.
following the tutorial, I configured my nutch for crawl a local file system,
thank you.

But I have a duobt: why all tutorials and guide about nutch speak about
crawl-urlfilter.txt' file, but the default config or Nutch don't have this
file? But If I insert the code that the guide write for the crawl-urlfilter
on regex-urlfilter, all works.
I would know this case.
thank you
alessio

Il giorno 04 marzo 2012 17:02, alessio crisantemi <
alessio.crisantemi@gmail.com> ha scritto:

> Hi all,
> I need to crawl a directory with a lot of pdf file.
> But I know onlye the step-by-step mode for crawl a website.
> how can I do for a root?
> thank you for help me
> alessio
>

Re: nutch crawling file system SOLVED

Posted by dpverma <pa...@gmail.com>.

following are the steps which I have done so far:
1. download the two missing libraries  from:
  http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/

I downloaded the additional JARS from the URL in step 1 but instead of
putting them in "src/plugin/parse-pdf/lib" folder, I put them in "plugins"
folder.  I modified the plugin.xml in the same folder as per the
instructions in it.  Then enabled the 'parse-pdf' plugin in 'nutch-site.xml'
file as shown below (just added 'parse-pdf' at the end. I did not think I
need to rebuild Nutch.  
 
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|que
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno
rmalizer-(pass|regex|basic)|parse-pdf</value>
</property>

Then i got the following error:
WARN  parse.Parser - Error parsing:
http://viterbi.usc.edu/aviation/assets/002/74092.pdf: failed(2,202): Content
truncated at 66251 bytes. Parser can't handle incomplete pdf file.

solution :for this I changed file.content.limit and hhtp.content.limit to -1
in both nutch-default.xml and nutch-site.xml

Then I fixed follwoing error:
The parsing plugins: [org.apache.nutch.parse.pdf.PdfParser] are enabled via
the plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it  in the
parse-plugins.xml file

solution: in parse-plugins.xml under nutch config , I uncommented the pdf
section

After doing this , I see no errors or warning in log file.
But there is still no text in the content section.

I have given direct link to the pdf file in regex-urlfilter.xml
+^http://([a-z0-9\-A-Z]*\.)*viterbi.usc.edu/aviation/assets/002/79884.pdf([a-z0-9\-A-Z]*\/)*

the only thing I have not done is rebuild the nutch. is that the reason no
text is getting extracted from the pdf?
If rebuilding nutch is crucial step...can you pls guide me as to how to do
it.

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4007024.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling file system SOLVED

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

Take a look at ftp.content.limit property in nutch-default.xml and set
it accordingly in nutch-site.xml

Thanks

Lewis

On Tue, Sep 11, 2012 at 12:20 AM, dpverma <pa...@gmail.com> wrote:
> Can you pls let me know how you solved your problem?
> I am also getting the same error which you had.
> Getting the index with pdf's file name but not the content in those
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4006754.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Re: nutch crawling file system SOLVED

Posted by dpverma <pa...@gmail.com>.

Can you pls let me know how you solved your problem?
I am also getting the same error which you had.
Getting the index with pdf's file name but not the content in those




--
View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4006754.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling file system SOLVED

Posted by alessio crisantemi <al...@gmail.com>.

Dear All,
now, all works: crawling, parsing and indexing.

I have the last problem: I can't stop crawling only for my directory, and
nutch crawl all parents directories.
How can I do to limit it?
thank you
alessio

Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> ha scritto:

> Hi Alessio,
>
> On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi <
> alessio.crisantemi@gmail.com> wrote:
>
> >
> >
> > suggestions?
> >
>
> For what?
>

Re: nutch crawling file system SOLVED

Posted by alessio crisantemi <al...@gmail.com>.

this is my schema configuration:

<fields>
 * * <field name="*id*" type="*string*" stored="*true*" indexed="*true*" />
*-* <!--

 core fields

* * -->
 * * <field name="*segment*" type="*string*" stored="*true*" indexed="*false
*" />
 * * <field name="*digest*" type="*string*" stored="*true*" indexed="*false*
" />
 * * <field name="*boost*" type="*float*" stored="*true*" indexed="*false*"/>
*-* <!--

 fields for index-basic plugin

* * -->
 * * <field name="*host*" type="*url*" stored="*false*" indexed="*true*" />
 * * <field name="*site*" type="*string*" stored="*false*" indexed="*true*"/>
 * * <field name="*url*" type="*url*" stored="*true*" indexed="*true*"required
="*true*" />
 * * <field name="*content*" type="*text*" stored="*true*" indexed="*true*"/>
 * * <field name="*title*" type="*text*" stored="*true*" indexed="*true*" />
 * * <field name="*cache*" type="*string*" stored="*true*" indexed="*false*"/>
 * * <field name="*tstamp*" type="*date*" stored="*true*" indexed="*false*"/>
*-* <!--

 fields for index-anchor plugin

* * -->
 * * <field name="*anchor*" type="*string*" stored="*true*"
indexed="*true*"multiValued
="*true*" />
*-* <!--

 fields for index-more plugin

* * -->
 * * <field name="*type*" type="*string*" stored="*true*"
indexed="*true*"multiValued
="*true*" />
 * * <field name="*contentLength*" type="*long*" stored="*true*" indexed="*
false*" />
 * * <field name="*lastModified*" type="*date*" stored="*true*" indexed="*
false*" />
 * * <field name="*date*" type="*date*" stored="*true*" indexed="*true*" />
*-* <!--

 fields for languageidentifier plugin

* * -->
 * * <field name="*lang*" type="*string*" stored="*true*" indexed="*true*"/>
*-* <!--

 fields for subcollection plugin

* * -->
 * * <field name="*subcollection*" type="*string*" stored="*true*" indexed="
*true*" multiValued="*true*" />
*-* <!--

 fields for feed plugin (tag is also used by microformats-reltag)

* * -->
 * * <field name="*author*" type="*string*" stored="*true*" indexed="*true*"/>
 * * <field name="*tag*" type="*string*" stored="*true*"
indexed="*true*"multiValued
="*true*" />
 * * <field name="*feed*" type="*string*" stored="*true*" indexed="*true*"/>
 * * <field name="*publishedDate*" type="*date*" stored="*true*" indexed="*
true*" />
 * * <field name="*updatedDate*" type="*date*" stored="*true*" indexed="*
true*" />

that's good for text research with pdf files?
thank you
alessio
Il giorno 17 marzo 2012 21:32, alessio crisantemi <
alessio.crisantemi@gmail.com> ha scritto:

> I would that the result of my search be the text of my pdf file and not
> the list of documents into the directory and the path address..
>
>
>
>
> Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> ha scritto:
>
> Hi Alessio,
>>
>> On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi <
>> alessio.crisantemi@gmail.com> wrote:
>>
>> >
>> >
>> > suggestions?
>> >
>>
>> For what?
>>
>
>

Re: nutch crawling file system SOLVED

Posted by alessio crisantemi <al...@gmail.com>.

I would that the result of my search be the text of my pdf file and not the
list of documents into the directory and the path address..

Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> ha scritto:

> Hi Alessio,
>
> On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi <
> alessio.crisantemi@gmail.com> wrote:
>
> >
> >
> > suggestions?
> >
>
> For what?
>

Re: nutch crawling file system SOLVED

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Alessio,

On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi <
alessio.crisantemi@gmail.com> wrote:

>
>
> suggestions?
>

For what?

Re: nutch crawling file system SOLVED

Posted by alessio crisantemi <al...@gmail.com>.

this is the return after crawling with nutch and indexing on solr:

<doc>
<float name="boost">0.298293</float>
-
<str name="content">
Index of C:\Documents and Settings\Alessio\Documenti Index of C:\Documents
and Settings\Alessio\Documenti ../ - - - 003_C_001_Alessio_2004_08_13.dvf
Tue, 17 Aug 2004 20:09:52 GMT 151552 13542 Christmas on Stars Affiliate
Banners 21.11/ Fri, 02 Dec 2011 20:45:37 GMT - 5 GETTONI 22 EURO.doc Mon,
08 Sep 2003 16:18:58 GMT 20480 adesione 191 con password.doc Fri, 29 Aug
2003 13:45:00 GMT 116736 art inceneritore.doc Sun, 18 Feb 2007 18:08:54 GMT
32768 articoli/ Wed, 26 Apr 2006 08:12:43 GMT - auguri pcplayer.it.jpg Sun,
24 Dec 2006 20:56:48 GMT 13888 Bluetooth Exchange Folder/ Wed, 22 Mar 2006
19:50:24 GMT - BluetoothXpSp2.pdf Mon, 11 Jul 2005 08:24:58 GMT 120812
Brevettato il disco volante.doc Wed, 30 Aug 2006 13:57:19 GMT 20992 Busta
Rhymes feat maria c.mp3 Sat, 05 Jul 2003 12:17:58 GMT 2595736 CALENDARIO
FANTA2005.doc Wed, 08 Sep 2004 13:13:20 GMT 141824 Cartella Scambio
Bluetooth/ Mon, 20 Mar 2006 22:03:45 GMT - cc_20111224_234440.reg Sat, 24
Dec 2011 22:44:44 GMT 1860 CD musicali 01 -01- 2003.xls Mon, 27 Jan 2003
00:10:00 GMT 515584 CLASSIFICA fantacalcio 2005.doc Wed, 08 Sep 2004
13:15:38 GMT 43008 Collegamento a My Shared Folder.lnk Thu, 29 Sep 2005
16:56:56 GMT 533 conte tagliaferri.doc Thu, 09 Oct 2008 00:00:36 GMT 29184
Corel User Files/ Sun, 16 Apr 2006 11:56:24 GMT - Curriculum ANGELO
CONTILI.doc Wed, 19 Jan 2005 22:35:34 GMT 42496 currriculum Alessio
agg2004.doc Thu, 27 Jan 2005 21:56:06 GMT 950784 currriculum Alessio.doc
Thu, 10 Apr 2003 19:29:28 GMT 44544 Default.rdp Mon, 11 Sep 2006 17:13:17
GMT 1166 desktop.ini Wed, 28 Sep 2011 20:52:02 GMT 75 DNSadsl.txt Mon, 01
Aug 2005 13:21:52 GMT 942 DNStabella.xls Mon, 01 Aug 2005 13:16:26 GMT
33792 Download/ Fri, 09 Mar 2012 23:03:54 GMT - Eseguibili JAVA.doc Mon, 20
Jun 2005 11:58:42 GMT 23552 FANTACALCIO/ Mon, 20 Mar 2006 22:04:34 GMT -
Fax/ Mon, 20 Mar 2006 22:04:40 GMT - File ricevuti/ Mon, 19 Oct 2009
12:50:30 GMT - FINALE TORNEO 06.doc Tue, 08 May 2007 17:47:41 GMT 49664
Finest/ Sat, 06 Mar 2010 15:05:46 GMT - FORMAZIONItipo2005.doc Mon, 13 Sep
2004 20:26:56 GMT 49664 free3gp/ Tue, 25 May 2010 16:23:43 GMT - Futurando/
Mon, 20 Mar 2006 09:59:24 GMT - GOL/ Mon, 20 Mar 2006 22:24:46 GMT -
guidadownloadconmirc.doc Sun, 20 Feb 2005 10:29:40 GMT 264704 HAPPY DAYS/
Mon, 20 Mar 2006 21:45:36 GMT - Happy Days2007/ Sun, 27 Jan 2008 15:53:33
GMT - hijackthis.log Fri, 04 Jul 2008 08:49:37 GMT 8573 Immagini/ Wed, 28
Sep 2011 20:52:03 GMT - Immagini.lnk Fri, 15 Aug 2008 16:26:58 GMT 375
intervisteEnada/ Mon, 20 Mar 2006 10:36:53 GMT - IP Pentima.txt Fri, 01 Jul
2005 08:19:56 GMT 99 L'AUTOMATICO/ Sun, 14 Jan 2007 17:30:41 GMT -
lavatr1h.mp3 Wed, 03 Oct 2001 15:19:52 GMT 2586624 lionsleeps_hq.wmv Tue,
17 May 2005 13:29:02 GMT 1842905 lista flip.docx Sat, 29 Mar 2008 14:04:24
GMT 13559 Masterizzare giochi con NERO BURNING ROM.doc Sun, 06 Mar 2005
19:32:28 GMT 23040 masterizzarre CD protetti.txt Thu, 20 Jan 2005 20:36:00
GMT 2326 Matlab 65 serial.txt Thu, 09 Oct 2003 22:34:00 GMT 86
MessageLog.xsl Sun, 21 Dec 2008 20:45:03 GMT 12160 mirc istruz.txt Sun, 09
Mar 2003 16:44:00 GMT 1123 Musica/ Wed, 28 Sep 2011 20:52:04 GMT - My Skype
Content/ Sat, 06 May 2006 12:04:04 GMT - My Skype Pictures/ Wed, 27 Apr
2011 19:54:03 GMT - My Skype Received Files/ Thu, 18 May 2006 16:50:34 GMT
- natale_flip.jpg Sat, 23 Dec 2006 17:18:52 GMT 118507 niagara.JPG Fri, 18
Aug 2006 16:53:48 GMT 1017782 niagara2.JPG Fri, 18 Aug 2006 16:53:44 GMT
988143 Norton AntiVirus_Key.txt Sun, 31 Oct 2004 19:28:24 GMT 357
postepay.txt Wed, 16 Jul 2008 07:48:38 GMT 16 presentazione_FB.pdf Thu, 09
Mar 2006 08:58:00 GMT 700629 richiesta.doc Sun, 16 Nov 2003 18:14:44 GMT
124928 ROSE FANTACALCIO 2005.doc Wed, 08 Sep 2004 13:59:54 GMT 45568
scudettoicona.ico Mon, 22 Sep 2003 19:55:10 GMT 13502
serial_akkxMDYwMTE0ODM5.txt Tue, 05 Aug 2003 20:19:26 GMT 155 Siti Web/
Sun, 03 Jun 2007 13:21:44 GMT - SitoTernanaGiochi/ Fri, 25 May 2007
19:52:51 GMT - sitoTGver1.1.pub Sun, 06 Mar 2005 20:23:56 GMT 1637888
starry(d).jpg Sun, 02 Apr 2006 10:10:26 GMT 2138166 suonerie/ Fri, 15 Aug
2008 16:29:09 GMT - Symantec/ Sun, 13 Aug 2006 12:09:04 GMT - Thumbs.db
Sun, 11 Feb 2007 14:45:34 GMT 71168 vecchioDocumenti/ Wed, 14 Jul 2010
15:42:28 GMT - virtualDub/ Mon, 20 Mar 2006 10:16:47 GMT - Voice Files/
Mon, 27 Mar 2006 11:57:35 GMT - ZbThumbnail.info Mon, 09 Jun 2008 08:25:30
GMT 2920 zurigo.doc Thu, 13 Apr 2006 15:24:45 GMT 27648
</str>
<str name="digest">6717a734c4f78c7f7f2dbc9a7324199e</str>
<str name="id">file:/C:/Documents and Settings/Alessio/Documenti/</str>
<str name="segment">20120317175631</str>
-
<str name="title">
Index of C:\Documents and Settings\Alessio\Documenti
</str>
<date name="tstamp">2012-03-17T16:56:39.014Z</date>
<str name="url">file:/C:/Documents and Settings/Alessio/Documenti/</str>
</doc>

suggestions?
tx
alessio

Il giorno 12 marzo 2012 09:39, alessio crisantemi <
alessio.crisantemi@gmail.com> ha scritto:

> I add the path of my directory on regex-urlfilter but nutch crawl also
> other directories...
>
> And more: I follow your suggestions and I indexing again my root, But I
> have still a index with the name of my pdf's files and not the content of
> those.
>
> I don't comprend..
> alessio
>
> Il giorno 12 marzo 2012 06:06, remi tassing <ta...@gmail.com> ha
> scritto:
>
> Using crawl-ulrfilter (or regex-urlfilter depending on which one you're
>> using), you should be able to solve this. Unless you're not clear on what
>> folders to exclude...?
>>
>> On Sunday, March 11, 2012, alessio crisantemi <
>> alessio.crisantemi@gmail.com>
>> wrote:
>>  > thank you Remi for your preciuos help. I try again and I write you the
>> > results.
>> > But I have another little question: how can I do for limit the crawling
>> > only to my selected root?
>> >
>> > Because all time, Nutch crawl also the parent directories. I read that
>> "The
>> > code that is responsable for this is in
>> >
>>
>> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
>> > f). "
>> >
>> > And a guy suggest to change the following line:
>> > this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
>> > true);
>> >
>> > to
>> > this.content = list2html(f.listFiles(), path, false);
>> >
>> > and recompiled.
>> >
>> > But in my class file, I have just this raw...And that's not a simple
>> mode
>> >
>> > There is another method, I suppose?
>> >
>> > thank you
>> >
>> > alessio
>> >
>> >
>> >
>> > Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney <
>> > lewis.mcgibbney@gmail.com> ha scritto:
>> >
>> >> Please see below
>> >>
>> >> On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi <
>> >> alessio.crisantemi@gmail.com> wrote:
>> >>
>> >> >
>> >> > [1]
>> >>
>> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
>> >> >
>> >>
>> >> I've now updated this link, thanks for pointing this out.
>> >>
>> >>
>> >> > And Now, I have another problem:
>> >> > I crawled my local file system: a directory with a lot of Pdf files.
>> All
>> >> > works, and nutch index on Solr the results.
>> >> >
>> >>
>> >> OK
>> >>
>> >>
>> >> > But this is the problem: whe I submit a query on solr, I can see
>> only a
>> >> > list of files, and not the pdf contents.
>> >> > why, in your opinion?
>> >> >
>> >>
>> >> Well this might be to with your file.content.limit in nutch-site.xml,
>> maybe
>> >> your documents are being truncated if they are too large.
>> >> Additionally your Solr mapping's and or schema configuration may need
>> to
>> be
>> >> tweaked slightly to permit you to view snippets of the PDF content
>> within
>> >> your Solr search results. In your schema configuration for index-basec,
>> try
>> >> changing
>> >>
>> >> <field name="content" type="text" stored="false" indexed="true"/>
>> >>
>> >> to
>> >>
>> >> <field name="content" type="text" stored="true" indexed="true"/>
>> >>
>> >>
>> >> You will need to reindex your content if you wish to see the results
>> >> through Solr.
>> >>
>> >
>>
>
>

Re: nutch crawling file system SOLVED

Posted by alessio crisantemi <al...@gmail.com>.

I add the path of my directory on regex-urlfilter but nutch crawl also
other directories...

And more: I follow your suggestions and I indexing again my root, But I
have still a index with the name of my pdf's files and not the content of
those.

I don't comprend..
alessio

Il giorno 12 marzo 2012 06:06, remi tassing <ta...@gmail.com> ha
scritto:

> Using crawl-ulrfilter (or regex-urlfilter depending on which one you're
> using), you should be able to solve this. Unless you're not clear on what
> folders to exclude...?
>
> On Sunday, March 11, 2012, alessio crisantemi <
> alessio.crisantemi@gmail.com>
> wrote:
> > thank you Remi for your preciuos help. I try again and I write you the
> > results.
> > But I have another little question: how can I do for limit the crawling
> > only to my selected root?
> >
> > Because all time, Nutch crawl also the parent directories. I read that
> "The
> > code that is responsable for this is in
> >
>
> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
> > f). "
> >
> > And a guy suggest to change the following line:
> > this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> > true);
> >
> > to
> > this.content = list2html(f.listFiles(), path, false);
> >
> > and recompiled.
> >
> > But in my class file, I have just this raw...And that's not a simple mode
> >
> > There is another method, I suppose?
> >
> > thank you
> >
> > alessio
> >
> >
> >
> > Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> ha scritto:
> >
> >> Please see below
> >>
> >> On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi <
> >> alessio.crisantemi@gmail.com> wrote:
> >>
> >> >
> >> > [1]
> >> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
> >> >
> >>
> >> I've now updated this link, thanks for pointing this out.
> >>
> >>
> >> > And Now, I have another problem:
> >> > I crawled my local file system: a directory with a lot of Pdf files.
> All
> >> > works, and nutch index on Solr the results.
> >> >
> >>
> >> OK
> >>
> >>
> >> > But this is the problem: whe I submit a query on solr, I can see only
> a
> >> > list of files, and not the pdf contents.
> >> > why, in your opinion?
> >> >
> >>
> >> Well this might be to with your file.content.limit in nutch-site.xml,
> maybe
> >> your documents are being truncated if they are too large.
> >> Additionally your Solr mapping's and or schema configuration may need to
> be
> >> tweaked slightly to permit you to view snippets of the PDF content
> within
> >> your Solr search results. In your schema configuration for index-basec,
> try
> >> changing
> >>
> >> <field name="content" type="text" stored="false" indexed="true"/>
> >>
> >> to
> >>
> >> <field name="content" type="text" stored="true" indexed="true"/>
> >>
> >>
> >> You will need to reindex your content if you wish to see the results
> >> through Solr.
> >>
> >
>

Re: nutch crawling file system SOLVED

Posted by remi tassing <ta...@gmail.com>.

Using crawl-ulrfilter (or regex-urlfilter depending on which one you're
using), you should be able to solve this. Unless you're not clear on what
folders to exclude...?

On Sunday, March 11, 2012, alessio crisantemi <al...@gmail.com>
wrote:
> thank you Remi for your preciuos help. I try again and I write you the
> results.
> But I have another little question: how can I do for limit the crawling
> only to my selected root?
>
> Because all time, Nutch crawl also the parent directories. I read that
"The
> code that is responsable for this is in
>
org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
> f). "
>
> And a guy suggest to change the following line:
> this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> true);
>
> to
> this.content = list2html(f.listFiles(), path, false);
>
> and recompiled.
>
> But in my class file, I have just this raw...And that's not a simple mode
>
> There is another method, I suppose?
>
> thank you
>
> alessio
>
>
>
> Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> ha scritto:
>
>> Please see below
>>
>> On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi <
>> alessio.crisantemi@gmail.com> wrote:
>>
>> >
>> > [1]
>> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
>> >
>>
>> I've now updated this link, thanks for pointing this out.
>>
>>
>> > And Now, I have another problem:
>> > I crawled my local file system: a directory with a lot of Pdf files.
All
>> > works, and nutch index on Solr the results.
>> >
>>
>> OK
>>
>>
>> > But this is the problem: whe I submit a query on solr, I can see only a
>> > list of files, and not the pdf contents.
>> > why, in your opinion?
>> >
>>
>> Well this might be to with your file.content.limit in nutch-site.xml,
maybe
>> your documents are being truncated if they are too large.
>> Additionally your Solr mapping's and or schema configuration may need to
be
>> tweaked slightly to permit you to view snippets of the PDF content within
>> your Solr search results. In your schema configuration for index-basec,
try
>> changing
>>
>> <field name="content" type="text" stored="false" indexed="true"/>
>>
>> to
>>
>> <field name="content" type="text" stored="true" indexed="true"/>
>>
>>
>> You will need to reindex your content if you wish to see the results
>> through Solr.
>>
>

Re: nutch crawling file system SOLVED

Posted by alessio crisantemi <al...@gmail.com>.

thank you Remi for your preciuos help. I try again and I write you the
results.
But I have another little question: how can I do for limit the crawling
only to my selected root?

Because all time, Nutch crawl also the parent directories. I read that "The
code that is responsable for this is in
org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
f). "

And a guy suggest to change the following line:
this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
true);

to
this.content = list2html(f.listFiles(), path, false);

and recompiled.

But in my class file, I have just this raw...And that's not a simple mode

There is another method, I suppose?

thank you

alessio



Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> ha scritto:

> Please see below
>
> On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi <
> alessio.crisantemi@gmail.com> wrote:
>
> >
> > [1]
> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
> >
>
> I've now updated this link, thanks for pointing this out.
>
>
> > And Now, I have another problem:
> > I crawled my local file system: a directory with a lot of Pdf files. All
> > works, and nutch index on Solr the results.
> >
>
> OK
>
>
> > But this is the problem: whe I submit a query on solr, I can see only a
> > list of files, and not the pdf contents.
> > why, in your opinion?
> >
>
> Well this might be to with your file.content.limit in nutch-site.xml, maybe
> your documents are being truncated if they are too large.
> Additionally your Solr mapping's and or schema configuration may need to be
> tweaked slightly to permit you to view snippets of the PDF content within
> your Solr search results. In your schema configuration for index-basec, try
> changing
>
> <field name="content" type="text" stored="false" indexed="true"/>
>
> to
>
> <field name="content" type="text" stored="true" indexed="true"/>
>
>
> You will need to reindex your content if you wish to see the results
> through Solr.
>

Re: nutch crawling file system SOLVED

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Please see below

On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi <
alessio.crisantemi@gmail.com> wrote:

>
> [1]http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
>

I've now updated this link, thanks for pointing this out.

> And Now, I have another problem:
> I crawled my local file system: a directory with a lot of Pdf files. All
> works, and nutch index on Solr the results.
>

OK

> But this is the problem: whe I submit a query on solr, I can see only a
> list of files, and not the pdf contents.
> why, in your opinion?
>

Well this might be to with your file.content.limit in nutch-site.xml, maybe
your documents are being truncated if they are too large.
Additionally your Solr mapping's and or schema configuration may need to be
tweaked slightly to permit you to view snippets of the PDF content within
your Solr search results. In your schema configuration for index-basec, try
changing

<field name="content" type="text" stored="false" indexed="true"/>

to

<field name="content" type="text" stored="true" indexed="true"/>

You will need to reindex your content if you wish to see the results
through Solr.

Re: nutch crawling file system SOLVED

Posted by remi tassing <ta...@gmail.com>.

You're probably looking for the "Highlighting" future

http://wiki.apache.org/solr/HighlightingParameters

Remi

On Sun, Mar 11, 2012 at 6:10 PM, alessio crisantemi <
alessio.crisantemi@gmail.com> wrote:

> Thank you Lewis for your explanation: I supposed this fact and I post on
> mailing list my request for have a confirm.
> but there is an updated tutorial about nutch configuration for sile system?
>
>
> I following the suggestion that I obtained just the mailing list...
>
> [1]http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
> [2]
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> [3]
>
> http://stackoverflow.com/questions/941519/how-to-make-nutch-crawl-file-system
> And Now, I have another problem:
> I crawled my local file system: a directory with a lot of Pdf files. All
> works, and nutch index on Solr the results.
> But this is the problem: whe I submit a query on solr, I can see only a
> list of files, and not the pdf contents.
> why, in your opinion?
>
> thank you
> alessio
>
>
> Il giorno 11 marzo 2012 17:59, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> ha scritto:
>
> > Hi Alessio,
> >
> > If you check out our official tutorial you will see no mention of
> > crawl-urlfilter, this was deprecated after Nutch 1.2 IIRC.
> >
> > I can only suggest that any other tutorial you are using is in need of an
> > update.
> >
> > http://wiki.apache.org/nutch/NutchTutorial
> >
> > On Sat, Mar 10, 2012 at 4:42 PM, alessio crisantemi <
> > alessio.crisantemi@gmail.com> wrote:
> >
> > > I'm partially solved.
> > > following the tutorial, I configured my nutch for crawl a local file
> > > system,
> > > thank you.
> > >
> > > But I have a duobt: why all tutorials and guide about nutch speak about
> > > crawl-urlfilter.txt' file, but the default config or Nutch don't have
> > this
> > > file? But If I insert the code that the guide write for the
> > crawl-urlfilter
> > > on regex-urlfilter, all works.
> > > I would know this case.
> > > thank you
> > > alessio
> > >
> > > Il giorno 04 marzo 2012 17:02, alessio crisantemi <
> > > alessio.crisantemi@gmail.com> ha scritto:
> > >
> > > > Hi all,
> > > > I need to crawl a directory with a lot of pdf file.
> > > > But I know onlye the step-by-step mode for crawl a website.
> > > > how can I do for a root?
> > > > thank you for help me
> > > > alessio
> > > >
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>

Re: nutch crawling file system SOLVED

Posted by alessio crisantemi <al...@gmail.com>.

Thank you Lewis for your explanation: I supposed this fact and I post on
mailing list my request for have a confirm.
but there is an updated tutorial about nutch configuration for sile system?

I following the suggestion that I obtained just the mailing list...

[1]http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
[2]http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
[3]
http://stackoverflow.com/questions/941519/how-to-make-nutch-crawl-file-system
And Now, I have another problem:
I crawled my local file system: a directory with a lot of Pdf files. All
works, and nutch index on Solr the results.
But this is the problem: whe I submit a query on solr, I can see only a
list of files, and not the pdf contents.
why, in your opinion?

thank you
alessio

Il giorno 11 marzo 2012 17:59, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> ha scritto:

> Hi Alessio,
>
> If you check out our official tutorial you will see no mention of
> crawl-urlfilter, this was deprecated after Nutch 1.2 IIRC.
>
> I can only suggest that any other tutorial you are using is in need of an
> update.
>
> http://wiki.apache.org/nutch/NutchTutorial
>
> On Sat, Mar 10, 2012 at 4:42 PM, alessio crisantemi <
> alessio.crisantemi@gmail.com> wrote:
>
> > I'm partially solved.
> > following the tutorial, I configured my nutch for crawl a local file
> > system,
> > thank you.
> >
> > But I have a duobt: why all tutorials and guide about nutch speak about
> > crawl-urlfilter.txt' file, but the default config or Nutch don't have
> this
> > file? But If I insert the code that the guide write for the
> crawl-urlfilter
> > on regex-urlfilter, all works.
> > I would know this case.
> > thank you
> > alessio
> >
> > Il giorno 04 marzo 2012 17:02, alessio crisantemi <
> > alessio.crisantemi@gmail.com> ha scritto:
> >
> > > Hi all,
> > > I need to crawl a directory with a lot of pdf file.
> > > But I know onlye the step-by-step mode for crawl a website.
> > > how can I do for a root?
> > > thank you for help me
> > > alessio
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: nutch crawling file system SOLVED

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Alessio,

If you check out our official tutorial you will see no mention of
crawl-urlfilter, this was deprecated after Nutch 1.2 IIRC.

I can only suggest that any other tutorial you are using is in need of an
update.

http://wiki.apache.org/nutch/NutchTutorial

On Sat, Mar 10, 2012 at 4:42 PM, alessio crisantemi <
alessio.crisantemi@gmail.com> wrote:

> I'm partially solved.
> following the tutorial, I configured my nutch for crawl a local file
> system,
> thank you.
>
> But I have a duobt: why all tutorials and guide about nutch speak about
> crawl-urlfilter.txt' file, but the default config or Nutch don't have this
> file? But If I insert the code that the guide write for the crawl-urlfilter
> on regex-urlfilter, all works.
> I would know this case.
> thank you
> alessio
>
> Il giorno 04 marzo 2012 17:02, alessio crisantemi <
> alessio.crisantemi@gmail.com> ha scritto:
>
> > Hi all,
> > I need to crawl a directory with a lot of pdf file.
> > But I know onlye the step-by-step mode for crawl a website.
> > how can I do for a root?
> > thank you for help me
> > alessio
> >
>



-- 
*Lewis*