You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by pankaj bhatt <pa...@gmail.com> on 2011/01/25 11:29:53 UTC
DIH From various File system locations
Hi All,
I need to index the documents presents in my file system at various
locations (e.g. C:\docs , d:\docs ).
Is there any way through which i can specify this in my DIH
Configuration.
Here is my configuration:-
<document>
<entity name="sd"
processor="FileListEntityProcessor"
fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
*baseDir="G:\\Desktop\\"*
recursive="false"
rootEntity="true"
transformer="DateFormatTransformer"
onerror="continue">
<entity name="tikatest"
processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
<field column="Author" name="author" meta="true"/>
<field column="Content-Type" name="title" meta="true"/>
<!-- field column="title" name="title" meta="true"/ -->
<field column="text" name="all_text"/>
</entity>
<!-- field column="fileLastModified" name="date"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / -->
<field column="fileSize" name="size"/>
<field column="file" name="filename"/>
</entity>
<!--baseDir="../site"-->
</document>
/ Pankaj Bhatt.
Re: DIH From various File system locations
Posted by Adam Estrada <es...@gmail.com>.
I take that back...Use am currently using version 1.2 and make sure
that the latest versions of Tika and PDFBox is in the contrib folder.
1.3 is structured a bit differently and it doesn't look like there is
a contrib directory. Maybe one of the Nutch contributors can comment
on this?
Adam
On Tue, Jan 25, 2011 at 3:21 PM, Adam Estrada
<es...@gmail.com> wrote:
> There are a few tutorials out there.
>
> 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical)
> 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.)
> 3. Build the latest from branch
> http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read
> this one.
>
> http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/
>
> but add the "solr" parameter at the end bin/nutch crawl urls -depth 5
> -topN 100 -solr http://localhost:8983/solr
>
> This will automatically add the data nutch collected to Solr. For
> larger files I would also increase your JAVA_OPTS env to something
> like JAVA_OPTS=' Xmx2048m'
>
> Adam
>
>
>
>
> On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt <pa...@gmail.com> wrote:
>> Thanks Adam, It seems like Nutch use to solve most of my concerns.
>> i would be great if you can have share resources for Nutch with us.
>>
>> / Pankaj Bhatt.
>>
>> On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups <
>> estrada.adam.groups@gmail.com> wrote:
>>
>>> I would just use Nutch and specify the -solr param on the command line.
>>> That will add the extracted content your instance of solr.
>>>
>>> Adam
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 25, 2011, at 5:29 AM, pankaj bhatt <pa...@gmail.com> wrote:
>>>
>>> > Hi All,
>>> > I need to index the documents presents in my file system at
>>> various
>>> > locations (e.g. C:\docs , d:\docs ).
>>> > Is there any way through which i can specify this in my DIH
>>> > Configuration.
>>> > Here is my configuration:-
>>> >
>>> > <document>
>>> > <entity name="sd"
>>> > processor="FileListEntityProcessor"
>>> > fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
>>> > *baseDir="G:\\Desktop\\"*
>>> > recursive="false"
>>> > rootEntity="true"
>>> > transformer="DateFormatTransformer"
>>> > onerror="continue">
>>> > <entity name="tikatest"
>>> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
>>> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
>>> > <field column="Author" name="author" meta="true"/>
>>> > <field column="Content-Type" name="title" meta="true"/>
>>> > <!-- field column="title" name="title" meta="true"/ -->
>>> > <field column="text" name="all_text"/>
>>> > </entity>
>>> >
>>> > <!-- field column="fileLastModified" name="date"
>>> > dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / -->
>>> > <field column="fileSize" name="size"/>
>>> > <field column="file" name="filename"/>
>>> > </entity>
>>> > <!--baseDir="../site"-->
>>> > </document>
>>> >
>>> > / Pankaj Bhatt.
>>>
>>
>
Re: DIH From various File system locations
Posted by Adam Estrada <es...@gmail.com>.
There are a few tutorials out there.
1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical)
2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.)
3. Build the latest from branch
http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read
this one.
http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/
but add the "solr" parameter at the end bin/nutch crawl urls -depth 5
-topN 100 -solr http://localhost:8983/solr
This will automatically add the data nutch collected to Solr. For
larger files I would also increase your JAVA_OPTS env to something
like JAVA_OPTS=' Xmx2048m'
Adam
On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt <pa...@gmail.com> wrote:
> Thanks Adam, It seems like Nutch use to solve most of my concerns.
> i would be great if you can have share resources for Nutch with us.
>
> / Pankaj Bhatt.
>
> On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups <
> estrada.adam.groups@gmail.com> wrote:
>
>> I would just use Nutch and specify the -solr param on the command line.
>> That will add the extracted content your instance of solr.
>>
>> Adam
>>
>> Sent from my iPhone
>>
>> On Jan 25, 2011, at 5:29 AM, pankaj bhatt <pa...@gmail.com> wrote:
>>
>> > Hi All,
>> > I need to index the documents presents in my file system at
>> various
>> > locations (e.g. C:\docs , d:\docs ).
>> > Is there any way through which i can specify this in my DIH
>> > Configuration.
>> > Here is my configuration:-
>> >
>> > <document>
>> > <entity name="sd"
>> > processor="FileListEntityProcessor"
>> > fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
>> > *baseDir="G:\\Desktop\\"*
>> > recursive="false"
>> > rootEntity="true"
>> > transformer="DateFormatTransformer"
>> > onerror="continue">
>> > <entity name="tikatest"
>> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
>> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
>> > <field column="Author" name="author" meta="true"/>
>> > <field column="Content-Type" name="title" meta="true"/>
>> > <!-- field column="title" name="title" meta="true"/ -->
>> > <field column="text" name="all_text"/>
>> > </entity>
>> >
>> > <!-- field column="fileLastModified" name="date"
>> > dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / -->
>> > <field column="fileSize" name="size"/>
>> > <field column="file" name="filename"/>
>> > </entity>
>> > <!--baseDir="../site"-->
>> > </document>
>> >
>> > / Pankaj Bhatt.
>>
>
Re: DIH From various File system locations
Posted by pankaj bhatt <pa...@gmail.com>.
Thanks Adam, It seems like Nutch use to solve most of my concerns.
i would be great if you can have share resources for Nutch with us.
/ Pankaj Bhatt.
On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups <
estrada.adam.groups@gmail.com> wrote:
> I would just use Nutch and specify the -solr param on the command line.
> That will add the extracted content your instance of solr.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 25, 2011, at 5:29 AM, pankaj bhatt <pa...@gmail.com> wrote:
>
> > Hi All,
> > I need to index the documents presents in my file system at
> various
> > locations (e.g. C:\docs , d:\docs ).
> > Is there any way through which i can specify this in my DIH
> > Configuration.
> > Here is my configuration:-
> >
> > <document>
> > <entity name="sd"
> > processor="FileListEntityProcessor"
> > fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
> > *baseDir="G:\\Desktop\\"*
> > recursive="false"
> > rootEntity="true"
> > transformer="DateFormatTransformer"
> > onerror="continue">
> > <entity name="tikatest"
> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
> > <field column="Author" name="author" meta="true"/>
> > <field column="Content-Type" name="title" meta="true"/>
> > <!-- field column="title" name="title" meta="true"/ -->
> > <field column="text" name="all_text"/>
> > </entity>
> >
> > <!-- field column="fileLastModified" name="date"
> > dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / -->
> > <field column="fileSize" name="size"/>
> > <field column="file" name="filename"/>
> > </entity>
> > <!--baseDir="../site"-->
> > </document>
> >
> > / Pankaj Bhatt.
>
Re: DIH From various File system locations
Posted by Estrada Groups <es...@gmail.com>.
I would just use Nutch and specify the -solr param on the command line. That will add the extracted content your instance of solr.
Adam
Sent from my iPhone
On Jan 25, 2011, at 5:29 AM, pankaj bhatt <pa...@gmail.com> wrote:
> Hi All,
> I need to index the documents presents in my file system at various
> locations (e.g. C:\docs , d:\docs ).
> Is there any way through which i can specify this in my DIH
> Configuration.
> Here is my configuration:-
>
> <document>
> <entity name="sd"
> processor="FileListEntityProcessor"
> fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
> *baseDir="G:\\Desktop\\"*
> recursive="false"
> rootEntity="true"
> transformer="DateFormatTransformer"
> onerror="continue">
> <entity name="tikatest"
> processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
> url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
> <field column="Author" name="author" meta="true"/>
> <field column="Content-Type" name="title" meta="true"/>
> <!-- field column="title" name="title" meta="true"/ -->
> <field column="text" name="all_text"/>
> </entity>
>
> <!-- field column="fileLastModified" name="date"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / -->
> <field column="fileSize" name="size"/>
> <field column="file" name="filename"/>
> </entity>
> <!--baseDir="../site"-->
> </document>
>
> / Pankaj Bhatt.