You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Savannah Beckett <sa...@yahoo.com> on 2010/09/29 21:56:24 UTC

How to Index Pure Text into Seperate Fields?

Hi,
  I am using xpath to index different parts of the html pages into different 
fields.  Now, I have some pure text documents that has no html.  So I can't use 
xpath.  How do I index these pure text into different fields of the index?  How 
do I make nutch/solr understand these different parts belong to different 
fields?  Maybe I can use existing content in the fields in my index?
Thanks.

Re: How to Index Pure Text into Seperate Fields?

Posted by Savannah Beckett <sa...@yahoo.com>.

No, these new documents are not html, these are pure text, like the ones you see 
in notepad or Microsoft Word.  I have no problem indexing Html, but I got stuck 
with these pure text.

________________________________
From: Scott Gonyea <sc...@aitrus.org>
To: solr-user@lucene.apache.org
Sent: Wed, September 29, 2010 1:20:20 PM
Subject: Re: How to Index Pure Text into Seperate Fields?

Break your HTML pages into the desired fields, format it as follows:

http://wiki.apache.org/solr/UpdateXmlMessages

And away you go.  You may want to search / review the Wiki.  Also, if
you're indexing websites and want to place it in Solr, you should look
at Nutch.  It can do all that work for you, and more.

Scott

On Wed, Sep 29, 2010 at 12:56 PM, Savannah Beckett
<sa...@yahoo.com> wrote:
> Hi,
>   I am using xpath to index different parts of the html pages into different
> fields.  Now, I have some pure text documents that has no html.  So I can't 
use
> xpath.  How do I index these pure text into different fields of the index?  
How
> do I make nutch/solr understand these different parts belong to different
> fields?  Maybe I can use existing content in the fields in my index?
> Thanks.
>
>
>

Re: How to Index Pure Text into Seperate Fields?

Posted by Scott Gonyea <sc...@aitrus.org>.

Break your HTML pages into the desired fields, format it as follows:

http://wiki.apache.org/solr/UpdateXmlMessages

And away you go.  You may want to search / review the Wiki.  Also, if
you're indexing websites and want to place it in Solr, you should look
at Nutch.  It can do all that work for you, and more.

Scott

On Wed, Sep 29, 2010 at 12:56 PM, Savannah Beckett
<sa...@yahoo.com> wrote:
> Hi,
>   I am using xpath to index different parts of the html pages into different
> fields.  Now, I have some pure text documents that has no html.  So I can't use
> xpath.  How do I index these pure text into different fields of the index?  How
> do I make nutch/solr understand these different parts belong to different
> fields?  Maybe I can use existing content in the fields in my index?
> Thanks.
>
>
>

Re: How to Index Pure Text into Seperate Fields?

Posted by Lance Norskog <go...@gmail.com>.

Simple text .txt files and MS office .doc files are very very different beasts.
You can do simple .txt files with some more lines in your
DataImportHandler script.
With DOC files it is easiest to use the extracting request handler
*/extract". This is on the wiki.
If you want to do this inside the DataImporthandler, you need to use
3.x or the trunk. And it has bugs.

On Wed, Sep 29, 2010 at 3:55 PM, Savannah Beckett
<sa...@yahoo.com> wrote:
> No, I am using xpath for html, this is not the question.  I am indexing pure
> text in addition to html that I was indexing.  Pure text like TXT file or
> Microsoft Word doc.  So, no xpath for TXT, how do I index TXT file into
> different fields in my index like the way I use xpath to index html into
> differernt fields in my index?
>
> My question is referring to pure TXT like .txt file and microsoft word, not
> html.  I am completely fine with html.
> Thanks.
>
>
>
>
> ________________________________
> From: Erick Erickson <er...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wed, September 29, 2010 2:59:26 PM
> Subject: Re: How to Index Pure Text into Seperate Fields?
>
> Can you provide a few more details? You mention xpath, which leads me
> to believe that you are using DIH, is that true? How are you getting
> your documents to index? Parts of a filesystem?
>
> Because it's possible to do many things. If you're using DIH against a
> filesystem,
> you could use two fileDataSources, one that works only on files with
> a particular extension (xml, say) and another that processes .txt files.
>
> But that said, if you're trying to index "just the text" of a Word document,
> you
> have to parse it quite differently than a plain text file, take a look at
> Tika.
>
> Al of which may not help you at all, because I'm guessing...
>
> So I think a more complete problem statement would help us help you.
>
> Best
> Erick
>
> On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett <
> savannah_beckett30@yahoo.com> wrote:
>
>> Hi,
>>  I am using xpath to index different parts of the html pages into
>> different
>> fields.  Now, I have some pure text documents that has no html.  So I can't
>> use
>> xpath.  How do I index these pure text into different fields of the index?
>> How
>> do I make nutch/solr understand these different parts belong to different
>> fields?  Maybe I can use existing content in the fields in my index?
>> Thanks.
>>
>>
>>
>
>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: How to Index Pure Text into Seperate Fields?

Posted by Savannah Beckett <sa...@yahoo.com>.

No, I am using xpath for html, this is not the question.  I am indexing pure 
text in addition to html that I was indexing.  Pure text like TXT file or 
Microsoft Word doc.  So, no xpath for TXT, how do I index TXT file into 
different fields in my index like the way I use xpath to index html into 
differernt fields in my index?

My question is referring to pure TXT like .txt file and microsoft word, not 
html.  I am completely fine with html.
Thanks.

________________________________
From: Erick Erickson <er...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Wed, September 29, 2010 2:59:26 PM
Subject: Re: How to Index Pure Text into Seperate Fields?

Can you provide a few more details? You mention xpath, which leads me
to believe that you are using DIH, is that true? How are you getting
your documents to index? Parts of a filesystem?

Because it's possible to do many things. If you're using DIH against a
filesystem,
you could use two fileDataSources, one that works only on files with
a particular extension (xml, say) and another that processes .txt files.

But that said, if you're trying to index "just the text" of a Word document,
you
have to parse it quite differently than a plain text file, take a look at
Tika.

Al of which may not help you at all, because I'm guessing...

So I think a more complete problem statement would help us help you.

Best
Erick

On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett <
savannah_beckett30@yahoo.com> wrote:

> Hi,
>  I am using xpath to index different parts of the html pages into
> different
> fields.  Now, I have some pure text documents that has no html.  So I can't
> use
> xpath.  How do I index these pure text into different fields of the index?
> How
> do I make nutch/solr understand these different parts belong to different
> fields?  Maybe I can use existing content in the fields in my index?
> Thanks.
>
>
>

Re: How to Index Pure Text into Seperate Fields?

Posted by Erick Erickson <er...@gmail.com>.

Can you provide a few more details? You mention xpath, which leads me
to believe that you are using DIH, is that true? How are you getting
your documents to index? Parts of a filesystem?

Because it's possible to do many things. If you're using DIH against a
filesystem,
you could use two fileDataSources, one that works only on files with
a particular extension (xml, say) and another that processes .txt files.

But that said, if you're trying to index "just the text" of a Word document,
you
have to parse it quite differently than a plain text file, take a look at
Tika.

Al of which may not help you at all, because I'm guessing...

So I think a more complete problem statement would help us help you.

Best
Erick

On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett <
savannah_beckett30@yahoo.com> wrote:

> Hi,
>   I am using xpath to index different parts of the html pages into
> different
> fields.  Now, I have some pure text documents that has no html.  So I can't
> use
> xpath.  How do I index these pure text into different fields of the index?
> How
> do I make nutch/solr understand these different parts belong to different
> fields?  Maybe I can use existing content in the fields in my index?
> Thanks.
>
>
>