You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Gaurav Kumar <ga...@gmail.com> on 2009/05/28 12:22:24 UTC

Help Needed...

Hi everyone,

I am doing a project using Lucene where i need to index HTML files. I am
using Tika to parse HTML files. But i need to index files according to their
tags which means that every text present in different HTML tag (like <p>
<a>) should be stored in different fields. Can i do that. If yes how? Also
can i assign different weightage to the tokens present in different fields.
If yes how?

Re: Help Needed...

Posted by Anshum <an...@gmail.com>.
Indexing/Storing are at developers discretion. You may choose to store or
not store a field as per your requirement.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Thu, May 28, 2009 at 4:22 PM, Alexander Aristov <
alexander.aristov@gmail.com> wrote:

> you will need to develop parser and indexer.
>
> but remember that in current implementation content is not stored in lucene
> index,
>
> indexed - yes nut not stored.
>
> Best Regards
> Alexander Aristov
>
>
> 2009/5/28 Gaurav Kumar <ga...@gmail.com>
>
> > Hi everyone,
> >
> > I am doing a project using Lucene where i need to index HTML files. I am
> > using Tika to parse HTML files. But i need to index files according to
> > their
> > tags which means that every text present in different HTML tag (like <p>
> > <a>) should be stored in different fields. Can i do that. If yes how?
> Also
> > can i assign different weightage to the tokens present in different
> fields.
> > If yes how?
> >
>

Re: Help Needed...

Posted by Alexander Aristov <al...@gmail.com>.
you will need to develop parser and indexer.

but remember that in current implementation content is not stored in lucene
index,

indexed - yes nut not stored.

Best Regards
Alexander Aristov


2009/5/28 Gaurav Kumar <ga...@gmail.com>

> Hi everyone,
>
> I am doing a project using Lucene where i need to index HTML files. I am
> using Tika to parse HTML files. But i need to index files according to
> their
> tags which means that every text present in different HTML tag (like <p>
> <a>) should be stored in different fields. Can i do that. If yes how? Also
> can i assign different weightage to the tokens present in different fields.
> If yes how?
>

Re: Help Needed...

Posted by Karl Wettin <ka...@gmail.com>.
28 maj 2009 kl. 12.22 skrev Gaurav Kumar:

> Hi everyone,
>
> I am doing a project using Lucene where i need to index HTML files.  
> I am
> using Tika to parse HTML files. But i need to index files according  
> to their
> tags which means that every text present in different HTML tag (like  
> <p>
> <a>) should be stored in different fields. Can i do that. If yes  
> how? Also
> can i assign different weightage to the tokens present in different  
> fields.
> If yes how?

You might want to explain what it is you try to achieve with this. I  
suspect you might want to use payloads rather than index the tokens in  
multiple fields.


      karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help Needed...

Posted by Paul Libbrecht <pa...@activemath.org>.
Kumar,

you'll have to make your own documents with after parsing yourself the  
HTML (e.g. with Nekohtml to dom).
As for the weights of tokens, supplementarily to IDF, you can do that  
per field, i.e. when you add a field into the document.

paul


Le 28-mai-09 à 12:22, Gaurav Kumar a écrit :

> Hi everyone,
>
> I am doing a project using Lucene where i need to index HTML files.  
> I am
> using Tika to parse HTML files. But i need to index files according  
> to their
> tags which means that every text present in different HTML tag (like  
> <p>
> <a>) should be stored in different fields. Can i do that. If yes  
> how? Also
> can i assign different weightage to the tokens present in different  
> fields.
> If yes how?