You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raghavendra Prabhu <rr...@gmail.com> on 2006/02/01 17:19:00 UTC

indexing issue

Hi

I have got some files also

How do i use some parser as the default

Currently the text parser does not work fine for the file type which i have

If i want to make the doc (word) parser as the default one (In a sense if no
parser is found ,word should be used as the default processor and not the
text parse)

How do i do it ?

Rgds
Prabhu

Re: indexing issue

Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi Chris

I am working with the comitted version

But still i am getting the problem

I will really try to see what is the issue and get back to you

Hope your code is also committed soon to the trunk.

Thanks

Rgds
Prabhu


On 2/1/06, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
>
> Hi Prabhu,
>
> Okay, it seems that NUTCH-135 fixed this issue in cached.jsp. The larger
> issue, however, with case insensitive metadata, and having standard
> metadata, NUTCH-139, Jerome C. and I are working on. I hope to have a
> patch
> out that addresses all the issues within the next few days for it, I've
> been
> kind of busy, sorry about that. If everyone approves the patch, it could
> be
> committed by the end of the week.
>
> Thanks,
> Chris
>
>
>
>
> ______________________________________________
> Chris A. Mattmann
> Chris.Mattmann@jpl.nasa.gov
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
> > -----Original Message-----
> > From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> > Sent: Wednesday, February 01, 2006 9:08 AM
> > To: nutch-user@lucene.apache.org; chris.mattmann@jpl.nasa.gov
> > Subject: Re: indexing issue
> >
> > Hi Chris
> >
> > Please refer to NUTCH-123
> >
> > Null Pointer exception for cached.jsp
> >
> >
> > Rgds
> > Prabhu
> >
> > On 2/1/06, Raghavendra Prabhu <rr...@gmail.com> wrote:
> > >
> > > Hi Chris
> > >
> > > I sometime cannot access the cached page .It throws an excpetion(This
> is
> > > what i get for file system indexing)
> > >
> > > I read a comment in the JIRA saying that it is because of content-type
> > and
> > > because of mismatches in case of content-type
> > >
> > > Since you were addressing this issue , I thought it will address this
> > > problem of the cached page throwing excpetion also
> > >
> > > I will try to find the exact cause for failure and let you Know
> > >
> > > Rgds
> > >
> > > Prabhu
> > >
> > >
> > >
> > >
> > > On 2/1/06, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> > > >
> > > > Hi Prabhu,
> > > >
> > > > > And also in the cached page , i get frequent errors for file
> system
> > > > >
> > > > > Is it because of the content-type bug (which you are working on)
> > > >
> > > > Not sure, what errors are you getting? I fixed a bug in cached.jsp
> > that
> > > > had
> > > > to do with an absolute versus relative link (see NUTCH-112). Jerome
> C
> > > > committed that a while back. Was your problem with cached.jsp having
> > to
> > > > do
> > > > with absolute versus relative links?
> > > >
> > > > Thanks,
> > > > Chris
> > > >
> > > > >
> > > > >
> > > > > Rgds
> > > > >
> > > > > Prabhu
> > > > >
> > > > > On 2/1/06, Chris Mattmann < chris.mattmann@jpl.nasa.gov> wrote:
> > > > > >
> > > > > > Hi Raghavendra,
> > > > > >
> > > > > > Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for
> > the
> > > > > > mimeType name="*" portion of the file. Now, look at the parser
> tag
> > > > > > underneath it. Change that parser id to the one you want to use
> > for
> > > > your
> > > > > > default parser, i.e., in your case, parse-msword.
> > > > > >
> > > > > > Hope that helps!
> > > > > >
> > > > > > Cheers,
> > > > > > Chris
> > > > > >
> > > > > >
> > > > > > ______________________________________________
> > > > > > Chris A. Mattmann
> > > > > > Chris.Mattmann@jpl.nasa.gov
> > > > > > Staff Member
> > > > > > Modeling and Data Management Systems Section (387)
> > > > > > Data Management Systems and Technologies Group
> > > > > >
> > > > > > _________________________________________________
> > > > > > Jet Propulsion Laboratory            Pasadena, CA
> > > > > > Office: 171-266B                        Mailstop:  171-246
> > > > > > _______________________________________________________
> > > > > >
> > > > > > Disclaimer:  The opinions presented within are my own and do not
> > > > reflect
> > > > > > those of either NASA, JPL, or the California Institute of
> > > > Technology.
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> > > > > > > Sent: Wednesday, February 01, 2006 8:19 AM
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > Subject: indexing issue
> > > > > > >
> > > > > > > Hi
> > > > > > >
> > > > > > > I have got some files also
> > > > > > >
> > > > > > > How do i use some parser as the default
> > > > > > >
> > > > > > > Currently the text parser does not work fine for the file type
> > > > which i
> > > > > > > have
> > > > > > >
> > > > > > > If i want to make the doc (word) parser as the default one (In
> a
> > > > sense
> > > > > > if
> > > > > > > no
> > > > > > > parser is found ,word should be used as the default processor
> > and
> > > > not
> > > > > > the
> > > > > > > text parse)
> > > > > > >
> > > > > > > How do i do it ?
> > > > > > >
> > > > > > > Rgds
> > > > > > > Prabhu
> > > > > >
> > > > > >
> > > >
> > > >
> > >
>
>

RE: indexing issue

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Prabhu,

 Okay, it seems that NUTCH-135 fixed this issue in cached.jsp. The larger
issue, however, with case insensitive metadata, and having standard
metadata, NUTCH-139, Jerome C. and I are working on. I hope to have a patch
out that addresses all the issues within the next few days for it, I've been
kind of busy, sorry about that. If everyone approves the patch, it could be
committed by the end of the week.

Thanks,
  Chris




______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> Sent: Wednesday, February 01, 2006 9:08 AM
> To: nutch-user@lucene.apache.org; chris.mattmann@jpl.nasa.gov
> Subject: Re: indexing issue
> 
> Hi Chris
> 
> Please refer to NUTCH-123
> 
> Null Pointer exception for cached.jsp
> 
> 
> Rgds
> Prabhu
> 
> On 2/1/06, Raghavendra Prabhu <rr...@gmail.com> wrote:
> >
> > Hi Chris
> >
> > I sometime cannot access the cached page .It throws an excpetion(This is
> > what i get for file system indexing)
> >
> > I read a comment in the JIRA saying that it is because of content-type
> and
> > because of mismatches in case of content-type
> >
> > Since you were addressing this issue , I thought it will address this
> > problem of the cached page throwing excpetion also
> >
> > I will try to find the exact cause for failure and let you Know
> >
> > Rgds
> >
> > Prabhu
> >
> >
> >
> >
> > On 2/1/06, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> > >
> > > Hi Prabhu,
> > >
> > > > And also in the cached page , i get frequent errors for file system
> > > >
> > > > Is it because of the content-type bug (which you are working on)
> > >
> > > Not sure, what errors are you getting? I fixed a bug in cached.jsp
> that
> > > had
> > > to do with an absolute versus relative link (see NUTCH-112). Jerome C
> > > committed that a while back. Was your problem with cached.jsp having
> to
> > > do
> > > with absolute versus relative links?
> > >
> > > Thanks,
> > > Chris
> > >
> > > >
> > > >
> > > > Rgds
> > > >
> > > > Prabhu
> > > >
> > > > On 2/1/06, Chris Mattmann < chris.mattmann@jpl.nasa.gov> wrote:
> > > > >
> > > > > Hi Raghavendra,
> > > > >
> > > > > Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for
> the
> > > > > mimeType name="*" portion of the file. Now, look at the parser tag
> > > > > underneath it. Change that parser id to the one you want to use
> for
> > > your
> > > > > default parser, i.e., in your case, parse-msword.
> > > > >
> > > > > Hope that helps!
> > > > >
> > > > > Cheers,
> > > > > Chris
> > > > >
> > > > >
> > > > > ______________________________________________
> > > > > Chris A. Mattmann
> > > > > Chris.Mattmann@jpl.nasa.gov
> > > > > Staff Member
> > > > > Modeling and Data Management Systems Section (387)
> > > > > Data Management Systems and Technologies Group
> > > > >
> > > > > _________________________________________________
> > > > > Jet Propulsion Laboratory            Pasadena, CA
> > > > > Office: 171-266B                        Mailstop:  171-246
> > > > > _______________________________________________________
> > > > >
> > > > > Disclaimer:  The opinions presented within are my own and do not
> > > reflect
> > > > > those of either NASA, JPL, or the California Institute of
> > > Technology.
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> > > > > > Sent: Wednesday, February 01, 2006 8:19 AM
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: indexing issue
> > > > > >
> > > > > > Hi
> > > > > >
> > > > > > I have got some files also
> > > > > >
> > > > > > How do i use some parser as the default
> > > > > >
> > > > > > Currently the text parser does not work fine for the file type
> > > which i
> > > > > > have
> > > > > >
> > > > > > If i want to make the doc (word) parser as the default one (In a
> > > sense
> > > > > if
> > > > > > no
> > > > > > parser is found ,word should be used as the default processor
> and
> > > not
> > > > > the
> > > > > > text parse)
> > > > > >
> > > > > > How do i do it ?
> > > > > >
> > > > > > Rgds
> > > > > > Prabhu
> > > > >
> > > > >
> > >
> > >
> >


Re: indexing issue

Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi Chris

Please refer to NUTCH-123

Null Pointer exception for cached.jsp


Rgds
Prabhu

On 2/1/06, Raghavendra Prabhu <rr...@gmail.com> wrote:
>
> Hi Chris
>
> I sometime cannot access the cached page .It throws an excpetion(This is
> what i get for file system indexing)
>
> I read a comment in the JIRA saying that it is because of content-type and
> because of mismatches in case of content-type
>
> Since you were addressing this issue , I thought it will address this
> problem of the cached page throwing excpetion also
>
> I will try to find the exact cause for failure and let you Know
>
> Rgds
>
> Prabhu
>
>
>
>
> On 2/1/06, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> >
> > Hi Prabhu,
> >
> > > And also in the cached page , i get frequent errors for file system
> > >
> > > Is it because of the content-type bug (which you are working on)
> >
> > Not sure, what errors are you getting? I fixed a bug in cached.jsp that
> > had
> > to do with an absolute versus relative link (see NUTCH-112). Jerome C
> > committed that a while back. Was your problem with cached.jsp having to
> > do
> > with absolute versus relative links?
> >
> > Thanks,
> > Chris
> >
> > >
> > >
> > > Rgds
> > >
> > > Prabhu
> > >
> > > On 2/1/06, Chris Mattmann < chris.mattmann@jpl.nasa.gov> wrote:
> > > >
> > > > Hi Raghavendra,
> > > >
> > > > Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the
> > > > mimeType name="*" portion of the file. Now, look at the parser tag
> > > > underneath it. Change that parser id to the one you want to use for
> > your
> > > > default parser, i.e., in your case, parse-msword.
> > > >
> > > > Hope that helps!
> > > >
> > > > Cheers,
> > > > Chris
> > > >
> > > >
> > > > ______________________________________________
> > > > Chris A. Mattmann
> > > > Chris.Mattmann@jpl.nasa.gov
> > > > Staff Member
> > > > Modeling and Data Management Systems Section (387)
> > > > Data Management Systems and Technologies Group
> > > >
> > > > _________________________________________________
> > > > Jet Propulsion Laboratory            Pasadena, CA
> > > > Office: 171-266B                        Mailstop:  171-246
> > > > _______________________________________________________
> > > >
> > > > Disclaimer:  The opinions presented within are my own and do not
> > reflect
> > > > those of either NASA, JPL, or the California Institute of
> > Technology.
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> > > > > Sent: Wednesday, February 01, 2006 8:19 AM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: indexing issue
> > > > >
> > > > > Hi
> > > > >
> > > > > I have got some files also
> > > > >
> > > > > How do i use some parser as the default
> > > > >
> > > > > Currently the text parser does not work fine for the file type
> > which i
> > > > > have
> > > > >
> > > > > If i want to make the doc (word) parser as the default one (In a
> > sense
> > > > if
> > > > > no
> > > > > parser is found ,word should be used as the default processor and
> > not
> > > > the
> > > > > text parse)
> > > > >
> > > > > How do i do it ?
> > > > >
> > > > > Rgds
> > > > > Prabhu
> > > >
> > > >
> >
> >
>

Re: indexing issue

Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi Chris

I sometime cannot access the cached page .It throws an excpetion(This is
what i get for file system indexing)

I read a comment in the JIRA saying that it is because of content-type and
because of mismatches in case of content-type

Since you were addressing this issue , I thought it will address this
problem of the cached page throwing excpetion also

I will try to find the exact cause for failure and let you Know

Rgds

Prabhu




On 2/1/06, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
>
> Hi Prabhu,
>
> > And also in the cached page , i get frequent errors for file system
> >
> > Is it because of the content-type bug (which you are working on)
>
> Not sure, what errors are you getting? I fixed a bug in cached.jsp that
> had
> to do with an absolute versus relative link (see NUTCH-112). Jerome C
> committed that a while back. Was your problem with cached.jsp having to do
> with absolute versus relative links?
>
> Thanks,
> Chris
>
> >
> >
> > Rgds
> >
> > Prabhu
> >
> > On 2/1/06, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> > >
> > > Hi Raghavendra,
> > >
> > > Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the
> > > mimeType name="*" portion of the file. Now, look at the parser tag
> > > underneath it. Change that parser id to the one you want to use for
> your
> > > default parser, i.e., in your case, parse-msword.
> > >
> > > Hope that helps!
> > >
> > > Cheers,
> > > Chris
> > >
> > >
> > > ______________________________________________
> > > Chris A. Mattmann
> > > Chris.Mattmann@jpl.nasa.gov
> > > Staff Member
> > > Modeling and Data Management Systems Section (387)
> > > Data Management Systems and Technologies Group
> > >
> > > _________________________________________________
> > > Jet Propulsion Laboratory            Pasadena, CA
> > > Office: 171-266B                        Mailstop:  171-246
> > > _______________________________________________________
> > >
> > > Disclaimer:  The opinions presented within are my own and do not
> reflect
> > > those of either NASA, JPL, or the California Institute of Technology.
> > >
> > >
> > > > -----Original Message-----
> > > > From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> > > > Sent: Wednesday, February 01, 2006 8:19 AM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: indexing issue
> > > >
> > > > Hi
> > > >
> > > > I have got some files also
> > > >
> > > > How do i use some parser as the default
> > > >
> > > > Currently the text parser does not work fine for the file type which
> i
> > > > have
> > > >
> > > > If i want to make the doc (word) parser as the default one (In a
> sense
> > > if
> > > > no
> > > > parser is found ,word should be used as the default processor and
> not
> > > the
> > > > text parse)
> > > >
> > > > How do i do it ?
> > > >
> > > > Rgds
> > > > Prabhu
> > >
> > >
>
>

RE: indexing issue

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Prabhu,

> And also in the cached page , i get frequent errors for file system
> 
> Is it because of the content-type bug (which you are working on)

Not sure, what errors are you getting? I fixed a bug in cached.jsp that had
to do with an absolute versus relative link (see NUTCH-112). Jerome C
committed that a while back. Was your problem with cached.jsp having to do
with absolute versus relative links?

Thanks,
  Chris

> 
> 
> Rgds
> 
> Prabhu
> 
> On 2/1/06, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> >
> > Hi Raghavendra,
> >
> > Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the
> > mimeType name="*" portion of the file. Now, look at the parser tag
> > underneath it. Change that parser id to the one you want to use for your
> > default parser, i.e., in your case, parse-msword.
> >
> > Hope that helps!
> >
> > Cheers,
> > Chris
> >
> >
> > ______________________________________________
> > Chris A. Mattmann
> > Chris.Mattmann@jpl.nasa.gov
> > Staff Member
> > Modeling and Data Management Systems Section (387)
> > Data Management Systems and Technologies Group
> >
> > _________________________________________________
> > Jet Propulsion Laboratory            Pasadena, CA
> > Office: 171-266B                        Mailstop:  171-246
> > _______________________________________________________
> >
> > Disclaimer:  The opinions presented within are my own and do not reflect
> > those of either NASA, JPL, or the California Institute of Technology.
> >
> >
> > > -----Original Message-----
> > > From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> > > Sent: Wednesday, February 01, 2006 8:19 AM
> > > To: nutch-user@lucene.apache.org
> > > Subject: indexing issue
> > >
> > > Hi
> > >
> > > I have got some files also
> > >
> > > How do i use some parser as the default
> > >
> > > Currently the text parser does not work fine for the file type which i
> > > have
> > >
> > > If i want to make the doc (word) parser as the default one (In a sense
> > if
> > > no
> > > parser is found ,word should be used as the default processor and not
> > the
> > > text parse)
> > >
> > > How do i do it ?
> > >
> > > Rgds
> > > Prabhu
> >
> >


Re: indexing issue

Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi Chris

Thanks

And also in the cached page , i get frequent errors for file system

Is it because of the content-type bug (which you are working on)


Rgds

Prabhu

On 2/1/06, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
>
> Hi Raghavendra,
>
> Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the
> mimeType name="*" portion of the file. Now, look at the parser tag
> underneath it. Change that parser id to the one you want to use for your
> default parser, i.e., in your case, parse-msword.
>
> Hope that helps!
>
> Cheers,
> Chris
>
>
> ______________________________________________
> Chris A. Mattmann
> Chris.Mattmann@jpl.nasa.gov
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
> > -----Original Message-----
> > From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> > Sent: Wednesday, February 01, 2006 8:19 AM
> > To: nutch-user@lucene.apache.org
> > Subject: indexing issue
> >
> > Hi
> >
> > I have got some files also
> >
> > How do i use some parser as the default
> >
> > Currently the text parser does not work fine for the file type which i
> > have
> >
> > If i want to make the doc (word) parser as the default one (In a sense
> if
> > no
> > parser is found ,word should be used as the default processor and not
> the
> > text parse)
> >
> > How do i do it ?
> >
> > Rgds
> > Prabhu
>
>

RE: indexing issue

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Raghavendra,

  Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the
mimeType name="*" portion of the file. Now, look at the parser tag
underneath it. Change that parser id to the one you want to use for your
default parser, i.e., in your case, parse-msword.

Hope that helps!

Cheers,
  Chris


______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> Sent: Wednesday, February 01, 2006 8:19 AM
> To: nutch-user@lucene.apache.org
> Subject: indexing issue
> 
> Hi
> 
> I have got some files also
> 
> How do i use some parser as the default
> 
> Currently the text parser does not work fine for the file type which i
> have
> 
> If i want to make the doc (word) parser as the default one (In a sense if
> no
> parser is found ,word should be used as the default processor and not the
> text parse)
> 
> How do i do it ?
> 
> Rgds
> Prabhu