You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by remi tassing <ta...@gmail.com> on 2012/02/15 14:26:33 UTC

tstamp vs. lastModified ...

Hello all,

What does tstamp represent? I can we shown in Solr results after indexing.

I'm interested in showing the "last modified" meta-data in Solr results but
I'm not sure if Nutch does retrieve this value.

Thanks in advance for the help!

Remi

Re: tstamp vs. lastModified ...

Posted by remi tassing <ta...@gmail.com>.
It could be interesting finding out what exactly causes such huge speed
difference. For me the speed increase is on the 10x order...crazy!

On Wed, Feb 15, 2012 at 9:35 PM, Markus Jelsma
<ma...@openindex.io>wrote:

>
> > You're both correct, after changing the type for tstamp and lastModified
> > from long to date, no error anymore.
> >
> > Next thing I need to do is setup cygwin/svn to be able to get fresh
> > svn/trunch code...it's so cool to be up-to-date. Nutch-1.4 is just
> > ridiculously faster than 1.2 :-)
> >
>
> Is it faster? I read such a thing before somewhere on the list but i really
> don't know why it would be faster. Must be a case of bad settings in 1.2 i
> guess.
>
>
>
> > Thanks!!
> >
> > Remi
> >
> > On Wed, Feb 15, 2012 at 9:14 PM, Markus Jelsma
> >
> > <ma...@openindex.io>wrote:
> > > That was likely an old schema. In trunk (or was it already in1.4) it is
> > > of type date.
> > > http://svn.apache.org/viewvc/nutch/trunk/conf/schema.xml?view=markup
> > >
> > > > Remi, I had a similar problem but for a custom field that I was
> trying
> > > > to post to Solr (via solrindex) as a type="date" in the schema.xml.
> > > > Turns
> > >
> > > out
> > >
> > > > my date string was formatted incorrectly (it was missing the trailing
> > > > Z). From the error message it appears that perhaps the field into
> > > > which this field is going in is set as long or int. If you set it to
> > > > type="date" it should take it (and you can do Solr's date arithmetic
> > > > on it.
> > > >
> > > > On Feb 15, 2012, at 11:01 AM, remi tassing wrote:
> > > > > Awesome!
> > > > >
> > > > > Pushing this to Solr gives me an error (solrindex):
> > > > > SEVERE: java.lang.NumberFormatException: For input string:
> > > > > "2012-02-08T14:40:09.416Z"
> > > > >
> > > > >        at java.lang.NumberFormatException.forInputString(Unknown
> > >
> > > Source)
> > >
> > > > > But I'll try to figure this out on my own
> > > > >
> > > > > I really appreciate your help!
> > > > >
> > > > > Remi
> > > > >
> > > > > On Wed, Feb 15, 2012 at 8:18 PM, Markus Jelsma
> > > > >
> > > > > <ma...@openindex.io>wrote:
> > > > >> sure, use the indexchecker tool.
> > > > >>
> > > > >>> Is it any quick way to see the impact of index-more?  I deleted
> the
> > > > >>> parse related folders in the segment and re-parsed it but when I
> > > > >>> readseg there
> > > > >>
> > > > >> is
> > > > >>
> > > > >>> no.difference....
> > > > >>>
> > > > >>> On Wednesday, February 15, 2012, Lewis John Mcgibbney <
> > > > >>>
> > > > >>> lewis.mcgibbney@gmail.com> wrote:
> > > > >>>> Hi,
> > > > >>>>
> > > > >>>> On Wed, Feb 15, 2012 at 4:00 PM, remi tassing <
> > >
> > > tassingremi@gmail.com>
> > >
> > > > >>> wrote:
> > > > >>>>> tstamp shows a string of digits like 20020123123212
> > > > >>>>
> > > > >>>> This is OK. yyyy-mm-dd-hh-mm-ssZ It is however hellishly old !
> > > > >>>>
> > > > >>>>> Never heard of the plugin "index-more" and it's poorly
> > > > >>>>> documented.
> > > > >>>>
> > > > >>>> Well it's been included in 1.2 onwards so I'm very surprised @
> > > > >>>> that. If
> > > > >>>
> > > > >>> you
> > > > >>>
> > > > >>>> feel like it then please feel free to add documentation, this is
> > > > >>>> always something we are after and would be a great help to the
> > > > >>>> community.
> > > > >>>>
> > > > >>>> After
> > > > >>>>
> > > > >>>>> adding this to plugins.include, I'll need to run solrindex or
> is
> > > > >>>>> it necessary to re-parse or recrawl (I think this less likely
> > > > >>>>> IMO)?
> > > > >>>>
> > > > >>>> If you wish to have the fields we are able to extract with
> > >
> > > index-more
> > >
> > > > >>>> e.g.
> > > > >>>>
> > > > >>>> <!-- fields for index-more plugin -->  81 <field name="type"
> > > > >>>> type="string" stored="true" indexed="true"  82
> > > > >>>> multiValued="true"/> 83 <field name="contentLength" type="long"
> > > > >>>> stored="true"  84 indexed="false"/>  85
> > > > >>>
> > > > >>> <field
> > > > >>>
> > > > >>>> name="lastModified" type="long" stored="true"  86
> indexed="true"/>
> > >
> > >  87
> > >
> > > > >>> <field
> > > > >>>
> > > > >>>> name="date" type="string" stored="true" indexed="true"/>
> > > > >>>> then you'll need to add the plugin, I would rebuild the project
> if
> > >
> > > it
> > >
> > > > >> is
> > > > >>
> > > > >>>> possible but this is not essential, then index your content. And
> > >
> > > yes I
> > >
> > > > >>>> would expect the parsers need to be re-run to extract the
> > >
> > > lastModified
> > >
> > > > >>>> value from pages.
>

Re: tstamp vs. lastModified ...

Posted by Markus Jelsma <ma...@openindex.io>.
> You're both correct, after changing the type for tstamp and lastModified
> from long to date, no error anymore.
> 
> Next thing I need to do is setup cygwin/svn to be able to get fresh
> svn/trunch code...it's so cool to be up-to-date. Nutch-1.4 is just
> ridiculously faster than 1.2 :-)
> 

Is it faster? I read such a thing before somewhere on the list but i really 
don't know why it would be faster. Must be a case of bad settings in 1.2 i 
guess.



> Thanks!!
> 
> Remi
> 
> On Wed, Feb 15, 2012 at 9:14 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > That was likely an old schema. In trunk (or was it already in1.4) it is
> > of type date.
> > http://svn.apache.org/viewvc/nutch/trunk/conf/schema.xml?view=markup
> > 
> > > Remi, I had a similar problem but for a custom field that I was trying
> > > to post to Solr (via solrindex) as a type="date" in the schema.xml.
> > > Turns
> > 
> > out
> > 
> > > my date string was formatted incorrectly (it was missing the trailing
> > > Z). From the error message it appears that perhaps the field into
> > > which this field is going in is set as long or int. If you set it to
> > > type="date" it should take it (and you can do Solr's date arithmetic
> > > on it.
> > > 
> > > On Feb 15, 2012, at 11:01 AM, remi tassing wrote:
> > > > Awesome!
> > > > 
> > > > Pushing this to Solr gives me an error (solrindex):
> > > > SEVERE: java.lang.NumberFormatException: For input string:
> > > > "2012-02-08T14:40:09.416Z"
> > > > 
> > > >        at java.lang.NumberFormatException.forInputString(Unknown
> > 
> > Source)
> > 
> > > > But I'll try to figure this out on my own
> > > > 
> > > > I really appreciate your help!
> > > > 
> > > > Remi
> > > > 
> > > > On Wed, Feb 15, 2012 at 8:18 PM, Markus Jelsma
> > > > 
> > > > <ma...@openindex.io>wrote:
> > > >> sure, use the indexchecker tool.
> > > >> 
> > > >>> Is it any quick way to see the impact of index-more?  I deleted the
> > > >>> parse related folders in the segment and re-parsed it but when I
> > > >>> readseg there
> > > >> 
> > > >> is
> > > >> 
> > > >>> no.difference....
> > > >>> 
> > > >>> On Wednesday, February 15, 2012, Lewis John Mcgibbney <
> > > >>> 
> > > >>> lewis.mcgibbney@gmail.com> wrote:
> > > >>>> Hi,
> > > >>>> 
> > > >>>> On Wed, Feb 15, 2012 at 4:00 PM, remi tassing <
> > 
> > tassingremi@gmail.com>
> > 
> > > >>> wrote:
> > > >>>>> tstamp shows a string of digits like 20020123123212
> > > >>>> 
> > > >>>> This is OK. yyyy-mm-dd-hh-mm-ssZ It is however hellishly old !
> > > >>>> 
> > > >>>>> Never heard of the plugin "index-more" and it's poorly
> > > >>>>> documented.
> > > >>>> 
> > > >>>> Well it's been included in 1.2 onwards so I'm very surprised @
> > > >>>> that. If
> > > >>> 
> > > >>> you
> > > >>> 
> > > >>>> feel like it then please feel free to add documentation, this is
> > > >>>> always something we are after and would be a great help to the
> > > >>>> community.
> > > >>>> 
> > > >>>> After
> > > >>>> 
> > > >>>>> adding this to plugins.include, I'll need to run solrindex or is
> > > >>>>> it necessary to re-parse or recrawl (I think this less likely
> > > >>>>> IMO)?
> > > >>>> 
> > > >>>> If you wish to have the fields we are able to extract with
> > 
> > index-more
> > 
> > > >>>> e.g.
> > > >>>> 
> > > >>>> <!-- fields for index-more plugin -->  81 <field name="type"
> > > >>>> type="string" stored="true" indexed="true"  82
> > > >>>> multiValued="true"/> 83 <field name="contentLength" type="long"
> > > >>>> stored="true"  84 indexed="false"/>  85
> > > >>> 
> > > >>> <field
> > > >>> 
> > > >>>> name="lastModified" type="long" stored="true"  86 indexed="true"/>
> >  
> >  87
> >  
> > > >>> <field
> > > >>> 
> > > >>>> name="date" type="string" stored="true" indexed="true"/>
> > > >>>> then you'll need to add the plugin, I would rebuild the project if
> > 
> > it
> > 
> > > >> is
> > > >> 
> > > >>>> possible but this is not essential, then index your content. And
> > 
> > yes I
> > 
> > > >>>> would expect the parsers need to be re-run to extract the
> > 
> > lastModified
> > 
> > > >>>> value from pages.

Re: tstamp vs. lastModified ...

Posted by remi tassing <ta...@gmail.com>.
You're both correct, after changing the type for tstamp and lastModified
from long to date, no error anymore.

Next thing I need to do is setup cygwin/svn to be able to get fresh
svn/trunch code...it's so cool to be up-to-date. Nutch-1.4 is just
ridiculously faster than 1.2 :-)

Thanks!!

Remi

On Wed, Feb 15, 2012 at 9:14 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> That was likely an old schema. In trunk (or was it already in1.4) it is of
> type date.
> http://svn.apache.org/viewvc/nutch/trunk/conf/schema.xml?view=markup
>
> > Remi, I had a similar problem but for a custom field that I was trying to
> > post to Solr (via solrindex) as a type="date" in the schema.xml. Turns
> out
> > my date string was formatted incorrectly (it was missing the trailing Z).
> > From the error message it appears that perhaps the field into which this
> > field is going in is set as long or int. If you set it to type="date" it
> > should take it (and you can do Solr's date arithmetic on it.
> >
> > On Feb 15, 2012, at 11:01 AM, remi tassing wrote:
> > > Awesome!
> > >
> > > Pushing this to Solr gives me an error (solrindex):
> > > SEVERE: java.lang.NumberFormatException: For input string:
> > > "2012-02-08T14:40:09.416Z"
> > >
> > >        at java.lang.NumberFormatException.forInputString(Unknown
> Source)
> > >
> > > But I'll try to figure this out on my own
> > >
> > > I really appreciate your help!
> > >
> > > Remi
> > >
> > > On Wed, Feb 15, 2012 at 8:18 PM, Markus Jelsma
> > >
> > > <ma...@openindex.io>wrote:
> > >> sure, use the indexchecker tool.
> > >>
> > >>> Is it any quick way to see the impact of index-more?  I deleted the
> > >>> parse related folders in the segment and re-parsed it but when I
> > >>> readseg there
> > >>
> > >> is
> > >>
> > >>> no.difference....
> > >>>
> > >>> On Wednesday, February 15, 2012, Lewis John Mcgibbney <
> > >>>
> > >>> lewis.mcgibbney@gmail.com> wrote:
> > >>>> Hi,
> > >>>>
> > >>>> On Wed, Feb 15, 2012 at 4:00 PM, remi tassing <
> tassingremi@gmail.com>
> > >>>
> > >>> wrote:
> > >>>>> tstamp shows a string of digits like 20020123123212
> > >>>>
> > >>>> This is OK. yyyy-mm-dd-hh-mm-ssZ It is however hellishly old !
> > >>>>
> > >>>>> Never heard of the plugin "index-more" and it's poorly documented.
> > >>>>
> > >>>> Well it's been included in 1.2 onwards so I'm very surprised @ that.
> > >>>> If
> > >>>
> > >>> you
> > >>>
> > >>>> feel like it then please feel free to add documentation, this is
> > >>>> always something we are after and would be a great help to the
> > >>>> community.
> > >>>>
> > >>>> After
> > >>>>
> > >>>>> adding this to plugins.include, I'll need to run solrindex or is it
> > >>>>> necessary to re-parse or recrawl (I think this less likely IMO)?
> > >>>>
> > >>>> If you wish to have the fields we are able to extract with
> index-more
> > >>>> e.g.
> > >>>>
> > >>>> <!-- fields for index-more plugin -->  81 <field name="type"
> > >>>> type="string" stored="true" indexed="true"  82 multiValued="true"/>
> > >>>> 83 <field name="contentLength" type="long" stored="true"  84
> > >>>> indexed="false"/>  85
> > >>>
> > >>> <field
> > >>>
> > >>>> name="lastModified" type="long" stored="true"  86 indexed="true"/>
>  87
> > >>>
> > >>> <field
> > >>>
> > >>>> name="date" type="string" stored="true" indexed="true"/>
> > >>>> then you'll need to add the plugin, I would rebuild the project if
> it
> > >>
> > >> is
> > >>
> > >>>> possible but this is not essential, then index your content. And
> yes I
> > >>>> would expect the parsers need to be re-run to extract the
> lastModified
> > >>>> value from pages.
>

Re: tstamp vs. lastModified ...

Posted by Markus Jelsma <ma...@openindex.io>.
That was likely an old schema. In trunk (or was it already in1.4) it is of 
type date.
http://svn.apache.org/viewvc/nutch/trunk/conf/schema.xml?view=markup

> Remi, I had a similar problem but for a custom field that I was trying to
> post to Solr (via solrindex) as a type="date" in the schema.xml. Turns out
> my date string was formatted incorrectly (it was missing the trailing Z).
> From the error message it appears that perhaps the field into which this
> field is going in is set as long or int. If you set it to type="date" it
> should take it (and you can do Solr's date arithmetic on it.
> 
> On Feb 15, 2012, at 11:01 AM, remi tassing wrote:
> > Awesome!
> > 
> > Pushing this to Solr gives me an error (solrindex):
> > SEVERE: java.lang.NumberFormatException: For input string:
> > "2012-02-08T14:40:09.416Z"
> > 
> >        at java.lang.NumberFormatException.forInputString(Unknown Source)
> > 
> > But I'll try to figure this out on my own
> > 
> > I really appreciate your help!
> > 
> > Remi
> > 
> > On Wed, Feb 15, 2012 at 8:18 PM, Markus Jelsma
> > 
> > <ma...@openindex.io>wrote:
> >> sure, use the indexchecker tool.
> >> 
> >>> Is it any quick way to see the impact of index-more?  I deleted the
> >>> parse related folders in the segment and re-parsed it but when I
> >>> readseg there
> >> 
> >> is
> >> 
> >>> no.difference....
> >>> 
> >>> On Wednesday, February 15, 2012, Lewis John Mcgibbney <
> >>> 
> >>> lewis.mcgibbney@gmail.com> wrote:
> >>>> Hi,
> >>>> 
> >>>> On Wed, Feb 15, 2012 at 4:00 PM, remi tassing <ta...@gmail.com>
> >>> 
> >>> wrote:
> >>>>> tstamp shows a string of digits like 20020123123212
> >>>> 
> >>>> This is OK. yyyy-mm-dd-hh-mm-ssZ It is however hellishly old !
> >>>> 
> >>>>> Never heard of the plugin "index-more" and it's poorly documented.
> >>>> 
> >>>> Well it's been included in 1.2 onwards so I'm very surprised @ that.
> >>>> If
> >>> 
> >>> you
> >>> 
> >>>> feel like it then please feel free to add documentation, this is
> >>>> always something we are after and would be a great help to the
> >>>> community.
> >>>> 
> >>>> After
> >>>> 
> >>>>> adding this to plugins.include, I'll need to run solrindex or is it
> >>>>> necessary to re-parse or recrawl (I think this less likely IMO)?
> >>>> 
> >>>> If you wish to have the fields we are able to extract with index-more
> >>>> e.g.
> >>>> 
> >>>> <!-- fields for index-more plugin -->  81 <field name="type"
> >>>> type="string" stored="true" indexed="true"  82 multiValued="true"/> 
> >>>> 83 <field name="contentLength" type="long" stored="true"  84
> >>>> indexed="false"/>  85
> >>> 
> >>> <field
> >>> 
> >>>> name="lastModified" type="long" stored="true"  86 indexed="true"/>  87
> >>> 
> >>> <field
> >>> 
> >>>> name="date" type="string" stored="true" indexed="true"/>
> >>>> then you'll need to add the plugin, I would rebuild the project if it
> >> 
> >> is
> >> 
> >>>> possible but this is not essential, then index your content. And yes I
> >>>> would expect the parsers need to be re-run to extract the lastModified
> >>>> value from pages.

Re: tstamp vs. lastModified ...

Posted by SUJIT PAL <su...@comcast.net>.
Remi, I had a similar problem but for a custom field that I was trying to post to Solr (via solrindex) as a type="date" in the schema.xml. Turns out my date string was formatted incorrectly (it was missing the trailing Z). From the error message it appears that perhaps the field into which this field is going in is set as long or int. If you set it to type="date" it should take it (and you can do Solr's date arithmetic on it.

On Feb 15, 2012, at 11:01 AM, remi tassing wrote:

> Awesome!
> 
> Pushing this to Solr gives me an error (solrindex):
> SEVERE: java.lang.NumberFormatException: For input string:
> "2012-02-08T14:40:09.416Z"
>        at java.lang.NumberFormatException.forInputString(Unknown Source)
> 
> But I'll try to figure this out on my own
> 
> I really appreciate your help!
> 
> Remi
> 
> On Wed, Feb 15, 2012 at 8:18 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
>> sure, use the indexchecker tool.
>> 
>>> Is it any quick way to see the impact of index-more?  I deleted the parse
>>> related folders in the segment and re-parsed it but when I readseg there
>> is
>>> no.difference....
>>> 
>>> On Wednesday, February 15, 2012, Lewis John Mcgibbney <
>>> 
>>> lewis.mcgibbney@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> On Wed, Feb 15, 2012 at 4:00 PM, remi tassing <ta...@gmail.com>
>>> 
>>> wrote:
>>>>> tstamp shows a string of digits like 20020123123212
>>>> 
>>>> This is OK. yyyy-mm-dd-hh-mm-ssZ It is however hellishly old !
>>>> 
>>>>> Never heard of the plugin "index-more" and it's poorly documented.
>>>> 
>>>> Well it's been included in 1.2 onwards so I'm very surprised @ that. If
>>> 
>>> you
>>> 
>>>> feel like it then please feel free to add documentation, this is always
>>>> something we are after and would be a great help to the community.
>>>> 
>>>> After
>>>> 
>>>>> adding this to plugins.include, I'll need to run solrindex or is it
>>>>> necessary to re-parse or recrawl (I think this less likely IMO)?
>>>> 
>>>> If you wish to have the fields we are able to extract with index-more
>>>> e.g.
>>>> 
>>>> <!-- fields for index-more plugin -->  81 <field name="type"
>>>> type="string" stored="true" indexed="true"  82 multiValued="true"/>  83
>>>> <field name="contentLength" type="long" stored="true"  84
>>>> indexed="false"/>  85
>>> 
>>> <field
>>> 
>>>> name="lastModified" type="long" stored="true"  86 indexed="true"/>  87
>>> 
>>> <field
>>> 
>>>> name="date" type="string" stored="true" indexed="true"/>
>>>> then you'll need to add the plugin, I would rebuild the project if it
>> is
>>>> possible but this is not essential, then index your content. And yes I
>>>> would expect the parsers need to be re-run to extract the lastModified
>>>> value from pages.
>> 


Re: tstamp vs. lastModified ...

Posted by remi tassing <ta...@gmail.com>.
Awesome!

Pushing this to Solr gives me an error (solrindex):
SEVERE: java.lang.NumberFormatException: For input string:
"2012-02-08T14:40:09.416Z"
        at java.lang.NumberFormatException.forInputString(Unknown Source)

But I'll try to figure this out on my own

I really appreciate your help!

Remi

On Wed, Feb 15, 2012 at 8:18 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> sure, use the indexchecker tool.
>
> > Is it any quick way to see the impact of index-more?  I deleted the parse
> > related folders in the segment and re-parsed it but when I readseg there
> is
> > no.difference....
> >
> > On Wednesday, February 15, 2012, Lewis John Mcgibbney <
> >
> > lewis.mcgibbney@gmail.com> wrote:
> > > Hi,
> > >
> > > On Wed, Feb 15, 2012 at 4:00 PM, remi tassing <ta...@gmail.com>
> >
> > wrote:
> > >> tstamp shows a string of digits like 20020123123212
> > >
> > > This is OK. yyyy-mm-dd-hh-mm-ssZ It is however hellishly old !
> > >
> > >> Never heard of the plugin "index-more" and it's poorly documented.
> > >
> > > Well it's been included in 1.2 onwards so I'm very surprised @ that. If
> >
> > you
> >
> > > feel like it then please feel free to add documentation, this is always
> > > something we are after and would be a great help to the community.
> > >
> > > After
> > >
> > >> adding this to plugins.include, I'll need to run solrindex or is it
> > >> necessary to re-parse or recrawl (I think this less likely IMO)?
> > >
> > > If you wish to have the fields we are able to extract with index-more
> > > e.g.
> > >
> > > <!-- fields for index-more plugin -->  81 <field name="type"
> > > type="string" stored="true" indexed="true"  82 multiValued="true"/>  83
> > > <field name="contentLength" type="long" stored="true"  84
> > > indexed="false"/>  85
> >
> > <field
> >
> > > name="lastModified" type="long" stored="true"  86 indexed="true"/>  87
> >
> > <field
> >
> > > name="date" type="string" stored="true" indexed="true"/>
> > > then you'll need to add the plugin, I would rebuild the project if it
> is
> > > possible but this is not essential, then index your content. And yes I
> > > would expect the parsers need to be re-run to extract the lastModified
> > > value from pages.
>

Re: tstamp vs. lastModified ...

Posted by Markus Jelsma <ma...@openindex.io>.
sure, use the indexchecker tool.

> Is it any quick way to see the impact of index-more?  I deleted the parse
> related folders in the segment and re-parsed it but when I readseg there is
> no.difference....
> 
> On Wednesday, February 15, 2012, Lewis John Mcgibbney <
> 
> lewis.mcgibbney@gmail.com> wrote:
> > Hi,
> > 
> > On Wed, Feb 15, 2012 at 4:00 PM, remi tassing <ta...@gmail.com>
> 
> wrote:
> >> tstamp shows a string of digits like 20020123123212
> > 
> > This is OK. yyyy-mm-dd-hh-mm-ssZ It is however hellishly old !
> > 
> >> Never heard of the plugin "index-more" and it's poorly documented.
> > 
> > Well it's been included in 1.2 onwards so I'm very surprised @ that. If
> 
> you
> 
> > feel like it then please feel free to add documentation, this is always
> > something we are after and would be a great help to the community.
> > 
> > After
> > 
> >> adding this to plugins.include, I'll need to run solrindex or is it
> >> necessary to re-parse or recrawl (I think this less likely IMO)?
> > 
> > If you wish to have the fields we are able to extract with index-more
> > e.g.
> > 
> > <!-- fields for index-more plugin -->  81 <field name="type"
> > type="string" stored="true" indexed="true"  82 multiValued="true"/>  83
> > <field name="contentLength" type="long" stored="true"  84
> > indexed="false"/>  85
> 
> <field
> 
> > name="lastModified" type="long" stored="true"  86 indexed="true"/>  87
> 
> <field
> 
> > name="date" type="string" stored="true" indexed="true"/>
> > then you'll need to add the plugin, I would rebuild the project if it is
> > possible but this is not essential, then index your content. And yes I
> > would expect the parsers need to be re-run to extract the lastModified
> > value from pages.

Re: tstamp vs. lastModified ...

Posted by remi tassing <ta...@gmail.com>.
Is it any quick way to see the impact of index-more?  I deleted the parse
related folders in the segment and re-parsed it but when I readseg there is
no.difference....

On Wednesday, February 15, 2012, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Hi,
>
> On Wed, Feb 15, 2012 at 4:00 PM, remi tassing <ta...@gmail.com>
wrote:
>
>> tstamp shows a string of digits like 20020123123212
>>
>
> This is OK. yyyy-mm-dd-hh-mm-ssZ It is however hellishly old !
>
>
>>
>> Never heard of the plugin "index-more" and it's poorly documented.
>
> Well it's been included in 1.2 onwards so I'm very surprised @ that. If
you
> feel like it then please feel free to add documentation, this is always
> something we are after and would be a great help to the community.
>
> After
>> adding this to plugins.include, I'll need to run solrindex or is it
>> necessary to re-parse or recrawl (I think this less likely IMO)?
>>
> If you wish to have the fields we are able to extract with index-more e.g.
>
> <!-- fields for index-more plugin -->  81 <field name="type" type="string"
> stored="true" indexed="true"  82 multiValued="true"/>  83 <field
> name="contentLength" type="long" stored="true"  84 indexed="false"/>  85
<field
> name="lastModified" type="long" stored="true"  86 indexed="true"/>  87
<field
> name="date" type="string" stored="true" indexed="true"/>
> then you'll need to add the plugin, I would rebuild the project if it is
> possible but this is not essential, then index your content. And yes I
> would expect the parsers need to be re-run to extract the lastModified
> value from pages.
>

Re: tstamp vs. lastModified ...

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

On Wed, Feb 15, 2012 at 4:00 PM, remi tassing <ta...@gmail.com> wrote:

> tstamp shows a string of digits like 20020123123212
>

This is OK. yyyy-mm-dd-hh-mm-ssZ It is however hellishly old !


>
> Never heard of the plugin "index-more" and it's poorly documented.

Well it's been included in 1.2 onwards so I'm very surprised @ that. If you
feel like it then please feel free to add documentation, this is always
something we are after and would be a great help to the community.

After
> adding this to plugins.include, I'll need to run solrindex or is it
> necessary to re-parse or recrawl (I think this less likely IMO)?
>
If you wish to have the fields we are able to extract with index-more e.g.

<!-- fields for index-more plugin -->  81 <field name="type" type="string"
stored="true" indexed="true"  82 multiValued="true"/>  83 <field
name="contentLength" type="long" stored="true"  84 indexed="false"/>  85 <field
name="lastModified" type="long" stored="true"  86 indexed="true"/>  87 <field
name="date" type="string" stored="true" indexed="true"/>
then you'll need to add the plugin, I would rebuild the project if it is
possible but this is not essential, then index your content. And yes I
would expect the parsers need to be re-run to extract the lastModified
value from pages.

Re: tstamp vs. lastModified ...

Posted by remi tassing <ta...@gmail.com>.
Hi,

tstamp shows a string of digits like 20020123123212

Never heard of the plugin "index-more" and it's poorly documented. After
adding this to plugins.include, I'll need to run solrindex or is it
necessary to re-parse or recrawl (I think this less likely IMO)?

Thanks again

Remi

On Wednesday, February 15, 2012, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Hi Remi,
>
> On Wed, Feb 15, 2012 at 1:51 PM, remi tassing <ta...@gmail.com>
wrote:
>
>> Thanks for the clarification!
>>
> nb
>
>>
>> For tstamp, I can actually see it in Solr results (even thought the
format
>> is weird)
>>
> what is the format?
>
>
>>
>> How can I get Last-Modified value in Solr as well? Does Nutch need to be
>> configured in some way?
>>
> You can get this by using index-more and by changing indexed value to true
> in schema.xml
>
> Thanks
>

Re: tstamp vs. lastModified ...

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Remi,

On Wed, Feb 15, 2012 at 1:51 PM, remi tassing <ta...@gmail.com> wrote:

> Thanks for the clarification!
>
nb

>
> For tstamp, I can actually see it in Solr results (even thought the format
> is weird)
>
what is the format?


>
> How can I get Last-Modified value in Solr as well? Does Nutch need to be
> configured in some way?
>
You can get this by using index-more and by changing indexed value to true
in schema.xml

Thanks

Re: tstamp vs. lastModified ...

Posted by remi tassing <ta...@gmail.com>.
Hey Lewis,

Thanks for the clarification!

For tstamp, I can actually see it in Solr results (even thought the format
is weird)

How can I get Last-Modified value in Solr as well? Does Nutch need to be
configured in some way?

Remi

On Wed, Feb 15, 2012 at 3:46 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> iirc time stamp represents when page was last fetched. Yes you should be
> able to specify this value in your schema and get it mapped to solr index.
>
> Last modified is when the actual page was last modified e.g. when there was
> a change to the page source or something.
>
> On Wed, Feb 15, 2012 at 1:26 PM, remi tassing <ta...@gmail.com>
> wrote:
>
> > Hello all,
> >
> > What does tstamp represent? I can we shown in Solr results after
> indexing.
> >
> > I'm interested in showing the "last modified" meta-data in Solr results
> but
> > I'm not sure if Nutch does retrieve this value.
> >
> > Thanks in advance for the help!
> >
> > Remi
> >
>
>
>
> --
> *Lewis*
>

Re: tstamp vs. lastModified ...

Posted by Lewis John Mcgibbney <le...@gmail.com>.
iirc time stamp represents when page was last fetched. Yes you should be
able to specify this value in your schema and get it mapped to solr index.

Last modified is when the actual page was last modified e.g. when there was
a change to the page source or something.

On Wed, Feb 15, 2012 at 1:26 PM, remi tassing <ta...@gmail.com> wrote:

> Hello all,
>
> What does tstamp represent? I can we shown in Solr results after indexing.
>
> I'm interested in showing the "last modified" meta-data in Solr results but
> I'm not sure if Nutch does retrieve this value.
>
> Thanks in advance for the help!
>
> Remi
>



-- 
*Lewis*