You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Manohar Kanuri <so...@kanuri.org> on 2014/09/22 22:16:24 UTC

Formatting dates

Hello,

I am a non-techie who decided to download and install Solr 5.0 to parse data  for my community activism. Got it installed and running, updated the example schema and installation with a bunch of CSV data. And went back to deal with the first of two fields I deferred till later - dates and location data. 

The CSV data file for Jan - August 2014 is about 650mb with about 1.25 million records/rows. I split it into 5 pieces and went changed MM/DD/YYYY HH:MM:SS AM/PM to the YYYY-MM-DDTHH:MM:SSZ format required by Solr, using TextWrangler. Which is what I know and a step up from trying to use Mac Numbers spreadsheet which does it very easily but I will have to break it into pieces smaller than 25-30mb. Random fields can get updated months after the record was created so I have to find an easier way than break the CSV file into smaller bits and reformat manually. Each record/row has 4 date fields so potentially there are upto 5 million fields to be reformatted in 8 months worth of data.. 

I did a Google search (didn't see a Solr search page) on the mailing list archives and the internet, but seems like my question is either too simple and/or it's staring me in the face and I'm just missing it:  Is there a simple way to reformat the dates to Solr-style in a 650mb-1gig CSV file? Or, ideally, have the dates and times automatically reformatted as the Solr index gets updated the latest data (I recall reading this was not possible). Is there a widget/gadget/gizmo/script that would do this? 

thanks,
manohar

Re: Formatting dates

Posted by Erick Erickson <er...@gmail.com>.
Alexandre:

Honest, I looked for that but was in a rush and couldn't find it and
thought I was remembering something _else_.

That's definitely a better approach, thanks! Perhaps this time I'll
remember....

Erick

On Mon, Sep 22, 2014 at 3:23 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> You could try - for your ideal scenario - creating an
> UpdateRequestProcessor (URP) chain, that
> includes:ParseDateFieldUpdateProcessorFactory
>
> https://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html
>
> Notice that it has been designed for dynamic field scenario, so by
> default it looks at everything and tries to make it a date. But its
> parent class has some parameters to specify specific fields to use:
>
> https://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html
>
> You can see an example in the schemaless config example:
>
> https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/example-schemaless/solr/collection1/conf/solrconfig.xml#L1584
>
> Just remember that when you are creating a URP chain:
> 1) You need to keep two (or three) of the update request processor in
> the chain, not just your date one. The details are here:
> https://wiki.apache.org/solr/UpdateRequestProcessor . The example
> above uses three, to deal with cloud situation
> 2) You need to refer to that chain in the request handler to make sure
> it is actually used:
>
> https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/example-schemaless/solr/collection1/conf/solrconfig.xml#L1014
>
> I THINK this should work and it would classify under configuration not
> customization and definitely not programming.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 22 September 2014 16:16, Manohar Kanuri <so...@kanuri.org> wrote:
> > Hello,
> >
> > I am a non-techie who decided to download and install Solr 5.0 to parse
> data  for my community activism. Got it installed and running, updated the
> example schema and installation with a bunch of CSV data. And went back to
> deal with the first of two fields I deferred till later - dates and
> location data.
> >
> > The CSV data file for Jan - August 2014 is about 650mb with about 1.25
> million records/rows. I split it into 5 pieces and went changed MM/DD/YYYY
> HH:MM:SS AM/PM to the YYYY-MM-DDTHH:MM:SSZ format required by Solr, using
> TextWrangler. Which is what I know and a step up from trying to use Mac
> Numbers spreadsheet which does it very easily but I will have to break it
> into pieces smaller than 25-30mb. Random fields can get updated months
> after the record was created so I have to find an easier way than break the
> CSV file into smaller bits and reformat manually. Each record/row has 4
> date fields so potentially there are upto 5 million fields to be
> reformatted in 8 months worth of data..
> >
> > I did a Google search (didn't see a Solr search page) on the mailing
> list archives and the internet, but seems like my question is either too
> simple and/or it's staring me in the face and I'm just missing it:  Is
> there a simple way to reformat the dates to Solr-style in a 650mb-1gig CSV
> file? Or, ideally, have the dates and times automatically reformatted as
> the Solr index gets updated the latest data (I recall reading this was not
> possible). Is there a widget/gadget/gizmo/script that would do this?
> >
> > thanks,
> > manohar
>

Re: Formatting dates

Posted by Manohar Kanuri <so...@kanuri.org>.
Thanks Alex,

I will try your "not programming" :) solution.  Really appreciate your time and effort. 

manohar

On Sep 22, 2014, at 6:23 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:

> You could try - for your ideal scenario - creating an
> UpdateRequestProcessor (URP) chain, that
> includes:ParseDateFieldUpdateProcessorFactory
> https://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html
> 
> Notice that it has been designed for dynamic field scenario, so by
> default it looks at everything and tries to make it a date. But its
> parent class has some parameters to specify specific fields to use:
> https://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html
> 
> You can see an example in the schemaless config example:
> https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/example-schemaless/solr/collection1/conf/solrconfig.xml#L1584
> 
> Just remember that when you are creating a URP chain:
> 1) You need to keep two (or three) of the update request processor in
> the chain, not just your date one. The details are here:
> https://wiki.apache.org/solr/UpdateRequestProcessor . The example
> above uses three, to deal with cloud situation
> 2) You need to refer to that chain in the request handler to make sure
> it is actually used:
> https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/example-schemaless/solr/collection1/conf/solrconfig.xml#L1014
> 
> I THINK this should work and it would classify under configuration not
> customization and definitely not programming.
> 
> Regards,
>   Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
> 
> 
> On 22 September 2014 16:16, Manohar Kanuri <so...@kanuri.org> wrote:
>> Hello,
>> 
>> I am a non-techie who decided to download and install Solr 5.0 to parse data  for my community activism. Got it installed and running, updated the example schema and installation with a bunch of CSV data. And went back to deal with the first of two fields I deferred till later - dates and location data.
>> 
>> The CSV data file for Jan - August 2014 is about 650mb with about 1.25 million records/rows. I split it into 5 pieces and went changed MM/DD/YYYY HH:MM:SS AM/PM to the YYYY-MM-DDTHH:MM:SSZ format required by Solr, using TextWrangler. Which is what I know and a step up from trying to use Mac Numbers spreadsheet which does it very easily but I will have to break it into pieces smaller than 25-30mb. Random fields can get updated months after the record was created so I have to find an easier way than break the CSV file into smaller bits and reformat manually. Each record/row has 4 date fields so potentially there are upto 5 million fields to be reformatted in 8 months worth of data..
>> 
>> I did a Google search (didn't see a Solr search page) on the mailing list archives and the internet, but seems like my question is either too simple and/or it's staring me in the face and I'm just missing it:  Is there a simple way to reformat the dates to Solr-style in a 650mb-1gig CSV file? Or, ideally, have the dates and times automatically reformatted as the Solr index gets updated the latest data (I recall reading this was not possible). Is there a widget/gadget/gizmo/script that would do this?
>> 
>> thanks,
>> manohar


Re: Formatting dates

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
You could try - for your ideal scenario - creating an
UpdateRequestProcessor (URP) chain, that
includes:ParseDateFieldUpdateProcessorFactory
https://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html

Notice that it has been designed for dynamic field scenario, so by
default it looks at everything and tries to make it a date. But its
parent class has some parameters to specify specific fields to use:
https://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html

You can see an example in the schemaless config example:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/example-schemaless/solr/collection1/conf/solrconfig.xml#L1584

Just remember that when you are creating a URP chain:
1) You need to keep two (or three) of the update request processor in
the chain, not just your date one. The details are here:
https://wiki.apache.org/solr/UpdateRequestProcessor . The example
above uses three, to deal with cloud situation
2) You need to refer to that chain in the request handler to make sure
it is actually used:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/example-schemaless/solr/collection1/conf/solrconfig.xml#L1014

I THINK this should work and it would classify under configuration not
customization and definitely not programming.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 22 September 2014 16:16, Manohar Kanuri <so...@kanuri.org> wrote:
> Hello,
>
> I am a non-techie who decided to download and install Solr 5.0 to parse data  for my community activism. Got it installed and running, updated the example schema and installation with a bunch of CSV data. And went back to deal with the first of two fields I deferred till later - dates and location data.
>
> The CSV data file for Jan - August 2014 is about 650mb with about 1.25 million records/rows. I split it into 5 pieces and went changed MM/DD/YYYY HH:MM:SS AM/PM to the YYYY-MM-DDTHH:MM:SSZ format required by Solr, using TextWrangler. Which is what I know and a step up from trying to use Mac Numbers spreadsheet which does it very easily but I will have to break it into pieces smaller than 25-30mb. Random fields can get updated months after the record was created so I have to find an easier way than break the CSV file into smaller bits and reformat manually. Each record/row has 4 date fields so potentially there are upto 5 million fields to be reformatted in 8 months worth of data..
>
> I did a Google search (didn't see a Solr search page) on the mailing list archives and the internet, but seems like my question is either too simple and/or it's staring me in the face and I'm just missing it:  Is there a simple way to reformat the dates to Solr-style in a 650mb-1gig CSV file? Or, ideally, have the dates and times automatically reformatted as the Solr index gets updated the latest data (I recall reading this was not possible). Is there a widget/gadget/gizmo/script that would do this?
>
> thanks,
> manohar

Re: Formatting dates

Posted by Manohar Kanuri <so...@kanuri.org>.
Thanks Erick,

I expected to hear the dreaded word "programming" at some point and I guess that point has arrived. Now that I know where and what to tinker with..... 

And I should have said 4.10 below, not 5.0.

On Sep 22, 2014, at 4:44 PM, Erick Erickson <er...@gmail.com> wrote:

> I think this'll help:
> 
> http://wiki.apache.org/solr/ScriptUpdateProcessor
> 
> Essentially, each time a document comes in to Solr,
> this will get invoked on it. You'll have to do some
> fiddling to get it right, you have to remove the field from
> the doc and transform it then put it back. None of this
> is hard, but it'll require a bit of programming. Fortunately
> not too much.....
> 
> Best,
> Erick
> 
> On Mon, Sep 22, 2014 at 1:16 PM, Manohar Kanuri <so...@kanuri.org> wrote:
>> Hello,
>> 
>> I am a non-techie who decided to download and install Solr 5.0 to parse data  for my community activism. Got it installed and running, updated the example schema and installation with a bunch of CSV data. And went back to deal with the first of two fields I deferred till later - dates and location data.
>> 
>> The CSV data file for Jan - August 2014 is about 650mb with about 1.25 million records/rows. I split it into 5 pieces and went changed MM/DD/YYYY HH:MM:SS AM/PM to the YYYY-MM-DDTHH:MM:SSZ format required by Solr, using TextWrangler. Which is what I know and a step up from trying to use Mac Numbers spreadsheet which does it very easily but I will have to break it into pieces smaller than 25-30mb. Random fields can get updated months after the record was created so I have to find an easier way than break the CSV file into smaller bits and reformat manually. Each record/row has 4 date fields so potentially there are upto 5 million fields to be reformatted in 8 months worth of data..
>> 
>> I did a Google search (didn't see a Solr search page) on the mailing list archives and the internet, but seems like my question is either too simple and/or it's staring me in the face and I'm just missing it:  Is there a simple way to reformat the dates to Solr-style in a 650mb-1gig CSV file? Or, ideally, have the dates and times automatically reformatted as the Solr index gets updated the latest data (I recall reading this was not possible). Is there a widget/gadget/gizmo/script that would do this?
>> 
>> thanks,
>> manohar


Re: Formatting dates

Posted by Erick Erickson <er...@gmail.com>.
I think this'll help:

http://wiki.apache.org/solr/ScriptUpdateProcessor

Essentially, each time a document comes in to Solr,
this will get invoked on it. You'll have to do some
fiddling to get it right, you have to remove the field from
the doc and transform it then put it back. None of this
is hard, but it'll require a bit of programming. Fortunately
not too much.....

Best,
Erick

On Mon, Sep 22, 2014 at 1:16 PM, Manohar Kanuri <so...@kanuri.org> wrote:
> Hello,
>
> I am a non-techie who decided to download and install Solr 5.0 to parse data  for my community activism. Got it installed and running, updated the example schema and installation with a bunch of CSV data. And went back to deal with the first of two fields I deferred till later - dates and location data.
>
> The CSV data file for Jan - August 2014 is about 650mb with about 1.25 million records/rows. I split it into 5 pieces and went changed MM/DD/YYYY HH:MM:SS AM/PM to the YYYY-MM-DDTHH:MM:SSZ format required by Solr, using TextWrangler. Which is what I know and a step up from trying to use Mac Numbers spreadsheet which does it very easily but I will have to break it into pieces smaller than 25-30mb. Random fields can get updated months after the record was created so I have to find an easier way than break the CSV file into smaller bits and reformat manually. Each record/row has 4 date fields so potentially there are upto 5 million fields to be reformatted in 8 months worth of data..
>
> I did a Google search (didn't see a Solr search page) on the mailing list archives and the internet, but seems like my question is either too simple and/or it's staring me in the face and I'm just missing it:  Is there a simple way to reformat the dates to Solr-style in a 650mb-1gig CSV file? Or, ideally, have the dates and times automatically reformatted as the Solr index gets updated the latest data (I recall reading this was not possible). Is there a widget/gadget/gizmo/script that would do this?
>
> thanks,
> manohar