You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Pulkit Singhal <pu...@gmail.com> on 2011/09/15 22:54:09 UTC

Generating large datasets for Solr proof-of-concept

Hello Everyone,

I have a goal of populating Solr with a million unique products in
order to create a test environment for a proof of concept. I started
out by using DIH with Amazon RSS feeds but I've quickly realized that
there's no way I can glean a million products from one RSS feed. And
I'd go mad if I just sat at my computer all day looking for feeds and
punching them into DIH config for Solr.

Has anyone ever had to create large mock/dummy datasets for test
environments or for POCs/Demos to convince folks that Solr was the
wave of the future? Any tips would be greatly appreciated. I suppose
it sounds a lot like crawling even though it started out as innocent
DIH usage.

- Pulkit

Re: Generating large datasets for Solr proof-of-concept

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Thu, 2011-09-15 at 22:54 +0200, Pulkit Singhal wrote:
> Has anyone ever had to create large mock/dummy datasets for test
> environments or for POCs/Demos to convince folks that Solr was the
> wave of the future?

Yes, but I did it badly. The problem is that real data are not random so
any simple random String generator is likely to produce data where the
distribution of words does not have much in common with real world data.

Zipf's law seems like the way to go:
https://secure.wikimedia.org/wikipedia/en/wiki/Zipf%27s_law

A little searching reveals things like
https://wiki.apache.org/pig/DataGeneratorHadoop
http://diveintodata.org/2009/09/zipf-distribution-generator-in-java/

Unfortunately most non-techies will be confused by seeing computer
generated words so a combination of Zipf to calculate word distribution
and a dictionary to provide the words themselves might be best.

That still leaves confusing computer generated sentences if one wants to
have larger text fields in the index, but opting for something that
generates text that looks like real sentences collides with proper
distribution of the words.

Re: Generating large datasets for Solr proof-of-concept

Posted by Daniel Skiles <da...@docfinity.com>.

I've done it using SolrJ and a *lot *of of parallel processes feeding dummy
data into the server.

On Thu, Sep 15, 2011 at 4:54 PM, Pulkit Singhal <pu...@gmail.com>wrote:

> Hello Everyone,
>
> I have a goal of populating Solr with a million unique products in
> order to create a test environment for a proof of concept. I started
> out by using DIH with Amazon RSS feeds but I've quickly realized that
> there's no way I can glean a million products from one RSS feed. And
> I'd go mad if I just sat at my computer all day looking for feeds and
> punching them into DIH config for Solr.
>
> Has anyone ever had to create large mock/dummy datasets for test
> environments or for POCs/Demos to convince folks that Solr was the
> wave of the future? Any tips would be greatly appreciated. I suppose
> it sounds a lot like crawling even though it started out as innocent
> DIH usage.
>
> - Pulkit
>

Re: Generating large datasets for Solr proof-of-concept

Posted by Lance Norskog <go...@gmail.com>.

http://aws.amazon.com/datasets

DBPedia might be the easiest to work with:
http://aws.amazon.com/datasets/2319

Amazon has a lot of these things.
Infochimps.com is a marketplace for free & pay versions.


Lance

On Thu, Sep 15, 2011 at 6:55 PM, Pulkit Singhal <pu...@gmail.com>wrote:

> Ah missing } doh!
>
> BTW I still welcome any ideas on how to build an e-commerce test base.
> It doesn't have to be amazon that was jsut my approach, any one?
>
> - Pulkit
>
> On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal <pu...@gmail.com>
> wrote:
> > Thanks for all the feedback thus far. Now to get  little technical about
> it :)
> >
> > I was thinking of feeding a file with all the tags of amazon that
> > yield close to roughly 50000 results each into a file and then running
> > my rss DIH off of that, I came up with the following config but
> > something is amiss, can someone please point out what is off about
> > this?
> >
> >    <document>
> >        <entity name="amazonFeeds"
> >                processor="LineEntityProcessor"
> >                url="file:///xxx/yyy/zzz/amazonfeeds.txt"
> >                rootEntity="false"
> >                dataSource="myURIreader1"
> >                transformer="RegexTransformer,DateFormatTransformer"
> >                >
> >            <entity name="feed"
> >                    pk="link"
> >                    url="${amazonFeeds.rawLine"
> >                    processor="XPathEntityProcessor"
> >                    forEach="/rss/channel | /rss/channel/item"
> >
> >
> transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow">
> > ...
> >
> > The rawline should feed into the url key but instead i get:
> >
> > Caused by: java.net.MalformedURLException: no protocol:
> > null${amazonFeeds.rawLine
> >        at
> org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
> >
> > Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2
> rollback
> > INFO: start rollback
> >
> > Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter
> rollback
> > SEVERE: Exception while solr rollback.
> >
> > Thanks in advance!
> >
> > On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
> > <ma...@openindex.io> wrote:
> >> If we want to test with huge amounts of data we feed portions of the
> internet.
> >> The problem is it takes a lot of bandwith and lots of computing power to
> get
> >> to a `reasonable` size. On the positive side, you deal with real text so
> it's
> >> easier to tune for relevance.
> >>
> >> I think it's easier to create a simple XML generator with mock data,
> prices,
> >> popularity rates etc. It's fast to generate millions of mock products
> and once
> >> you have a large quantity of XML files, you can easily index, test,
> change
> >> config or schema and reindex.
> >>
> >> On the other hand, the sample data that comes with the Solr example is a
> good
> >> set as well as it proves the concepts well, especially with the stock
> Velocity
> >> templates.
> >>
> >> We know Solr will handle enormous sets but quantity is not always a part
> of a
> >> PoC.
> >>
> >>> Hello Everyone,
> >>>
> >>> I have a goal of populating Solr with a million unique products in
> >>> order to create a test environment for a proof of concept. I started
> >>> out by using DIH with Amazon RSS feeds but I've quickly realized that
> >>> there's no way I can glean a million products from one RSS feed. And
> >>> I'd go mad if I just sat at my computer all day looking for feeds and
> >>> punching them into DIH config for Solr.
> >>>
> >>> Has anyone ever had to create large mock/dummy datasets for test
> >>> environments or for POCs/Demos to convince folks that Solr was the
> >>> wave of the future? Any tips would be greatly appreciated. I suppose
> >>> it sounds a lot like crawling even though it started out as innocent
> >>> DIH usage.
> >>>
> >>> - Pulkit
> >>
> >
>



-- 
Lance Norskog
goksron@gmail.com

Re: Generating large datasets for Solr proof-of-concept

Posted by Pulkit Singhal <pu...@gmail.com>.

Ah missing } doh!

BTW I still welcome any ideas on how to build an e-commerce test base.
It doesn't have to be amazon that was jsut my approach, any one?

- Pulkit

On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal <pu...@gmail.com> wrote:
> Thanks for all the feedback thus far. Now to get  little technical about it :)
>
> I was thinking of feeding a file with all the tags of amazon that
> yield close to roughly 50000 results each into a file and then running
> my rss DIH off of that, I came up with the following config but
> something is amiss, can someone please point out what is off about
> this?
>
>    <document>
>        <entity name="amazonFeeds"
>                processor="LineEntityProcessor"
>                url="file:///xxx/yyy/zzz/amazonfeeds.txt"
>                rootEntity="false"
>                dataSource="myURIreader1"
>                transformer="RegexTransformer,DateFormatTransformer"
>                >
>            <entity name="feed"
>                    pk="link"
>                    url="${amazonFeeds.rawLine"
>                    processor="XPathEntityProcessor"
>                    forEach="/rss/channel | /rss/channel/item"
>
> transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow">
> ...
>
> The rawline should feed into the url key but instead i get:
>
> Caused by: java.net.MalformedURLException: no protocol:
> null${amazonFeeds.rawLine
>        at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
>
> Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: start rollback
>
> Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
> SEVERE: Exception while solr rollback.
>
> Thanks in advance!
>
> On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
> <ma...@openindex.io> wrote:
>> If we want to test with huge amounts of data we feed portions of the internet.
>> The problem is it takes a lot of bandwith and lots of computing power to get
>> to a `reasonable` size. On the positive side, you deal with real text so it's
>> easier to tune for relevance.
>>
>> I think it's easier to create a simple XML generator with mock data, prices,
>> popularity rates etc. It's fast to generate millions of mock products and once
>> you have a large quantity of XML files, you can easily index, test, change
>> config or schema and reindex.
>>
>> On the other hand, the sample data that comes with the Solr example is a good
>> set as well as it proves the concepts well, especially with the stock Velocity
>> templates.
>>
>> We know Solr will handle enormous sets but quantity is not always a part of a
>> PoC.
>>
>>> Hello Everyone,
>>>
>>> I have a goal of populating Solr with a million unique products in
>>> order to create a test environment for a proof of concept. I started
>>> out by using DIH with Amazon RSS feeds but I've quickly realized that
>>> there's no way I can glean a million products from one RSS feed. And
>>> I'd go mad if I just sat at my computer all day looking for feeds and
>>> punching them into DIH config for Solr.
>>>
>>> Has anyone ever had to create large mock/dummy datasets for test
>>> environments or for POCs/Demos to convince folks that Solr was the
>>> wave of the future? Any tips would be greatly appreciated. I suppose
>>> it sounds a lot like crawling even though it started out as innocent
>>> DIH usage.
>>>
>>> - Pulkit
>>
>

Re: Generating large datasets for Solr proof-of-concept

Posted by Pulkit Singhal <pu...@gmail.com>.

Thanks for all the feedback thus far. Now to get  little technical about it :)

I was thinking of feeding a file with all the tags of amazon that
yield close to roughly 50000 results each into a file and then running
my rss DIH off of that, I came up with the following config but
something is amiss, can someone please point out what is off about
this?

    <document>
        <entity name="amazonFeeds"
                processor="LineEntityProcessor"
                url="file:///xxx/yyy/zzz/amazonfeeds.txt"
                rootEntity="false"
                dataSource="myURIreader1"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <entity name="feed"
                    pk="link"
                    url="${amazonFeeds.rawLine"
                    processor="XPathEntityProcessor"
                    forEach="/rss/channel | /rss/channel/item"

transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow">
...

The rawline should feed into the url key but instead i get:

Caused by: java.net.MalformedURLException: no protocol:
null${amazonFeeds.rawLine
	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)

Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback

Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
SEVERE: Exception while solr rollback.

Thanks in advance!

On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> If we want to test with huge amounts of data we feed portions of the internet.
> The problem is it takes a lot of bandwith and lots of computing power to get
> to a `reasonable` size. On the positive side, you deal with real text so it's
> easier to tune for relevance.
>
> I think it's easier to create a simple XML generator with mock data, prices,
> popularity rates etc. It's fast to generate millions of mock products and once
> you have a large quantity of XML files, you can easily index, test, change
> config or schema and reindex.
>
> On the other hand, the sample data that comes with the Solr example is a good
> set as well as it proves the concepts well, especially with the stock Velocity
> templates.
>
> We know Solr will handle enormous sets but quantity is not always a part of a
> PoC.
>
>> Hello Everyone,
>>
>> I have a goal of populating Solr with a million unique products in
>> order to create a test environment for a proof of concept. I started
>> out by using DIH with Amazon RSS feeds but I've quickly realized that
>> there's no way I can glean a million products from one RSS feed. And
>> I'd go mad if I just sat at my computer all day looking for feeds and
>> punching them into DIH config for Solr.
>>
>> Has anyone ever had to create large mock/dummy datasets for test
>> environments or for POCs/Demos to convince folks that Solr was the
>> wave of the future? Any tips would be greatly appreciated. I suppose
>> it sounds a lot like crawling even though it started out as innocent
>> DIH usage.
>>
>> - Pulkit
>

Re: Generating large datasets for Solr proof-of-concept

Posted by Markus Jelsma <ma...@openindex.io>.

If we want to test with huge amounts of data we feed portions of the internet. 
The problem is it takes a lot of bandwith and lots of computing power to get 
to a `reasonable` size. On the positive side, you deal with real text so it's 
easier to tune for relevance.

I think it's easier to create a simple XML generator with mock data, prices, 
popularity rates etc. It's fast to generate millions of mock products and once 
you have a large quantity of XML files, you can easily index, test, change 
config or schema and reindex.

On the other hand, the sample data that comes with the Solr example is a good 
set as well as it proves the concepts well, especially with the stock Velocity 
templates.

We know Solr will handle enormous sets but quantity is not always a part of a 
PoC.

> Hello Everyone,
> 
> I have a goal of populating Solr with a million unique products in
> order to create a test environment for a proof of concept. I started
> out by using DIH with Amazon RSS feeds but I've quickly realized that
> there's no way I can glean a million products from one RSS feed. And
> I'd go mad if I just sat at my computer all day looking for feeds and
> punching them into DIH config for Solr.
> 
> Has anyone ever had to create large mock/dummy datasets for test
> environments or for POCs/Demos to convince folks that Solr was the
> wave of the future? Any tips would be greatly appreciated. I suppose
> it sounds a lot like crawling even though it started out as innocent
> DIH usage.
> 
> - Pulkit

Re: Generating large datasets for Solr proof-of-concept

Posted by Pulkit Singhal <pu...@gmail.com>.

Thanks Hoss. I agree that the way you restated the question is better
for getting results. BTW I think you've tipped me off to exactly what
I needed with this URL: http://bbyopen.com/

Thanks!
- Pulkit

On Fri, Sep 16, 2011 at 4:35 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : Has anyone ever had to create large mock/dummy datasets for test
> : environments or for POCs/Demos to convince folks that Solr was the
> : wave of the future? Any tips would be greatly appreciated. I suppose
> : it sounds a lot like crawling even though it started out as innocent
> : DIH usage.
>
> the better question to ask is where you can find good sample data sets for
> building proof of concept implementations.
>
> If you want an example of product data, the best buy product catalog is
> available for developers using either an API or a bulk download of xml
> files...
>
>        http://bbyopen.com/
>
> ...last time i looked (~1 year ago) there were about 1 million products in
> the data dump.
>
>
> -Hoss
>

Re: Generating large datasets for Solr proof-of-concept

Posted by Chris Hostetter <ho...@fucit.org>.

: Has anyone ever had to create large mock/dummy datasets for test
: environments or for POCs/Demos to convince folks that Solr was the
: wave of the future? Any tips would be greatly appreciated. I suppose
: it sounds a lot like crawling even though it started out as innocent
: DIH usage.

the better question to ask is where you can find good sample data sets for 
building proof of concept implementations.

If you want an example of product data, the best buy product catalog is 
available for developers using either an API or a bulk download of xml 
files...

	http://bbyopen.com/

...last time i looked (~1 year ago) there were about 1 million products in 
the data dump.


-Hoss