You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by ntsrikanth <nt...@gmail.com> on 2011/09/15 10:46:36 UTC

Index creation from multiple data sources

Hi,

  I work for a travel company and I am trying to do a feasibility study with
Solr. 
We have got two different datasources, one for the accommodation
details(mysql) and other for the availability(csv). We need to merge them
together so that we get one record which would contain data from each
source. For example, name, description and facilities of a accommodation
from mysql and price details from csv file needs to be merged together to
create a single record in solr. 

I searched the wiki, solr book and forums but couldn't find any answer. Have
anyone got similar setup and if so how did you design it?


Thanks,
Srikanth


--
View this message in context: http://lucene.472066.n3.nabble.com/Index-creation-from-multiple-data-sources-tp3338344p3338344.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index creation from multiple data sources

Posted by ntsrikanth <nt...@gmail.com>.
Hi Ted,

After little struggle figured out a way of joining xml files with database.
But for some reason it is not working. After the import, only the content
from xml is present in my index. Msql contents are missing.

To debug, I replaced the parametrized query with a simple select statement
and it worked well. As a next step, I purposefully created a syntax error in
the sql and tried again. This time the import failed as expected printing
the values in the log file.

What I found interesting is all the values eg. brochure_id are substituted
in the query by a enclosing square brackets. for example:   *and brochure_id
= '[55]'*

I have the following in the schema.xml
    
    613    
    614    


And my data configuration:


dataconfig.xml
----------------------
<?xml version="1.0" encoding="UTF-8"?>

     

    
    
        
            
               
               
               
               

               

                    
            
        
    



Any idea why I am getting this weird substitution ?

Thanks,
Srikanth


--
View this message in context: http://lucene.472066.n3.nabble.com/Index-creation-from-multiple-data-sources-tp3338344p3348603.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index creation from multiple data sources

Posted by Ted Dunning <te...@gmail.com>.
This is a common need and there are a number of solutions.

One common method for joining large semi-structured data sets is to use
map-reduce such as with Apache Hadoop.

However, since you already have one side of your data in mysql, you should
test whether simply scanning the CSV data while accessing the mysql is
sufficiently performant to be an answer for you.  This is very likely to be
sufficient if your mysql data is small enough to fit into memory.  If the
CSV data is small enough to fit large portions of it in memory then sorting
those portions so that scanning the mysql database in order is possible may
improve your performance enormously.

There is a considerable amount of tuning that you can do to the join process
to make it work well.

Sometimes simply dumping data to an external sort/merge works just as well
as anything.

Regardless of how you do it, the result should be a bunch of joined records.
 From there, you just do the normal Lucene thing to index them.

The reason that the Lucene books don't talk about this is because there are
a gizillion different places data can come from and the exact method for
joining the data will vary.  Once joined, the references you mention will
help you.

On Thu, Sep 15, 2011 at 8:46 AM, ntsrikanth <nt...@gmail.com> wrote:

> Hi,
>
>  I work for a travel company and I am trying to do a feasibility study with
> Solr.
> We have got two different datasources, one for the accommodation
> details(mysql) and other for the availability(csv). We need to merge them
> together so that we get one record which would contain data from each
> source. For example, name, description and facilities of a accommodation
> from mysql and price details from csv file needs to be merged together to
> create a single record in solr.
>
> I searched the wiki, solr book and forums but couldn't find any answer.
> Have
> anyone got similar setup and if so how did you design it?
>
>
> Thanks,
> Srikanth
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Index-creation-from-multiple-data-sources-tp3338344p3338344.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>