You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Francisco Fernandez <fr...@gmail.com> on 2013/09/27 12:28:13 UTC
Pubmed XML indexing
Hi, I'm a newby trying to index PubMed texts obtained as xml with similar structure to:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=23864173,22073418
The nodes I need to extract, expressed as XPaths would be:
//PubmedArticle/MedlineCitation/PMID
//PubmedArticle/MedlineCitation/DateCreated/Year
//PubmedArticle/MedlineCitation/Article/ArticleTitle
//PubmedArticle/MedlineCitation/Article/Abstract/AbstractText
//PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading
I think a way to index them in Solr is to create another xml structure similar to:
<add>
<doc>
<field name="id">PMID</field>
<field name="year_i">Year</field>
<field name="name">ArticleTitle</field>
<field name="abstract_s">AbstractText</field>
<field name="cat">MeshHeading1</field>
<field name="cat">MeshHeading2</field>
</doc>
</add>
Being "PMID" = '23864173' and "ArticleTitle" = 'Cost-effectiveness of low-molecular-weight heparin compared with aspirin for prophylaxis against venous thromboembolism after total joint arthroplasty' and so on.
With that structure I would post it to Solr using the following statement over the documents folder
java -jar post.jar *.xml
I'm wondering if is there a more direct way to perform the same task that does not imply a 'iterate->parsing->restructure->write to disk->post' cycle
Many thanks
Francisco
Re: Pubmed XML indexing
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Did you look at dataImportHandler? There is also Flume, I think.
Regards,
Alex
On 27 Sep 2013 17:28, "Francisco Fernandez" <fr...@gmail.com> wrote:
> Hi, I'm a newby trying to index PubMed texts obtained as xml with similar
> structure to:
>
>
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=23864173,22073418
>
> The nodes I need to extract, expressed as XPaths would be:
>
> //PubmedArticle/MedlineCitation/PMID
> //PubmedArticle/MedlineCitation/DateCreated/Year
> //PubmedArticle/MedlineCitation/Article/ArticleTitle
> //PubmedArticle/MedlineCitation/Article/Abstract/AbstractText
> //PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading
>
> I think a way to index them in Solr is to create another xml structure
> similar to:
> <add>
> <doc>
> <field name="id">PMID</field>
> <field name="year_i">Year</field>
> <field name="name">ArticleTitle</field>
> <field name="abstract_s">AbstractText</field>
> <field name="cat">MeshHeading1</field>
> <field name="cat">MeshHeading2</field>
> </doc>
> </add>
>
> Being "PMID" = '23864173' and "ArticleTitle" = 'Cost-effectiveness of
> low-molecular-weight heparin compared with aspirin for prophylaxis against
> venous thromboembolism after total joint arthroplasty' and so on.
> With that structure I would post it to Solr using the following statement
> over the documents folder
> java -jar post.jar *.xml
>
> I'm wondering if is there a more direct way to perform the same task that
> does not imply a 'iterate->parsing->restructure->write to disk->post' cycle
> Many thanks
>
> Francisco
Re: Pubmed XML indexing
Posted by Francisco Fernandez <fr...@gmail.com>.
Many thanks both Mike and Alexandre.
I'll peek those tools.
Lux seems a good option.
Thanks again,
Francisco
El 27/09/2013, a las 09:33, Michael Sokolov escribió:
> You might be interested in Lux (http://luxdb.org), which is designed for indexing and querying XML using Solr and Lucene. It can run index-supported XPath/XQuery over your documents, and you can define arbitrary XPath indexes.
>
> -Mike
>
> On 9/27/13 6:28 AM, Francisco Fernandez wrote:
>> Hi, I'm a newby trying to index PubMed texts obtained as xml with similar structure to:
>>
>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=23864173,22073418
>>
>> The nodes I need to extract, expressed as XPaths would be:
>>
>> //PubmedArticle/MedlineCitation/PMID
>> //PubmedArticle/MedlineCitation/DateCreated/Year
>> //PubmedArticle/MedlineCitation/Article/ArticleTitle
>> //PubmedArticle/MedlineCitation/Article/Abstract/AbstractText
>> //PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading
>>
>> I think a way to index them in Solr is to create another xml structure similar to:
>> <add>
>> <doc>
>> <field name="id">PMID</field>
>> <field name="year_i">Year</field>
>> <field name="name">ArticleTitle</field>
>> <field name="abstract_s">AbstractText</field>
>> <field name="cat">MeshHeading1</field>
>> <field name="cat">MeshHeading2</field>
>> </doc>
>> </add>
>>
>> Being "PMID" = '23864173' and "ArticleTitle" = 'Cost-effectiveness of low-molecular-weight heparin compared with aspirin for prophylaxis against venous thromboembolism after total joint arthroplasty' and so on.
>> With that structure I would post it to Solr using the following statement over the documents folder
>> java -jar post.jar *.xml
>>
>> I'm wondering if is there a more direct way to perform the same task that does not imply a 'iterate->parsing->restructure->write to disk->post' cycle
>> Many thanks
>>
>> Francisco
>
Re: Pubmed XML indexing
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
You might be interested in Lux (http://luxdb.org), which is designed for
indexing and querying XML using Solr and Lucene. It can run
index-supported XPath/XQuery over your documents, and you can define
arbitrary XPath indexes.
-Mike
On 9/27/13 6:28 AM, Francisco Fernandez wrote:
> Hi, I'm a newby trying to index PubMed texts obtained as xml with similar structure to:
>
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=23864173,22073418
>
> The nodes I need to extract, expressed as XPaths would be:
>
> //PubmedArticle/MedlineCitation/PMID
> //PubmedArticle/MedlineCitation/DateCreated/Year
> //PubmedArticle/MedlineCitation/Article/ArticleTitle
> //PubmedArticle/MedlineCitation/Article/Abstract/AbstractText
> //PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading
>
> I think a way to index them in Solr is to create another xml structure similar to:
> <add>
> <doc>
> <field name="id">PMID</field>
> <field name="year_i">Year</field>
> <field name="name">ArticleTitle</field>
> <field name="abstract_s">AbstractText</field>
> <field name="cat">MeshHeading1</field>
> <field name="cat">MeshHeading2</field>
> </doc>
> </add>
>
> Being "PMID" = '23864173' and "ArticleTitle" = 'Cost-effectiveness of low-molecular-weight heparin compared with aspirin for prophylaxis against venous thromboembolism after total joint arthroplasty' and so on.
> With that structure I would post it to Solr using the following statement over the documents folder
> java -jar post.jar *.xml
>
> I'm wondering if is there a more direct way to perform the same task that does not imply a 'iterate->parsing->restructure->write to disk->post' cycle
> Many thanks
>
> Francisco