You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by adiyaksa kevin <ad...@gmail.com> on 2018/10/30 04:00:13 UTC

Indexing PDF file in Apache SOLR via Apache TIKA

Hello there, let me introduce my self. My name is Mohammad Kevin Putra (you
can call me Kevin), from Indonesia, i am a beginner in backend developer, i
use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0.

I have a little bit problem about how to put PDF File via Apache TIKA. I
understand how SOLR or TIKA works, but i don't know how they both
integrated.
Last thing i know, TIKA can extract the PDF file i upload, and parse it
into data/meta data automatically. And i just have to copy & paste it to
the "Documents" tab in core solr.
The question is :
1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only
with CLI mode ? if yes only with CLI mode, can you explain it to me please ?
2. Is it possible to add a text result in "Query" tab ?.

The Background i asking about this is, i want to indexing PDF in my local
system, then i just upload it like "drag & drop" in SOLR (is it possible ?)
then when i type something in search box the result is like this :
(Title of doc)
blablablabla (yellow stabilo result) blablabla.
the blablabla text is like a couple sentences. That's all i need.
Sorry for my bad english.
Thanks for reading and replying this for me, it will be very helpful to me.
Thanks a lot

RE: Indexing PDF file in Apache SOLR via Apache TIKA

Posted by Phil Scadden <P....@gns.cri.nz>.

I will second the SolrJ method. You don’t want to be doing this on your SOLR instance. One question is whether your PDFs are scanned or are already searchable. I use tesseract offline to convert all scanned PDFs into searchable PDF so I don’t want Tika to be doing that. My code core is:
            File f = new File(filename);
             ContentHandler textHandler = new BodyContentHandler(Integer.MAX_VALUE);
             Metadata metadata = new Metadata();
             Parser parser = new AutoDetectParser();
             ParseContext context = new ParseContext();
             if (filename.toLowerCase().contains("pdf")) {
               PDFParserConfig pdfConfig = new PDFParserConfig();
               pdfConfig.setExtractInlineImages(false);
               pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR); // Remove this line (in fact remove the whole pdfparserConfig if you want tika to OCR
               context.set(PDFParserConfig.class,pdfConfig);
               context.set(Parser.class,parser);
             }
             InputStream input = new FileInputStream(f);
             try {
               parser.parse(input, textHandler, metadata, context);
             } catch (Exception e) {
               e.printStackTrace();
               return false;
              }
             SolrInputDocument up = new SolrInputDocument();
             if (title==null) title = metadata.get("title");
             if (author==null) author = metadata.get("author");
             up.addField("id",f.getCanonicalPath()); // load up whatever fields you are using
             up.addField("location",idString);
             up.addField("access",access);
             up.addField("datasource",datasource);
             up.addField("title",title);
             up.addField("author",author);
             if (year>0) up.addField("year",year);
             if (opfyear>0) up.addField("opfyear",opfyear);
             String content = textHandler.toString();
             up.addField("_text_",content);
             UpdateRequest req = new UpdateRequest();
             req.add(up);
             req.setBasicAuthCredentials("solrAdmin", password);
             UpdateResponse ur =  req.process(solr,"prindex");
             req.commit(solr, "prindex");
             return true;

-----Original Message-----
From: Erick Erickson <er...@gmail.com>
Sent: Wednesday, 31 October 2018 06:00
To: solr-user <so...@lucene.apache.org>
Subject: Re: Indexing PDF file in Apache SOLR via Apache TIKA

All of the above work, but for robust production situations you'll want to consider a SolrJ client, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog combines indexing from a DB and using Tika, but those are independent.

Best,
Erick
On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau <ka...@gmail.com> wrote:
>
> Hi there,
>
> Here are a couple of ways I'm aware of:
>
> 1. Extract-handler / post tool
> You can use the curl command with the extract handler or bin/post to
> upload a single document.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell
> -using-apache-tika.html
>
> 2. DataImportHandler
> This could be used for, say, uploading multiple documents with Tika.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-sto
> re-data-with-the-data-import-handler.html#the-tikaentityprocessor
>
> You should also be able to do it via the admin page, so long as you
> define and modify the extract handler in solrconfig.xml.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-up
> load
>
> Hope this helps!
>
> On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin
> <ad...@gmail.com>
> wrote:
>
> > Hello there, let me introduce my self. My name is Mohammad Kevin
> > Putra (you can call me Kevin), from Indonesia, i am a beginner in
> > backend developer, i use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0.
> >
> > I have a little bit problem about how to put PDF File via Apache
> > TIKA. I understand how SOLR or TIKA works, but i don't know how they
> > both integrated.
> > Last thing i know, TIKA can extract the PDF file i upload, and parse
> > it into data/meta data automatically. And i just have to copy &
> > paste it to the "Documents" tab in core solr.
> > The question is :
> > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it
> > only with CLI mode ? if yes only with CLI mode, can you explain it
> > to me please ?
> > 2. Is it possible to add a text result in "Query" tab ?.
> >
> > The Background i asking about this is, i want to indexing PDF in my
> > local system, then i just upload it like "drag & drop" in SOLR (is
> > it possible ?) then when i type something in search box the result is like this :
> > (Title of doc)
> > blablablabla (yellow stabilo result) blablabla.
> > the blablabla text is like a couple sentences. That's all i need.
> > Sorry for my bad english.
> > Thanks for reading and replying this for me, it will be very helpful to me.
> > Thanks a lot
> >
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

Re: Indexing PDF file in Apache SOLR via Apache TIKA

Posted by ☼ R Nair <ra...@gmail.com>.

I have done a production implementation of this, running for last four
months without any issue. Just a resatrt every week of all components.

http://blog.cloudera.com/blog/2015/10/how-to-index-scanned-pdfs-at-scale-using-fewer-than-50-lines-of-code/


Best, Ravion

On Tue, Oct 30, 2018, 1:00 PM Erick Erickson <er...@gmail.com>
wrote:

> All of the above work, but for robust production situations you'll
> want to consider a SolrJ client, see:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog
> combines indexing from a DB and using Tika, but those are independent.
>
> Best,
> Erick
> On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau <ka...@gmail.com>
> wrote:
> >
> > Hi there,
> >
> > Here are a couple of ways I'm aware of:
> >
> > 1. Extract-handler / post tool
> > You can use the curl command with the extract handler or bin/post to
> upload
> > a single document.
> > Reference:
> >
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > 2. DataImportHandler
> > This could be used for, say, uploading multiple documents with Tika.
> > Reference:
> >
> https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-store-data-with-the-data-import-handler.html#the-tikaentityprocessor
> >
> > You should also be able to do it via the admin page, so long as you
> define
> > and modify the extract handler in solrconfig.xml.
> > Reference:
> >
> https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-upload
> >
> > Hope this helps!
> >
> > On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin <ad...@gmail.com>
> > wrote:
> >
> > > Hello there, let me introduce my self. My name is Mohammad Kevin Putra
> (you
> > > can call me Kevin), from Indonesia, i am a beginner in backend
> developer, i
> > > use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0.
> > >
> > > I have a little bit problem about how to put PDF File via Apache TIKA.
> I
> > > understand how SOLR or TIKA works, but i don't know how they both
> > > integrated.
> > > Last thing i know, TIKA can extract the PDF file i upload, and parse it
> > > into data/meta data automatically. And i just have to copy & paste it
> to
> > > the "Documents" tab in core solr.
> > > The question is :
> > > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only
> > > with CLI mode ? if yes only with CLI mode, can you explain it to me
> please
> > > ?
> > > 2. Is it possible to add a text result in "Query" tab ?.
> > >
> > > The Background i asking about this is, i want to indexing PDF in my
> local
> > > system, then i just upload it like "drag & drop" in SOLR (is it
> possible ?)
> > > then when i type something in search box the result is like this :
> > > (Title of doc)
> > > blablablabla (yellow stabilo result) blablabla.
> > > the blablabla text is like a couple sentences. That's all i need.
> > > Sorry for my bad english.
> > > Thanks for reading and replying this for me, it will be very helpful
> to me.
> > > Thanks a lot
> > >
>

Re: Indexing PDF file in Apache SOLR via Apache TIKA

Posted by Erick Erickson <er...@gmail.com>.

All of the above work, but for robust production situations you'll
want to consider a SolrJ client, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog
combines indexing from a DB and using Tika, but those are independent.

Best,
Erick
On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau <ka...@gmail.com> wrote:
>
> Hi there,
>
> Here are a couple of ways I'm aware of:
>
> 1. Extract-handler / post tool
> You can use the curl command with the extract handler or bin/post to upload
> a single document.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html
>
> 2. DataImportHandler
> This could be used for, say, uploading multiple documents with Tika.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-store-data-with-the-data-import-handler.html#the-tikaentityprocessor
>
> You should also be able to do it via the admin page, so long as you define
> and modify the extract handler in solrconfig.xml.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-upload
>
> Hope this helps!
>
> On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin <ad...@gmail.com>
> wrote:
>
> > Hello there, let me introduce my self. My name is Mohammad Kevin Putra (you
> > can call me Kevin), from Indonesia, i am a beginner in backend developer, i
> > use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0.
> >
> > I have a little bit problem about how to put PDF File via Apache TIKA. I
> > understand how SOLR or TIKA works, but i don't know how they both
> > integrated.
> > Last thing i know, TIKA can extract the PDF file i upload, and parse it
> > into data/meta data automatically. And i just have to copy & paste it to
> > the "Documents" tab in core solr.
> > The question is :
> > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only
> > with CLI mode ? if yes only with CLI mode, can you explain it to me please
> > ?
> > 2. Is it possible to add a text result in "Query" tab ?.
> >
> > The Background i asking about this is, i want to indexing PDF in my local
> > system, then i just upload it like "drag & drop" in SOLR (is it possible ?)
> > then when i type something in search box the result is like this :
> > (Title of doc)
> > blablablabla (yellow stabilo result) blablabla.
> > the blablabla text is like a couple sentences. That's all i need.
> > Sorry for my bad english.
> > Thanks for reading and replying this for me, it will be very helpful to me.
> > Thanks a lot
> >

Re: Indexing PDF file in Apache SOLR via Apache TIKA

Posted by Kamuela Lau <ka...@gmail.com>.

Hi there,

Here are a couple of ways I'm aware of:

1. Extract-handler / post tool
You can use the curl command with the extract handler or bin/post to upload
a single document.
Reference:
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html

2. DataImportHandler
This could be used for, say, uploading multiple documents with Tika.
Reference:
https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-store-data-with-the-data-import-handler.html#the-tikaentityprocessor

You should also be able to do it via the admin page, so long as you define
and modify the extract handler in solrconfig.xml.
Reference:
https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-upload

Hope this helps!

On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin <ad...@gmail.com>
wrote:

> Hello there, let me introduce my self. My name is Mohammad Kevin Putra (you
> can call me Kevin), from Indonesia, i am a beginner in backend developer, i
> use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0.
>
> I have a little bit problem about how to put PDF File via Apache TIKA. I
> understand how SOLR or TIKA works, but i don't know how they both
> integrated.
> Last thing i know, TIKA can extract the PDF file i upload, and parse it
> into data/meta data automatically. And i just have to copy & paste it to
> the "Documents" tab in core solr.
> The question is :
> 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only
> with CLI mode ? if yes only with CLI mode, can you explain it to me please
> ?
> 2. Is it possible to add a text result in "Query" tab ?.
>
> The Background i asking about this is, i want to indexing PDF in my local
> system, then i just upload it like "drag & drop" in SOLR (is it possible ?)
> then when i type something in search box the result is like this :
> (Title of doc)
> blablablabla (yellow stabilo result) blablabla.
> the blablabla text is like a couple sentences. That's all i need.
> Sorry for my bad english.
> Thanks for reading and replying this for me, it will be very helpful to me.
> Thanks a lot
>