You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Kamesh Joshi <ka...@gmail.com> on 2017/01/05 07:08:39 UTC

Fwd: Tika not parsing underlines

I am trying to parse the attached the pdf.but it does not give me the
places where the underline is present it just returns me plain text.
Please help me how can i also get the underline present in pdf or some way
to split text based on that.

I am using curl -T Downloads/kameshjoshi.pdf  http://localhost:9998/tika
--header "Accept: text/plain" in my command line.

RE: Fwd: Tika not parsing underlines

Posted by "Allison, Timothy B." <ta...@mitre.org>.
+1 to John's feedback

Another option, if you want to get into the weeds, is to override your own PDFTextStripper and use the TextPosition (x/y coordinates on the page) to do your own custom zoning.  This will be application/document stream specific, tho.

-----Original Message-----
From: John Patrick [mailto:nhoj.patrick@gmail.com] 
Sent: Thursday, January 5, 2017 8:15 AM
To: user@tika.apache.org
Subject: Re: Fwd: Tika not parsing underlines

okay so I'm looking at the right part of the pdf, as I previously said those visual elements might have started life as underscores but in the pdf they are some form as image so I would not expect them to be returned when you asked for text.

so tika server text/plain in my view is working correctly

are you able to go back to the original, change the image back to underscores and don't let your word editor make them look pretty and then save as a pdf.

you could potentially write your own pdf parser or extend and existing one, and work out how those images are present in the pdf. But this can be done in multiple ways, the images might actually be a background image, they might be images with absolute page coordinates given, or they might be embedded in the right location. Depending what pdf version and what extra metadata was put into the pdf you might be able to write code to correctly detect the image and replace it with underscores.

I've done pdf processing several times with tika and the source pdf can be your biggest issue as their are several ways of doing the same thing and several version of pdf spec.


On 5 January 2017 at 12:37, Kamesh Joshi <ka...@gmail.com> wrote:
> The Breaks which i am trying to parse are those line present before 
> Experience or Skills & Expertise (in attached pdf)  but there is no 
> indication of these lines when i am parsing the pdf through tika.
>
> On Thu, Jan 5, 2017 at 4:50 PM, John Patrick <nh...@gmail.com> wrote:
>>
>> When you say underline are you talking about the visual breaks like 
>> between "kameshpjoshi@gmail.com" and "Experience". How where they 
>> created?
>>
>> Is it because in they are images in the pdf, not text?
>>
>> I downloaded the pdf opened on my mac, I tried searching for _ and - 
>> and only found 4 matches for -.
>>
>> Personally I would say tika is returning what I would expect it to 
>> return, if the visual breaks as mentioned in my opening sentence are 
>> what you mean by underscores i.e _ not hyphen -
>>
>> If you mean something else be underscores are you able to identify 
>> where in the pdf your talking about.
>>
>> Cheers,
>> John
>>
>>
>> On 5 January 2017 at 08:51, Kamesh Joshi <ka...@gmail.com> wrote:
>> > I already tried that but it does not give me any indication for the 
>> > underline present in the line it juts give me data in text data in 
>> > <p></p> tags
>> >
>> > On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <ap...@gagravarr.org> wrote:
>> >>
>> >> On Thu, 5 Jan 2017, Kamesh Joshi wrote:
>> >>>
>> >>> I am trying to parse the attached the pdf.but it does not give me 
>> >>> the places where the underline is present it just returns me plain text.
>> >>> Please help me how can i also get the underline present in pdf or 
>> >>> some way to split text based on that.
>> >>>
>> >>> I am using curl -T Downloads/kameshjoshi.pdf 
>> >>> http://localhost:9998/tika --header "Accept: text/plain" in my 
>> >>> command line.
>> >>
>> >>
>> >> You need to ask Tika to give you the HTML version to be able to 
>> >> spot markup like underlines. Swap that accept header to text/html 
>> >> and you should then be able to see them
>> >>
>> >> Nick
>> >
>> >
>
>

Re: Fwd: Tika not parsing underlines

Posted by John Patrick <nh...@gmail.com>.
okay so I'm looking at the right part of the pdf, as I previously said
those visual elements might have started life as underscores but in
the pdf they are some form as image so I would not expect them to be
returned when you asked for text.

so tika server text/plain in my view is working correctly

are you able to go back to the original, change the image back to
underscores and don't let your word editor make them look pretty and
then save as a pdf.

you could potentially write your own pdf parser or extend and existing
one, and work out how those images are present in the pdf. But this
can be done in multiple ways, the images might actually be a
background image, they might be images with absolute page coordinates
given, or they might be embedded in the right location. Depending what
pdf version and what extra metadata was put into the pdf you might be
able to write code to correctly detect the image and replace it with
underscores.

I've done pdf processing several times with tika and the source pdf
can be your biggest issue as their are several ways of doing the same
thing and several version of pdf spec.


On 5 January 2017 at 12:37, Kamesh Joshi <ka...@gmail.com> wrote:
> The Breaks which i am trying to parse are those line present before
> Experience or Skills & Expertise (in attached pdf)  but there is no
> indication of these lines when i am parsing the pdf through tika.
>
> On Thu, Jan 5, 2017 at 4:50 PM, John Patrick <nh...@gmail.com> wrote:
>>
>> When you say underline are you talking about the visual breaks like
>> between "kameshpjoshi@gmail.com" and "Experience". How where they
>> created?
>>
>> Is it because in they are images in the pdf, not text?
>>
>> I downloaded the pdf opened on my mac, I tried searching for _ and -
>> and only found 4 matches for -.
>>
>> Personally I would say tika is returning what I would expect it to
>> return, if the visual breaks as mentioned in my opening sentence are
>> what you mean by underscores i.e _ not hyphen -
>>
>> If you mean something else be underscores are you able to identify
>> where in the pdf your talking about.
>>
>> Cheers,
>> John
>>
>>
>> On 5 January 2017 at 08:51, Kamesh Joshi <ka...@gmail.com> wrote:
>> > I already tried that but it does not give me any indication for the
>> > underline present in the line it juts give me data in text data in
>> > <p></p>
>> > tags
>> >
>> > On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <ap...@gagravarr.org> wrote:
>> >>
>> >> On Thu, 5 Jan 2017, Kamesh Joshi wrote:
>> >>>
>> >>> I am trying to parse the attached the pdf.but it does not give me the
>> >>> places where the underline is present it just returns me plain text.
>> >>> Please help me how can i also get the underline present in pdf or some
>> >>> way
>> >>> to split text based on that.
>> >>>
>> >>> I am using curl -T Downloads/kameshjoshi.pdf
>> >>> http://localhost:9998/tika
>> >>> --header "Accept: text/plain" in my command line.
>> >>
>> >>
>> >> You need to ask Tika to give you the HTML version to be able to spot
>> >> markup like underlines. Swap that accept header to text/html and you
>> >> should
>> >> then be able to see them
>> >>
>> >> Nick
>> >
>> >
>
>

Re: Fwd: Tika not parsing underlines

Posted by Kamesh Joshi <ka...@gmail.com>.
The Breaks which i am trying to parse are those line present before
*Experience* or *Skills & Expertise (in attached pdf)  but *there is no
indication of these lines when i am parsing the pdf through tika.

On Thu, Jan 5, 2017 at 4:50 PM, John Patrick <nh...@gmail.com> wrote:

> When you say underline are you talking about the visual breaks like
> between "kameshpjoshi@gmail.com" and "Experience". How where they
> created?
>
> Is it because in they are images in the pdf, not text?
>
> I downloaded the pdf opened on my mac, I tried searching for _ and -
> and only found 4 matches for -.
>
> Personally I would say tika is returning what I would expect it to
> return, if the visual breaks as mentioned in my opening sentence are
> what you mean by underscores i.e _ not hyphen -
>
> If you mean something else be underscores are you able to identify
> where in the pdf your talking about.
>
> Cheers,
> John
>
>
> On 5 January 2017 at 08:51, Kamesh Joshi <ka...@gmail.com> wrote:
> > I already tried that but it does not give me any indication for the
> > underline present in the line it juts give me data in text data in
> <p></p>
> > tags
> >
> > On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <ap...@gagravarr.org> wrote:
> >>
> >> On Thu, 5 Jan 2017, Kamesh Joshi wrote:
> >>>
> >>> I am trying to parse the attached the pdf.but it does not give me the
> >>> places where the underline is present it just returns me plain text.
> >>> Please help me how can i also get the underline present in pdf or some
> >>> way
> >>> to split text based on that.
> >>>
> >>> I am using curl -T Downloads/kameshjoshi.pdf
> http://localhost:9998/tika
> >>> --header "Accept: text/plain" in my command line.
> >>
> >>
> >> You need to ask Tika to give you the HTML version to be able to spot
> >> markup like underlines. Swap that accept header to text/html and you
> should
> >> then be able to see them
> >>
> >> Nick
> >
> >
>

Re: Fwd: Tika not parsing underlines

Posted by John Patrick <nh...@gmail.com>.
When you say underline are you talking about the visual breaks like
between "kameshpjoshi@gmail.com" and "Experience". How where they
created?

Is it because in they are images in the pdf, not text?

I downloaded the pdf opened on my mac, I tried searching for _ and -
and only found 4 matches for -.

Personally I would say tika is returning what I would expect it to
return, if the visual breaks as mentioned in my opening sentence are
what you mean by underscores i.e _ not hyphen -

If you mean something else be underscores are you able to identify
where in the pdf your talking about.

Cheers,
John


On 5 January 2017 at 08:51, Kamesh Joshi <ka...@gmail.com> wrote:
> I already tried that but it does not give me any indication for the
> underline present in the line it juts give me data in text data in <p></p>
> tags
>
> On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <ap...@gagravarr.org> wrote:
>>
>> On Thu, 5 Jan 2017, Kamesh Joshi wrote:
>>>
>>> I am trying to parse the attached the pdf.but it does not give me the
>>> places where the underline is present it just returns me plain text.
>>> Please help me how can i also get the underline present in pdf or some
>>> way
>>> to split text based on that.
>>>
>>> I am using curl -T Downloads/kameshjoshi.pdf  http://localhost:9998/tika
>>> --header "Accept: text/plain" in my command line.
>>
>>
>> You need to ask Tika to give you the HTML version to be able to spot
>> markup like underlines. Swap that accept header to text/html and you should
>> then be able to see them
>>
>> Nick
>
>

Re: Fwd: Tika not parsing underlines

Posted by Kamesh Joshi <ka...@gmail.com>.
I already tried that but it does not give me any indication for the
underline present in the line it juts give me data in text data in <p></p>
tags

On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 5 Jan 2017, Kamesh Joshi wrote:
>
>> I am trying to parse the attached the pdf.but it does not give me the
>> places where the underline is present it just returns me plain text.
>> Please help me how can i also get the underline present in pdf or some way
>> to split text based on that.
>>
>> I am using curl -T Downloads/kameshjoshi.pdf  http://localhost:9998/tika
>> --header "Accept: text/plain" in my command line.
>>
>
> You need to ask Tika to give you the HTML version to be able to spot
> markup like underlines. Swap that accept header to text/html and you should
> then be able to see them
>
> Nick
>

Re: Fwd: Tika not parsing underlines

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 5 Jan 2017, Kamesh Joshi wrote:
> I am trying to parse the attached the pdf.but it does not give me the
> places where the underline is present it just returns me plain text.
> Please help me how can i also get the underline present in pdf or some way
> to split text based on that.
>
> I am using curl -T Downloads/kameshjoshi.pdf  http://localhost:9998/tika
> --header "Accept: text/plain" in my command line.

You need to ask Tika to give you the HTML version to be able to spot 
markup like underlines. Swap that accept header to text/html and you 
should then be able to see them

Nick