You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2022/06/29 10:20:28 UTC

Re: regarding the data bank of test PDF files (pdfs_202011) . . .

Hi Albretch,
  Thank you for the pointer.  The PDFs that I packaged there were gathered
from various bug trackers [1] and [2].  I'm not necessarily against
gathering the Regents exams, but they would represent a different purpose.

        Best,

               Tim


[1] https://www.pdfa.org/a-new-stressful-pdf-corpus/
[2] https://www.pdfa.org/stressful-pdf-corpus-grows/

On Tue, Jun 28, 2022 at 8:21 PM Albretch Mueller <lb...@gmail.com> wrote:

>  kept at: https://corpora.tika.apache.org/base/packaged/pdfs/pdfs_202011/
>
>  I think copies of the archived NYS Regents exams:
>
>  https://www.nysl.nysed.gov/regentsexams.htm
>
>  then click on the link in the one liner: "Browse all available Regents
> Exams"
>
>
> https://nysl.ptfs.com/knowvation/app/consolidatedSearch/#search/v=list,c=1,q=qs%3D%5B*%5D%2Cfacet-fields%3D%5Bbrowse1_ss%3A%22All%20Government%20Collections%22%3E%3Ebrowse2_ss%3A%22New%20York%20State%20Government%20Documents%22%3E%3Ebrowse3_ss%3A%22Education%20Department%22%3E%3Ebrowse4_ss%3A%22Office%20of%20Elementary%2C%20Middle%2C%20Secondary%20and%20Continuing%20Education%22%3E%3Ebrowse5_ss%3A%22Office%20of%20Standards%2C%20Assessment%20and%20Reporting%22%3E%3Ebrowse6_ss%3A%22Regents%20high%20school%20examinations%22%5D%2CqueryType%3D%5B16%5D,sm=s,b=t,bs=ALPH%3AASC,sb=1%3Atitle%3AASC,l=library1_lib
>
>  and why not, more recent versions of the Regents exams
> (nysedregents.org) should be included, should be included, as well.
> Legally, they are public domain.
>
>  As part of my own research I am interested in corpora of
> multi-encoded texts containing not only "natural language", but also
> formulas, graphs, descriptive pictures, structural formula of a
> chemical compound, ... The nysl ptfs site obfuscates links behind a
> javascript wall. I think those links or the links' content should be
> more descriptive. Who else would like to work on that?
>
>  lbrtchx
>