You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tim Allison <ta...@apache.org> on 2020/06/02 19:57:44 UTC

New file vm WAS: Re: Release 2.0.20 ?

>proper domain for https access

I just pinged infra on slack.

If they're able to do it, what would we want?

file-corpora.apache.org
corpora.apache.org
corpora-pdfbox.apache.org
corpora-tika.apache.org

Something else?  I'm also happy to buy a domain if that won't work.  There
are a couple available that are close enough.

On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

>
> > AMD ryzen looks fantastic.  Others would be great as well.
> >
> > If ubuntu is possible at all, that's what I've been working with most
> > recently.
>
> OK - will setup with that distro
>
> >
> > Other than that, ssh access and sudo privileges would be all I'd need.
> >
> > Are you ok if we set up apache httpd to host files for the public or will
> > this be a community only resource?
>
> it can be used for whatever we want it to - so if you consider public file
> sharing useful of course we can do that. Would be
> good if we get a proper domain for https access. Would that be something
> infra can do?
>
> >
> > If this is corporate sponsored, please let me know how/if we should
> mention
> > the sponsorship.
>
> no need to mention it - happy to help.
>
> >
> > Again...wow.  Thank you!
> >
> > Best,
> >
> >       Tim
> >
> > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sa...@fileaffairs.de>
> > wrote:
> >
> > > Could fund either:
> > >
> > > AMD Ryzen 5 3600
> > > 64 GB RAM
> > > 2x2TB
> > >
> > > or
> > >
> > > AMD Ryzen 7 3700X based Server
> > > 64 GB RAM
> > > 2x8TB
> > >
> > > or
> > > Intel® Core™ i9-9900K
> > > 64 GB RAM
> > > 2x8TB
> > >
> > > All are root servers so one has to vote for taking care of them (I can
> do
> > > the initial setup).
> > >
> > >
> > >
> > > BR
> > > Maruan
> > >
> > > > There are two use cases.
> > > >
> > > > 1) host shared data so that we can all point to and work from the
> same
> > > > data, ideally both literal docs and also extracts (text/metadata
> .json
> > > > files representing extracted information).
> > > >
> > > > 2) a modest vm to allow all of us to run the regression tests
> > > >
> > > > We could use help with either or both.
> > > >
> > > > What we had before:
> > > > 8 GB RAM
> > > > 8 cores
> > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > >
> > > > We can always use more RAM and more cores up to the point of I/O
> > > > bottlenecks.
> > > >
> > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> sahyoun@fileaffairs.de>
> > > > wrote:
> > > >
> > > > > is that a storage box only or does it need to do some computings
> too?
> > > > >
> > > > > Maybe you could write a small spec for the server requirement?
> > > > >
> > > > > BR
> > > > > Maruan
> > > > >
> > > > >
> > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > >
> > > > > >  Yes, more than happy to share.
> > > > > >
> > > > > > If anyone has recommendations for file hosting for a couple of
> TB,
> > > let me
> > > > > > know.
> > > > > >
> > > > > > One option would be to work with CommonCrawl to bump the max file
> > > size
> > > > > one
> > > > > > crawl a year...
> > > > > >
> > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > THausherr@t-online.de>
> > > > > > wrote:
> > > > > >
> > > > > > > Can we / I access these files? Most differences are
> improvements
> > > or not
> > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > >
> > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > >
> > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
> was
> > > a big
> > > > > > > one and gets assigned to another line.
> > > > > > >
> > > > > > > Tilman
> > > > > > >
> > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > Reports are available here:
> > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > Looks like there are trivial differences in content with a
> slight
> > > > > > > > improvement over 2.0.19.  I don't see any differences in
> > > exceptions
> > > > > or
> > > > > > > > attachments.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > >          Tim
> > > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > > > > >
> > > > > > >
> > > --
> > > Maruan Sahyoun
> > >
> > > FileAffairs GmbH
> > > Josef-Schappe-Straße 21
> > > 40882 Ratingen
> > >
> > > Tel: +49 (2102) 89497 88
> > > Fax: +49 (2102) 89497 91
> > > sahyoun@fileaffairs.de
> > > www.fileaffairs.de
> > >
> > > Geschäftsführer: Maruan Sahyoun
> > > Handelsregister: AG Düsseldorf, HRB 53837
> > > UST.-ID: DE248275827
> > >
> > >
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahyoun@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Tim Allison <ta...@apache.org>.
>I'd go for corpora.tika.apache.org too.

Infra ticket updated.  Thank you, all!

On Wed, Jun 3, 2020 at 2:07 AM Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

>
> > Am 02.06.20 um 23:29 schrieb Tim Allison:
> > > https://issues.apache.org/jira/browse/INFRA-20372
> > >
> > > On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
> > > happy with corpora.pdfbox.apache.org or anything else.  Please let us
> know
> > > what you think over on that ticket.
> > IMHO it should be either corpora.pdfbox.apache.org or
> corpora.tika.apache.org. I
> > would prefer the latter, as tika is the tools which is mainly used here.
>
> I'd go for corpora.tika.apache.org too.
> BR
> Maruan
>
> >
> > Andreas
> >
> > > Thank you, again!
> > >
> > > Cheers,
> > >
> > >               Tim
> > >
> > > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org>
> wrote:
> > >
> > > > > proper domain for https access
> > > >
> > > > I just pinged infra on slack.
> > > >
> > > > If they're able to do it, what would we want?
> > > >
> > > > file-corpora.apache.org
> > > > corpora.apache.org
> > > > corpora-pdfbox.apache.org
> > > > corpora-tika.apache.org
> > > >
> > > > Something else?  I'm also happy to buy a domain if that won't work.
> There
> > > > are a couple available that are close enough.
> > > >
> > > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <
> sahyoun@fileaffairs.de>
> > > > wrote:
> > > >
> > > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > > >
> > > > > > If ubuntu is possible at all, that's what I've been working with
> most
> > > > > > recently.
> > > > >
> > > > > OK - will setup with that distro
> > > > >
> > > > > > Other than that, ssh access and sudo privileges would be all I'd
> need.
> > > > > >
> > > > > > Are you ok if we set up apache httpd to host files for the
> public or
> > > > > will
> > > > > > this be a community only resource?
> > > > >
> > > > > it can be used for whatever we want it to - so if you consider
> public
> > > > > file sharing useful of course we can do that. Would be
> > > > > good if we get a proper domain for https access. Would that be
> something
> > > > > infra can do?
> > > > >
> > > > > > If this is corporate sponsored, please let me know how/if we
> should
> > > > > mention
> > > > > > the sponsorship.
> > > > >
> > > > > no need to mention it - happy to help.
> > > > >
> > > > > > Again...wow.  Thank you!
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > >        Tim
> > > > > >
> > > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <
> sahyoun@fileaffairs.de>
> > > > > > wrote:
> > > > > >
> > > > > > > Could fund either:
> > > > > > >
> > > > > > > AMD Ryzen 5 3600
> > > > > > > 64 GB RAM
> > > > > > > 2x2TB
> > > > > > >
> > > > > > > or
> > > > > > >
> > > > > > > AMD Ryzen 7 3700X based Server
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > >
> > > > > > > or
> > > > > > > Intel® Core™ i9-9900K
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > >
> > > > > > > All are root servers so one has to vote for taking care of
> them (I
> > > > > can do
> > > > > > > the initial setup).
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > BR
> > > > > > > Maruan
> > > > > > >
> > > > > > > > There are two use cases.
> > > > > > > >
> > > > > > > > 1) host shared data so that we can all point to and work
> from the
> > > > > same
> > > > > > > > data, ideally both literal docs and also extracts
> (text/metadata
> > > > > .json
> > > > > > > > files representing extracted information).
> > > > > > > >
> > > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > > >
> > > > > > > > We could use help with either or both.
> > > > > > > >
> > > > > > > > What we had before:
> > > > > > > > 8 GB RAM
> > > > > > > > 8 cores
> > > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > > >
> > > > > > > > We can always use more RAM and more cores up to the point of
> I/O
> > > > > > > > bottlenecks.
> > > > > > > >
> > > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > > sahyoun@fileaffairs.de>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > is that a storage box only or does it need to do some
> computings
> > > > > too?
> > > > > > > > > Maybe you could write a small spec for the server
> requirement?
> > > > > > > > >
> > > > > > > > > BR
> > > > > > > > > Maruan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > > >
> > > > > > > > > >   Yes, more than happy to share.
> > > > > > > > > >
> > > > > > > > > > If anyone has recommendations for file hosting for a
> couple of
> > > > > TB,
> > > > > > > let me
> > > > > > > > > > know.
> > > > > > > > > >
> > > > > > > > > > One option would be to work with CommonCrawl to bump the
> max
> > > > > file
> > > > > > > size
> > > > > > > > > one
> > > > > > > > > > crawl a year...
> > > > > > > > > >
> > > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > > > THausherr@t-online.de>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Can we / I access these files? Most differences are
> > > > > improvements
> > > > > > > or not
> > > > > > > > > > > meaningful, but there are a few I'd like to have a
> look, e.g.
> > > > > > > > > > >
> > > > > > > > > > >
> commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > > >
> > > > > > > > > > > the word "antrag" loses the first "a". Although maybe
> the "a"
> > > > > was
> > > > > > > a big
> > > > > > > > > > > one and gets assigned to another line.
> > > > > > > > > > >
> > > > > > > > > > > Tilman
> > > > > > > > > > >
> > > > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > > > > > Reports are available here:
> > > > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > > > > > Looks like there are trivial differences in content
> with a
> > > > > slight
> > > > > > > > > > > > improvement over 2.0.19.  I don't see any
> differences in
> > > > > > > exceptions
> > > > > > > > > or
> > > > > > > > > > > > attachments.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > >
> > > > > > > > > > > >           Tim
> > > > > > > > > > > >
> > > > > > >
> ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> dev-unsubscribe@pdfbox.apache.org
> > > > > > > > > > > For additional commands, e-mail:
> dev-help@pdfbox.apache.org
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > --
> > > > > > > Maruan Sahyoun
> > > > > > >
> > > > > > > FileAffairs GmbH
> > > > > > > Josef-Schappe-Straße 21
> > > > > > > 40882 Ratingen
> > > > > > >
> > > > > > > Tel: +49 (2102) 89497 88
> > > > > > > Fax: +49 (2102) 89497 91
> > > > > > > sahyoun@fileaffairs.de
> > > > > > > www.fileaffairs.de
> > > > > > >
> > > > > > > Geschäftsführer: Maruan Sahyoun
> > > > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > > > UST.-ID: DE248275827
> > > > > > >
> > > > > > >
> > > > > --
> > > > > Maruan Sahyoun
> > > > >
> > > > > FileAffairs GmbH
> > > > > Josef-Schappe-Straße 21
> > > > > 40882 Ratingen
> > > > >
> > > > > Tel: +49 (2102) 89497 88
> > > > > Fax: +49 (2102) 89497 91
> > > > > sahyoun@fileaffairs.de
> > > > > www.fileaffairs.de
> > > > >
> > > > > Geschäftsführer: Maruan Sahyoun
> > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > UST.-ID: DE248275827
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > > >
> > > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
 
> Am 02.06.20 um 23:29 schrieb Tim Allison:
> > https://issues.apache.org/jira/browse/INFRA-20372
> > 
> > On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
> > happy with corpora.pdfbox.apache.org or anything else.  Please let us know
> > what you think over on that ticket.
> IMHO it should be either corpora.pdfbox.apache.org or corpora.tika.apache.org. I 
> would prefer the latter, as tika is the tools which is mainly used here.

I'd go for corpora.tika.apache.org too.
BR
Maruan

> 
> Andreas
> 
> > Thank you, again!
> > 
> > Cheers,
> > 
> >               Tim
> > 
> > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org> wrote:
> > 
> > > > proper domain for https access
> > > 
> > > I just pinged infra on slack.
> > > 
> > > If they're able to do it, what would we want?
> > > 
> > > file-corpora.apache.org
> > > corpora.apache.org
> > > corpora-pdfbox.apache.org
> > > corpora-tika.apache.org
> > > 
> > > Something else?  I'm also happy to buy a domain if that won't work.  There
> > > are a couple available that are close enough.
> > > 
> > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
> > > wrote:
> > > 
> > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > > 
> > > > > If ubuntu is possible at all, that's what I've been working with most
> > > > > recently.
> > > > 
> > > > OK - will setup with that distro
> > > > 
> > > > > Other than that, ssh access and sudo privileges would be all I'd need.
> > > > > 
> > > > > Are you ok if we set up apache httpd to host files for the public or
> > > > will
> > > > > this be a community only resource?
> > > > 
> > > > it can be used for whatever we want it to - so if you consider public
> > > > file sharing useful of course we can do that. Would be
> > > > good if we get a proper domain for https access. Would that be something
> > > > infra can do?
> > > > 
> > > > > If this is corporate sponsored, please let me know how/if we should
> > > > mention
> > > > > the sponsorship.
> > > > 
> > > > no need to mention it - happy to help.
> > > > 
> > > > > Again...wow.  Thank you!
> > > > > 
> > > > > Best,
> > > > > 
> > > > >        Tim
> > > > > 
> > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sa...@fileaffairs.de>
> > > > > wrote:
> > > > > 
> > > > > > Could fund either:
> > > > > > 
> > > > > > AMD Ryzen 5 3600
> > > > > > 64 GB RAM
> > > > > > 2x2TB
> > > > > > 
> > > > > > or
> > > > > > 
> > > > > > AMD Ryzen 7 3700X based Server
> > > > > > 64 GB RAM
> > > > > > 2x8TB
> > > > > > 
> > > > > > or
> > > > > > Intel® Core™ i9-9900K
> > > > > > 64 GB RAM
> > > > > > 2x8TB
> > > > > > 
> > > > > > All are root servers so one has to vote for taking care of them (I
> > > > can do
> > > > > > the initial setup).
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > BR
> > > > > > Maruan
> > > > > > 
> > > > > > > There are two use cases.
> > > > > > > 
> > > > > > > 1) host shared data so that we can all point to and work from the
> > > > same
> > > > > > > data, ideally both literal docs and also extracts (text/metadata
> > > > .json
> > > > > > > files representing extracted information).
> > > > > > > 
> > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > > 
> > > > > > > We could use help with either or both.
> > > > > > > 
> > > > > > > What we had before:
> > > > > > > 8 GB RAM
> > > > > > > 8 cores
> > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > > 
> > > > > > > We can always use more RAM and more cores up to the point of I/O
> > > > > > > bottlenecks.
> > > > > > > 
> > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > sahyoun@fileaffairs.de>
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > is that a storage box only or does it need to do some computings
> > > > too?
> > > > > > > > Maybe you could write a small spec for the server requirement?
> > > > > > > > 
> > > > > > > > BR
> > > > > > > > Maruan
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > > 
> > > > > > > > >   Yes, more than happy to share.
> > > > > > > > > 
> > > > > > > > > If anyone has recommendations for file hosting for a couple of
> > > > TB,
> > > > > > let me
> > > > > > > > > know.
> > > > > > > > > 
> > > > > > > > > One option would be to work with CommonCrawl to bump the max
> > > > file
> > > > > > size
> > > > > > > > one
> > > > > > > > > crawl a year...
> > > > > > > > > 
> > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > > THausherr@t-online.de>
> > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > > Can we / I access these files? Most differences are
> > > > improvements
> > > > > > or not
> > > > > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > > > > > 
> > > > > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > > 
> > > > > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
> > > > was
> > > > > > a big
> > > > > > > > > > one and gets assigned to another line.
> > > > > > > > > > 
> > > > > > > > > > Tilman
> > > > > > > > > > 
> > > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > > > > Reports are available here:
> > > > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > > > > Looks like there are trivial differences in content with a
> > > > slight
> > > > > > > > > > > improvement over 2.0.19.  I don't see any differences in
> > > > > > exceptions
> > > > > > > > or
> > > > > > > > > > > attachments.
> > > > > > > > > > > 
> > > > > > > > > > > Cheers,
> > > > > > > > > > > 
> > > > > > > > > > >           Tim
> > > > > > > > > > > 
> > > > > > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > > > > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > --
> > > > > > Maruan Sahyoun
> > > > > > 
> > > > > > FileAffairs GmbH
> > > > > > Josef-Schappe-Straße 21
> > > > > > 40882 Ratingen
> > > > > > 
> > > > > > Tel: +49 (2102) 89497 88
> > > > > > Fax: +49 (2102) 89497 91
> > > > > > sahyoun@fileaffairs.de
> > > > > > www.fileaffairs.de
> > > > > > 
> > > > > > Geschäftsführer: Maruan Sahyoun
> > > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > > UST.-ID: DE248275827
> > > > > > 
> > > > > > 
> > > > --
> > > > Maruan Sahyoun
> > > > 
> > > > FileAffairs GmbH
> > > > Josef-Schappe-Straße 21
> > > > 40882 Ratingen
> > > > 
> > > > Tel: +49 (2102) 89497 88
> > > > Fax: +49 (2102) 89497 91
> > > > sahyoun@fileaffairs.de
> > > > www.fileaffairs.de
> > > > 
> > > > Geschäftsführer: Maruan Sahyoun
> > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > UST.-ID: DE248275827
> > > > 
> > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > > 
> > > > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 02.06.20 um 23:29 schrieb Tim Allison:
> https://issues.apache.org/jira/browse/INFRA-20372
> 
> On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
> happy with corpora.pdfbox.apache.org or anything else.  Please let us know
> what you think over on that ticket.
IMHO it should be either corpora.pdfbox.apache.org or corpora.tika.apache.org. I 
would prefer the latter, as tika is the tools which is mainly used here.

Andreas

> 
> Thank you, again!
> 
> Cheers,
> 
>               Tim
> 
> On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org> wrote:
> 
>>> proper domain for https access
>>
>> I just pinged infra on slack.
>>
>> If they're able to do it, what would we want?
>>
>> file-corpora.apache.org
>> corpora.apache.org
>> corpora-pdfbox.apache.org
>> corpora-tika.apache.org
>>
>> Something else?  I'm also happy to buy a domain if that won't work.  There
>> are a couple available that are close enough.
>>
>> On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
>> wrote:
>>
>>>
>>>> AMD ryzen looks fantastic.  Others would be great as well.
>>>>
>>>> If ubuntu is possible at all, that's what I've been working with most
>>>> recently.
>>>
>>> OK - will setup with that distro
>>>
>>>>
>>>> Other than that, ssh access and sudo privileges would be all I'd need.
>>>>
>>>> Are you ok if we set up apache httpd to host files for the public or
>>> will
>>>> this be a community only resource?
>>>
>>> it can be used for whatever we want it to - so if you consider public
>>> file sharing useful of course we can do that. Would be
>>> good if we get a proper domain for https access. Would that be something
>>> infra can do?
>>>
>>>>
>>>> If this is corporate sponsored, please let me know how/if we should
>>> mention
>>>> the sponsorship.
>>>
>>> no need to mention it - happy to help.
>>>
>>>>
>>>> Again...wow.  Thank you!
>>>>
>>>> Best,
>>>>
>>>>        Tim
>>>>
>>>> On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sa...@fileaffairs.de>
>>>> wrote:
>>>>
>>>>> Could fund either:
>>>>>
>>>>> AMD Ryzen 5 3600
>>>>> 64 GB RAM
>>>>> 2x2TB
>>>>>
>>>>> or
>>>>>
>>>>> AMD Ryzen 7 3700X based Server
>>>>> 64 GB RAM
>>>>> 2x8TB
>>>>>
>>>>> or
>>>>> Intel® Core™ i9-9900K
>>>>> 64 GB RAM
>>>>> 2x8TB
>>>>>
>>>>> All are root servers so one has to vote for taking care of them (I
>>> can do
>>>>> the initial setup).
>>>>>
>>>>>
>>>>>
>>>>> BR
>>>>> Maruan
>>>>>
>>>>>> There are two use cases.
>>>>>>
>>>>>> 1) host shared data so that we can all point to and work from the
>>> same
>>>>>> data, ideally both literal docs and also extracts (text/metadata
>>> .json
>>>>>> files representing extracted information).
>>>>>>
>>>>>> 2) a modest vm to allow all of us to run the regression tests
>>>>>>
>>>>>> We could use help with either or both.
>>>>>>
>>>>>> What we had before:
>>>>>> 8 GB RAM
>>>>>> 8 cores
>>>>>> 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
>>>>>>
>>>>>> We can always use more RAM and more cores up to the point of I/O
>>>>>> bottlenecks.
>>>>>>
>>>>>> On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
>>> sahyoun@fileaffairs.de>
>>>>>> wrote:
>>>>>>
>>>>>>> is that a storage box only or does it need to do some computings
>>> too?
>>>>>>>
>>>>>>> Maybe you could write a small spec for the server requirement?
>>>>>>>
>>>>>>> BR
>>>>>>> Maruan
>>>>>>>
>>>>>>>
>>>>>>>> Still haven’t had time to put the server in a dmz. Ugh.
>>>>>>>>
>>>>>>>>   Yes, more than happy to share.
>>>>>>>>
>>>>>>>> If anyone has recommendations for file hosting for a couple of
>>> TB,
>>>>> let me
>>>>>>>> know.
>>>>>>>>
>>>>>>>> One option would be to work with CommonCrawl to bump the max
>>> file
>>>>> size
>>>>>>> one
>>>>>>>> crawl a year...
>>>>>>>>
>>>>>>>> On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Can we / I access these files? Most differences are
>>> improvements
>>>>> or not
>>>>>>>>> meaningful, but there are a few I'd like to have a look, e.g.
>>>>>>>>>
>>>>>>>>> commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
>>>>>>>>>
>>>>>>>>> the word "antrag" loses the first "a". Although maybe the "a"
>>> was
>>>>> a big
>>>>>>>>> one and gets assigned to another line.
>>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>> Am 02.06.2020 um 02:58 schrieb Tim Allison:
>>>>>>>>>>>> Reports are available here:
>>>>>
>>> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
>>>>>>>>>> Looks like there are trivial differences in content with a
>>> slight
>>>>>>>>>> improvement over 2.0.19.  I don't see any differences in
>>>>> exceptions
>>>>>>> or
>>>>>>>>>> attachments.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>>           Tim
>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>> --
>>>>> Maruan Sahyoun
>>>>>
>>>>> FileAffairs GmbH
>>>>> Josef-Schappe-Straße 21
>>>>> 40882 Ratingen
>>>>>
>>>>> Tel: +49 (2102) 89497 88
>>>>> Fax: +49 (2102) 89497 91
>>>>> sahyoun@fileaffairs.de
>>>>> www.fileaffairs.de
>>>>>
>>>>> Geschäftsführer: Maruan Sahyoun
>>>>> Handelsregister: AG Düsseldorf, HRB 53837
>>>>> UST.-ID: DE248275827
>>>>>
>>>>>
>>> --
>>> Maruan Sahyoun
>>>
>>> FileAffairs GmbH
>>> Josef-Schappe-Straße 21
>>> 40882 Ratingen
>>>
>>> Tel: +49 (2102) 89497 88
>>> Fax: +49 (2102) 89497 91
>>> sahyoun@fileaffairs.de
>>> www.fileaffairs.de
>>>
>>> Geschäftsführer: Maruan Sahyoun
>>> Handelsregister: AG Düsseldorf, HRB 53837
>>> UST.-ID: DE248275827
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Tim Allison <ta...@apache.org>.
https://issues.apache.org/jira/browse/INFRA-20372

On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
happy with corpora.pdfbox.apache.org or anything else.  Please let us know
what you think over on that ticket.

Thank you, again!

Cheers,

             Tim

On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org> wrote:

> >proper domain for https access
>
> I just pinged infra on slack.
>
> If they're able to do it, what would we want?
>
> file-corpora.apache.org
> corpora.apache.org
> corpora-pdfbox.apache.org
> corpora-tika.apache.org
>
> Something else?  I'm also happy to buy a domain if that won't work.  There
> are a couple available that are close enough.
>
> On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
> wrote:
>
>>
>> > AMD ryzen looks fantastic.  Others would be great as well.
>> >
>> > If ubuntu is possible at all, that's what I've been working with most
>> > recently.
>>
>> OK - will setup with that distro
>>
>> >
>> > Other than that, ssh access and sudo privileges would be all I'd need.
>> >
>> > Are you ok if we set up apache httpd to host files for the public or
>> will
>> > this be a community only resource?
>>
>> it can be used for whatever we want it to - so if you consider public
>> file sharing useful of course we can do that. Would be
>> good if we get a proper domain for https access. Would that be something
>> infra can do?
>>
>> >
>> > If this is corporate sponsored, please let me know how/if we should
>> mention
>> > the sponsorship.
>>
>> no need to mention it - happy to help.
>>
>> >
>> > Again...wow.  Thank you!
>> >
>> > Best,
>> >
>> >       Tim
>> >
>> > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sa...@fileaffairs.de>
>> > wrote:
>> >
>> > > Could fund either:
>> > >
>> > > AMD Ryzen 5 3600
>> > > 64 GB RAM
>> > > 2x2TB
>> > >
>> > > or
>> > >
>> > > AMD Ryzen 7 3700X based Server
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > or
>> > > Intel® Core™ i9-9900K
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > All are root servers so one has to vote for taking care of them (I
>> can do
>> > > the initial setup).
>> > >
>> > >
>> > >
>> > > BR
>> > > Maruan
>> > >
>> > > > There are two use cases.
>> > > >
>> > > > 1) host shared data so that we can all point to and work from the
>> same
>> > > > data, ideally both literal docs and also extracts (text/metadata
>> .json
>> > > > files representing extracted information).
>> > > >
>> > > > 2) a modest vm to allow all of us to run the regression tests
>> > > >
>> > > > We could use help with either or both.
>> > > >
>> > > > What we had before:
>> > > > 8 GB RAM
>> > > > 8 cores
>> > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
>> > > >
>> > > > We can always use more RAM and more cores up to the point of I/O
>> > > > bottlenecks.
>> > > >
>> > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
>> sahyoun@fileaffairs.de>
>> > > > wrote:
>> > > >
>> > > > > is that a storage box only or does it need to do some computings
>> too?
>> > > > >
>> > > > > Maybe you could write a small spec for the server requirement?
>> > > > >
>> > > > > BR
>> > > > > Maruan
>> > > > >
>> > > > >
>> > > > > > Still haven’t had time to put the server in a dmz. Ugh.
>> > > > > >
>> > > > > >  Yes, more than happy to share.
>> > > > > >
>> > > > > > If anyone has recommendations for file hosting for a couple of
>> TB,
>> > > let me
>> > > > > > know.
>> > > > > >
>> > > > > > One option would be to work with CommonCrawl to bump the max
>> file
>> > > size
>> > > > > one
>> > > > > > crawl a year...
>> > > > > >
>> > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
>> > > THausherr@t-online.de>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Can we / I access these files? Most differences are
>> improvements
>> > > or not
>> > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
>> > > > > > >
>> > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
>> > > > > > >
>> > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
>> was
>> > > a big
>> > > > > > > one and gets assigned to another line.
>> > > > > > >
>> > > > > > > Tilman
>> > > > > > >
>> > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
>> > > > > > > > > > Reports are available here:
>> > >
>> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
>> > > > > > > > Looks like there are trivial differences in content with a
>> slight
>> > > > > > > > improvement over 2.0.19.  I don't see any differences in
>> > > exceptions
>> > > > > or
>> > > > > > > > attachments.
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > >
>> > > > > > > >          Tim
>> > > > > > > >
>> > > ---------------------------------------------------------------------
>> > > > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> > > > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
>> > > > > > >
>> > > > > > >
>> > > --
>> > > Maruan Sahyoun
>> > >
>> > > FileAffairs GmbH
>> > > Josef-Schappe-Straße 21
>> > > 40882 Ratingen
>> > >
>> > > Tel: +49 (2102) 89497 88
>> > > Fax: +49 (2102) 89497 91
>> > > sahyoun@fileaffairs.de
>> > > www.fileaffairs.de
>> > >
>> > > Geschäftsführer: Maruan Sahyoun
>> > > Handelsregister: AG Düsseldorf, HRB 53837
>> > > UST.-ID: DE248275827
>> > >
>> > >
>> --
>> Maruan Sahyoun
>>
>> FileAffairs GmbH
>> Josef-Schappe-Straße 21
>> 40882 Ratingen
>>
>> Tel: +49 (2102) 89497 88
>> Fax: +49 (2102) 89497 91
>> sahyoun@fileaffairs.de
>> www.fileaffairs.de
>>
>> Geschäftsführer: Maruan Sahyoun
>> Handelsregister: AG Düsseldorf, HRB 53837
>> UST.-ID: DE248275827
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>

Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
 
> Am 02.06.20 um 22:20 schrieb Maruan Sahyoun:
> >   
> > > Maruan,
> > >    To confirm, you're ok if we grant access to the server to our colleagues
> > > on Tika and POI?
> > 
> > to be clear - my company is only sponsoring the box. It's the projects decision who needs access not mine. So feel free.
> Thanks a lot Maruan! Should we mention your company somewhere as sponsor?

I'm glad that I can give something back to the projects. Mention the sponsoring is not needed. With the corpora files PDFBox,
Tika, POI others have one of the best base of real world files. I remember when doing a presentation at PDF Days some times ago
that people were really impressed about our testing. 
BR
Maruan

> 
> Andreas
> 
> 
> > BR
> > Maruan
> > 
> > 
> > >    Again, wow, THANK YOU!
> > > 
> > >                 Best,
> > > 
> > >                            Tim
> > > 
> > > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org> wrote:
> > > 
> > > > > proper domain for https access
> > > > 
> > > > I just pinged infra on slack.
> > > > 
> > > > If they're able to do it, what would we want?
> > > > 
> > > > file-corpora.apache.org
> > > > corpora.apache.org
> > > > corpora-pdfbox.apache.org
> > > > corpora-tika.apache.org
> > > > 
> > > > Something else?  I'm also happy to buy a domain if that won't work.  There
> > > > are a couple available that are close enough.
> > > > 
> > > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
> > > > wrote:
> > > > 
> > > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > > > 
> > > > > > If ubuntu is possible at all, that's what I've been working with most
> > > > > > recently.
> > > > > 
> > > > > OK - will setup with that distro
> > > > > 
> > > > > > Other than that, ssh access and sudo privileges would be all I'd need.
> > > > > > 
> > > > > > Are you ok if we set up apache httpd to host files for the public or
> > > > > will
> > > > > > this be a community only resource?
> > > > > 
> > > > > it can be used for whatever we want it to - so if you consider public
> > > > > file sharing useful of course we can do that. Would be
> > > > > good if we get a proper domain for https access. Would that be something
> > > > > infra can do?
> > > > > 
> > > > > > If this is corporate sponsored, please let me know how/if we should
> > > > > mention
> > > > > > the sponsorship.
> > > > > 
> > > > > no need to mention it - happy to help.
> > > > > 
> > > > > > Again...wow.  Thank you!
> > > > > > 
> > > > > > Best,
> > > > > > 
> > > > > >        Tim
> > > > > > 
> > > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sa...@fileaffairs.de>
> > > > > > wrote:
> > > > > > 
> > > > > > > Could fund either:
> > > > > > > 
> > > > > > > AMD Ryzen 5 3600
> > > > > > > 64 GB RAM
> > > > > > > 2x2TB
> > > > > > > 
> > > > > > > or
> > > > > > > 
> > > > > > > AMD Ryzen 7 3700X based Server
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > > 
> > > > > > > or
> > > > > > > Intel® Core™ i9-9900K
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > > 
> > > > > > > All are root servers so one has to vote for taking care of them (I
> > > > > can do
> > > > > > > the initial setup).
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > BR
> > > > > > > Maruan
> > > > > > > 
> > > > > > > > There are two use cases.
> > > > > > > > 
> > > > > > > > 1) host shared data so that we can all point to and work from the
> > > > > same
> > > > > > > > data, ideally both literal docs and also extracts (text/metadata
> > > > > .json
> > > > > > > > files representing extracted information).
> > > > > > > > 
> > > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > > > 
> > > > > > > > We could use help with either or both.
> > > > > > > > 
> > > > > > > > What we had before:
> > > > > > > > 8 GB RAM
> > > > > > > > 8 cores
> > > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > > > 
> > > > > > > > We can always use more RAM and more cores up to the point of I/O
> > > > > > > > bottlenecks.
> > > > > > > > 
> > > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > > sahyoun@fileaffairs.de>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > is that a storage box only or does it need to do some computings
> > > > > too?
> > > > > > > > > Maybe you could write a small spec for the server requirement?
> > > > > > > > > 
> > > > > > > > > BR
> > > > > > > > > Maruan
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > > > 
> > > > > > > > > >   Yes, more than happy to share.
> > > > > > > > > > 
> > > > > > > > > > If anyone has recommendations for file hosting for a couple of
> > > > > TB,
> > > > > > > let me
> > > > > > > > > > know.
> > > > > > > > > > 
> > > > > > > > > > One option would be to work with CommonCrawl to bump the max
> > > > > file
> > > > > > > size
> > > > > > > > > one
> > > > > > > > > > crawl a year...
> > > > > > > > > > 
> > > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > > > THausherr@t-online.de>
> > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > > Can we / I access these files? Most differences are
> > > > > improvements
> > > > > > > or not
> > > > > > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > > > > > > 
> > > > > > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > > > 
> > > > > > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
> > > > > was
> > > > > > > a big
> > > > > > > > > > > one and gets assigned to another line.
> > > > > > > > > > > 
> > > > > > > > > > > Tilman
> > > > > > > > > > > 
> > > > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > > > > > Reports are available here:
> > > > > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > > > > > Looks like there are trivial differences in content with a
> > > > > slight
> > > > > > > > > > > > improvement over 2.0.19.  I don't see any differences in
> > > > > > > exceptions
> > > > > > > > > or
> > > > > > > > > > > > attachments.
> > > > > > > > > > > > 
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > 
> > > > > > > > > > > >           Tim
> > > > > > > > > > > > 
> > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > > > > > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > --
> > > > > > > Maruan Sahyoun
> > > > > > > 
> > > > > > > FileAffairs GmbH
> > > > > > > Josef-Schappe-Straße 21
> > > > > > > 40882 Ratingen
> > > > > > > 
> > > > > > > Tel: +49 (2102) 89497 88
> > > > > > > Fax: +49 (2102) 89497 91
> > > > > > > sahyoun@fileaffairs.de
> > > > > > > www.fileaffairs.de
> > > > > > > 
> > > > > > > Geschäftsführer: Maruan Sahyoun
> > > > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > > > UST.-ID: DE248275827
> > > > > > > 
> > > > > > > 
> > > > > --
> > > > > Maruan Sahyoun
> > > > > 
> > > > > FileAffairs GmbH
> > > > > Josef-Schappe-Straße 21
> > > > > 40882 Ratingen
> > > > > 
> > > > > Tel: +49 (2102) 89497 88
> > > > > Fax: +49 (2102) 89497 91
> > > > > sahyoun@fileaffairs.de
> > > > > www.fileaffairs.de
> > > > > 
> > > > > Geschäftsführer: Maruan Sahyoun
> > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > UST.-ID: DE248275827
> > > > > 
> > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > > > 
> > > > > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 02.06.20 um 22:20 schrieb Maruan Sahyoun:
>   
>> Maruan,
>>    To confirm, you're ok if we grant access to the server to our colleagues
>> on Tika and POI?
> 
> to be clear - my company is only sponsoring the box. It's the projects decision who needs access not mine. So feel free.
Thanks a lot Maruan! Should we mention your company somewhere as sponsor?

Andreas


> 
> BR
> Maruan
> 
> 
>>    Again, wow, THANK YOU!
>>
>>                 Best,
>>
>>                            Tim
>>
>> On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org> wrote:
>>
>>>> proper domain for https access
>>>
>>> I just pinged infra on slack.
>>>
>>> If they're able to do it, what would we want?
>>>
>>> file-corpora.apache.org
>>> corpora.apache.org
>>> corpora-pdfbox.apache.org
>>> corpora-tika.apache.org
>>>
>>> Something else?  I'm also happy to buy a domain if that won't work.  There
>>> are a couple available that are close enough.
>>>
>>> On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
>>> wrote:
>>>
>>>>> AMD ryzen looks fantastic.  Others would be great as well.
>>>>>
>>>>> If ubuntu is possible at all, that's what I've been working with most
>>>>> recently.
>>>>
>>>> OK - will setup with that distro
>>>>
>>>>> Other than that, ssh access and sudo privileges would be all I'd need.
>>>>>
>>>>> Are you ok if we set up apache httpd to host files for the public or
>>>> will
>>>>> this be a community only resource?
>>>>
>>>> it can be used for whatever we want it to - so if you consider public
>>>> file sharing useful of course we can do that. Would be
>>>> good if we get a proper domain for https access. Would that be something
>>>> infra can do?
>>>>
>>>>> If this is corporate sponsored, please let me know how/if we should
>>>> mention
>>>>> the sponsorship.
>>>>
>>>> no need to mention it - happy to help.
>>>>
>>>>> Again...wow.  Thank you!
>>>>>
>>>>> Best,
>>>>>
>>>>>        Tim
>>>>>
>>>>> On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sa...@fileaffairs.de>
>>>>> wrote:
>>>>>
>>>>>> Could fund either:
>>>>>>
>>>>>> AMD Ryzen 5 3600
>>>>>> 64 GB RAM
>>>>>> 2x2TB
>>>>>>
>>>>>> or
>>>>>>
>>>>>> AMD Ryzen 7 3700X based Server
>>>>>> 64 GB RAM
>>>>>> 2x8TB
>>>>>>
>>>>>> or
>>>>>> Intel® Core™ i9-9900K
>>>>>> 64 GB RAM
>>>>>> 2x8TB
>>>>>>
>>>>>> All are root servers so one has to vote for taking care of them (I
>>>> can do
>>>>>> the initial setup).
>>>>>>
>>>>>>
>>>>>>
>>>>>> BR
>>>>>> Maruan
>>>>>>
>>>>>>> There are two use cases.
>>>>>>>
>>>>>>> 1) host shared data so that we can all point to and work from the
>>>> same
>>>>>>> data, ideally both literal docs and also extracts (text/metadata
>>>> .json
>>>>>>> files representing extracted information).
>>>>>>>
>>>>>>> 2) a modest vm to allow all of us to run the regression tests
>>>>>>>
>>>>>>> We could use help with either or both.
>>>>>>>
>>>>>>> What we had before:
>>>>>>> 8 GB RAM
>>>>>>> 8 cores
>>>>>>> 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
>>>>>>>
>>>>>>> We can always use more RAM and more cores up to the point of I/O
>>>>>>> bottlenecks.
>>>>>>>
>>>>>>> On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
>>>> sahyoun@fileaffairs.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> is that a storage box only or does it need to do some computings
>>>> too?
>>>>>>>> Maybe you could write a small spec for the server requirement?
>>>>>>>>
>>>>>>>> BR
>>>>>>>> Maruan
>>>>>>>>
>>>>>>>>
>>>>>>>>> Still haven’t had time to put the server in a dmz. Ugh.
>>>>>>>>>
>>>>>>>>>   Yes, more than happy to share.
>>>>>>>>>
>>>>>>>>> If anyone has recommendations for file hosting for a couple of
>>>> TB,
>>>>>> let me
>>>>>>>>> know.
>>>>>>>>>
>>>>>>>>> One option would be to work with CommonCrawl to bump the max
>>>> file
>>>>>> size
>>>>>>>> one
>>>>>>>>> crawl a year...
>>>>>>>>>
>>>>>>>>> On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
>>>>>> THausherr@t-online.de>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Can we / I access these files? Most differences are
>>>> improvements
>>>>>> or not
>>>>>>>>>> meaningful, but there are a few I'd like to have a look, e.g.
>>>>>>>>>>
>>>>>>>>>> commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
>>>>>>>>>>
>>>>>>>>>> the word "antrag" loses the first "a". Although maybe the "a"
>>>> was
>>>>>> a big
>>>>>>>>>> one and gets assigned to another line.
>>>>>>>>>>
>>>>>>>>>> Tilman
>>>>>>>>>>
>>>>>>>>>> Am 02.06.2020 um 02:58 schrieb Tim Allison:
>>>>>>>>>>>>> Reports are available here:
>>>> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
>>>>>>>>>>> Looks like there are trivial differences in content with a
>>>> slight
>>>>>>>>>>> improvement over 2.0.19.  I don't see any differences in
>>>>>> exceptions
>>>>>>>> or
>>>>>>>>>>> attachments.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>>           Tim
>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>> --
>>>>>> Maruan Sahyoun
>>>>>>
>>>>>> FileAffairs GmbH
>>>>>> Josef-Schappe-Straße 21
>>>>>> 40882 Ratingen
>>>>>>
>>>>>> Tel: +49 (2102) 89497 88
>>>>>> Fax: +49 (2102) 89497 91
>>>>>> sahyoun@fileaffairs.de
>>>>>> www.fileaffairs.de
>>>>>>
>>>>>> Geschäftsführer: Maruan Sahyoun
>>>>>> Handelsregister: AG Düsseldorf, HRB 53837
>>>>>> UST.-ID: DE248275827
>>>>>>
>>>>>>
>>>> --
>>>> Maruan Sahyoun
>>>>
>>>> FileAffairs GmbH
>>>> Josef-Schappe-Straße 21
>>>> 40882 Ratingen
>>>>
>>>> Tel: +49 (2102) 89497 88
>>>> Fax: +49 (2102) 89497 91
>>>> sahyoun@fileaffairs.de
>>>> www.fileaffairs.de
>>>>
>>>> Geschäftsführer: Maruan Sahyoun
>>>> Handelsregister: AG Düsseldorf, HRB 53837
>>>> UST.-ID: DE248275827
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
 
> I'm rsync'ing the data over now.  I probably won't get around to setting up httpd this week, but if anyone else wants to take it, go for it.  This will at least get team members access to the files asap.

I can take care of httpd but would prefer to wait until the subdomain/cert is done as I'd go for https only. If access is needed
quicker let me know - I'd do an initial setup in that case.

BR
Maruan

> 
> I've disabled login via password. 
> 
> If anyone feels that I'm doing something wrong, please let me know!
> 
> Cheers and thank you Maruan!
> 
>     Tim
> 
> On Tue, Jun 2, 2020 at 4:20 PM Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> >  
> > > Maruan,
> > >   To confirm, you're ok if we grant access to the server to our colleagues
> > > on Tika and POI?
> > 
> > to be clear - my company is only sponsoring the box. It's the projects decision who needs access not mine. So feel free.
> > 
> > BR
> > Maruan
> > 
> > 
> > >   Again, wow, THANK YOU!
> > > 
> > >                Best,
> > > 
> > >                           Tim
> > > 
> > > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org> wrote:
> > > 
> > > > > proper domain for https access
> > > > 
> > > > I just pinged infra on slack.
> > > > 
> > > > If they're able to do it, what would we want?
> > > > 
> > > > file-corpora.apache.org
> > > > corpora.apache.org
> > > > corpora-pdfbox.apache.org
> > > > corpora-tika.apache.org
> > > > 
> > > > Something else?  I'm also happy to buy a domain if that won't work.  There
> > > > are a couple available that are close enough.
> > > > 
> > > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
> > > > wrote:
> > > > 
> > > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > > > 
> > > > > > If ubuntu is possible at all, that's what I've been working with most
> > > > > > recently.
> > > > > 
> > > > > OK - will setup with that distro
> > > > > 
> > > > > > Other than that, ssh access and sudo privileges would be all I'd need.
> > > > > > 
> > > > > > Are you ok if we set up apache httpd to host files for the public or
> > > > > will
> > > > > > this be a community only resource?
> > > > > 
> > > > > it can be used for whatever we want it to - so if you consider public
> > > > > file sharing useful of course we can do that. Would be
> > > > > good if we get a proper domain for https access. Would that be something
> > > > > infra can do?
> > > > > 
> > > > > > If this is corporate sponsored, please let me know how/if we should
> > > > > mention
> > > > > > the sponsorship.
> > > > > 
> > > > > no need to mention it - happy to help.
> > > > > 
> > > > > > Again...wow.  Thank you!
> > > > > > 
> > > > > > Best,
> > > > > > 
> > > > > >       Tim
> > > > > > 
> > > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sa...@fileaffairs.de>
> > > > > > wrote:
> > > > > > 
> > > > > > > Could fund either:
> > > > > > > 
> > > > > > > AMD Ryzen 5 3600
> > > > > > > 64 GB RAM
> > > > > > > 2x2TB
> > > > > > > 
> > > > > > > or
> > > > > > > 
> > > > > > > AMD Ryzen 7 3700X based Server
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > > 
> > > > > > > or
> > > > > > > Intel® Core™ i9-9900K
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > > 
> > > > > > > All are root servers so one has to vote for taking care of them (I
> > > > > can do
> > > > > > > the initial setup).
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > BR
> > > > > > > Maruan
> > > > > > > 
> > > > > > > > There are two use cases.
> > > > > > > > 
> > > > > > > > 1) host shared data so that we can all point to and work from the
> > > > > same
> > > > > > > > data, ideally both literal docs and also extracts (text/metadata
> > > > > .json
> > > > > > > > files representing extracted information).
> > > > > > > > 
> > > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > > > 
> > > > > > > > We could use help with either or both.
> > > > > > > > 
> > > > > > > > What we had before:
> > > > > > > > 8 GB RAM
> > > > > > > > 8 cores
> > > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > > > 
> > > > > > > > We can always use more RAM and more cores up to the point of I/O
> > > > > > > > bottlenecks.
> > > > > > > > 
> > > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > > sahyoun@fileaffairs.de>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > is that a storage box only or does it need to do some computings
> > > > > too?
> > > > > > > > > Maybe you could write a small spec for the server requirement?
> > > > > > > > > 
> > > > > > > > > BR
> > > > > > > > > Maruan
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > > > 
> > > > > > > > > >  Yes, more than happy to share.
> > > > > > > > > > 
> > > > > > > > > > If anyone has recommendations for file hosting for a couple of
> > > > > TB,
> > > > > > > let me
> > > > > > > > > > know.
> > > > > > > > > > 
> > > > > > > > > > One option would be to work with CommonCrawl to bump the max
> > > > > file
> > > > > > > size
> > > > > > > > > one
> > > > > > > > > > crawl a year...
> > > > > > > > > > 
> > > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > > > THausherr@t-online.de>
> > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > > Can we / I access these files? Most differences are
> > > > > improvements
> > > > > > > or not
> > > > > > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > > > > > > 
> > > > > > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > > > 
> > > > > > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
> > > > > was
> > > > > > > a big
> > > > > > > > > > > one and gets assigned to another line.
> > > > > > > > > > > 
> > > > > > > > > > > Tilman
> > > > > > > > > > > 
> > > > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > > > > > Reports are available here:
> > > > > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > > > > > Looks like there are trivial differences in content with a
> > > > > slight
> > > > > > > > > > > > improvement over 2.0.19.  I don't see any differences in
> > > > > > > exceptions
> > > > > > > > > or
> > > > > > > > > > > > attachments.
> > > > > > > > > > > > 
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > 
> > > > > > > > > > > >          Tim
> > > > > > > > > > > > 
> > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > > > > > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > --
> > > > > > > Maruan Sahyoun
> > > > > > > 
> > > > > > > FileAffairs GmbH
> > > > > > > Josef-Schappe-Straße 21
> > > > > > > 40882 Ratingen
> > > > > > > 
> > > > > > > Tel: +49 (2102) 89497 88
> > > > > > > Fax: +49 (2102) 89497 91
> > > > > > > sahyoun@fileaffairs.de
> > > > > > > www.fileaffairs.de
> > > > > > > 
> > > > > > > Geschäftsführer: Maruan Sahyoun
> > > > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > > > UST.-ID: DE248275827
> > > > > > > 
> > > > > > > 
> > > > > --
> > > > > Maruan Sahyoun
> > > > > 
> > > > > FileAffairs GmbH
> > > > > Josef-Schappe-Straße 21
> > > > > 40882 Ratingen
> > > > > 
> > > > > Tel: +49 (2102) 89497 88
> > > > > Fax: +49 (2102) 89497 91
> > > > > sahyoun@fileaffairs.de
> > > > > www.fileaffairs.de
> > > > > 
> > > > > Geschäftsführer: Maruan Sahyoun
> > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > UST.-ID: DE248275827
> > > > > 
> > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > > > 
> > > > > 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Tim Allison <ta...@apache.org>.
I'm rsync'ing the data over now.  I probably won't get around to setting up
httpd this week, but if anyone else wants to take it, go for it.  This will
at least get team members access to the files asap.

I've disabled login via password.

If anyone feels that I'm doing something wrong, please let me know!

Cheers and thank you Maruan!

    Tim

On Tue, Jun 2, 2020 at 4:20 PM Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

>
> > Maruan,
> >   To confirm, you're ok if we grant access to the server to our
> colleagues
> > on Tika and POI?
>
> to be clear - my company is only sponsoring the box. It's the projects
> decision who needs access not mine. So feel free.
>
> BR
> Maruan
>
>
> >   Again, wow, THANK YOU!
> >
> >                Best,
> >
> >                           Tim
> >
> > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org> wrote:
> >
> > > > proper domain for https access
> > >
> > > I just pinged infra on slack.
> > >
> > > If they're able to do it, what would we want?
> > >
> > > file-corpora.apache.org
> > > corpora.apache.org
> > > corpora-pdfbox.apache.org
> > > corpora-tika.apache.org
> > >
> > > Something else?  I'm also happy to buy a domain if that won't work.
> There
> > > are a couple available that are close enough.
> > >
> > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
> > > wrote:
> > >
> > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > >
> > > > > If ubuntu is possible at all, that's what I've been working with
> most
> > > > > recently.
> > > >
> > > > OK - will setup with that distro
> > > >
> > > > > Other than that, ssh access and sudo privileges would be all I'd
> need.
> > > > >
> > > > > Are you ok if we set up apache httpd to host files for the public
> or
> > > > will
> > > > > this be a community only resource?
> > > >
> > > > it can be used for whatever we want it to - so if you consider public
> > > > file sharing useful of course we can do that. Would be
> > > > good if we get a proper domain for https access. Would that be
> something
> > > > infra can do?
> > > >
> > > > > If this is corporate sponsored, please let me know how/if we should
> > > > mention
> > > > > the sponsorship.
> > > >
> > > > no need to mention it - happy to help.
> > > >
> > > > > Again...wow.  Thank you!
> > > > >
> > > > > Best,
> > > > >
> > > > >       Tim
> > > > >
> > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <
> sahyoun@fileaffairs.de>
> > > > > wrote:
> > > > >
> > > > > > Could fund either:
> > > > > >
> > > > > > AMD Ryzen 5 3600
> > > > > > 64 GB RAM
> > > > > > 2x2TB
> > > > > >
> > > > > > or
> > > > > >
> > > > > > AMD Ryzen 7 3700X based Server
> > > > > > 64 GB RAM
> > > > > > 2x8TB
> > > > > >
> > > > > > or
> > > > > > Intel® Core™ i9-9900K
> > > > > > 64 GB RAM
> > > > > > 2x8TB
> > > > > >
> > > > > > All are root servers so one has to vote for taking care of them
> (I
> > > > can do
> > > > > > the initial setup).
> > > > > >
> > > > > >
> > > > > >
> > > > > > BR
> > > > > > Maruan
> > > > > >
> > > > > > > There are two use cases.
> > > > > > >
> > > > > > > 1) host shared data so that we can all point to and work from
> the
> > > > same
> > > > > > > data, ideally both literal docs and also extracts
> (text/metadata
> > > > .json
> > > > > > > files representing extracted information).
> > > > > > >
> > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > >
> > > > > > > We could use help with either or both.
> > > > > > >
> > > > > > > What we had before:
> > > > > > > 8 GB RAM
> > > > > > > 8 cores
> > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > >
> > > > > > > We can always use more RAM and more cores up to the point of
> I/O
> > > > > > > bottlenecks.
> > > > > > >
> > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > sahyoun@fileaffairs.de>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > is that a storage box only or does it need to do some
> computings
> > > > too?
> > > > > > > > Maybe you could write a small spec for the server
> requirement?
> > > > > > > >
> > > > > > > > BR
> > > > > > > > Maruan
> > > > > > > >
> > > > > > > >
> > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > >
> > > > > > > > >  Yes, more than happy to share.
> > > > > > > > >
> > > > > > > > > If anyone has recommendations for file hosting for a
> couple of
> > > > TB,
> > > > > > let me
> > > > > > > > > know.
> > > > > > > > >
> > > > > > > > > One option would be to work with CommonCrawl to bump the
> max
> > > > file
> > > > > > size
> > > > > > > > one
> > > > > > > > > crawl a year...
> > > > > > > > >
> > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > > THausherr@t-online.de>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Can we / I access these files? Most differences are
> > > > improvements
> > > > > > or not
> > > > > > > > > > meaningful, but there are a few I'd like to have a look,
> e.g.
> > > > > > > > > >
> > > > > > > > > >
> commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > >
> > > > > > > > > > the word "antrag" loses the first "a". Although maybe
> the "a"
> > > > was
> > > > > > a big
> > > > > > > > > > one and gets assigned to another line.
> > > > > > > > > >
> > > > > > > > > > Tilman
> > > > > > > > > >
> > > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > > > > Reports are available here:
> > > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > > > > Looks like there are trivial differences in content
> with a
> > > > slight
> > > > > > > > > > > improvement over 2.0.19.  I don't see any differences
> in
> > > > > > exceptions
> > > > > > > > or
> > > > > > > > > > > attachments.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > >
> > > > > > > > > > >          Tim
> > > > > > > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> dev-unsubscribe@pdfbox.apache.org
> > > > > > > > > > For additional commands, e-mail:
> dev-help@pdfbox.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > --
> > > > > > Maruan Sahyoun
> > > > > >
> > > > > > FileAffairs GmbH
> > > > > > Josef-Schappe-Straße 21
> > > > > > 40882 Ratingen
> > > > > >
> > > > > > Tel: +49 (2102) 89497 88
> > > > > > Fax: +49 (2102) 89497 91
> > > > > > sahyoun@fileaffairs.de
> > > > > > www.fileaffairs.de
> > > > > >
> > > > > > Geschäftsführer: Maruan Sahyoun
> > > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > > UST.-ID: DE248275827
> > > > > >
> > > > > >
> > > > --
> > > > Maruan Sahyoun
> > > >
> > > > FileAffairs GmbH
> > > > Josef-Schappe-Straße 21
> > > > 40882 Ratingen
> > > >
> > > > Tel: +49 (2102) 89497 88
> > > > Fax: +49 (2102) 89497 91
> > > > sahyoun@fileaffairs.de
> > > > www.fileaffairs.de
> > > >
> > > > Geschäftsführer: Maruan Sahyoun
> > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > UST.-ID: DE248275827
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > >
> > > >
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahyoun@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>
>

Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
 
> Maruan,
>   To confirm, you're ok if we grant access to the server to our colleagues
> on Tika and POI?

to be clear - my company is only sponsoring the box. It's the projects decision who needs access not mine. So feel free.

BR
Maruan


>   Again, wow, THANK YOU!
> 
>                Best,
> 
>                           Tim
> 
> On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org> wrote:
> 
> > > proper domain for https access
> > 
> > I just pinged infra on slack.
> > 
> > If they're able to do it, what would we want?
> > 
> > file-corpora.apache.org
> > corpora.apache.org
> > corpora-pdfbox.apache.org
> > corpora-tika.apache.org
> > 
> > Something else?  I'm also happy to buy a domain if that won't work.  There
> > are a couple available that are close enough.
> > 
> > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
> > wrote:
> > 
> > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > 
> > > > If ubuntu is possible at all, that's what I've been working with most
> > > > recently.
> > > 
> > > OK - will setup with that distro
> > > 
> > > > Other than that, ssh access and sudo privileges would be all I'd need.
> > > > 
> > > > Are you ok if we set up apache httpd to host files for the public or
> > > will
> > > > this be a community only resource?
> > > 
> > > it can be used for whatever we want it to - so if you consider public
> > > file sharing useful of course we can do that. Would be
> > > good if we get a proper domain for https access. Would that be something
> > > infra can do?
> > > 
> > > > If this is corporate sponsored, please let me know how/if we should
> > > mention
> > > > the sponsorship.
> > > 
> > > no need to mention it - happy to help.
> > > 
> > > > Again...wow.  Thank you!
> > > > 
> > > > Best,
> > > > 
> > > >       Tim
> > > > 
> > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sa...@fileaffairs.de>
> > > > wrote:
> > > > 
> > > > > Could fund either:
> > > > > 
> > > > > AMD Ryzen 5 3600
> > > > > 64 GB RAM
> > > > > 2x2TB
> > > > > 
> > > > > or
> > > > > 
> > > > > AMD Ryzen 7 3700X based Server
> > > > > 64 GB RAM
> > > > > 2x8TB
> > > > > 
> > > > > or
> > > > > Intel® Core™ i9-9900K
> > > > > 64 GB RAM
> > > > > 2x8TB
> > > > > 
> > > > > All are root servers so one has to vote for taking care of them (I
> > > can do
> > > > > the initial setup).
> > > > > 
> > > > > 
> > > > > 
> > > > > BR
> > > > > Maruan
> > > > > 
> > > > > > There are two use cases.
> > > > > > 
> > > > > > 1) host shared data so that we can all point to and work from the
> > > same
> > > > > > data, ideally both literal docs and also extracts (text/metadata
> > > .json
> > > > > > files representing extracted information).
> > > > > > 
> > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > 
> > > > > > We could use help with either or both.
> > > > > > 
> > > > > > What we had before:
> > > > > > 8 GB RAM
> > > > > > 8 cores
> > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > 
> > > > > > We can always use more RAM and more cores up to the point of I/O
> > > > > > bottlenecks.
> > > > > > 
> > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > sahyoun@fileaffairs.de>
> > > > > > wrote:
> > > > > > 
> > > > > > > is that a storage box only or does it need to do some computings
> > > too?
> > > > > > > Maybe you could write a small spec for the server requirement?
> > > > > > > 
> > > > > > > BR
> > > > > > > Maruan
> > > > > > > 
> > > > > > > 
> > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > 
> > > > > > > >  Yes, more than happy to share.
> > > > > > > > 
> > > > > > > > If anyone has recommendations for file hosting for a couple of
> > > TB,
> > > > > let me
> > > > > > > > know.
> > > > > > > > 
> > > > > > > > One option would be to work with CommonCrawl to bump the max
> > > file
> > > > > size
> > > > > > > one
> > > > > > > > crawl a year...
> > > > > > > > 
> > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > THausherr@t-online.de>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > Can we / I access these files? Most differences are
> > > improvements
> > > > > or not
> > > > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > > > > 
> > > > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > 
> > > > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
> > > was
> > > > > a big
> > > > > > > > > one and gets assigned to another line.
> > > > > > > > > 
> > > > > > > > > Tilman
> > > > > > > > > 
> > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > > > Reports are available here:
> > > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > > > Looks like there are trivial differences in content with a
> > > slight
> > > > > > > > > > improvement over 2.0.19.  I don't see any differences in
> > > > > exceptions
> > > > > > > or
> > > > > > > > > > attachments.
> > > > > > > > > > 
> > > > > > > > > > Cheers,
> > > > > > > > > > 
> > > > > > > > > >          Tim
> > > > > > > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > > > > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > > > > > > > 
> > > > > > > > > 
> > > > > --
> > > > > Maruan Sahyoun
> > > > > 
> > > > > FileAffairs GmbH
> > > > > Josef-Schappe-Straße 21
> > > > > 40882 Ratingen
> > > > > 
> > > > > Tel: +49 (2102) 89497 88
> > > > > Fax: +49 (2102) 89497 91
> > > > > sahyoun@fileaffairs.de
> > > > > www.fileaffairs.de
> > > > > 
> > > > > Geschäftsführer: Maruan Sahyoun
> > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > UST.-ID: DE248275827
> > > > > 
> > > > > 
> > > --
> > > Maruan Sahyoun
> > > 
> > > FileAffairs GmbH
> > > Josef-Schappe-Straße 21
> > > 40882 Ratingen
> > > 
> > > Tel: +49 (2102) 89497 88
> > > Fax: +49 (2102) 89497 91
> > > sahyoun@fileaffairs.de
> > > www.fileaffairs.de
> > > 
> > > Geschäftsführer: Maruan Sahyoun
> > > Handelsregister: AG Düsseldorf, HRB 53837
> > > UST.-ID: DE248275827
> > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > > 
> > > 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: New file vm WAS: Re: Release 2.0.20 ?

Posted by Tim Allison <ta...@apache.org>.
Maruan,
  To confirm, you're ok if we grant access to the server to our colleagues
on Tika and POI?
  Again, wow, THANK YOU!

               Best,

                          Tim

On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <ta...@apache.org> wrote:

> >proper domain for https access
>
> I just pinged infra on slack.
>
> If they're able to do it, what would we want?
>
> file-corpora.apache.org
> corpora.apache.org
> corpora-pdfbox.apache.org
> corpora-tika.apache.org
>
> Something else?  I'm also happy to buy a domain if that won't work.  There
> are a couple available that are close enough.
>
> On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sa...@fileaffairs.de>
> wrote:
>
>>
>> > AMD ryzen looks fantastic.  Others would be great as well.
>> >
>> > If ubuntu is possible at all, that's what I've been working with most
>> > recently.
>>
>> OK - will setup with that distro
>>
>> >
>> > Other than that, ssh access and sudo privileges would be all I'd need.
>> >
>> > Are you ok if we set up apache httpd to host files for the public or
>> will
>> > this be a community only resource?
>>
>> it can be used for whatever we want it to - so if you consider public
>> file sharing useful of course we can do that. Would be
>> good if we get a proper domain for https access. Would that be something
>> infra can do?
>>
>> >
>> > If this is corporate sponsored, please let me know how/if we should
>> mention
>> > the sponsorship.
>>
>> no need to mention it - happy to help.
>>
>> >
>> > Again...wow.  Thank you!
>> >
>> > Best,
>> >
>> >       Tim
>> >
>> > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sa...@fileaffairs.de>
>> > wrote:
>> >
>> > > Could fund either:
>> > >
>> > > AMD Ryzen 5 3600
>> > > 64 GB RAM
>> > > 2x2TB
>> > >
>> > > or
>> > >
>> > > AMD Ryzen 7 3700X based Server
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > or
>> > > Intel® Core™ i9-9900K
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > All are root servers so one has to vote for taking care of them (I
>> can do
>> > > the initial setup).
>> > >
>> > >
>> > >
>> > > BR
>> > > Maruan
>> > >
>> > > > There are two use cases.
>> > > >
>> > > > 1) host shared data so that we can all point to and work from the
>> same
>> > > > data, ideally both literal docs and also extracts (text/metadata
>> .json
>> > > > files representing extracted information).
>> > > >
>> > > > 2) a modest vm to allow all of us to run the regression tests
>> > > >
>> > > > We could use help with either or both.
>> > > >
>> > > > What we had before:
>> > > > 8 GB RAM
>> > > > 8 cores
>> > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
>> > > >
>> > > > We can always use more RAM and more cores up to the point of I/O
>> > > > bottlenecks.
>> > > >
>> > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
>> sahyoun@fileaffairs.de>
>> > > > wrote:
>> > > >
>> > > > > is that a storage box only or does it need to do some computings
>> too?
>> > > > >
>> > > > > Maybe you could write a small spec for the server requirement?
>> > > > >
>> > > > > BR
>> > > > > Maruan
>> > > > >
>> > > > >
>> > > > > > Still haven’t had time to put the server in a dmz. Ugh.
>> > > > > >
>> > > > > >  Yes, more than happy to share.
>> > > > > >
>> > > > > > If anyone has recommendations for file hosting for a couple of
>> TB,
>> > > let me
>> > > > > > know.
>> > > > > >
>> > > > > > One option would be to work with CommonCrawl to bump the max
>> file
>> > > size
>> > > > > one
>> > > > > > crawl a year...
>> > > > > >
>> > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
>> > > THausherr@t-online.de>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Can we / I access these files? Most differences are
>> improvements
>> > > or not
>> > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
>> > > > > > >
>> > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
>> > > > > > >
>> > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
>> was
>> > > a big
>> > > > > > > one and gets assigned to another line.
>> > > > > > >
>> > > > > > > Tilman
>> > > > > > >
>> > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
>> > > > > > > > > > Reports are available here:
>> > >
>> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
>> > > > > > > > Looks like there are trivial differences in content with a
>> slight
>> > > > > > > > improvement over 2.0.19.  I don't see any differences in
>> > > exceptions
>> > > > > or
>> > > > > > > > attachments.
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > >
>> > > > > > > >          Tim
>> > > > > > > >
>> > > ---------------------------------------------------------------------
>> > > > > > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> > > > > > > For additional commands, e-mail: dev-help@pdfbox.apache.org
>> > > > > > >
>> > > > > > >
>> > > --
>> > > Maruan Sahyoun
>> > >
>> > > FileAffairs GmbH
>> > > Josef-Schappe-Straße 21
>> > > 40882 Ratingen
>> > >
>> > > Tel: +49 (2102) 89497 88
>> > > Fax: +49 (2102) 89497 91
>> > > sahyoun@fileaffairs.de
>> > > www.fileaffairs.de
>> > >
>> > > Geschäftsführer: Maruan Sahyoun
>> > > Handelsregister: AG Düsseldorf, HRB 53837
>> > > UST.-ID: DE248275827
>> > >
>> > >
>> --
>> Maruan Sahyoun
>>
>> FileAffairs GmbH
>> Josef-Schappe-Straße 21
>> 40882 Ratingen
>>
>> Tel: +49 (2102) 89497 88
>> Fax: +49 (2102) 89497 91
>> sahyoun@fileaffairs.de
>> www.fileaffairs.de
>>
>> Geschäftsführer: Maruan Sahyoun
>> Handelsregister: AG Düsseldorf, HRB 53837
>> UST.-ID: DE248275827
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>