PDF: Still unfit for human consumption, 20 years later

jcrawfordor · on Aug 10, 2020

Any complaint about the PDF format tends to be hard to address because the PDF format is so complicated and so flexible---except, of course, for the argument that the PDF format is too complicated and flexible, which tends to be the one enduring criticism since it has lead to a history of various security, compatibility, and performance issues related to PDFs.

The major attempts to replace PDF have largely failed, though. DjVu is relatively limited in scope. Postscript (as a document display format) has never been well-supported on Windows and is increasingly poorly supported on Linux due to rarity. XPS is perhaps the most direct "PDF replacement" but is nearly equally complicated (being based on the MS Office OOXML formats, giving it a similar cursed heritage to PDF's basis in the Photoshop PSD format), and there was never really a compelling argument to switch to it.

What I don't get is the suggestion that PDF should be replaced by HTML. The purposes of the two formats are basically orthogonal and replacing one with the other is doomed to failure. The author's argument seems more akin to "print-layout documents should be replaced by hypertext," and perhaps this is true in some cases, but it's definitely a different matter and one that the author's arguments don't really support that well.

In my opinion, hopefully more humble than the author's, PDF's main downside is the remarkable unevenness of the quality of the creation and reading tools, considering its supposedly "reads everywhere" nature. The "reference implementation" is a commercial product and supports a huge list of features that are rarely or never supported by third-party commercial or open-source implementations. The Linux toolchain still widely used with PDF (e.g. Ghostscript) is decidedly outdated and hard to work with, but there's not a lot of momentum towards development of more modern tools. All of these issues are likely rooted in the basic fact that the PDF format is extremely complicated, and so thoroughly implementing it is a massive undertaking.

The author's complaints about performance in particular reflect the flexibility and complexity of the format. Web browsers have mostly switched over to using pdf.js to render PDFs, which is completely satisfactory for documents that consist of text or images (like scanned documents), but can be absolutely unusable when dealing with extremely vector-heavy PDFs like GIS exports.

Even printing PDFs can become rather frustrating as the complexity of the format means that parse-related printing issues are relatively common. Even Acrobat, for a long time, would munge certain characters when printing due to some sort of inconsistency with how different generators and readers implemented font embedding leading to Acrobat not being able to locate the embedded character font. This seemed most common with the letter "l" but maybe I'm imagining that... but also maybe it reflects some frightening detail of the format or implementation behavior.

One of the most common issues around PDF consistency comes down to file size... different PDF generators are prone to create representations of the same document that are significantly different sizes. Scanners are often an extreme example, some combination of not "knowing the tricks" for PDF optimization and a probably very low-performance compression implementation means that low-end network scanners often produce PDFs that are hilariously large. Opening them in Acrobat and using the "optimize file" tool can reduce file size by 90% without apparent visual impact... the whole fact that Acrobat has an "optimize" tool (and that Acrobat Distiller used to exist) speaks to the scale of this problem. Inspecting PDFs that are "optimized" by Acrobat can be an alarming experience, as well. You may remember that this played a strange role in Obama's birth certificate some years back, as Acrobat seems to normally split PDFs into all kinds of different layers and apply strange transformations to them when it "optimizes." It's hard to know how much of this is actually "best practice" versus just a result of Acrobat accumulating decades of eccentricities.

So the bottom line is... PDF is too complicated for its own good, but then so are a great deal of other formats in widespread usage, like modern webpages which require complex parsing of multiple formats to render, and a great deal of historic cruft brought along with them. I'm not sure that there's any sound technical argument that PDF or web pages are a "better format," it's all a matter of opinion over whether you prefer print-format documents or hypertext, and that's going to be very application-specific.

Thorentis · on Aug 11, 2020

> PDF's main downside is the remarkable unevenness of the quality of the creation and reading tools

Funny enough, I think one of the reasons PDF became so popular, is because it was originally seen as a "difficult / impossible to modify file that can be downloaded as a file and read in a static way". The lack of editing tools in the most popular PDF reader for a long time (Acrobat Reader) was the reason it became such a widely used format. Especially compared to distributing a .doc or .docx where the user can easily accidentally change something.

ubermonkey · on Aug 11, 2020

That's absolutely what we use it for. It's a "terminal" format for documentation we distribute to customers. We don't want to send out .docx.

aj7 · on Aug 12, 2020

Exactly.

DrAwdeOccarim · on Aug 11, 2020

That's why I use it at work. If I don't pay for my employees to get Acrobat Pro, or allow them to install software outside of a helpdesk tickets, then I know a PDF they generate using our lab management software is unadulterated. It's part of our data verification policy. It's not that I think my employees with change data nefariously, it's that they may want to edit the layout and accidentally change a numerical value.

davidgerard · on Aug 11, 2020

LibreOffice Draw is pretty good for editing PDFs - if you have all the matching fonts. So I wouldn't go assuming people can't edit PDFs.

DrAwdeOccarim · on Aug 12, 2020

They'd need to put in a ticket to get it installed, which I would need to approve. But also, this isn't to stop malice it's to stop accidental edits. I can't stop my employees from messing with things if they want to. That's where the trust part and common goals come into play.

drawkbox · on Aug 11, 2020

The document can be tampersealed to help prevent or at least notify of that.

In a way a locked tampersealed document is a pretty decent template for parsing data as well. It is messy as PDFs always are but a decent lib and a consistent source with sealed docs can be used for decently verifiable scraping.

For instance some sort of license, certification or official document like tax forms, it can be generated with a common output, tampersealed and then reliably parsed after verification.

frandroid · on Aug 12, 2020

Don't these users have a recent version of MS Word?

chipotle_coyote · on Aug 10, 2020

> What I don't get is the suggestion that PDF should be replaced by HTML. The purposes of the two formats are basically orthogonal and replacing one with the other is doomed to failure.

Isn't "the purposes of the two formats are basically orthogonal" actually the entire point the article is making? Literally the first line of the summary:

> Research spanning 20 years proves PDFs are problematic for online reading. Yet they’re still prevalent and users continue to get lost in them.

From the second paragraph:

> The [PDF] format is intended and optimized for print. It’s inherently inaccessible, unpleasant to read, and cumbersome to navigate online.

The bolded statement in the second paragraph that's clearly meant to be the One Important Thing to Take Away:

> Do not use PDFs to present digital content that could and should otherwise be a web page.

Your comment here is eloquent, but the article's argument is not "print-layout documents should be replaced by hypertext," it's "print-layout documents are a poor fit for reading on screen-layout devices." When you conclude:

> It's all a matter of opinion over whether you prefer print-format documents or hypertext, and that's going to be very application-specific.

Aren't you essentially restating the article's thesis?

I don't want to read an article online that's a PDF for largely the same reason that I don't want to print the web version of the same article rather than a PDF. It's generally going to be clunky. The print page size and dimensions are not going to be my screen/window size and dimensions. I certainly don't want to read two- or three-column text on screen, which may require zooming in and out and scrolling back and forth on the same "print" page. And God help me if I'm trying to do that on my phone or iPad mini.

The article isn't saying "PDF is terrible and nobody should ever use it"; it's saying "PDFs were meant for specific applications and in nearly all circumstances, online reading is not it."

lmm · on Aug 11, 2020

People who use PDFs generally do it because they want to have a fixed layout. If you tell those people to use HTML, they'll find a way to produce a non-reflowable webpage.

bryanrasmussen · on Aug 11, 2020

Or they use them because they have a publishing flow implemented somewhere in the byzantine processes of their company that spits out a nice looking pdf at the end (and maybe a crappy looking 1999 html document)

chipotle_coyote · on Aug 11, 2020

That's actually been more my experience, yes. :)

dragonwriter · on Aug 12, 2020

People who use PDFs often do it because the content they use it for is of a type (often a series) that has long been produced in PDF, and the reasoning for that in many cases is because in the 1990s people would print it out and read it. In many cases, that's not how people use it now, but PDF is still used because that's the way it has always been done.

feteru · on Aug 12, 2020

I use PDFs for engineering drawings of mechanical parts. It's been done for a long time and is a good fit. There's a specific sheet format and scale, and is meant to be able to be printed easily

jfk13 · on Aug 12, 2020

...although you need to beware of printouts getting silently scaled. It's really common, when someone asks to print (e.g.) an A4-sized PDF to A4 paper, for the printer driver to rescale it so that the entire document (including margins) fits within the (slightly smaller) printable area of the device. "Shrink to fit [within printable area]" seems to be a common default setting.

(If you're using PDFs for precisely-scaled engineering drawings, I expect you're well aware of this and have a workflow that avoids scaling, but I see people trip over it all too often.)

dragonwriter · on Aug 12, 2020

> I use PDFs for engineering drawings of mechanical parts. It's been done for a long time and is a good fit.

Good for you, but that doesn't change that most of the uses of PDF on the web aren't for applications like that.

scelerat · on Aug 11, 2020

Granted, but a surprising number of people (still, in 2020) envision a very static, print-like experience for all web pages. i.e. yes, they "want" a fixed layout but in many cases their reasoning is mis- or uninformed.

GoblinSlayer · on Aug 12, 2020

They use PDFs because of five monkeys experiment.

KineticLensman · on Aug 12, 2020

This experiment, which apparently never happened [0], just means 'because tradition'.

[0] http://www.throwcase.com/2014/12/21/that-five-monkeys-and-a-...

GoblinSlayer · on Aug 12, 2020

Serious documents are traditionally published in pdf format, that's why they are published in pdf format.

_8ljf · on Aug 11, 2020

”> The [PDF] format is intended and optimized for print. It’s inherently inaccessible, unpleasant to read, and cumbersome to navigate online.”

PDF format is perfectly capable of holding structured text and other machine-readable/accessibility data, alongside the print-ready representation. Ask the developers of document authoring tools (starting with MS Office) and the various PDF generation libraries why they don’t include all that data as standard.

One can argue the appropriateness of a print-derived format in a constantly fluid digital world, but it does what it was designed to do and does it pretty well, and it could provide a lot more if developers and users could be bothered to do it.

And yes, HTML is its own exercise in awfulness that is equally bad at everything. I’d rather set my feet on fire that propagate that horror further.

Honestly, what we really need is a 21st-century Donald Knuth. I only wish she’d hurry up.

alanburger88 · on Aug 13, 2020

The use of PDF's is usually for it's immutability. Not all PDF's are created equal. Some are harder to change than others - but there are many tools and options to change them, if you need to.

However, it is possible to turn HTML5 into statement-of-record documents (Not PDF's),and make them immutable, encrypted and authenticated. A HTML5 document can have the features we need from PDF (immutability, encryption, authentication, pixel perfect print, etc.) while still allowing the resulting document to be interactive and responsive (work well mobile & web) in nature.

Effectively the best of both worlds.

mavhc · on Aug 11, 2020

Tagged PDFs that allow reflow have existed for at least 8 years

chipotle_coyote · on Aug 11, 2020

Properly tagging a PDF involves marking up literally every element in the document in a similar way that you'd mark up that document in HTML (e.g., everything must be described semantically; the tags are really designed for accessibility reasons and for helping screen readers). Sometimes it might be the right choice, but this doesn't change my basic argument: if the final destination of your document is intended to be a web site, then HTML is almost always the right delivery format.

gjvc · on Aug 11, 2020

interesting. what's that format / feature called?

michieldotv · on Aug 11, 2020

It's called just that: Tagged PDF. It was added to the ISO spec quite some time ago and allows you to annotate your documents structurally.

JadeNB · on Aug 10, 2020

> What I don't get is the suggestion that PDF should be replaced by HTML.

What I don't get is the authors' assumption that replacing with HTML means replacing with HTML that correctly uses "color, contrast, document structure, tags, and much more", leaves users in "a familiar context", is not "excruciatingly slow to load both on desktop and mobile", correctly employs "chunking, using bullets, subheadlines, anchor links, and accordions", and "show[s] a standard navigation", as opposed to … all that stuff that's actually out there. I don't know about the authors, but, if you give me a choice between a typical web site's idea about the flashy, JavaScript-heavy, animated, ad-laden way that I want to consume information on one hand, and a PDF on the other hand, then I'll take the PDF every time.

Voliokis · on Aug 12, 2020

I have to agree. People will have to pry PDF's out of my cold, dead fingers. No way in hell am I going to ever switch to a web-based format. A PDF means that I know what I'm getting. A static document that can't do any fuckery around preventing me from copying text or doing their own weird implementation of scrolling or breaking back/forwards navigation or anything else that modern websites love to do. If I get a PDF, I know what it is and the tools that I use to consume that PDF don't change just because the author of that PDF happened to find a new PDF framework and was so blinded by the shinyness that he just had to implement as much of it as he could. No, a PDF is a PDF. My PDF readers all behave in a consistent way and give me the exact same functionality and the same interface and show consistent performance behavior, whether it's a PDF from the year 2000 or a PDF from today, regardless of what eccentric tastes the author might have, and the author has zero say in how I interact with the PDF.

That's how I want it. I'm tired of new formats and new frameworks and new tools releasing every single year that pretend they're better. The best thing about PDF is that it doesn't change. Yes, whatever, Adobe adds new features, but any PDF ebook I download has nothing to do with that, nor that PDF scientific paper I downloaded this morning. I don't want random authors to be able to dictate how I interact with a medium, because the vast majority of them are, frankly, idiots when it comes to this domain and they have no sense of good user interface design.

JadeNB · on Aug 12, 2020

> A static document that can't do any fuckery around preventing me from copying text or doing their own weird implementation of scrolling or breaking back/forwards navigation or anything else that modern websites love to do.

You can, and I do, hope for this, and it's often what you get, but it's not at all guaranteed by the format. For example, PDF allows JavaScript: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdf... . I think most readers other than Acrobat probably don't support it (which, of course, Adobe spins as a feature of Acrobat!), and not too many PDFs require it—although I do find that fillable PDFs, including from hopefully capable authors like the IRS, can be very finicky in anything other than Acrobat—but I fear it's only a matter of time until it becomes as difficult to "browse the PDF without JavaScript" as it currently is to browse the web without it.

> I don't want random authors to be able to dictate how I interact with a medium

On this score you're out of luck even now—if the author, intentionally or unintentionally, did a bad job creating the PDF, you're out of luck. This is one area where I'll give HTML the win: it is, or can be, good at things like dynamic reflow, whereas PDF not only isn't but, I believe, effectively can't be.

Voliokis · on Aug 13, 2020

> On this score you're out of luck even now—if the author, intentionally or unintentionally, did a bad job creating the PDF, you're out of luck. This is one area where I'll give HTML the win: it is, or can be, good at things like dynamic reflow, whereas PDF not only isn't but, I believe, effectively can't be.

I'm not sure what you mean. I've never had an ebook or paper have any influence on how my PDF reader works or how I interact with it.

JadeNB · on Aug 14, 2020

> I'm not sure what you mean. I've never had an ebook or paper have any influence on how my PDF reader works or how I interact with it.

I was responding to a quote which I think, in retrospect, I misread:

> > I don't want random authors to be able to dictate how I interact with a medium

To me, random authors of PDF files do dictate how I interact with the content of the PDF file—I'm not sure whether or not to call that "the medium"—because they lay it out, and there is very little that I can do after the fact if, for example, I don't like their line breaks or their text layout, or if I want to be able to do a proper full text search, or cut and paste, etc. etc.

However, on re-reading (including your response above), it seems clear that what you meant, and what I should have understood, was not about being locked into the author's presentational choices, but about the UI of the PDF reader itself. I agree that what I said is irrelevant to this.

msla · on Aug 11, 2020

> In my opinion, hopefully more humble than the author's, PDF's main downside is the remarkable unevenness of the quality of the creation and reading tools, considering its supposedly "reads everywhere" nature.

Improving the tooling will make PDFs load faster and (possibly) be easier to navigate. (Unless they're JPEGs stitched together in a booklet, in which case they're pretty much hopeless from a navigation standpoint in any event.) It won't, however, address the core concern, which you touch on here:

> What I don't get is the suggestion that PDF should be replaced by HTML. The purposes of the two formats are basically orthogonal and replacing one with the other is doomed to failure. The author's argument seems more akin to "print-layout documents should be replaced by hypertext," and perhaps this is true in some cases, but it's definitely a different matter and one that the author's arguments don't really support that well.

I mostly agree with you:

I think the arguments support it well enough: PDFs are sized for print, laid out for print, and fundamentally do not flow. HTML, despite the best efforts of some, is still malleable enough you can have a single page which gracefully resizes itself to a range of screens, a minor technical miracle the march of UX progress still hasn't fully taken from us.

My response to the author is this:

PDF looks the same on the screen as it does on the page. That is its blessing. That is its curse. Some people absolutely demand that as a hard requirement, and will not brook anything which adapts to different environments. If they didn't have PDFs, they'd go back to making websites where all of the content is in a series of JPEG images scaled to look right on their screens. I've seen it happen. Therefore, replacing PDF with HTML is not socially viable. It doesn't solve the "soft" problem, which is a harder constraint than any technical problem.

TheOtherHobbes · on Aug 12, 2020

It's not just that some people demand it, it's that some PDF use cases require it. PDF is often used for semi-complex forms, and while of course you can design forms that reflow for mobile/HTML, the results are often... not great.

The meta-UX point is that a fixed layout allows you to use spatial relationships to indicate how fields are related. This is such an intuitive thing even inexperienced or amateur designers tend to do it by default.

Reflow doesn't allow this, and with some forms, reflow can literally make the layout - and the content - incomprehensible. There are workarounds, but it's often impossible to create a dynamic design that has the same mix of information density and spatial hinting as a static layout.

bryanrasmussen · on Aug 11, 2020

>PDF's main downside is the remarkable unevenness of the quality of the creation and reading tools

I can't help but think that HTML's third downside is the remarkable unevenness of the its quality.

The second downside is the remarkable unevenness of the quality of the CSS that is used with it.

The primary downside is the remarkable dangerousness of much of the JavaScript that is found bound to the HTML, that if turned off means that you often see a message that this site requires you turn on our dangerous JavaScript. At the best you end up back with the second and third downsides moving up a level.

on edit: perhaps a little facetious, but given the problems with quality found with websites that probably most of us are aware of it seems a bit much to complain about the quality of PDF. Maybe this is just some silly whataboutism on my part though.

dvdkhlng · on Aug 11, 2020

> The primary downside is the remarkable dangerousness of much of the JavaScript that is found bound to the HTML.

Are you aware that PDF format has also featured JavaScript support (and related vulnerabilities) for a long time now [1]?

[1] https://us-cert.cisa.gov/ncas/alerts/TA09-133B

bryanrasmussen · on Aug 11, 2020

yes, although I forgot about it, never seem to have a use for it myself.

jabroni_salad · on Aug 11, 2020

The finance industry loves it. Their bureaucracy ran on documents, and it still does, but now those documents can run code.

bXVsbGVy · on Aug 12, 2020

> You may remember that this played a strange role in Obama's birth certificate some years back

Just to clarify, the issue with the compression algorithm used by the scanner, not with the PDF format itself.

There a few videos of David Kriesel explaining the bug for the curious: https://www.youtube.com/watch?v=c0O6UXrOZJo

bla3 · on Aug 11, 2020

> Web browsers have mostly switched over to using pdf.js

As far as I know, Firefox is the only browser that uses pdf.js. Chrome uses PDFium, Safari uses the macOS system pdf libraries, and Edge probably does what Chrome does.

leephillips · on Aug 11, 2020

I use qutebrowser, where the choices are: download or pdf.js.

I guess the extension to use pdf.js with Chrome is pretty popular.

leephillips · on Aug 10, 2020

Interesting observations, but I don’t understand why you distinguish between PDF and hypertext, since of course PDF can contain hypertext.

jcrawfordor · on Aug 10, 2020

That might be part of the problem. :)

Really though I might not be using the best term, esp. with the definition of hypertext being one of those things that's a little historic now. I'm mostly just comparing between print formats and formats where layout is done by the viewer to reflect user preferences (which is kind of a dead concept with HTML anyway, but...)

cryptonector · on Aug 10, 2020

Not that the specification for the Web is less complex than that of PDF...

GoblinSlayer · on Aug 12, 2020

There are simplified profiles: https://www.w3.org/TR/xhtml-basic/

akx · on Aug 11, 2020

> heritage to PDF's basis in the Photoshop PSD format

Is that really so? PSDs are through and through binary, while PDFs smell more like PostScript with extras...

simonh · on Aug 11, 2020

Quite right, PDF is mostly a 'flattened' subset of the PostScript format containing tokenised and interpreted data generated from the PostScript code, plus the subset of assets such as fonts that are actually used, in a bundle structure.

It also has some optimisations for it's specific use case, such as that individual pages are completely described independently, whereas in PostScript the code generating any page can affect the content of any succeeding page. This is why in PDF files you can easily re-order pages or efficiently jump directly to and render any page.

jondubois · on Aug 11, 2020

The purposes of HTML and PDF are not orthogonal, there is a great deal of overlap.

The real advantage of PDF is that the images which are used inside the document are bundled into the same file... Whereas HTML has historically required the image files to be loaded from elsewhere which made it not portable. That said, now with HTML, you can define images with base64 data, so it could in fact replace PDF.

solarmist · on Aug 11, 2020

The real advantage of PDF is that it is a final form document format. Truly WYSIWYG. Which is literally the anthesis of HTML which is completely separated (theoretically) from the display/formatting.

shkkmo · on Aug 11, 2020

> Truly WYSIWYG.

PDFs are treated that way, but it isn't really true. Due to the complexity of the format, there are many PDFs that will display differently in different viewers.

solarmist · on Aug 11, 2020

Sure, but in comparison to other formats and especially HTML it is 90%+ there.

jondubois · on Aug 11, 2020

I don't agree with this view. HTML with CSS can support either fixed or fluid (and responsive) layouts... So it supports all the features required by PDF and more.

I don't see a problem with giving the document creator the option to go with a fluid or fixed layout and make the software default to fixed layout.

solarmist · on Aug 11, 2020

"Supporting all the features required" and being able to rely on them on most platforms and in most readers are a very different thing.

HTML and CSS theoretically have these properties, but if you asked someone in the publishing industry to layout a book with them they would either quit on the spot or hate you until their dying breath. That or they're a masochist and want to see if they could actually do it because it is theoretically possible.

tabtab · on Aug 11, 2020

Coordinate-based HTML/CSS implementation is too inconsistent across implementations and versions to be relied upon, especially with regard to fonts.

What are some good WYSIWYG vector standards, including the font department? It could be useful for a GUI markup standard also so that we can have platform-neutral GUI's and GUI's over HTTP.

The server side may still have a dynamic/flow layout engine, but calculated coordinates are sent to the client, keeping the client simpler and more predictable.

GoblinSlayer · on Aug 12, 2020

But most publications don't need WYSIWYG.

solarmist · on Aug 13, 2020

Need is a relative term. Talk to any designer/artist who’s published something and they’ll disagree with you all day.

nyanpasu64 · on Aug 10, 2020

> Web browsers have mostly switched over to using pdf.js to render PDFs, which is completely satisfactory for documents that consist of text or images (like scanned documents)

Except pdf.js is not satisfactory. Every now and then I come across a PDF file where text is invisible, because Firefox uses a blank font instead of an external font.

avasthe · on Aug 11, 2020

> What I don't get is the suggestion that PDF should be replaced by HTML

If it is a limited subset, I am okay with it. With increasing ubiquity of mobile devices, reflowing PDF is hard. I rather like EPUB these days. Which consists of XHTML. And reader support is also fairly ubiquitous with many Open Source software supporting Epub.

Mikhail_Edoshin · on Aug 11, 2020

I've read XPS specs [1] and from the looks of it it's a very sane format, which I cannot say about other Microsoft XML formats I've seen (such as MS Word XML). I'm not that familiar with PDF internals, but I really doubt XPS has much inessential complexity. And, being a new format, it has zero legacy issues. The common complaint against PDF is that it's hard to extract text from it, but with XPS it seems to be rather easy and can be done with the standard XML toolchain. Besides, it has a good support for document structure: it has not only document outline, but also stories, sections, tables, lists, figures, etc.

http://www.ecma-international.org/publications/standards/Ecm...

tonyedgecombe · on Aug 12, 2020

I've read both specs and the PDF one leaves you feeling everything is somewhat vague and waiting to trip you up.

It doesn't help that nobody in Adobe could say no to every new feature that might possibly help to sell another upgrade to Acrobat.

Akronymus · on Aug 11, 2020

It'd be awesome if it were possible to host (La)TeX files directly over the web.

Altough, for most people a non wysiwyg editor is too cumbersome.

dash2 · on Aug 11, 2020

This would be a horrible nightmare. LaTeX is an atrocious mess that deserves to die. Source: personal experience trying to build a decent programmatic table creator for LaTeX.

CryoLogic · on Aug 11, 2020

Most of the publishing industry still uses .docx, some of the more advanced publishers have moved over to ASCIIDoc - personally I think Markdown is the easiest to use for a big project (having written and published a technical book prior).

floatboth · on Aug 11, 2020

Except for scientific publishing, where LaTeX is king.

bjoli · on Aug 11, 2020

Most people I know use org mode with inline latex for stuff that needs to be latex.

Akronymus · on Aug 11, 2020

Hence the (La) part being optional. Normal TeX isn't bad at all, from what I can see.

jgalt212 · on Aug 11, 2020

> Scanners are often an extreme example

I have yet to find an Android scanner that doesn't make pdfs that weight less than 350K per page. I have tried MS Lens, CamScanner, and a few others I cannot recall at this time.

Has anyone had success in this arena?

kanox · on Aug 11, 2020

What would it take to replace PDF with a zipped html?

As far as a I know the svg format has comparable capabilities for graphics, all that is missing is a "page model" for html which would have to be invented.

pbsurf · on Aug 11, 2020

One way that SVG could be used for multipage documents is with a convention that the top-level <svg> tag is the document and child <svg> tags are the pages - this is what I do in my app. I also use the fact that gzip files can be created with independently decompressible blocks to create svgz files with page-level random read access [1].

But another barrier is that browsers refuse to support SVG fonts. One supposed reason for this, the lack of hinting support in SVG fonts, is less relevant now with high DPI displays - macOS no longer does hinting at all I believe. The additional effort to support SVG fonts is really minimal [2], so it seems strange that it's intentionally omitted.

[1] https://github.com/styluslabs/ulib/blob/master/miniz_gzip.h

[2] https://github.com/styluslabs/usvg

chriswarbo · on Aug 11, 2020

https://en.wikipedia.org/wiki/EPUB

dragonwriter · on Aug 12, 2020

> What would it take to replace PDF with a zipped html

EPUB. It's already zipped HTML and supports fixed layout.

Or, perhaps more simply, maybe just HTML + paged.js.

> all that is missing is a "page model" for html which would have to be invented.

CSS paged media is already a thing. What else do you need?

achr2 · on Aug 11, 2020

There are already page-break attributes for CSS.

Bombthecat · on Aug 11, 2020

Maybe the author meant a html DSL language for documents? A bit like, i think, the ebook format i can't remember.

phre4k · on Aug 11, 2020

Epub uses lots of XML.

dragonwriter · on Aug 12, 2020

> Epub uses lots of XML.

EPUB3 uses an XML packaging document, and the XML serialization of HTML5, so it doesn't really use lots of XML as opposed to HTML.

greggman3 · on Aug 11, 2020

I would argue essentially print is dead or should die. I read things one my phone, my tablet, and various monitor sizes with various sized windows. I think, though could be wrong, that's true for most people on the planet in 2020. Even poor people on the other side of the world probably read stuff on feature phones if nothing else.

The world has also gotten more international and a PDF designed for US Letter Size doesn't fit A4 paper used in many other countries.

PDFs is left over from the early 90s when print was still the main way we communicated. We didn't yet email each other, at least not the masses. We didn't have lots of different devices. Our screens were low-res so it was much easier to read paper than screens (some people might still find that true). Heck, when PDF came out in 93 most PCs still ran DOS.

Now-a-days though does nothing bet get in the way. Sure, some rare PDFs can be reflowed but basically PDF wasn't designed for that and it's certainly not used that way. We need a format that re-flows for all the various devices we might be reading something on. For the most part HTML seems to fit that bill. Maybe a version with better image/diagram embedding would be good but we arguably do not need something brand new from scratch.

TL:DR; the world changed. PDF is designed for the the world from 30yrs ago. May it rest in peace.

bepvte · on Aug 10, 2020

I agree with the size complaint. Its unbelievably annoying when I have to embed a bitmap image in a pdf and turn some 75kb jpeg into a 26mb pdf for compat reasons

enriquto · on Aug 10, 2020

This sounds like a very bad software for pdf creation. Notice that the pdf standard contains the jpeg and the png formats as subsets, so you can embed such images directly into a pdf file.

hnick · on Aug 11, 2020

Yes, literally just copy everything between stream/endstream for a DCTDecode image and you can save it as a JPEG. Sounds like the tool's creator didn't know about that encoding.

GnarfGnarf · on Aug 10, 2020

Although I agree that PDFs (and screens in general) are not the best for reading, the PDF file format is a minor miracle. It is a thing of beauty, combining text and graphics to preserve the author's design.

I have built a business on PDF. I develop graphics software, enabling my customers to create large charts (36" x 96" and bigger) in PDF format, which they can take to the print shop for printing on large-format plotters and printers.

The sharp crispness of PDF text and vector graphics allows unlimited zooming while never pixellating (except the photos, of course).

If you are familiar with the technical specifications of PDF (1,300 pages 2006 ed.), you will appreciate the sophistication and power of the internal structure of PDF.

As an exchange medium, PDF has made huge contributions to commerce, technology and culture.

SilasX · on Aug 10, 2020

Okay, but that’s not disagreeing with the article’s point, which is that it’s a bad UI for communicating digital content to non technical end users. The author supports its use for its primary case, and how you use it, which is printing.

fastaguy88 · on Aug 10, 2020

That may be the author's point, but what he says is that PDF is unfit for human consumption. Which is absurd. As a scientist, PDF journal articles are almost always easier to read than the HTML version -- the graphics are of much higher quality, two column printing is more common, scientific equations and fonts are rendered better, etc. etc.

If the article title were "PDF -- unfit for web presentation" the author might have a stronger case.

rumanator · on Aug 11, 2020

> That may be the author's point, but what he says is that PDF is unfit for human consumption. Which is absurd.

You're quoting the click-baity title, but failing to actually read past the article summary's first sentence.

The summary's very first sentence states "Research spanning 20 years proves PDFs are problematic for online reading." This sentence alone frames the problem, and explains the whole point of it.

barbs · on Aug 11, 2020

Sounds like we can all agree that this page should have a better, more accurate title.

rumanator · on Aug 11, 2020

No, it sounds like people shouldn't criticize a text when they clearly haven't read it.

SilasX · on Aug 11, 2020

> That may be the author's point, but what he says is that PDF is unfit for human consumption. Which is absurd. As a scientist, PDF journal articles are almost always easier to read than the HTML version

I don’t think the author was disagreeing with that either! They were saying, rather, that this effect is due to a collective failure to use HTML properly. If all you want to do is reproduce the physical pages of an article onto a digital device, and gain no more functionality (like text search, hyperlinking, reformatting), the author agrees PDF is great for that. But if you want to exploit all the features digital devices and the web offer, PDF constantly gets in your way.

elcritch · on Aug 11, 2020

I still tend to agree with fastaguy88, PDF's are still my preferred format for reading science data, even on my digital devices. HTML data just doesn't have the font quality, layout rendering, and ability to ensure all the appropriate data is saved without requiring some remote server (and login!). My preferred digital device is a 12" iPad Pro as it shows the pdf in native 8 1/2 by 11 size. That makes me happy! Change the device form to match PDF.

DrAwdeOccarim · on Aug 11, 2020

Yes, completely agree. Scientific articles on a phone are terrible--open the PDF and zoom/scroll around? You know exactly what to do and how it works.

rumanator · on Aug 11, 2020

> But if you want to exploit all the features digital devices and the web offer, PDF constantly gets in your way.

I agree, PDFs are indeed poor and ill-suited for online reading. They are not reflowable, and PDF authors force a pagination format that more often than not are only readable in a device by chance or if readers use a large format device such as a large tablet. Hell, some PDFs are even unreadable and impractical to read in 13' laptop screens with 1980*1200 resolution.

dnhz · on Aug 11, 2020

You should give HTML journal reading another chance. The journals that let you see citations without losing your place in the article are great.

GoblinSlayer · on Aug 12, 2020

> As a scientist, PDF journal articles are almost always easier to read than the HTML version -- the graphics are of much higher quality

You mean they look better on paper than on screen? And what magic improves quality of graphics?

bonoboTP · on Aug 21, 2020

The auto-reflow of HTML often puts things to weird places. I'm sure it can be done properly as well, but auto-converting a PDF or LaTeX document to HTML results in crappy layout. I always download the PDF when given the option, because the the actual authors spent a lot of time to make sure everything is at the right position and everything looks as intended. Sure if the culture changed and scientists now learned web dev instead of LaTeX and put in the same effort into producing production quality, polished HTML instead of PDF, it may work as well. But scientists need to send LaTeX and PDF to journals and conferences, so you can't expect them to put in double effort. Also, making sure that things look good in all browsers is just a way bigger job than making sure a single PDF looks as it should.

Gibbon1 · on Aug 10, 2020

I find that modern web UI are increasing shity on the desktop. In particular increasingly dog slow. But well designed pdf documents are still very good. And you can save a pdf locally.

Seriously html is used mostly for delivering spam and porn. And pdf's excel for technical documents.

Wowfunhappy · on Aug 10, 2020

> Seriously html is used mostly for delivering spam and porn.

Oh come now. I know we all like to hate on the modern web but this just isn't true. Wikipedia, CNN, Amazon, etc all use HTML.

rumanator · on Aug 11, 2020

> But well designed pdf documents are still very good.

You're just spewing a tautology. I mean, a well designed thing is still very good? Come on.

> And you can save a pdf locally.

You can also save epub and even HTML docs locally. That doesn't add much to the discussion.

> Seriously html is used mostly for delivering spam and porn.

It sounds like you're trying to force a morality-based argument to compensate for your lack of meaningful, rational points to make in favour of PDFs.

> And pdf's excel for technical documents.

They really don't. PDFs show good results in documents intended to be printed on paper following a very specific format, or whose main purpose is to deliver high-resolution vector graphics content intended to be printed.

Once your usecase consists of consumption with a electronic device, which involves delivering reflowable content that reflects personalized settings such as reader-specific accessibility settings and device properties, PDF fails to be an adequate option.

egwor · on Aug 11, 2020

I’m going to bite. The web site we’re discussing is full of poor points too. The person you are responding to makes some good points.

- once you’ve saved it as a pdf you can email to anyone. Most (all?) phones support it out of the box. They can then print it out and you can be sure that the entire page is rendered correctly. If you try and save that as an ePub, lots of people won’t know how to open it. For HTML files; often the default HTML store is a browser specific format. Even more often there’s some dependency that means that the page doesn’t actually render after the web site changes (e.g. it missed a dependency or because of JavaScript use or because the format changed between versions). This is one of the reasons why tickets for events come via pdf.

- read only. Want to send your CV and make it difficult for the recruiter to edit. Perfect solution is to use pdf. You also know that the pdf will look the same to the people reading it (and printing it) as when you created it.

- want to create a pdf really easily? Print to pdf. Done. Want to reliably do that to HTML. Good luck. Often there are weird issues that pop up. It often doesn’t render properly once the complexity of the document becomes slightly involved.

- PDF’s excel for technical documents. Yeah, I agree with this. They’re great at providing professionally rendered files, guaranteeing a rendered look across technical fields. This is why they’re used in research. Your argument boils down to ‘the pdf requires a very specific format’. Yes. That’s the intent behind pdf. It doesn’t provide ways to re-render the document, and I don’t think that it was intended to be used that way. Should it, so that less able folk are able to access content: yes.

- All my lecturers used to use ps/pdf for their lecture slides. It worked very well. It got the info across and we could get a copy of the notes. All done.

I feel like the only strong argument here is that it would be nice for the file to render for accessibility and for differing sized screens. That would be nice. Sounds like a great challenge.

Overall this article is a bait article, and I don’t think it should be on hackernews. The stuffed with fluff argument basically boils down to PDF’s are bad because for the PDF’s we’ve looked at “Authors don’t use bullet points”. Cummon! Let’s have some intelligently thought out arguments for a sensible discussion.

Spooky23 · on Aug 11, 2020

I think that you're 100% correct. IMO the general grousing about PDF is the same "get off my lawn" nonsense as people grousing about the ribbon in Microsoft Office.

PDF is not the ideal mechanism for making a webpage or generally browsable thing. It's great for creating portable documents that look and perform the same over time. You can go into an archive in the UK and if preserved, read a legal filing submitted in the 1600s, and understand what it says. Likewise, if preserved, our successors will be able to look at digital PDF/A US Federal court filings in the year 2400 and understand what it says.

We already have content that is essentially lost from the 15-40 years ago due to file format issues.

xscott · on Aug 10, 2020

Other than fillable forms, which is sometimes very important, I don't see how PDF is much of an improvement over PostScript (which came first).

jcrawfordor · on Aug 10, 2020

A very simple answer: Xorg has integrated PostScript support which makes rendering .ps files very easy on Linux. Very few tools were ever developed to do this on Windows, to the extent that using Ghostscript ported from Linux is still a common approach. It's still a pain to deal with PostScript files on Windows, and obvious tricks like using ports of Linux viewers that support .ps generally don't work on Windows because those viewers were just leaning on Xorg to do the hard part.

PDF would have a similar problem, but Adobe leveraged their previous work on other products so they basically already had the rendering engine for Windows and it gained traction there.

Keep in mind that both Postscript and PDF were principally designed by Adobe. Adobe designed both because they were intended for different purposes, and this stands today.

kbr2000 · on Aug 11, 2020

Hey, I'd like to read more about this integrated PostScript support for Xorg, can you point me in the right direction please? Tnx

lmz · on Aug 11, 2020

There you go: https://docs.oracle.com/cd/E19683-01/816-0279/dps-91433/inde...

Except it's not Xorg but Sun's own server, and it's not Linux but Solaris.

jcrawfordor · on Aug 15, 2020

The feature is called display postscript (DPS), it is basically gone today and was not widespread on Linux, but Adobe maintained a version which they commonly packaged with their products and was a key part of the genesis of the PDF format. These days it is mostly replaced by ghostscript.

thrtythreeforty · on Aug 10, 2020

Embedded fonts are a big one. PDF lets you embed straight TrueType/OpenType, preserving all the ligatures, kerning, etc. With Postscript you have to convert. Maybe this would be hassle free, but I'd be willing to bet there'd be a lot of edge cases that prevent it from Just Working.

microcolonel · on Aug 10, 2020

> preserving all the ligatures, kerning

Far as I can tell, most of the time (all of the time?) PDFs just seem to throw out this information, and manually select and place glyphs.

rietta · on Aug 10, 2020

PDFs also serve as a decent interchange format for dropping "signatures" onto a document. These are acceptable in court and in a format that regular business people understand. This is important for digitized contracts that can then also be archived by both parties.

ThePadawan · on Aug 10, 2020

My $0.02 on this:

A work colleague worked on a document signing solution for a client once. Legally, at the time (and I hope this has improved), when a person added their digital signature to a document, that meant that they signed that exact version (read: hash of all the bytes) of the document.

That meant that PDF was sort-of problematic for the use case that the customer required: Giving the customer an A4 version to keep for their documentation was important - but having an A4 version on screen made for terrible scaling UX on mobile and tablet devices.

The fact that PDF is more than just text+formatting in that manner was a real hindrance at that point in time (2017).

(I'd be happy to know if I got any of this wrong after only hearing about it second-hand. This was in Switzerland, if this affects which laws were relevant at the time.).

pfranz · on Aug 11, 2020

I think it's been around since the early 2000s, but a few months ago I got very tripped up by this for the first time. I got a PDF I needed to sign and send back (I usually drop and image of my signature and fax or reply). I actually needed to submit a bunch of these forms with varying dates. I couldn't open the PDF with macOS' Preview or any web reader (which happens time to time). I reluctantly downloaded Acrobat Reader. Filled it out, "signed it" and tried to change the date and Save As. It wouldn't let me and it wasn't clear why. I thought I just didn't know how to undo or select and delete the signature because I'm used to PDFs being text + images.

It turns out to be a feature where adding a "signature" locks the document. They suggest if you need to modify it you request a new document.

I'm probably reading the spec wrong, but it might have been added in PDF 1.5 (Aug 2003?)

quietbritishjim · on Aug 11, 2020

The signature is cryptographically based on the exact document (and that person's certificate) so you can prove precisely what they signed. That is very much the whole point of the signature feature. It wouldn't be much use if, when you tried to challenge someone later, they could said "oh no, that's not the version I saw, you must have changed it after I signed".

pfranz · on Aug 11, 2020

Sure, my frustration is that it a) required Adobe's notoriously terrible software and b) wasn't obvious the implications of what I was doing and c) wasn't reversable/undoable.

I think the concept is great and was just passing along that it existed in PDFs to the parent. Personally, having signed probably 100s of PDFs over the years I had never encountered it. I've only seen the web-based DocuSign. In every case (including this one) faxing the document back was acceptable, which breaks this. I am all for improving chain of trust, but it's not very helpful if the user doesn't understand what they're doing and as someone who likes tech I tend to want a bit more control than most people.

izacus · on Aug 11, 2020

The behaviour you're describing is literally the reason why PDF is still widely used - once you cryptographically SIGN the document you can verify that its content wasn't changed anymore after signing it and both parties can check that.

Incredibly important in many business processes not to mention signing contracts.

sjy · on Aug 11, 2020

In my experience digital signatures are not widely used in “business processes” facilitated by people sending around PDFs. Most are signed with a pen and scanned, or signed by embedding an image. Such signatures have value as a signal of the intent to form a contract, even though it is well known that they do not guarantee authenticity or non-repudiation. Digital signatures don’t add much unless the signer publishes their public key and can prove that a new key with their name on it is inauthentic, and the authenticity of the message cannot be inferred from the surrounding circumstances.

rietta · on Aug 11, 2020

I can attest to the same. I've signed security contracts with government agencies by taking their PDF, opening in in Apple Preview and attaching image of my signature, and sending back. Same with Xournal in Linux. This is how real world, big money contracts get signed. The email history probably attests more to the bonofides of the document than the document itself. It's up to the signers to spot differences visually.

Another tactic is contracts signed by another authorizing action like "your check is as good as your signature language" or "Under the U.S. Uniform Electronic Transactions Act (UETA), this Agreement is executed electronically when both parties agree via e-mail, an Internet web page, or other electronic means and the Client pays the deposit as set forth in..."

A simple email saying I agree is acceptable under US law. No cryptographic PDF features necessary.

jdhawk · on Aug 11, 2020

Preview's signature import via WebCam makes it even easier.

I'd never heard of Xournal, thanks!

izacus · on Aug 11, 2020

I worked with PDF libraries professionally and we had plenty of customers that requested (and paid) such fuctionality for their business processes.

Also note that those "hand signed" PDFs tend to not hold up to legal scrutiny unless they're also digitally signed.

sjy · on Aug 11, 2020

I'm not sure what jurisdiction you are from but this is certainly not how it works in any legal system I've heard of. In this article published by an Australian law firm [1], the "electronic signatures" routinely used in commerce are clearly distinguished from true "digital signatures" in footnote 1. Digital signatures are only convincing if you understand how they work, and most people executing transactions don't.

[1] https://www.allens.com.au/globalassets/pdfs/insights/xmedia/...

floatboth · on Aug 11, 2020

For business, you'd usually use the government's PKI for authenticity.

hnick · on Aug 11, 2020

One problem is that Postscript is Turing complete, so it can be unreliable to parse. Not a great outcome if you just want to view a document.

However, in reality I've often found it easier to write throwaway perl scripts to analyse/modify PS files rather than PDF. Writing a tool to target a set of PS files made by the same process usually isn't too hard unless there is something really unique going on - but it can be a problem generalising it to any random PS file.

PDF is more structured so it can be easier to make general purpose tools, but in my line of work we prefer version 1.4 since the later versions add bloat that isn't necessary for print. It's also usually easier to consider them append-only, it's trivial to add content but editing is a lot harder due to offsets and references.

dragontamer · on Aug 10, 2020

PDFs work on Windows. Postscript just never had a big Windows adoption rate.

From my understanding, PDF is largely based on Postscript. PDF is to Postscript as HTML5 is to XHTML or HTML4.

p1necone · on Aug 11, 2020

I find pdfs easier to read long texts than other formats - I can have a page (or two) displayed at a time, they're always cleanly broken up because everyone sees the same page size, and I can press a keyboard key to move to the next page.

smoe · on Aug 10, 2020

> 4. Stuffed with fluff. PDFs tend to lack real substance, compared to regular web pages.

The exact opposite is the case in my experience. Unfortunately the actual substance is often in a PDF and all the web pages pointing to it are superficial, copy and pasted and/or clickbaity fluff.

They then go on about how in web sites the content can be better structured and navigated. Unless I'm misunderstanding the word in English, what has that to do with whether the content has substance?

> [...] This leads to overwhelmingly long and inane PDFs

You mean something akin to a book?

red_admiral · on Aug 10, 2020

Couldn't agree more! I've yet to see a PDF that has animated ads, pops up a subscribe-to-our-newsletter modal halfway through, or even autoplays video a few seconds after you've started reading it.

I mean I suppose you could do all that with embedded JS, in theory, but one of the nice things about PDF is it mostly works absolutely fine with scripts turned off.

ljm · on Aug 10, 2020

I've seen web based documentation that is infinitely worse than the 200 page PDF your bank in the EU will give you to implement SEPA integration.

That stuff is thorough as hell and you even get schema definitions for all of it.

I'd pick that over poorly explained or 'discoverable' alternatives.

TedDoesntTalk · on Aug 10, 2020

They also wrote:

> and boring to read.

Not only is that subjective, but how is that relevant?

ImaCake · on Aug 11, 2020

Neilsen group has clearly chosen their opinion and then found evidence that supports it. It is a shame too. If they discarded their bias, maybe they could take some of the real problem points about PDFs and make a solid persuasive argument why we should try to fix those problems.

agumonkey · on Aug 11, 2020

yeah I want to see what internet the author is browsing because for the last 5 years all I could see if vacuum on the web. headers footer side-ads privacy-popups massive-intro-photo and somewhere lost in all of this, a paragraph. Quite often the content could fit in a tweet.

notriddle · on Aug 10, 2020

> The exact opposite is the case in my experience. Unfortunately the actual substance is often in a PDF and all the web pages pointing to it are superficial, copy and pasted and/or clickbaity fluff.

That's entirely cultural crap, and has little to do with either format. Or do you think that this HN comments page would be better distributed in PDF form?

smoe · on Aug 10, 2020

I'm not saying one format is better than the other, or it having to do anything with the format.

The word "Unfortunately" is there on purpose. I often have to sift trough PDFs were it doesn't make sense to have the information only there.

What I'm disagreeing with is that PDFs unlike web pages lack substance. In my opinion the substance is often in the PDFs not because of the format, but how the information is produced.

E.g. within a government or enterprise the content could come from anywhere within the org structure, often multiple intermediaries away from the people putting stuff on the website. Everyone knows basic MS Word. On-boarding potentially hundreds or thousands of employees to a CMS and send them to a "how to craft effective digital content" which is what the Nielson article is ultimately selling is not always feasible. Only select pieces get a web treatment the rest gets summarized if not just linked. News papers have pipelines from Word to digital publishing tools to print / online because of this. But also this setup is not easy.

I just recently needed some specific information about traveling to and quarantine in Switzerland. The news sites where useless, the linked government web page was useless. Only the original PDF at the end of the link chain contained the information.

I'd prefer having this information more easily accessible/searchable. But as it stands, the substance is often in the PDFs, not the web pages.

nattaylor · on Aug 10, 2020

Like the authors, I find that content published as a PDF is often extremely verbose, almost like the authors are paid by the page.

Government reports, or those prepared by consultants, are often the worst offenders.

bachmeier · on Aug 10, 2020

Honestly, that's just silly. If they'd use html instead of pdf, you'd still have the same content. pdf is a format. It has nothing to do with the content.

znpy · on Aug 10, 2020

If we used html instead of pdf the whole society would collapse.

Didn't anyone notice that it's basically impossible to save an html page today and have it load and render correctly and offline tomorrow?

Tade0 · on Aug 10, 2020

I've put together a small tool that, using *.har files, by accident also does this[0], but it only works for simple sites like Wikipedia.

[0] https://github.com/Tade0/emergency-poncho

enriquto · on Aug 11, 2020

It's funny that you mention wikipedia, since one of my favorite wikipedia features is the "download as PDF" link, by which you can obtain any article in a beautiful and readable form.

ljcn · on Aug 10, 2020

Not all html pages - what about hackernews? It's mostly tables with minimal CSS (a bit of padding and font*/color), I bet it continues to be perfectly rendered practically indefinitely. At least snapshots on archive.org from 2007 still look perfect.

j88439h84 · on Aug 10, 2020

SingleFile can do it.

kevincox · on Aug 10, 2020

Perfect fidelity isn't there but all popular browsers have a "save page" functionality which seems to work really well.

oconnor663 · on Aug 10, 2020

That's exactly what the parent is criticizing. The problem with save page is that the HTML you save still contains tons of links to server resources, particularly CSS and JS. Of course those links will work if you look at the saved page immediately after you save it. The problem is that if you come back later, sometimes even just the next day, they no longer work. A lot of JS file names are auto-generated random numbers, produced by packaging systems rather than humans, which change whenever the developers edit their JS. They aren't designed to be stable.

There are tools that try to fetch those links and update the HTML to point to the local copy. But those tools can only go so far. JS is allowed to fetch new files dynamically, and there's no reliable way to look at a piece of code and automatically figure out what it's going to fetch when you run it.

kindofastrawman · on Aug 10, 2020

> JS is allowed to fetch new files dynamically, and there's no reliable way to look at a piece of code and automatically figure out what it's going to fetch when you run it.

You've diverged from the context and are no longer doing an apples-to-apples comparison. The things you're describing are all opt-in and amount to having to deal with an adversarial input. There's nothing inherent to the medium that requires those things.

In other words, a person publishing a PDF is already abstaining from certain things. (Namely, the sorts of things you're bringing up that would make for a pathological case.) If the person who publishes a PDF does a straightforward translation into a web page, then you end up with something that doesn't exhibit any of the downsides you're discussing.

anoncake · on Aug 11, 2020

No, but the medium allows these things. And that's a problem.

oconnor663 · on Aug 10, 2020

Good point, and also relevant user name :)

kevincox · on Aug 10, 2020

No, most browsers will save the resources as well and rewrite the HTML to reference them. You can have problems with dynamically loaded things but I have found that it works very well in practice. I have had maybe one page that was significantly broken saving from Firefox over the years.

znpy · on Aug 13, 2020

Thanks dude, it's nice to see that there still arr people that can read a text and understand the point.

spear · on Aug 10, 2020

I've found the best way to save a page on a browser is to print it ... to PDF.

_Microft · on Aug 11, 2020

Absolutely, depending on how much I care about the content, I either print it directly from the reader mode (which gives pretty bland results) or I touch up the page itself with things like "column-count: 2" and a few changes to headlines, to give it the look of a proper print article. Either way, printing to PDFs is a great way to archive/save web content for later.

stOneskull · on Aug 10, 2020

it's quite nice this way. much better than the old .mht file even. it skips the junk.

znpy · on Aug 11, 2020

This is brilliant... I hadn't thought about it.

a-priori · on Aug 10, 2020

The constraints and expectations of the medium strongly influence the content.

If I had an idea and wanted to communicate it, then I did so by recorded video, by live video, by blog post, by Twitter thread, and by HN comment, the same idea would be presented in very different ways.

In the same way, a writer who publishes something by HTML (blog post, etc.) will produce a very different document than if they intend to publish it by PDF (ebook, etc.). They tailor their message to the constraint and expectations of the medium.

formerly_proven · on Aug 10, 2020

In Gov't or consulting reports the content tends to start somewhere around page 20-30.

MattGaiser · on Aug 10, 2020

I had a capstone project in university where the clearest way to generate a better grade was to generate more pages.

We put all sorts of rubbish in the report to make sure it made a "thunk" sound when we handed it in. It was nearly 300 pages when it should have been 90.

The problem is, plenty will judge a report on its thickness. "It is thick, so it must be comprehensive." What percent of government reports are read cover to cover and what percentage are just ctrl+f through?

II2II · on Aug 10, 2020

I ran into a similar situation, but decided to submit a short report anyhow. It earned one of my best grades in university. When I asked about the grade, since I ignored multiple guidelines, the response was that I said a lot more than most people even though I wrote less. It probably had something to do with my admiration of concise writing. It is something that I wish that I could accomplish more often.

leephillips · on Aug 10, 2020

Good for your professor. I always refused to impose length requirements, but I would say what a typical length for the assignment would be (always in number of words, never “pages”). If you did the job in significantly fewer words, that earned you extra points. If you went long, but every word counted, you also got extra points. But any padding, wasting my time with unnecessary words, meant a penalty.

fwip · on Aug 10, 2020

If reports are often ctrl-F'd through for relevant information, it seems likely that many people consuming it are reading far fewer than 90 pages in total - and wouldn't have read the full shorter report.

Perhaps it is better to be comprehensive in government reports than concise, to accommodate a variety of readers who want to drill into different aspects of the report.

(Of course, a PDF may not be the best structure for this! A well-formatted HTML reference with appropriate hyperlinks may be much more useful.)

asdff · on Aug 10, 2020

Or a PDF document with a table of contents. Look at this clickable beauty containing a wealth of information in tidy categories:

http://media.metro.net/about_us/vision-2028/report_metro_vis...

formerly_proven · on Aug 10, 2020

Yeah, but who is actually going to read a 7000 page report on torture or surveillance? (Assuming these reports were actually published, which they were not)

hrktb · on Aug 10, 2020

I think that's less on the format and more on the intent. For reports and what is traditionally viewed as "written form", reader is expected to have high tolerance for length and boilerplate.

I see it as the pendant of the "younger generations can't read anymore" critic, where lenghty, rambling and diluted prose is becoming harder and harder to parse and focus on.

On the other side page load speed and attention grabbing metrics are thoroughly studied for web pages and people value terseness, to the point of loathing click baits and endless listicles.

leephillips · on Aug 10, 2020

Have you ever visited https://arxiv.org/?

smoe · on Aug 10, 2020

Content is often written to a predefined length, no matter how it is going to be published. E.g. some news papers even if there is no print version anymore still decide how long a story is going to be upfront. And food bloggers, will fluff up an basic oatmeal recipe with 3 pages worth of childhood stories before getting to the point.

Even if verbose, those PDF are often still the only place where the relevant substance is together. The websites referring to them then cherry pick from it. I spend a whole lot of time sifting trough goverment PDFs over the last couple months because it was the only way to get to the information I needed.

It would be much easier if the content were available in different formats.

rayiner · on Aug 10, 2020

Incorrect. Modern web pages are garbage and PDFs are far better. No auto-play animations, no animations at all, no bizarre hijacking of scrolling, etc. a multi-hundred page PDF loads in a blink of an eye compared to a advertising tracker-loaded web page.

Screen size-adaptability and reflow remains a problem. It would be better to fix that on the PDF end than to move those uses over to inferior web technologies.

pcwalton · on Aug 10, 2020

I don't think reflow is really "fixable" in PDF. PDF's model is fundamentally based on absolute coordinates and transforms for everything, as it's descended from a language for printers. Adding client-side layout to that radically alters the entire design in ways that would make it not PDF anymore.

When you say "a multi-hundred page PDF loads in a blink of an eye compared to a advertising tracker-loaded web page", consider why that is. The basic reason is that every page in PDF can be rendered individually. (In fact, the top-level grouping in PDF is the physical page instead of the semantic model of HTML.) This is only possible because PDF has no layout! When you introduce client-side layout, the client must lay out every page to render any of them, because the locations of page breaks depend on characteristics of the client device, creating a sequential dependency. If you were to somehow add layout to PDF, the sequential dependency would be there too; there's nothing magical about PDF that would prevent it from inheriting the problems of HTML.

Finally, PDF does have animations and scripting (with multiple JavaScript engines). In fact, it even has 3D (old-school VRML-style 3D, not the flexible immediate-mode GPU APIs browsers have). You'd be amazed how bloated PDF is!

scrollaway · on Aug 10, 2020

I'd like to see you try to have a conversation on a tech & startup news aggregator built in PDF, see how quickly your reader loads it then. You're talking about PDF like the only documents you've seen are printed from LaTeX / Chrome, but PDF supports forms, javascript, 3D models and more.

PDF is an atrociously bad format, and I don't know what "multi-hundred page PDF loads in the blink of an eye" for you but even a 100 blank page PDF takes nearly a second to fully load on my beefy rig (I did the test a few months back to prove a point). [Edit: Other commenters made the clarification below, but single page render time is not the same as document render time]

Clearly extracting text from a PDF is nearly as difficult as extracting it from a photo. Digitally extracting information from PDFs in general is awful, which makes the format awful for the various things it's used for.

Not to mention that many uninformed users today still install the garbage / malware PDF readers such as Acrobat because they don't know any better.

rayiner · on Aug 10, 2020

> I'd like to see you try to have a conversation on a tech & startup news aggregator built in PDF, see how quickly your reader loads it then.

Sure. I agree we shouldn’t replace interactive web apps with PDF.

gspr · on Aug 10, 2020

> I don't know what "multi-hundred page PDF loads in the blink of an eye" for you but even a 100 blank page PDF takes nearly a second to fully load on my beefy rig (I did the test a few months back to prove a point).

The manual for PGF/TikZ [1] is a huge PDF I frequently open. It's more than 1300 pages and has lots of graphics. It opens and navigates in the blink of an eye on my 3 year old laptop (with the Okular reader). PDFs aren't perfect, but they sure feel spiffy compared to modern webpages.

I do agree with some of the article's complaints, but not this one.

[1] http://mirrors.ctan.org/graphics/pgf/base/doc/pgfmanual.pdf

7952 · on Aug 10, 2020

The speed depends a lot on how the PDF is structured. If you export a complex CAD drawing you may have a ridiculous amount of detail that has to be fully rendered before the page can be viewed. Or you can have very simple PDFs that are just a few images.

tehabe · on Aug 10, 2020

That is created using LuaTeX and I'm sure the sources behind that PDF document are carefully crafted and LuaTeX works really well. But if you would do the same document with the same amount of images in Microsoft Word and create a PDF document is would be much much bigger and it won't load that quickly.

I will take the last part back, if someone can prove that I'm wrong about Word and PDF documents.

gspr · on Aug 10, 2020

In that case it sounds like a problem with Word and not with PDF.

I wouldn't know – most PDFs I consume are generated by some variant of TeX. I gave a random 300-page datasheet I have lying around a go. It says it was made with Acrobat Distiller and "C2 Rendition". Feels just as spiffy as the PGF/TikZ manual.

tehabe · on Aug 10, 2020

All that I wanted to say is: not all PDF documents are created equal, some are really well and some are just awful.

athriren · on Aug 10, 2020

What reader do you use?

gspr · on Aug 10, 2020

Okular [1]. It's strange; I'm a KDE user and big fan of the core DE, but I find almost all the KDE software outside of that core DE nearly unusable. Except Okular – it's by far the best PDF reader I know. I guess credit goes to Poppler for the heavy lifting [2].

[1] https://okular.kde.org/

[2] https://poppler.freedesktop.org/

leephillips · on Aug 10, 2020

Try zathura, if you haven’t. Super fast, and keyboard oriented.

savingsPossible · on Aug 10, 2020

thanks! really cool

andrepd · on Aug 10, 2020

I use qpdfview and it works very well.

andrepd · on Aug 10, 2020

If pretty sure he's opening it in the browser.

avereveard · on Aug 10, 2020

weird, my Alfa Romeo user manual is 270 pages filled with graphics (literally, they are jpeg scanned to a pdf) and loads instantly even on my mobile phone

corty · on Aug 10, 2020

The first page is rendered instantly you mean. PDF, at least when generated by a sane generator, can be parsed pagewise. HTML cannot, you always have to parse everything in a page to do layout, because later objects can change or overlay earlier ones.

Santosh83 · on Aug 10, 2020

This is being partly addressed in the latest draft CSS specifications...

https://news.ycombinator.com/item?id=24093273

goto11 · on Aug 11, 2020

> HTML cannot, you always have to parse everything in a page to do layout, because later objects can change or overlay earlier ones.

HTML is progressively rendered by default. This has been a feature since Netscape 1.0! It is only if you use certain types of layout this is not possible. For example an adaptive table have to be fully loaded before the width of the columns can be calculated.

im3w1l · on Aug 11, 2020

And it was a very important feature too, back when internet was slow.

avereveard · on Aug 10, 2020

https://streamable.com/erlsy6

I rest my case

JKCalhoun · on Aug 10, 2020

Is PDF still unstreamable? AFAIK, the TOC (catalog?) in a PDF was located at the end of the file, meaning the whole PDF had to come down in order to parse the PDF. (With the exception of the first page, as you say — some aspect of the PDF spec allowed for a self-contained page 1.)

hnick · on Aug 11, 2020

Linearized PDF has existed as a concept since v1.2 which I think was released in 1996.

You can see it mentioned in the v1.4 spec at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pd... [follow the link from the contents].

nogridbag · on Aug 10, 2020

There are libraries to linearize PDFs but not all PDFs can be converted. Some of the more popular open source PDF libraries do not support it though.

MaxBarraclough · on Aug 10, 2020

Even PDFs can be inexplicably bloated. This cropped up on HN discussion a week ago. There was an 11MB bloated PDF, and a 500KB PDF, of the same article, with no visible difference between the two.

https://news.ycombinator.com/item?id=24035955

> No auto-play animations, no animations at all, no bizarre hijacking of scrolling

HackerNews commits neither of these sins. They aren't universal to the modern web, even if they're annoyingly prevalent. Given sufficient incompetence, both PDFs and websites can be bloated monstrosities.

ImaCake · on Aug 11, 2020

I like your conclusion here.

I've never needed an ad blocker for a PDF. But I also don't have a good pdf reader for all of my devices.

SV_BubbleTime · on Aug 10, 2020

Surely the technology exists to get all the things you mentioned without an insane spec that Adobe allowed to bloom out of control.... I mean... right!?

Please remember that PDFs are absolutely capable of running code and do to deploy the advertising / tracking you listed as an issue with webpages.

If you are part of Adobe's premier advertising / tracking club (whatever it's called), and the user is viewing with Acrobat, you can see what people printed, where they highlighted, how long they stayed on a page, where they accessed, etc etc.

That's more of a problem with Adobe than PDF itself (never use Acrobat!), but that's hardly a rare theme when it comes to Adobe.

kevin_thibedeau · on Aug 10, 2020

XSL-FO withered on the vine and we still don't have a suitable PDF replacement.

rietta · on Aug 10, 2020

I have to admit, after a decade of tablets, I am back to printing some PDFS, reading, making notes, and scanning back if I want. It's actually cheaper than continually upgrading the iPad ;-p I still have the tablet but its not my first choice always.

tokai · on Aug 10, 2020

Both pdfs and modern js webpages can be bad for online reading at the same time.

onion2k · on Aug 10, 2020

No auto-play animations, no animations at all, no bizarre hijacking of scrolling...

As annoying and obnoxious as animations and scrolljacking are, I prefer them to things like embedded viruses and SMB attacks that PDFs with embedded JS will happily run. https://www.sentinelone.com/blog/malicious-pdfs-revealing-te...

jrockway · on Aug 10, 2020

That's actually not true! PDFs can do all sorts of annoying stuff. Here's a PDF that I ran into that told me my PDF reader wasn't good enough: https://twitter.com/jrockway/status/1247153472895664128

GoblinSlayer · on Aug 12, 2020

Depends on what web pages you see. https://timsong-cpp.github.io/cppwp/n3337/ - how about this?

crazygringo · on Aug 10, 2020

Hard disagree. Also the author is arguing against a strawman.

Normal PDF's are simple, reliable, and interoperable.

In contrast to webpages which are actually more often the "clunky", "slow", "stuffed with fluff", and "disorienting" (with scroll hijacking) alternative.

But the strawman is people creating PDF content as an alternative to HTML. Practically nobody is doing that. Virtually every PDF out there is designed to be a printable document first, that is then made available on the web. Nobody is saying "how should we architect our new site -- I know, let's make all our pages PDF's!"

What a truly bizarre article.

visarga · on Aug 10, 2020

> Virtually every PDF out there is designed to be a printable document first , that is then made available on the web.

Tell that to Arxiv. Most papers never get printed. Everything is consumed on screen. Yet the layout is completely wrong for screens.

I think browsers should offer PDF reflow as HTML, to adapt to any screen width with optimal font size.

anigbrowl · on Aug 10, 2020

2 pages side-by-side on a sufficiently large screen looks great. I've only seen a few websites that flow text in columns and make graphics pleasant to interact with. Sure, many web browsers have reader mode but it's limited, clunky, and hard to configure.

Web designers have the idea that I want a big column of text running down the center and lots of whitespace to the sides, perhaps with sub-menus. This would look OK if I had my main monitor oriented vertically, but I don't and almost nobody does. As a result only about 50% of my screen space is working and I am constantly scrolling back and forth on long pages if I want to look back more than a paragraph or two.

I've developed a deep dislike of commercial graphic designers as a class of people because they took everything that was annoying about magazines and put it on steroids. Many graphic designers hate text and now we have a million interfaces that look superficially interesting but are deeply unpleasant to read.

ew6082 · on Aug 10, 2020

The use case for 99% of pdfs is email transfer. They are absolutely superior to sending a clunky, bloated MS Word or CAD document. The web archive is just the final resting place in the process that made them.

crazygringo · on Aug 10, 2020

Exactly.

With academic articles, I virtually never want to simply read them online.

I need to save them for future reference, read them later when I've set aside time, annotate them, refer back to my annotations four months later...

Arxiv (or JSTOR or wherever else) is just where you get the papers. It's not where most academics are going to be consuming them.

(For consumption, a full-size tablet like an iPad, with a stylus or Apple Pencil, is absolutely ideal.)

jpindar · on Aug 10, 2020

Same with electronic part datasheets. I need to be able to mark up and save datasheets along with the other documents which make up the design of a product.

ImaCake · on Aug 11, 2020

Can I ask what app you use on your iPad for reading PDFs? I use Adobe reader which is pretty good on iOS, but I find Preview a better experience.

sjy · on Aug 11, 2020

I have been maintaining a PDF library this way using GoodReader for about 10 years now. You can connect it to most cloud storage services or any SFTP or WebDAV server, and sync annotations with Acrobat, Preview, Okular, etc. on the desktop. I have still yet to find something this good for HTML or EPUB documents.

kindofastrawman · on Aug 10, 2020

> What a truly bizarre article.

Not at all. It would be bizarre if the uses of PDF that the article is meant to address didn't exist, but they do. Just look at https://berkshirehathaway.com for one example.

For a reference, we're a little over a week into the month so far. Yet when I check my browser history for PDFs, there are around 50 entries for August alone. Most of those instances are exactly what the author describes: cases where the format choice led to a worse experience than if that content had existed on a web page instead (or multiple ones). And as annoying as it is to try grappling with the format on a desktop screen, doing it on a smartphone would have been a non-starter, i.e. near 100% bounce rate.

s1artibartfast · on Aug 10, 2020

I wish more websites were organized the way berkshire's is. A simple dashboard directory followed by PDFs where appropriate.

My only objection is that I wish more of the content was in PDF, or at least had a PDF options.

https://www.sec.gov/Archives/edgar/data/1081316/000108131619...

When I download a 10-k, I dont want html to review on my phone. I want a PDF to read.

kindofastrawman · on Aug 10, 2020

This comment reads like person who has taken a special case (and even then one that only appears to contradict the "other side", even though it really doesn't)—something like having a 10-K in PDF format—and then constructs an entire (and entirely hypothetical) ideal out of it, just so they can relish in spiting the person they're responding to. It's a crummy way to have a discussion and a crummy interaction to force on other people in general.