Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That simply does not work. You can't sanitise, escape and reproduce correctly all at the same time.

Say you run a blog. I post a comment saying "But in this case, B<A!"

This is clearly dangerous input! But it is also exactly what I wanted to say. How do you sanitise this? Change < to &lt; in the database? Now you have to remember to NOT escape that again when outputting! And you have to make sure that, say, your text resources in your UI are all also escaped the exact same way, or you have to remember to escape them DIFFERENTLY than user-provided input.

Or maybe you "sanitise" by stripping out dangerous characters like "<". Now you have broken my comment.

The only strategy that is at all maintainable is to store the comment as received, and to escape on output. Anything else is massively fragile or broken.



> You can't sanitise, escape and reproduce correctly all at the same time.

That's why you do them at different times...

Let's go to the example:

> This is clearly dangerous input!

That's not clear at all. There is a set of values allowed for a comment, this one is probably within them, while, for example, an empty value usually isn't, as isn't and invalid UTF8 sequence. This one should pass sanitization as is.

> Change < to &lt; in the database?

You escape it when converting into HTML. It's not the same as sanitization.


I think what you are saying is to validate, not sanitize.

Sanitize (as I have understood it) usually means to "modify to be safe", while you are talking about rejecting invalid responses.


If your rules say comments will be truncated to 1000 chars, yo do it on the input. If your rules say all prices are in dollars, but your frontend accepts other currencies, you convert on the input, and overstaff the consumer support.

Honestly, those names mean a lot of different stuff to different people. It's not good that there are so many, it's more a consequence of the widespread of bad practices.


I think what people thinking of the term like me (sanitize means modify, validate means accept/reject) will think is that if your rule is "comments will 1000 chars or less", then the validate reasoning would say reject the 1001 chars comment, while the sanitize reasoning would say truncate to 1000 chars.

Do you have a different reading of the terms?


You seem to be agreeing with me. I am saying to escape, not sanitise. What I described is sanitising.


Maybe I'm overengineering, but couldn't you store the sanitized version as the normal value, and also store and make publicly available the original unsanitized value in an ominously and obviously named key (say, dangerouslyUnsanitizedValue) that happens to be easily greppable/lintable?


I think you are overengineering ;)

Plain text can contain anything and it shall be treated as such, it is that simple.

As for security, don't assume everything in your database came from a trusted source. Maybe there are remains from an old version of your code that didn't sanitize, maybe you improperly used admin tools that bypassed checks.


The idea that one string is more dangerous than another is the problem.


How would you determine which value to display? It seems to suffer from the same issue where if you display the sanitized value then the comment is still missing necessary characters, but if you use the unsanitized value then your application will be vulnerable to XSS.


In most cases, that would be overengineering, but it is an entirely plausible solution if you happen to have a case where you need to allow the user to enter things like angle brackets, and for some reason you cannot escape them.


Thanks for explaining this better than I could.


”Change < to &lt; in the database?”

Of course not. The fact that “<“ is risky isn’t part of the string, it’s part of the output format (HTML).

If you were to write that string to json or csv, you would have to special-case double quotes. In. POSIX shell, asterisks and question marks need special attention, etc.


> This is clearly dangerous input! ...

You are missing the point.

You should sanitize the input when possible, so that numbers are really numbers, strings are really strings, slugs and similar are cleaned... But of course you can't clean text so that it will be safe when displayed. After all, `<` is only problematic if you are displaying the text as HTML, which, while common, is not a given.

When displaying anything, you should however use a _framework_ that doesn't allow you to display anything that would not be safe (unless you use some function with "UNSAFE" or "DANGEROUS" in its name). For example React does that, and others too.

There are many different kinds of attacks and the less leeway an attacker has, the safer you are. So sanitize both, input and output.


The solution: don't try to reproduce input exactly. That's a weird thing to want in general anyhow - what exactly are reproducing so exactly? Text? Including Markup? How about some animation thrown in? Maybe Interactivity? Hey, let's just reproduce arbitrary executable code accurately?

The whole point of sane sanitization is that you don't need to reproduce all that stuff exactly. Pick a small domain, and reproduce that. Often, it's OK to reproduce approximately; e.g. not worrying about things like retaining multiple consecutive whitespaces, or perhaps leading/trailing whitespace, or whatever.

The point of sanitization is to make it easier not to make a dangerous mistake accidentally. If you have an input that needs to support layout, that's a pain. But if you can live with just text - so much the better. If you do need to support markup; then I don't see the wisdom in sanitizing it late; that's just asking for bugs to lead to security issues.

Frankly the whole tradeoff is nonsensical. These aren't mutually exclusive alternatives, and don't even really address the same issues. Yes, you should sanitize (and validate) your input. And you also need to escape output as appropriate.

If the point is that it's not wise to skip escaping because you "know" the input is safe due to sanitization - then sure, while theoretically sometimes sound, that's pratically a nasty bug waiting to happen. Don't do that, sure.


So are you saying I should just not ever want to say "A < B"?


No; that's just you picking an absurd example rather than being practical.

Pick a reasonable domain for each input field, considering what kind of input is useful, and what kind of usage in output (i.e. plain text output is likely much less risky than rich text). There's rarely a reason to ban < in plain text; but retaining stuff like zero-width joiners or rtl-ltr-transitions is likely less valueable, and potentially an issue with in things like usernames or email addresses (because they make it trivial to make apparently identical usernames). Similarly, if you're storing a telephone number and want to retain spaces - are you going to retain nul-chars too?

Not all input should allow arbritrary plain text. I'd guess most don't, and lot's of input is at least rich text nowadays (not to mention images and other media - you think it's a good idea to just reproduce an arbitrary image exactly?).




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: