That simply does not work. You can't sanitise, escape and reproduce correctly al...

marcosdumay · on Feb 27, 2020

> You can't sanitise, escape and reproduce correctly all at the same time.

That's why you do them at different times...

Let's go to the example:

> This is clearly dangerous input!

That's not clear at all. There is a set of values allowed for a comment, this one is probably within them, while, for example, an empty value usually isn't, as isn't and invalid UTF8 sequence. This one should pass sanitization as is.

> Change < to < in the database?

You escape it when converting into HTML. It's not the same as sanitization.

SahAssar · on Feb 27, 2020

I think what you are saying is to validate, not sanitize.

Sanitize (as I have understood it) usually means to "modify to be safe", while you are talking about rejecting invalid responses.

marcosdumay · on Feb 27, 2020

If your rules say comments will be truncated to 1000 chars, yo do it on the input. If your rules say all prices are in dollars, but your frontend accepts other currencies, you convert on the input, and overstaff the consumer support.

Honestly, those names mean a lot of different stuff to different people. It's not good that there are so many, it's more a consequence of the widespread of bad practices.

SahAssar · on Feb 27, 2020

I think what people thinking of the term like me (sanitize means modify, validate means accept/reject) will think is that if your rule is "comments will 1000 chars or less", then the validate reasoning would say reject the 1001 chars comment, while the sanitize reasoning would say truncate to 1000 chars.

Do you have a different reading of the terms?

DagAgren · on Feb 28, 2020

You seem to be agreeing with me. I am saying to escape, not sanitise. What I described is sanitising.

dictum · on Feb 27, 2020

Maybe I'm overengineering, but couldn't you store the sanitized version as the normal value, and also store and make publicly available the original unsanitized value in an ominously and obviously named key (say, dangerouslyUnsanitizedValue) that happens to be easily greppable/lintable?

GuB-42 · on Feb 27, 2020

I think you are overengineering ;)

Plain text can contain anything and it shall be treated as such, it is that simple.

As for security, don't assume everything in your database came from a trusted source. Maybe there are remains from an old version of your code that didn't sanitize, maybe you improperly used admin tools that bypassed checks.

inimino · on Feb 27, 2020

The idea that one string is more dangerous than another is the problem.

asheroth · on Feb 27, 2020

How would you determine which value to display? It seems to suffer from the same issue where if you display the sanitized value then the comment is still missing necessary characters, but if you use the unsanitized value then your application will be vulnerable to XSS.

rossdavidh · on Feb 27, 2020

In most cases, that would be overengineering, but it is an entirely plausible solution if you happen to have a case where you need to allow the user to enter things like angle brackets, and for some reason you cannot escape them.

kochthesecond · on Feb 27, 2020

Thanks for explaining this better than I could.

Someone · on Feb 27, 2020

”Change < to < in the database?”

Of course not. The fact that “<“ is risky isn’t part of the string, it’s part of the output format (HTML).

If you were to write that string to json or csv, you would have to special-case double quotes. In. POSIX shell, asterisks and question marks need special attention, etc.

amenod · on Feb 27, 2020

> This is clearly dangerous input! ...

You are missing the point.

You should sanitize the input when possible, so that numbers are really numbers, strings are really strings, slugs and similar are cleaned... But of course you can't clean text so that it will be safe when displayed. After all, `<` is only problematic if you are displaying the text as HTML, which, while common, is not a given.

When displaying anything, you should however use a _framework_ that doesn't allow you to display anything that would not be safe (unless you use some function with "UNSAFE" or "DANGEROUS" in its name). For example React does that, and others too.

There are many different kinds of attacks and the less leeway an attacker has, the safer you are. So sanitize both, input and output.

emn13 · on Feb 27, 2020

The solution: don't try to reproduce input exactly. That's a weird thing to want in general anyhow - what exactly are reproducing so exactly? Text? Including Markup? How about some animation thrown in? Maybe Interactivity? Hey, let's just reproduce arbitrary executable code accurately?

The whole point of sane sanitization is that you don't need to reproduce all that stuff exactly. Pick a small domain, and reproduce that. Often, it's OK to reproduce approximately; e.g. not worrying about things like retaining multiple consecutive whitespaces, or perhaps leading/trailing whitespace, or whatever.

The point of sanitization is to make it easier not to make a dangerous mistake accidentally. If you have an input that needs to support layout, that's a pain. But if you can live with just text - so much the better. If you do need to support markup; then I don't see the wisdom in sanitizing it late; that's just asking for bugs to lead to security issues.

Frankly the whole tradeoff is nonsensical. These aren't mutually exclusive alternatives, and don't even really address the same issues. Yes, you should sanitize (and validate) your input. And you also need to escape output as appropriate.

If the point is that it's not wise to skip escaping because you "know" the input is safe due to sanitization - then sure, while theoretically sometimes sound, that's pratically a nasty bug waiting to happen. Don't do that, sure.

DagAgren · on Feb 28, 2020

So are you saying I should just not ever want to say "A < B"?

emn13 · on Feb 29, 2020

No; that's just you picking an absurd example rather than being practical.

Pick a reasonable domain for each input field, considering what kind of input is useful, and what kind of usage in output (i.e. plain text output is likely much less risky than rich text). There's rarely a reason to ban < in plain text; but retaining stuff like zero-width joiners or rtl-ltr-transitions is likely less valueable, and potentially an issue with in things like usernames or email addresses (because they make it trivial to make apparently identical usernames). Similarly, if you're storing a telephone number and want to retain spaces - are you going to retain nul-chars too?

Not all input should allow arbritrary plain text. I'd guess most don't, and lot's of input is at least rich text nowadays (not to mention images and other media - you think it's a good idea to just reproduce an arbitrary image exactly?).