Stop saying input validation

by pravir on Friday, August 3, 2007

So it seems like almost everywhere you turn for advice about securing programs or resolving known security problems leads you to a ‘security guy’ telling you something along the lines of ‘well, you have to validate your inputs to prevent these kinds of issues’.

Perhaps I’ve heard it too many times or perhaps I’m just jaded, but I’m throwing the BS card. Of course, I’d never leave it at just that… I think I’ve got a pretty good case for why it’s BS.

Consider my favorite red-headed stepchild, cross-site scripting (XSS). The mechanics of this problem are simple: an application accepts some input data and then offers that data in the form of output back to a user without checking the content of the data along the way (this is the case for both reflected or stored XSS problems, fundamentally).

Now, consider a small flashback I’m about to have to a Computer Networking & Communications from my undergrad days. I know serial communications are sooo last week, but anyone remember putting together simple protocols to transfer data over a line? In a simple message-based protocol, you’d pick a few byte-values to represent a few control commands like ‘end of message’ or ‘close this channel down’. This seemed like a great plan until you tested it out and noticed that some messages were getting truncated in weird ways and occasionally the whole channel went down. If you didn’t just chalk it up to bit-gnomes and listened to the professor, what you learned was that since you intermixed the CONTROL channel with the DATA channel, your data was inadvertently being interpreted as control commands when the appropriate byte-values were present in the data being transferred. Hopefully, you then learned that to make the protocol reliable, you needed to have a mechanism to escape data that contained values that would be interpreted as control codes. How’d you implement the fix? Well, you certainly didn’t try to trace the origin and content of every byte that might enter a message. What you did was augment the send_message() function with logic to zip through the pending message and escape anything that was a control code and then you’d do the normal stuff of writing it to the wire.

Let’s pop back to reality and our XSS problem. The problem is that we’re mixing user DATA with our application’s CONTROL and allowing the user data to be interpreted as control commands (in this case, HTML and/or javascript elements that run in the victim’s browser). I can understand how input validation might help for common, pathological cases of XSS vulns, but in no way is it a complete or adequate solution because what’s really important (and the correct chokepoint to fix the issue at) is the point of output. Yeah, we need to ensure that every time we put content into the user-bound HTML stream we first encode it to be safe in that control channel/language.

This is not limited to XSS. SQL injection (or really any injection attack) is about taking input data and passing it with unchecked contents to a DB command (or any API with a control language of its own, e.g. LDAP queries). Again, input validation can help in simple cases, but you’ve gotta know a priori all the ways in which data might be used by an app and choose some kind of mutually safe set of characters to let through. Either that or encode the potentially unsafe characters so they don’t cause trouble somewhere down the line. Not very extensible, usable, or maintainable in many circumstances since it’s overly restricting and fragile to change.

Take for instance, a ‘Comments’ field in a web app. Many real-world applications really do need to allow users to use characters like ‘$’, ‘%’, ‘<’, or ‘>’ to represent things like money, percentages, and value comparisons, so I simply reject the idea of banning those characters because it’s unnecessary and indicative of misunderstanding of the real problem. Some folks (which shall remain nameless) have said, ‘well, if you need those characters, just HTML/URL encode them as part of the input validation process and you’re all set’. Well, now you’ve added another problem where you’ve gotta go decode and re-encode appropriate to any other output vectors. To supporters of this strategy I ask, how many loading dock foremen and warehouse employees do you know that would correctly interpret a printout containing ‘Boxes &amp; packing must be &lt;50lbs’? How about ‘Boxes%20%26%20packing%20must%20be%20%3C50lbs’?

Now, I’m not saying you should not do input validation. It adds great usability features and might limit impact of other programming mistakes. What I’m saying is that input validation alone isn’t enough. You’ve gotta have output encoding to truly solve it right. We need to ensure that architects & developers have a deeper understanding of what the problem really is in order for them to naturally build systems to these types of attacks.

4 Responses to “Stop saying input validation”

  1. Thanks for putting a more academic word behind what I’ve been saying for a very long time. I’ve downplayed my stance on it some because it seems to be a losing battle – welcome to a very strange minority.

    You might want to keep in mind that for sites like MySpace, input validation, manipulation, and rejection are the only way to really deal with their stuff, but how many websites do you really want for the user to be in control of HTML? The cases where output filtering does a bad thing is a *really* small set, but the approach by many other security experts is to treat that case as the norm.

  2. I think there’s a slight bit of confusion between input “validation” and “encoding data” during input. Ideally speaking, input validation should determine that the input data is in a correct format: if you asked for a number, there shouldn’t be letters; if you asked for only alphanumeric characters, there shouldn’t be less-than or greater-than signs.

    When you see someone encoding data for a projected output format (not the one immediately relevant, such as the database), it’s usually an ill-advised optimization. The sub-conscious rationale is if you encode it in-bound, you don’t have to worry about it when it’s out-bound. It can be a perfectly reasonable optimization, especially if encoding the data in a safe manner is particularly resource intensive (the canonic case is user-submitted HTML in HTML, for which I’ve written a library for). But if you’re encoding willy-nilly without understanding the ramifications, you get problems. The perfect example is PHP’s magic_quotes. The functionality “encodes” your data so that it can be safely included in SQL statements. There are two problems with this approach:

    1. It assumes that the data is actually going to be used in an SQL context (which, many times, it’s not), and
    2. It performs the encoding improperly!

    In summary, the proper sequence of events is:

    1. Inbound: validate the data
    2. Store the pure, un-encoded/un-escaped data (to do this, you’ll probably need to perform SQL escaping on it to store in a database)
    3. Output: retrieve the data, and transform it into a form safe for the output format (usually HTML)

    If three proves to be too costly to be done on each page request, a cache of the pre-encoded data can be added.

  3. I’m sorry but I think your logic is flawed. I do not disagree with the basic premise that output filtering is under-emphasized and input filtering is no panacea, but you must have both.

    You wrote:

    “augment the send_message() function with logic to zip through the pending message and escape anything that was a control code and then you’d do the normal stuff of writing it to the wire.”

    and

    “You’ve gotta have output encoding to truly solve it right.”

    then you objected to “HTML/URL encod[ing]” which I agree in context you meant using a perfectly valid HTML or URL “output” encoding as an improper “output” encoding for the database.

    If your output is going to be sent to HTML the HTML encoding is the correct output encoding, and if is being sent as data on a URL the URL encoding is correct. Those are however, not the correct output encodings if it is being sent to a database.

    More to the point; is your function which “zip[s] through the pending message and escape[s] anything that was a control code [before] writing it to the wire;” not in fact a priori? You must know the control characters or how to recognize them in advance of escaping them to perform the specific output to that channel of communication.

    At the point where input is received, you must encode or escape it with knowledge of where it is going so it is still a priori.

    Furthermore, as Edward Z. Yang correctly pointed out, it is ridiculous to accept characters outside the scope, range or domain of the expected input. To do nothing opens the opportunity for other attacks where perfectly valid data creates an undesired fault, such as a buffer overflow.

    You need both input and output operations.

    While I don’t disagree with the assertion that: “We need to ensure that architects & developers have a deeper understanding of what the problem really is in order for them to naturally build systems to these types of attacks.”

    It would be far better to modify the underlying architecture to make it impossible (or very difficult) to output to any particular “channel” potentially tainted data.

    But your argument does not lead me to conclude that validating input warrants the B.S. card. It is indeed the first, but not the only step towards robust security.

  4. Mostly, my post was to try to swing back the dogma that many security folks peddle a little closer to my view of the ‘right thing’. Let me address the comments that everyone has left:

    Sylvan: For the cases where you want users to be able to submit ‘rich content’, I generally recommend NOT allowing the users to do so using HTML. It’s far cleaner to allow them to input their content with some kind of markup language and then have the application parse and validate according to the markup language on input, and translate appropriately on output (to HTML, for instance, by HTML encoding everything that wasn’t a markup control sequence for, say, links, images, etc.)

    Edward: I agree with your three step process, with one minor exception… For the second step where you store the data in a database, don’t bother with manually implementing SQL escaping. Instead, use a properly formed parameterized query (prepared statement) and have the underlying framework take care of it for you. That way, there’s a smaller chance of omission or error, not to mention cleaner code that’s easy to audit for compliance.

    Rod: I think you misunderstood my comments related to the HTML/URL encoding. I was talking about using those encoding mechanisms as a part of input validation, which is absolutely a misguided strategy. Generally, people do this to avoid having to think about the various types of outputs that their application makes (HTML responses, log messages, SQL queries, etc.), thus it’s indicative of having not thought through the problem.
    As for a priori knowledge of output control characters, what I meant was that in a scenario where you still decided to encode on input, you would need to know all the ways in which the app (and database) would eventually use the data and then pick an encoding scheme that was universally safe. This is very hard for any app that’s built to be extensible (web services) since it can’t be known a priori all the ways in which caller’s might use the data. It’s much simpler for all data consumers (including your own app) to treat all the data as potentially hazardous and always encode appropriately on output to ensure whatever it was, it will do no harm. If the data was already encoded before the caller gets it, they’d either a) use it without further through, potentially leading to vulns (obviously bad for security), or b) be forced to unencode and then re-encode appropriate to how they’d like to use it (bad for maintainability and simplicity). Using the validate-on-input/store-pure-data/encode-on-output paradigm, the only thing you need to know a priori is all the ways in which your application will output data and assign an encoding scheme for each. The contract with other callers is clean: don’t trust that the data is safe for any output channel.
    As for input validation itself, I did state in the last para (if a little cryptically) that people should do it because it is good for usability and can prevent impact of other programming mistakes. Perhaps it’s just a terminology difference, but I was using the term “input validation” for validation that occurs at the application scope (external data entrypoints and such). If an internal function in your program has a buffer that can be overflowed and it’s relying on entrypoint input validation to keep it safe, it’s a poorly designed function. Keeping a dependency tree of which fragile internal functions are protected by given entrypoint validation methods is error-prone and terrible for maintainability/extensibility. It’s better to design internal objects in such a way that they cannot, under any input/usage, lead to security problems. I suppose that’s a kind of input validation in itself, but like I said, it might be a terminology difference.

    As a final note, to everyone, thanks very much for taking the time to leave feedback. I think we’re all more in agreement than disagreement ;)

Leave a Reply