<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.0.11" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Stop saying input validation</title>
	<link>http://www.cigital.com/justiceleague/2007/08/03/stop-saying-input-validation/</link>
	<description>The Cigital Software Security and Quality Blog</description>
	<pubDate>Wed, 20 Aug 2008 19:22:43 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0.11</generator>

	<item>
		<title>by: Pravir Chandra</title>
		<link>http://www.cigital.com/justiceleague/2007/08/03/stop-saying-input-validation/#comment-1991</link>
		<pubDate>Thu, 09 Aug 2007 00:18:10 +0000</pubDate>
		<guid>http://www.cigital.com/justiceleague/2007/08/03/stop-saying-input-validation/#comment-1991</guid>
					<description>Mostly, my post was to try to swing back the dogma that many security folks peddle a little closer to my view of the 'right thing'. Let me address the comments that everyone has left:

Sylvan: For the cases where you want users to be able to submit 'rich content', I generally recommend NOT allowing the users to do so using HTML. It's far cleaner to allow them to input their content with some kind of markup language and then have the application parse and validate according to the markup language on input, and translate appropriately on output (to HTML, for instance, by HTML encoding everything that wasn't a markup control sequence for, say, links, images, etc.)

Edward: I agree with your three step process, with one minor exception... For the second step where you store the data in a database, don't bother with manually implementing SQL escaping. Instead, use a properly formed parameterized query (prepared statement) and have the underlying framework take care of it for you. That way, there's a smaller chance of omission or error, not to mention cleaner code that's easy to audit for compliance.

Rod: I think you misunderstood my comments related to the HTML/URL encoding. I was talking about using those encoding mechanisms as a part of input validation, which is absolutely a misguided strategy. Generally, people do this to avoid having to think about the various types of outputs that their application makes (HTML responses, log messages, SQL queries, etc.), thus it's indicative of having not thought through the problem.
As for a priori knowledge of output control characters, what I meant was that in a scenario where you still decided to encode on input, you would need to know all the ways in which the app (and database) would eventually use the data and then pick an encoding scheme that was universally safe. This is very hard for any app that's built to be extensible (web services) since it can't be known a priori all the ways in which caller's might use the data. It's much simpler for all data consumers (including your own app) to treat all the data as potentially hazardous and always encode appropriately on output to ensure whatever it was, it will do no harm. If the data was already encoded before the caller gets it, they'd either a) use it without further through, potentially leading to vulns (obviously bad for security), or b) be forced to unencode and then re-encode appropriate to how they'd like to use it (bad for maintainability and simplicity). Using the validate-on-input/store-pure-data/encode-on-output paradigm, the only thing you need to know a priori is all the ways in which your application will output data and assign an encoding scheme for each. The contract with other callers is clean: don't trust that the data is safe for any output channel.
As for input validation itself, I did state in the last para (if a little cryptically) that people should do it because it is good for usability and can prevent impact of other programming mistakes. Perhaps it's just a terminology difference, but I was using the term "input validation" for validation that occurs at the application scope (external data entrypoints and such). If an internal function in your program has a buffer that can be overflowed and it's relying on entrypoint input validation to keep it safe, it's a poorly designed function. Keeping a dependency tree of which fragile internal functions are protected by given entrypoint validation methods is error-prone and terrible for maintainability/extensibility.  It's better to design internal objects in such a way that they cannot, under any input/usage, lead to security problems. I suppose that's a kind of input validation in itself, but like I said, it might be a terminology difference.

As a final note, to everyone, thanks very much for taking the time to leave feedback. I think we're all more in agreement than disagreement ;)</description>
		<content:encoded><![CDATA[<p>Mostly, my post was to try to swing back the dogma that many security folks peddle a little closer to my view of the &#8216;right thing&#8217;. Let me address the comments that everyone has left:</p>
<p>Sylvan: For the cases where you want users to be able to submit &#8216;rich content&#8217;, I generally recommend NOT allowing the users to do so using HTML. It&#8217;s far cleaner to allow them to input their content with some kind of markup language and then have the application parse and validate according to the markup language on input, and translate appropriately on output (to HTML, for instance, by HTML encoding everything that wasn&#8217;t a markup control sequence for, say, links, images, etc.)</p>
<p>Edward: I agree with your three step process, with one minor exception&#8230; For the second step where you store the data in a database, don&#8217;t bother with manually implementing SQL escaping. Instead, use a properly formed parameterized query (prepared statement) and have the underlying framework take care of it for you. That way, there&#8217;s a smaller chance of omission or error, not to mention cleaner code that&#8217;s easy to audit for compliance.</p>
<p>Rod: I think you misunderstood my comments related to the HTML/URL encoding. I was talking about using those encoding mechanisms as a part of input validation, which is absolutely a misguided strategy. Generally, people do this to avoid having to think about the various types of outputs that their application makes (HTML responses, log messages, SQL queries, etc.), thus it&#8217;s indicative of having not thought through the problem.<br />
As for a priori knowledge of output control characters, what I meant was that in a scenario where you still decided to encode on input, you would need to know all the ways in which the app (and database) would eventually use the data and then pick an encoding scheme that was universally safe. This is very hard for any app that&#8217;s built to be extensible (web services) since it can&#8217;t be known a priori all the ways in which caller&#8217;s might use the data. It&#8217;s much simpler for all data consumers (including your own app) to treat all the data as potentially hazardous and always encode appropriately on output to ensure whatever it was, it will do no harm. If the data was already encoded before the caller gets it, they&#8217;d either a) use it without further through, potentially leading to vulns (obviously bad for security), or b) be forced to unencode and then re-encode appropriate to how they&#8217;d like to use it (bad for maintainability and simplicity). Using the validate-on-input/store-pure-data/encode-on-output paradigm, the only thing you need to know a priori is all the ways in which your application will output data and assign an encoding scheme for each. The contract with other callers is clean: don&#8217;t trust that the data is safe for any output channel.<br />
As for input validation itself, I did state in the last para (if a little cryptically) that people should do it because it is good for usability and can prevent impact of other programming mistakes. Perhaps it&#8217;s just a terminology difference, but I was using the term &#8220;input validation&#8221; for validation that occurs at the application scope (external data entrypoints and such). If an internal function in your program has a buffer that can be overflowed and it&#8217;s relying on entrypoint input validation to keep it safe, it&#8217;s a poorly designed function. Keeping a dependency tree of which fragile internal functions are protected by given entrypoint validation methods is error-prone and terrible for maintainability/extensibility.  It&#8217;s better to design internal objects in such a way that they cannot, under any input/usage, lead to security problems. I suppose that&#8217;s a kind of input validation in itself, but like I said, it might be a terminology difference.</p>
<p>As a final note, to everyone, thanks very much for taking the time to leave feedback. I think we&#8217;re all more in agreement than disagreement ;)
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Roderick Divilbiss</title>
		<link>http://www.cigital.com/justiceleague/2007/08/03/stop-saying-input-validation/#comment-1981</link>
		<pubDate>Tue, 07 Aug 2007 22:34:06 +0000</pubDate>
		<guid>http://www.cigital.com/justiceleague/2007/08/03/stop-saying-input-validation/#comment-1981</guid>
					<description>I'm sorry but I think your logic is flawed.  I do not disagree with the basic premise that output filtering is under-emphasized and input filtering is no panacea, but you must have both.

You wrote:

"augment the send_message() function with logic to zip through the pending message and escape anything that was a control code and then you’d do the normal stuff of writing it to the wire."

and

"You’ve gotta have output encoding to truly solve it right."

then you objected to "HTML/URL encod[ing]" which I agree in context you meant using a perfectly valid HTML or URL "output" encoding as an improper "output" encoding for the database.

If your output is going to be sent to HTML the HTML encoding is the correct output encoding, and if is being sent as data on a URL the URL encoding is correct.  Those are however, not the correct output encodings if it is being sent to a database.

More to the point; is your function which "zip[s] through the pending message and escape[s] anything that was a control code [before] writing it to the wire;" not in fact a priori?  You must know the control characters or how to recognize them in advance of escaping them to perform the specific output to that channel of communication.

At the point where input is received, you must encode or escape it with knowledge of where it is going so it is still a priori.

Furthermore, as Edward Z. Yang correctly pointed out, it is ridiculous to accept characters outside the scope, range or domain of the expected input.  To do nothing opens the opportunity for other attacks where perfectly valid data creates an undesired fault, such as a buffer overflow.

You need both input and output operations.

While I don't disagree with the assertion that: "We need to ensure that architects &#38; developers have a deeper understanding of what the problem really is in order for them to naturally build systems to these types of attacks."

It would be far better to modify the underlying architecture to make it impossible (or very difficult) to output to any particular "channel" potentially tainted data.

But your argument does not lead me to conclude that validating input warrants the B.S. card.  It is indeed the first, but not the only step towards robust security.</description>
		<content:encoded><![CDATA[<p>I&#8217;m sorry but I think your logic is flawed.  I do not disagree with the basic premise that output filtering is under-emphasized and input filtering is no panacea, but you must have both.</p>
<p>You wrote:</p>
<p>&#8220;augment the send_message() function with logic to zip through the pending message and escape anything that was a control code and then you’d do the normal stuff of writing it to the wire.&#8221;</p>
<p>and</p>
<p>&#8220;You’ve gotta have output encoding to truly solve it right.&#8221;</p>
<p>then you objected to &#8220;HTML/URL encod[ing]&#8221; which I agree in context you meant using a perfectly valid HTML or URL &#8220;output&#8221; encoding as an improper &#8220;output&#8221; encoding for the database.</p>
<p>If your output is going to be sent to HTML the HTML encoding is the correct output encoding, and if is being sent as data on a URL the URL encoding is correct.  Those are however, not the correct output encodings if it is being sent to a database.</p>
<p>More to the point; is your function which &#8220;zip[s] through the pending message and escape[s] anything that was a control code [before] writing it to the wire;&#8221; not in fact a priori?  You must know the control characters or how to recognize them in advance of escaping them to perform the specific output to that channel of communication.</p>
<p>At the point where input is received, you must encode or escape it with knowledge of where it is going so it is still a priori.</p>
<p>Furthermore, as Edward Z. Yang correctly pointed out, it is ridiculous to accept characters outside the scope, range or domain of the expected input.  To do nothing opens the opportunity for other attacks where perfectly valid data creates an undesired fault, such as a buffer overflow.</p>
<p>You need both input and output operations.</p>
<p>While I don&#8217;t disagree with the assertion that: &#8220;We need to ensure that architects &amp; developers have a deeper understanding of what the problem really is in order for them to naturally build systems to these types of attacks.&#8221;</p>
<p>It would be far better to modify the underlying architecture to make it impossible (or very difficult) to output to any particular &#8220;channel&#8221; potentially tainted data.</p>
<p>But your argument does not lead me to conclude that validating input warrants the B.S. card.  It is indeed the first, but not the only step towards robust security.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Edward Z. Yang</title>
		<link>http://www.cigital.com/justiceleague/2007/08/03/stop-saying-input-validation/#comment-1948</link>
		<pubDate>Sun, 05 Aug 2007 00:20:45 +0000</pubDate>
		<guid>http://www.cigital.com/justiceleague/2007/08/03/stop-saying-input-validation/#comment-1948</guid>
					<description>I think there's a slight bit of confusion between input "validation" and "encoding data" during input. Ideally speaking, input validation should determine that the input data is in a correct format: if you asked for a number, there shouldn't be letters; if you asked for only alphanumeric characters, there shouldn't be less-than or greater-than signs.

When you see someone encoding data for a projected output format (not the one immediately relevant, such as the database), it's usually an ill-advised optimization. The sub-conscious rationale is if you encode it in-bound, you don't have to worry about it when it's out-bound. It can be a perfectly reasonable optimization, especially if encoding the data in a safe manner is particularly resource intensive (the canonic case is user-submitted HTML in HTML, for which I've written a library for). But if you're encoding willy-nilly without understanding the ramifications, you get problems. The perfect example is PHP's magic_quotes. The functionality "encodes" your data so that it can be safely included in SQL statements. There are two problems with this approach:

1. It assumes that the data is actually going to be used in an SQL context (which, many times, it's not), and
2. It performs the encoding improperly!

In summary, the proper sequence of events is:

1. Inbound: validate the data
2. Store the pure, un-encoded/un-escaped data (to do this, you'll probably need to perform SQL escaping on it to store in a database)
3. Output: retrieve the data, and transform it into a form safe for the output format (usually HTML)

If three proves to be too costly to be done on each page request, a cache of the pre-encoded data can be added.</description>
		<content:encoded><![CDATA[<p>I think there&#8217;s a slight bit of confusion between input &#8220;validation&#8221; and &#8220;encoding data&#8221; during input. Ideally speaking, input validation should determine that the input data is in a correct format: if you asked for a number, there shouldn&#8217;t be letters; if you asked for only alphanumeric characters, there shouldn&#8217;t be less-than or greater-than signs.</p>
<p>When you see someone encoding data for a projected output format (not the one immediately relevant, such as the database), it&#8217;s usually an ill-advised optimization. The sub-conscious rationale is if you encode it in-bound, you don&#8217;t have to worry about it when it&#8217;s out-bound. It can be a perfectly reasonable optimization, especially if encoding the data in a safe manner is particularly resource intensive (the canonic case is user-submitted HTML in HTML, for which I&#8217;ve written a library for). But if you&#8217;re encoding willy-nilly without understanding the ramifications, you get problems. The perfect example is PHP&#8217;s magic_quotes. The functionality &#8220;encodes&#8221; your data so that it can be safely included in SQL statements. There are two problems with this approach:</p>
<p>1. It assumes that the data is actually going to be used in an SQL context (which, many times, it&#8217;s not), and<br />
2. It performs the encoding improperly!</p>
<p>In summary, the proper sequence of events is:</p>
<p>1. Inbound: validate the data<br />
2. Store the pure, un-encoded/un-escaped data (to do this, you&#8217;ll probably need to perform SQL escaping on it to store in a database)<br />
3. Output: retrieve the data, and transform it into a form safe for the output format (usually HTML)</p>
<p>If three proves to be too costly to be done on each page request, a cache of the pre-encoded data can be added.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Sylvan von Stuppe</title>
		<link>http://www.cigital.com/justiceleague/2007/08/03/stop-saying-input-validation/#comment-1946</link>
		<pubDate>Sat, 04 Aug 2007 14:53:18 +0000</pubDate>
		<guid>http://www.cigital.com/justiceleague/2007/08/03/stop-saying-input-validation/#comment-1946</guid>
					<description>Thanks for putting a more academic word behind what I've been saying for a very long time.  I've downplayed my stance on it some because it seems to be a losing battle - welcome to a very strange minority.

You might want to keep in mind that for sites like MySpace, input validation, manipulation, and rejection are the only way to really deal with their stuff, but how many websites do you really want for the user to be in control of HTML?  The cases where output filtering does a bad thing is a *really* small set, but the approach by many other security experts is to treat that case as the norm.</description>
		<content:encoded><![CDATA[<p>Thanks for putting a more academic word behind what I&#8217;ve been saying for a very long time.  I&#8217;ve downplayed my stance on it some because it seems to be a losing battle - welcome to a very strange minority.</p>
<p>You might want to keep in mind that for sites like MySpace, input validation, manipulation, and rejection are the only way to really deal with their stuff, but how many websites do you really want for the user to be in control of HTML?  The cases where output filtering does a bad thing is a *really* small set, but the approach by many other security experts is to treat that case as the norm.
</p>
]]></content:encoded>
				</item>
</channel>
</rss>
