Justice League Blog

Input Validation and Data Dictionaries

Our internal discussion board brought up the topic of input validation last week. The discussion was around the regex for validating an email address. The message was that what seems like a very simple input validation can get complicated if the full standard is supported. As I read the discussion I started thinking about Data Dictionaries and their failed attempt to standardize data definitions within an enterprise.

The history lesson here is that around the time that database management systems were becoming commercialized (1980s) there were products called data dictionaries or data repositories. The idea was that metadata should be kept outside of the DBMS and shared by all of the application code. In the data dictionary nirvana, there would be ONE definition of EmployeeID for the entire enterprise and all applications and database would share this one definition.

Fast forward to 2010. Where are these products today? Just about every client I walk into has a commercial DBMS (Oracle, Microsoft or IBM), but nowhere do we encounter data dictionaries. Why? Because they never got deployed. What enterprises found out is that no one could agree on what an EmployeeID (or any other piece of data) looked like. Sometime it was politics; sometimes it was acquisitions; sometimes it was I18n. Whatever the reason, the bottom line is the same – there are precious few common data definitions in any enterprise.

So, how in the world does one expect to do input validation against data when there are only a precious few data definition that accurately describe data? This seems like a path that will only lead to the same quagmire that data dictionaries fell into.

For the curious, the regex for email addresses is:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|”(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*”)@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])

(from http://www.regular-expressions.info/email.html)

Thanks to David Lindsay for the reference.

This entry was posted in Data Security. Bookmark the permalink.
« »
  • Tom Van Vleck

    I worked at Tandem in the 80s. They had their own SQL database implementation, and a company-wide data dictionary. Using this dictionary exposed several problems with data dictionaries.
    1. Administration. If there is bureaucracy involved in defining data elements, developers will bypass it.
    2. Merging. When independently developed data dictionaries must be merged, either you end up rewriting applications or you end up with multiple similar fields.
    3. Semantics. This is the killer. Simple example: we had an “airport” field in the corp data dictionary definition for a branch office. Some applications used it for the air freight calculation; others for expense report mileage calculation.. but some airports didn’t do air freight, and the apps conflicted in their use. Each app designer said, “airport, aha, that’s what I need” and didn’t realize there was a semantic problem. Getting precise semantic definitions of each data element required people to model the whole real world, not just the part their app touched.

    You might be able to model what a mail address looked like (wonderful regex!) but the DD also has to have semantics for how mail addresses are related to other entities and how they are used. My SQL address book has about 5 email address fields and i find myself punning all the time to account for idiosyncratic uses.