Anti-Pattern: Trusting User Input taken to the Nth
June 12, 2012 Leave a comment
Most developers are well aware that we should not trust user input. In the 90s, many learned this when they hid pricing information in hidden fields on a form and clever users changed the price of an item to a penny and successfully bought it. Others have learned this through SQL injection hacks.
But it goes beyond information that users input via a form. Anything you have to store that affects behavior in your system should be validated prior to attempting to initiate the behavior.
Just yesterday, I found an interesting form of trust that is particularly insidious. The system has a service that accepts an object from an outside system. It also takes an ID that it uses like a primary key in our system. Rather than go on typing, let me illustrate.
The system uses a variety of queues and there are objects that need to be queued. I have faked the actual objects (to protect the innocent?), but the core concept is like this. There is a base object that indicates a queued object.
In this object is an Id, which is used to store the object in a queue and a description so the consuming team can determine what an object was designed for. I am not worried about the description, although I do consider it a bit superfluous in this application, so I will get back to this ID in a moment.
The classes representing the objects that are queued derive from the QueuedObjectBase class, like so:
The actual object (as derived type) is stored in a serialized format (not a paradigm I like, but “it is what it is”). But, before it is stored, the Id and Description are pulled. This is actually accomplished in a CMS system (used like a database, which is another paradigm I dislike), but it is easier to illustrate when we look at a database.
Suppose we have a table designed like this:
If you want to see this visually, you end up with this:
When a Queue object enters the system, the following happens:
- The Id is pulled out as a variable to go in the Id field
- The Description is pulled out as a variable to go in the Descript field
- The entire object is serialized to go in the serialized field
The code would look something like this:
Assume the repository performs something like the following:
Seeing the problem yet? Let’s illustrate by setting up some code that is guaranteed to create a problem.
Got it now? Yes, since we have given the client the opportunity to control our Id (the primary key in the database scenario), they can send us duplicate information.
In the way we currently have this set up, the second call throws an exception, so we can inform the client they cannot have that Id, as they have already used it in the queue. But, what if we are using a different persistent mechanism that defaults to overwriting the browseQueuedObject with a catalogQueuedObject since they have the same id?
Here is the client code to pull an object from queue that is a browse queued object.
What happens now is a casting exception, as the actual object returned is a catalogQueuedObject instead of a browseQueuedObject.
Now, you probably are saying something like “we can protect the user by adding a type and then validating”, but the core problem is we have tightly coupled the client to our storage mechanism when we allow them to create the key. As this is our storage engine, we should be in control of the keys. We can pass the key back for the client to store so they can communicate (one option) or we can store their id for the object (or similar object on their side) along with a type. We can also make multiple queues. But, we should not have them deciding how we identify the object.
When I mentioned this, the first argument I heard was “well this is an internal application and our developers would never do that kind of thing”. This is an argument I have heard numerous times for a variety of bad practices. The problem is we might eventually open this system to people outside of our organization. At first, it might be partners. This increases the likelihood of clashes, but not likely someone will purposefully try to crash the system. But, one day we might want to use this service like FaceBook or Twitter and open to the world. At that point, you might have hackers trying to overwrite objects to see if they can crash the site. This might just be for fun, but they may also be looking for exception messages to determine how to hack deeper into our system.
The second argument I heard was “we are using globally unique identifiers”. This sounds good and well, but while GUIDs are designed to be statistically “impossible” to end up with the same ID, over time the “impossible” becomes more possible, especially with multiple people creating IDs. But this is not the end of the story. There are people who have created their own GUID algorithms. What if someone create a bad algorithm and shared it? In addition, I have the power to manually create GUIDs like so:
Perhaps I see a GUID coming back and decide to use it? What would happen then.
Now, there are a lot of things wrong with storing serialized objects without some type of typing information, especially when the “data” is only useful when deserialized back to an object. It is sloppy and prone to errors. But it is further exacerbated when you allow the client to dictate the key for the object.
In this blog entry I focused on not trusting user input. Rather than hit the common scenarios, however, I focused on allowing a user to choose an identity for the object in our system. While this might not seem like a common enough scenario to blog about, I have seen numerous systems cobbled together in this manner lately.
There are a couple of reasons why this is a very bad practices. First, the user might duplicate the key information and either overwrite data or cause an error. Second, and more important in an internal application, the user is now tightly coupled to our storage mechanism, making it impossible to change our implementation of storage without consulting everyone who is using the system.
After airing my concerns on allow a user to specify the primary key for the item in our data store, the decision was made to check for a duplicate and send back an exception if it already existed. This may sound like a good compromise, but it puts a lot of extra work on each side (checking and throwing the exception on our end and handling exceptions on their end (possibly even transaction logic?)). We can avoid all of this extra work if we simply do it right the first time.
Peace and Grace,