Maybe that dastardly style sheet just won’t cascade elegantly on browser X. An incomplete comment chucks out some broken mark-up. Maybe you should have persisted those database connections after all. Hey, we all overlook things in the excitement of getting our first version running – but how many of these oversights can we happily stomach, and how many might just leave a bitter taste in ours, and more painfully our client’s mouths…
This article walks through the brainstorming stage of planning for what is in this instance, a hypothetical user-centric web application. Although you won’t be left with a complete project – nor a market ready framework, my hope is that each of you, when faced with future workloads, may muse on the better practices described. So, without further ado…Are you sitting comfortably?
The Example
We’ve been asked by our client to incorporate into an existing site, a book review system. The site already has user accounts, and allows anonymous commentary.
After a quick chat with the client, we have the following specification to implement, and only twenty four hours to do it:
Note: The client’s server is running PHP5, and MySQL – but these details are not critical to understanding the bugbears outlined in this article.
The Processes:
Our client has given us a PHP include to gain access to the database:
We don’t actually need the source to this file to use it. In fact, had the client merely told us where it lived we could have used it with an include statement and the $db variable.
On to authorisation… within the datatable schema we are concerned with the following column names:
- username, varchar(128) – stored as plain text.
- password, varchar(128) – stored as plain text.
Given that we’re working against the clock… let’s write a PHP function as quickly as we can that we can re-use to authenticate our users:
$_REQUEST Variables
In the code above you will notice I’ve highlighted an area amber, and an area red.
Why did I highlight the not-so-dangerous $_REQUEST variables?
Although this doesn’t expose any real danger, what it does allow for is a lax approach when it comes to client side code. PHP has three arrays that most of us use to get our posted data from users, and more often than not we might be tempted to use $_REQUEST. This array conveniently gives our PHP access to the POST and GET variables, but herein lies a potential hang-up…
Consider the following scenario. You write your code client side to use POST requests, but you handover the project while you grab a break – and when you get back, your sidekick has written a couple of GET requests into the project. Everything runs okay – but it shouldn’t.
A little while later, an unsuspecting user types an external link into a comment box, and before you know it, that external site has a dozen username/password combinations in its referrer log.
By referencing the $_POST variables instead of $_REQUEST, we eliminate accidentally publishing any working code that might reveal a risky GET request.
The same principle applies to session identifiers. If you find you’re writing session variables into URLs, you’re either doing something wrong or you have a very good reason to do so.
SQL Injection
Referring again to the PHP code: the red highlighted line might have leaped out at some of you? For those who didn’t spot the problem, I’ll give you an example and from there see if something strikes you as risky…
This image makes clear the flaw in embedding variables directly into SQL statements. Although it can’t be said exactly what control a malicious user could have – it is guaranteed, if you use this method to string together an SQL statement, your server is barely protected. The example above is dangerous enough on a read-only account; the powers a read/write connection have are only limited by your imagination.
To protect against SQL injection is actually quite easy. Let’s first look at the case of quote enclosed string variables:
The quickest protection is to strip the enclosure characters or escape them. Since PHP 4.3.0 the function mysql_real_escape_string has been available to cleanse incoming strings. The function takes the raw string as a single parameter and returns the string with the volatile characters escaped. However mysql_real_escape_string doesn’t escape all the characters that are valid control characters in SQL… the highlighted elements in the image below shows the techniques I use to sanitise String, Number and Boolean values.
The first highlight, the line that sets $string_b uses a PHP function called addcslashes. This function has been part of PHP since version 4 and as is written in the above example, is my preferred method for SQL string health and safety.
A wealth of information is available in the PHP documentation, but I’ll briefly explain what addcslashes does and how to it differs to mysql_real_escape_string.
From the diagram above you can see that mysql_real_escape_string doesn’t add slashes to the (%) percent character.
The % is used in SQL LIKE clauses, as well as a few others. It behaves as a wildcard and not a literal character. So it should be escaped by a preceding backslash character in any cases where string literals make up an SQL statement.
The second parameter I pass to addcslashes, which in the image is bold; is the character group PHP will add slashes for. In most cases it will split the string you provide into characters, and then operate on each. It is worth noting, that this character group can also be fed a range of characters, although that is beyond the scope of this article – in the scenarios we’re discussing, we can use alphanumeric characters literally e.g. “abcd1234” and all other characters as either their C-style literal “\r\n\t”, or their ASCII index “\x0A\x0D\x09”.
The next highlight makes our number values safe for SQL statements.
This time we don’t want to escape anything, we just want to have nothing but a valid numerical value – be it an integer or floating point.
You might have noticed line 10, and perhaps wondered as to the purpose. A few years ago I worked on a call centre logging system that was using variable += 0; to ensure numerical values. Why this was done, I cannot honestly say… unless prior to PHP 4 that was how we did it?! Maybe somebody reading can shed some light on the subject. Other than that, if you, like I did, come across a line like that in the wild, you’ll know what it’s trying to do.
Moving forward then; lines 11 and 12 are all we need to prepare our numerical input values for SQL. I should say, had the input string $number_i contained any non-numerical characters in front or to the left of the numerical ones… our values $number_a, $number_b and $number_c would all equals 0.
We’ll use floatval to clean our input numbers; PHP only prints decimal places when they exist in the input value – so printing them into an SQL statement won’t cause any errors if no decimal was in the input. As long as our server code is safe, we can leave the more finicky validating to our client side code.
Before we move on to a final listing for our PHP, we’ll glance at the final code highlight, the Boolean boxing.
Like the C++ equivalent, a Boolean in PHP is really an integer. As in, True + True = Two. There are countless ways to translate an input string to a Boolean type, my personal favourite being: does the lower case string contain the word true?
You each may have you own preferred methods; does the input string explicitly equal “true” or is the input string “1” etcetera… what is important is that the value coming in, whatever it might look like, is represented by a Boolean (or integer) before we use it.
My personal philosophy is simply, if X is true or false, then X is a Boolean. I’ll blissfully write all the code I might need to review later with Booleans and not short, int, tinyint or anything that isn’t Boolean. What happens on the metal isn’t my concern, so what it looks like to a human is far more important.
So, as with numbers and strings, our Booleans are guaranteed safe from the moment we pull them into our script. Moreover our hygienic code doesn’t need additional lines.
Processing HTML
Now that we have our protected our SQL from injections, and we’ve made certain only a POST login can affably work with our script, we are ready to implement our review submission feature.
Our client wants to allow review enabled users to format their contributions as regular HTML. This would seem straightforward enough, but we also know that emails addresses are ten to the penny, and bookstore accounts are created programmatically – so in the better interests of everyone we’ll make sure only the tags we say pass.
Deciding how we check the incoming review might seem daunting. The HTML specification has a rather wholesome array of tags, many of which we’re happy to allow.
As longwinded the task might seem, I eagerly advise everyone – choose what to allow, and never what to deny. Browser and server mark-up languages all adhere to XML like structuring, so we can base our code on the fundamental fact that executable code must be surrounded by, or be part of, angle bracketed tags.
Granted, there are several ways we can achieve the same result. For this article I will describe one possible regular expression pipeline:
These regular expressions won’t produce a flawless output, but in the majority of cases – they should do a near elegant job.
Let’s take a look at the regular expression we’ll be using in our PHP. You’ll notice two arrays have been declared. $safelist_review and $safelist_comment – this is so we can use the same functions to validate reviews and later, comments:
…and here is the main function that we will call to sanitise the review and comment data:
The input parameters, I have highlighted red and blue. $input is the raw data as submitted by the user and $list is a reference to the expression array; $safelist_review or $safelist_comment depending of course on which type of submission we wish to validate.
The function returns the reformatted version of the submitted data – any tags that don’t pass any of the regular expressions in our chosen list are converted to HTML encoded equivalents. Which in the simplest terms makes < and > into < and > other characters are modified too, but none of these really pose a security threat to our client or the users.
Note: The functions: cleanWhitespace and getTags are included in the article’s source files.
You’d be correct to assume all we have really done is helped survive the aesthetics of our site’s pages, and not done everything to protect the user’s security. There still remains a rather enormous security hole even with the SQL safe, request spoofing cured and mark-up manipulated. The JavaScript injection;
This particular flaw could be fixed by a few more regular expressions, and/or modification to the ones we are already using. Our anchor regular expression only allows “/…”, “h…” and “#…” values as the href attribute – which is really only an example of a solution. Browsers across the board understand a huge variety of script visible attributes, such as onClick, onLoad and so forth.
We have in essence created a thorny problem for ourselves, we wanted to allow HTML – but now we have a near endless list of keywords to strip. There is of course, a less than perfect – but quite quickly written way to do this:
On reflection you’d be absolutely justified in asking, “Why didn’t we just use BBCode or Textile or…?”
Myself, if I were dealing with mark-up processing, I might even go for XML walking. After all the incoming data should be valid XML.
However, this article is not meant to teach us how to regex, how to PHP or how to write anything in one particular language. The rationale behind it simply being, don’t leave any doors ajar.
So let’s finish off then; with quick review of what we’ve looked at:
Although this article hasn’t equipped you with any off the shelf project. A primary purpose of my writing was not to scare away the designers who code, or nitpick the work of coders anywhere – but to encourage everyone to author robust code from the off. That said, I do plan to revisit certain elements of this article in more detail later.
Until then, safe coding!