Skip to content


Quick and dirty upgrading a SaaS (Software-as-a-Service) when then there is bad data

One of the under-appreciated, but very important, parts of running a SaaS (Software-as-a-Service) business is the upgrade path. This is particularly hard when you are trying to be a lean startup.

The pure SAAS (Software-as-a-Service) model means “no installed” software on the client machine. In the ideal case, a new server is pushed to production instantly and the users never notice any downtime. They just suddenly have a better experience.

The reality is not that easy. This is particularly true if the new release includes bug fixes to bad data or misplaced data problems. Now you have to make sure that the bug fix doesn’t cause new bugs.

For example, lets say that there was a bug when storing user email addresses. The bug was that email addresses where not stripped of whitespace before and after the email address. As a result, the database had email entries:

USER_EMAIL table:

ID EMAIL_ADDRESS (the problem)
1 ‘ konstantin@mailinator.com’ space at beginning
2 ‘konstantin@mailinator.com ‘ space at end
3 ‘konstantin@mailinator.com’

The EMAIL_ADDRESS column is also UNIQUE.

Furthermore, USER_EMAIL table ID maybe a foreign key in other tables, fixing the existing data maybe more trouble than it is worth.

For this post, lets just assume that there is no foreign key problem. The next problem is that updating the existing rows will result in violating the UNIQUE constraint.

Lastly, there is this nagging (or there should be) question in the back of your head: “Is my fix as complete as I think it is?” Some where was there a query that will be looking for the email address with the space at the end? An audit table perhaps? Some sales report that will now no longer run correctly “konstantin@mailinator.com “?

Important NOTE: give yourself an escape path:

  1. Backup the database
  2. Test your fixes on a backup

It is for these doubts of invincibility, my approach is to add an new column that is not UNIQUE but just indexed.

new USER_EMAIL table:

ID EMAIL_ADDRESS NORMALIZED_EMAIL
1 ‘ konstantin@mailinator.com’ konstantin@mailinator.com
2 ‘konstantin@mailinator.com ‘ konstantin@mailinator.com
3 ‘konstantin@mailinator.com’ konstantin@mailinator.com

I then modify the queries that search for email address to look at the NORMALIZED_EMAIL column.
For new entries, both columns are written with the same whitespace stripped values.

Usually the biggest problem are queries that are expecting a single row being returned now returning multiple rows ( after all EMAIL_ADDRESS was supposed to be unique )

After running local tests, then push to production. Let everything run for a while. Confirm that the code works as expected.

The next step is to rename the EMAIL_ADDRESS column to WHITE_SPACED_EMAIL_ADDRESS and see if there any queries that break. Fix the broken queries and repeat.

At this point, revisit and determine if the duplicate (if considering just the NORMALIZED_EMAIL column ) can be consolidated.

These steps are not the most complete and they are not the most bulletproof. But for a smaller startups, they are “good enough”.

Posted in technical.


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.