Quick and dirty upgrading a SaaS (Software-as-a-Service) when then there is bad data

One of the under-appreciated, but very important, parts of running a SaaS (Software-as-a-Service) business is the upgrade path. This is particularly hard when you are trying to be a lean startup.

The pure SAAS (Software-as-a-Service) model means “no installed” software on the client machine. In the ideal case, a new server is pushed to production instantly and the users never notice any downtime. They just suddenly have a better experience.

The reality is not that easy. This is particularly true if the new release includes bug fixes to bad data or misplaced data problems. Now you have to make sure that the bug fix doesn’t cause new bugs.

For example, lets say that there was a bug when storing user email addresses. The bug was that email addresses where not stripped of whitespace before and after the email address. As a result, the database had email entries:

USER_EMAIL table:

ID EMAIL_ADDRESS (the problem)
1 ‘ [email protected] space at beginning
2 [email protected] ‘ space at end
3 [email protected]

The EMAIL_ADDRESS column is also UNIQUE.

Furthermore, USER_EMAIL table ID maybe a foreign key in other tables, fixing the existing data maybe more trouble than it is worth.

For this post, lets just assume that there is no foreign key problem. The next problem is that updating the existing rows will result in violating the UNIQUE constraint.

Lastly, there is this nagging (or there should be) question in the back of your head: “Is my fix as complete as I think it is?” Some where was there a query that will be looking for the email address with the space at the end? An audit table perhaps? Some sales report that will now no longer run correctly “[email protected] “?

Important NOTE: give yourself an escape path:

  1. Backup the database
  2. Test your fixes on a backup

It is for these doubts of invincibility, my approach is to add an new column that is not UNIQUE but just indexed.

new USER_EMAIL table:

ID EMAIL_ADDRESS NORMALIZED_EMAIL
1 ‘ [email protected] [email protected]
2 [email protected] ‘ [email protected]
3 [email protected] [email protected]

I then modify the queries that search for email address to look at the NORMALIZED_EMAIL column.
For new entries, both columns are written with the same whitespace stripped values.

Usually the biggest problem are queries that are expecting a single row being returned now returning multiple rows ( after all EMAIL_ADDRESS was supposed to be unique )

After running local tests, then push to production. Let everything run for a while. Confirm that the code works as expected.

The next step is to rename the EMAIL_ADDRESS column to WHITE_SPACED_EMAIL_ADDRESS and see if there any queries that break. Fix the broken queries and repeat.

At this point, revisit and determine if the duplicate (if considering just the NORMALIZED_EMAIL column ) can be consolidated.

These steps are not the most complete and they are not the most bulletproof. But for a smaller startups, they are “good enough”.

This entry was posted in technical. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *