Yet another data migration problem

TL;DR Ensure data consistency while copying data across databases having RDBMS (PostgreSQL in this case) on board.

The problem

Imagine you have two databases, such a they’ve had the same parent in the past. As the time goes by, some of the data might change in any of them. Now, you’d like to copy object A between databases under assumption that it’s only going to create a copy if there’s no equal object in the destination database. The object might contain foreign keys and such associations are also considered during checking equality.

Considerations

The easiest solution you’d think of is dump the data you want and then restore in destination database. Such approach, however, implies that you’d need a tool taking only data you want to copy. Not the whole database or table, only object A with its associations. PostgreSQL provides pg_dump or copy for data migrations, however none of them lets you deal with associations easily. You’d then use some higher level tools, e.g. any ORM you like and deal with deep object copy itself.

To check for equality, you’d need some data to compare. The best candidate would be to compare record id and its foreign keys. In this case however, you’re guaranteed that id in database X and Y points to the same record. They may differ and result in a mess.

Check for hash(database_X(A)) == hash(database_Y(A))

Another approach would be to calculate a hash of the data you’d like to compare and then use hashes instead of ids. So if the result matches, you’d not need to make a copy and for further operations, you’d just use record id.

Build a hash of record

To build a hash, you’d add a trigger to your database with appropriate function, e.g:

CREATE OR REPLACE FUNCTION update_post_footprint_func()
RETURNS trigger AS $$
DECLARE raw_footprint text;
BEGIN

raw_footprint := concat(NEW.title, NEW.content, NEW.owner_id);
NEW.footprint := (SELECT md5(raw_footprint));

RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER update_post_footprint BEFORE INSERT OR UPDATE ON posts FOR EACH ROW EXECUTE PROCEDURE update_post_footprint_func();

Such function will build new hash for given record for each insert or update. As you’d notice, this use case considers only 1 x 1 relationship at most and doesn’t cover 1 x N. For instance, a post record might have many tags. In this case you have two choices, either select for footprints of the dependencies (note that it implies any dependency has its own footprint), e.g:

raw_footprint := concat(...,
(select array_to_string(array(select footprint from tags where post_id = NEW.id order by id ASC), '|')));

or build parent footprint based on the dependency data, e.g:

raw_footprint := concat(...,
(select array_to_string(array(select name from tags inner join post_tags on tags.id = post_tags.tag_id where post_tags.post_id = NEW.id order by id ASC), '|')));

The footprint build process is somewhat similar to the Russian Doll caching pattern, despite you need to be aware that dependencies footprint must be built before the record footprint. However, it only applies when refering dependency footprints directly.

Possible issues

  1. Depending on the record dependencies, there might be a need to build a few/several triggers, where each generates sub-footprint, finally assembled with the main footprint.
  2. The speed. Since each trigger execution is a non-zero time consuming operation, the need of using it should be further discussed and associated with the use case. If it’s going to be rarely used and data insertions/updates are heavy, perhaps it would be a better idea to use it within the app itself.

Yet another Phoenix failure

As many of you, some time ago I’ve finished reading The Phoenix Project and no, I won’t write yet another review how good or bad is this book. However, it seems there’re two camps around, one loves the novel, and one hates. If you still aren’t a camper of any, come and join us. Perhaps you’ll learn something or just waste yet another several hours, not for the first time. Come and be a camper!

I won’t write yet another review, but it seems there’re Phoenix projects everywhere or at least they look like such. Today is Monday and I wanted to do a bank transfer. No chance, it didn’t work. Such crucial bank service is not accessible all day and they still haven’t fixed it. Guess what, they performed a customer migration to a brand new platform with completely new UI, perhaps even better than the previous one. There’s just one thing, it doesn’t work. So I’ve tried to send a message through the system to tell them all the issues, but it also failed again and again.

They spent probably thousands of hours working on a new platform, invested time and money and when it came to delivery time, it just failed. Of course they say they’re familiar with these issues and the whole IT department is working on it, but that’s not the case while everything is burning. I mean, it mustn’t never happen, especially if it’s a bank and there’s money involved.

We all want to be IT professionals, but such things are still happening and I started pondering how come. Is it because of simple math and probability, because the internet now achieved the point it never been ever since and among thousands of online services, some of them must just fail? Is it because of the vast changes in IT so no one could understand it well? Is it because of IT people since they just don’t care? Finally, is it because of management pressure, because whatever is happening, the product must be delivered on time?

Such app failure is not just a problem to solve. The point is, the whole migration process has failed and from customer point of view, new product is completely unusable, no matter how it look like or how well it is designed regarding UX best practices. The business can’t operate with such product.

If you’re familiar with such situation, waste several hours and read The Phoenix Project.