Thursday, October 18, 2007

Discussing data representation

It's interesting how data representation discussions always teach me something. I've read uncountable papers about it, but it seems like it takes me actually talking about it, arguing with different-minded people, to understand some of the concepts behind what exists.

So, today it was a discussion that actually opened my eyes to the "beauty" of RDF. I'll have to admit that it's sometimes hard to work with all the verbosity of it. It's also strange to have to create intermediary "fake" elements, just so that you can tie things together in a more coherent way. And the more elements you add, the more expensive calculations become and so on.

Why is it good then? Because its very simple triple limitation provides you with one of what I'm starting to believe to be the most powerful thing in data representation: schema compatibility.

Schema changes all the time and it's always going to be painful to maintain it. However, if when your change can't really do something that will break the expected elements at a certain section of your graph, it makes everything much more powerful. Let me try to explain with an example:

You are modeling a catalog of items and somebody tells you that you have a concept called "list price". The first thing that comes to your mind as a modeler is that list price has one value for each currency of a given marketplace. So that's how you model it and go on with your life, making good use of your new attribute.

Then comes a request to aggregate the data from an already-existing, very-similarly-modeled catalog, but that had one important difference: it was better tailored to different marketplaces. When you start the import procedure you suddenly realize that your assumption that "list price" is a single value for a single-currency marketplace is not valid any more. In some places of the world, for imported products from neighboring countries, customers like to see also the list price on the original currency!

If your catalog is very database-driven, you are very likely looking at having to change your schema, maybe backfill all your items and have some sort of either downtime or time that your catalog is in transition, so not completely consistent. It's a lot of pain that just keeps growing as your catalog grows.

Now, if your underlying data representation was just this RDF triples graph, adding a new edge of the same type won't really cause any problems. You have only to fix the validation code and, maybe, add some filtering code to make sure that clients that consume your data and can only consume the list price in the marketplace's currency only see the data they were used to see and voila, you are done. No backfill, no database cleanup, no database schema changes (but you do change your RDF schema to increase the cardinality of the connection restriction).

Oh, well, unfortunately sometimes you just have to live with other options due to optimization, so that you can handle millions of updates a day with a reasonably small fleet of machines. It's what engineers were bred to do, I guess.

0 comments: