What should we call simulated data?

Data is not made. Data is born as a result of a measurement process. Taking measurements (in conjunction with a measurement theory) creates data. But then, what should we call – in contrast – the results of simulations, the output of theoretical models? Some might object that this is not an interesting question in the first place, but pointless – and rather nitpicky at that – semantics. However, I must disagree with this. The question is of critical importance, as “simulated data” is a contradiction in terms. Data concern the state of the real world, the world of things. The outputs of a simulation firmly belong to the realm of ideas. It is critical not to confuse the two, lest we make ill-informed statements about the real world solely based on observations we made based on models or simulations. This matters, as was impressively shown in the financial crisis of 2008 – when it became apparent that the if in “risk is only tamed if the real world risk fits the risk model” is indeed a big (and uncertain) if. Entire fields can be built on the confusion between models that work for idealized systems and the real world they are trying to account for, as Ricardo and the rise of economics impressively shows. That doesn’t mean one should attempt this.

Actual simulated data

Actual simulated data

With that in mind, what should we call “simulated data”? The term itself is a contradiction in terms and it bugs me that the rise of data science makes me qualify actual data with the term “real”. Predictions would be needlessly imperialist, as many (if not most) models deal solely with postdiction (there is nothing inherently wrong with this). I tried to make “sata” happen, but that did not catch on (yet).

Any suggestions?


PS: Lest you think that I’m needlessly pedantic and that this is a distinction without a difference – it really does matter. Simulations, modeling and “simulated data” (data) all have a role in the scientific process. But they are no substitute for data. In age of sophisticated modeling, it really does matter what one uses as a training (and test) set. Actual, real data – well recorded – is best. To wit: http://www.pnas.org/content/early/2016/06/27/1602413113.long

This entry was posted in Pet peeve, Philosophy, Science. Bookmark the permalink.

1 Response to What should we call simulated data?

  1. Erik says:

    Datum, in Latin, means “thing” or “given thing.” Let’s assume then that data are things that are given. They may be hard to find, but once you settle upon what the data are, they are given.

    Factura means “creation” or “performance” or “work.” It means something along the lines of “this thing that was made,” and can translate as “manufactured thing.”

    Simulations are indeed creations, can be a form of performance (see Markram), and they sure are work — these things don’t make themselves.

    I’m going with factura. Plural: facturae.

Leave a Reply

Your email address will not be published.

− 3 = three