Another Sentiers mainstay, Cory Doctorow, with a good piece on machine learning and how so many AI projects are flawed from the beginning, resorting to bad data and following that up with a bad process. It’s not the first time I include pieces threading similar ground, I’m including this one because of the mention of “thick descriptions,” a “thick understanding of context,” and “thick analysis.” A thick description is “a description of human social action that describes not just physical behaviours, but their context as interpreted by the actors as well, so that it can be better understood by an outsider.” (Wikipedia)
In much of ML, most data is very thin, as its collection is often just a piled-on task for over-worked and under-paid people, while the model is then created from it without any of the vital context.
ML practitioners don’t merely use poor quality data when good quality data isn’t available — they also use the poor quality data to assess the resulting models. When you train an ML model, you hold back some of the training data for assessment purposes. […]
Bad data makes bad models. Bad models instruct people to make ineffective or harmful interventions. Those bad interventions produce more bad data, which is fed into more bad models — it’s a “data-cascade.”