Overview of Audio Testing Methodologies
How Many Blind People Does it Take to Know that Your Speakers are the Best?
originally published October 10th, 2005
Double-blind, single-blind, ABX…all these terms and more are bantered about on the forums and elsewhere when discussing someone's subjective experience with a piece of audio equipment. This editorial is meant as a primer for the testing methodologies used to validly measure someone's subjective experiences. This is not a definitive guide nor is it meant to be. It is also not going to be a technical paper - I will use as many examples with as few numbers as possible. I want this to be accessible to all.
History - What's the deal with all this anyway?
Many moons ago, there was only experience. You put your hand in a fire, it burned, you told your caveman buddies not to do that, and…they probably did it anyway. Whether or not cavemen debated which part of the dino was the best or not may never be known but we do know that only a battle royale to the death could "prove" one of them right. Why? Because it was all opinion…opinion without scientific fact to back it up.
Well, time marched on and low and behold, the deductive method was invented by Euclid (still used today in mathematics). In simple terms, one creates simple true statements (axioms or postulates) then builds theorems based on them. Anyone flashing back to high school geometry? I am. The deductive method is great - for mathematics. But what if you don't have simple truths to build on?
Enter Sir Francis Bacon (named after my favorite pork product). Sir Bacon came up with the inductive method (commonly referred to as the Scientific Method). The Scientific Method is basically:
Observe and describe
Create a hypothesis (theory for why what you observed happened)
Make predictions based on your hypothesis
Experiment to see if your predictions hold true
Of course, step five is "repeat." At this point, an example is in order:
Example 1: You are walking down the street. You notice a pain in your foot. You look, and there is a tack imbedded in your foot (observation). You suspect that the tack is the cause of your considerable discomfort (hypothesis). You think that if you remove the tack, the pain will subside (predict). You pull the tack out of your foot and the pain does reduce considerably, though some is still present (experiment). Now comes the dreaded step 5 - repeat. Ouch.
So, what has this got to do with audio you say? Well, this is the deal: The deductive method works great for testing audio equipment. No lie. No BS. Absolute truth. You put a signal in, you measure it when it comes out, if it is the same, IT'S THE SAME. Unfortunately, not everyone buys in to this little piece of common sense. Some believe that there are real quantifiable differences that can't be measured. Hey, I'm not an electrical or mechanical engineer so who am I to argue?
Statistics to the Rescue: Correlation does not equal causation
If you've ever taken a statistics course you'll have heard this phrase a million times: Correlation does not equal causation. But what does it mean? Looking back at Example 1, the alleviation of the pain in your foot coincided (correlated) with the removal of the tack. So the tack caused the pain, right? Not to a statistician. The statistician only knows that the pain went away when the tack was removed. That could be a coincidence. No, the statistician is going to insist that you, all your friends, and a bunch of people of various ethnicities, income levels, geographic locations, and genders all stick yourself multiple times in the foot with a tack before he'll buy that the tack was probably the cause of the pain. Idiotic yes? No:
Example 2: A crime statistician discovers that with a high degree of correlation the incidence of domestic violence increases with the sales of ice cream. He, of course, runs home and throws out all the ice cream in the house to make sure he doesn't beat his wife. He immediately gets the crap kicked out of him by his very pregnant, extremely unreasonable better half who told him to thank his lucky stars he didn't throw out the pickles too.
Where did he go wrong? He forgot the first rule of statistics: Correlation does not equal causation. Just because domestic violence increases along with ice cream sales, does not mean that ice cream was the cause of the increase in domestic violence. It could be that the sugar in the ice cream was making people crazy or it could be that the beaten partners are buying more ice cream to console themselves. More likely, it is that ice cream is bought during the summer, and it is hot, and people are uncomfortable, and they tend to get out more, and Lord knows what else. There are a lot of reasons, but eating ice cream is probably not one of them.
Well, what does this have to do with audio…I'm getting to that, hold on and keep reading.
Case Study Methodology: What do you mean my own experiences don't count?
How many times have you been in an argument with someone over something where you had completely different experiences with the exact same thing?
"That movie was awesome!"
"No it wasn't, are you kidding me, a cappuccino enema couldn't keep me awake during that snoozefest."
"Man, you are on crack, that was the best thing I've ever seen!"
What's the problem here? Lets say these two friends sat in the same theater at the same time right next to each other (probably with an empty chair between them because…well, you know) so you are reasonably sure they had the same experience. How could they have such different opinions? We all know the answer to that; they see the movie through their own biases. One loves action flicks, one loves sappy dramas. Of course one is going to hate it. They were biased against it before they even walked in (and conversely you could say that the other was biased for it before they even walked in).
This type of methodology is used extensively in our everyday lives, by anthropologists and (interestingly enough) on NPR. It is called the Case Study Method. The Case Study Method generally looks at one (or a small group of) subject(s) and reports on their experiences. Most of us use this method daily when we ask a trusted friend which restaurant to go to, what movie to see, how they like their new car, etc. Anthropologists (not all mind you) use this method when they imbed themselves in an isolated tribe in the middle of nowhere then write a book about them. NPR uses it when they report on one person's experience about a particular issue (though one assumes that they are using that person's experience as corroborating evidence for the general state of things).
Now the audio link becomes clear doesn't it? Someone tells you they have the best speakers ever, you can now point at them like the pod people at the end of "Invasion of the Body Snatchers" screaming, "Case Study, Case Study." Hold on, says I. There is nothing wrong with the case study method. Huh? Wh…Wh…What? Yep, perfectly reasonable. Nothing wrong with it… As long as you know the limitations. And you know what, we all do. You have a friend that has your same taste in action flicks, but not sci-fi. You know that you tend to agree with Ebert over Roper (or vise versa). You make these mental modifications all the time without thinking about it. What you are doing is making adjustments to how much stock you put in their opinion based on your knowledge of their biases. See, we are all amateur statisticians!
Bias and Error: The whack-a-moles of the statistical world
The Case Study Method is fraught with bias - true. A good researcher will be trained to recognize their own biases and put them aside (for the most part) but bias is an ugly animal. It pops up when and where you least expect it. That is why professional reviewers tend to get more weight than lay people: They are not only experts in their field (presumably) but they have ways of controlling their own biases, with varying degrees of success. Well, the trained researcher is not satisfied taking someone else's word for it, they want something more scientific.
First, before we go any farther, let's talk about a little creature called Error. Error comes in two flavors: Systematic and Random. Systematic bias (or systematic error) is something that affects the outcome of your experiment in ONE direction. Random error is something that is equally likely to affect the outcome of you experiment in either direction. And…an example is in order:
Example 3: Children at the local Junior College are taking their statistics finals. One of the questions on the test is, "Correlation does not equal __________." The instructor forgot to take his favorite banner displaying that exact phrase down before the test. Every student got that question right. Unfortunately, when the teacher was grading the tests, he was enjoying a glass (or two or three) of his favorite single malt and he tended to grade the later tests a bit more leniently than the first ones.
So, which is which? Well, since the banner affected all the tests in the same way (everyone got it right), it is Systematic Bias. Since the order of the tests was random (presumably), the effect of the professor's increasing inebriation was Random error. Statisticians try to control for (eliminate) as much systematic bias as possible and randomize the rest.
Whoa, whoa, whoa there cowboy, run that by me again.
Controlling for bias is trying, to the best of your ability, to eliminate the sources error or bias. Random error is really impossible to control for and technically there is no need. If it is just as likely to affect the outcome one way or the other, it won't reliably change your results. Crap, there I go throwing around technical terms again assuming you all know what I mean:
Example 4: When you step on the scale at the doctor's office and each time it says you are fat, that is reliable. When you step on your scale at home and it says you are fat, you step off and step on again and it says you are slightly less fat, you step off and step on again, and it says you are slightly more fat, that is unreliable. Your weight hasn't changed in the 14 nanoseconds from 1 st to 2 nd to 3 rd weighing, but the measure has, and if it's as likely to indicate that you're slightly fatter or slightly thinner, that's unreliability. If it consistently indicates that you're growing fatter with each successive weigh-in, even those that are only seconds apart, that's bias.
Engineers have -ometers and -ographs that measure things reliably to the nano-whatever. They don't have to worry that much about reliability. Their tools are built to measure reliably. Statisticians are constantly worried about reliability because of the havoc it wreaks. An unreliable test or measure may be unreliable because of random error (which makes differences harder to detect) or systematic error (changing the scores all in one direction). Regardless, unreliability is bad.
But back to controlling and randomizing bias: When a source of systematic bias is identified, thousands of statisticians, like little cockroaches, scurry around trying to identify all the ways it can be eliminated from the experiment. In example 3 above, the source of systematic bias can be controlled by…yep, removing the banner. Not too hard, eh? But what about when the source of systematic bias is something else...? Something harder to control?
Example 5: You volunteer for your local congressperson that is up for reelection. You are tasked with finding out what the people want. A beautiful survey is developed that, because of time and cost, will be administered over the phone. Everything is going wonderfully until someone brings to your attention that there are large segments of the community that don't have ready access to or don't have time to use the phone…Elderly in nursing homes, the working poor, some youth of voting age, etc. You look over the demographics of your results and you realize, yep, the ages of the respondents are between 28 and 56. The congressperson is not going to be pleased.
You've got systematic bias going on - a bad case of it. How do you fix it? Well you switch methodologies. You do face-to-face interviews door-to-door. You send out mailings. You hold town meetings. Finally, you get most everyone in the area's opinions when that very same colleague mentions that your results might differ not because of the groups, but because people answer questions differently in a face-to-face interview, a telephone interview, a paper survey, or in front of a group at a community meeting. Ack!
What about random error? Well, this is much easier to deal with…ignore it. *Gasp* Yep, that's right, ignore it. Believe it or not, random error is your friend. Let's go with an example shall we?
Example 6: The classic random example is a coin toss. Half the time it comes up heads, the other half it comes up tails. Weigh yourself. Write that number down. Now flip a coin 10 times, each time it comes up heads, add 10lbs to your weight and write that number down. Each time it comes up tails, subtract 10lbs from your weight and write that number down. After 20 times, take the average. Statistically, it should be very, very close to your original weight.
Your link no longer takes you where you want to go.