Overview of Audio Testing Methodologies
How Many Blind People Does it Take to Know that Your Speakers are the Best?
originally published October 10th, 2005
Double-blind, single-blind, ABX: all of these terms and more are bandied about on forums and elsewhere when discussing someone's subjective experience with a piece of audio equipment. This editorial is meant as a primer for the testing methodologies used to validly measure someone's subjective experiences. This is not a definitive guide, nor is it meant to be. It is also not going to be a technical paper; I will use as many examples with as few numbers as possible. I want this to be accessible to all.
History - What's the deal with all this anyway?
Many moons ago, there was only experience. You put your hand in a fire, it burned, you told your caveman buddies not to do that, and…they probably did it anyway. Whether or not cavemen debated which part of the dino was the best might never be known, but we do know that only a battle to the death could "prove" one of them right. Why? Because it was all opinion: opinion without scientific fact to back it up.
Well, time marched on, and low and behold, the deductive method (still used today in mathematics) was invented by Euclid. In simple terms, one creates simple, true statements (axioms or postulates), then builds theorems based upon them. Anyone flashing back to high school geometry? I am. The deductive method is great - for mathematics. But, what if you don't have simple truths upon which to build?
Enter Sir Francis Bacon (named after my favorite pork product). Sir Bacon came up with the inductive method (commonly referred to as the Scientific Method). The Scientific Method is basically:
Observe and describe
Create a hypothesis (theory for why what you observed happened)
Make predictions based on your hypothesis
Experiment to see if your predictions hold true
Of course, step five is, "repeat." At this point, an example is in order:
Example 1: You are walking down the street. You notice a pain in your foot. You look, and there is a tack imbedded in your foot (observation). You suspect that the tack is the cause of your considerable discomfort (hypothesis). You think that if you remove the tack, the pain will subside (predict). You pull the tack out of your foot, and there is a considerable reduction of the pain, though some is still present (experiment). Now comes the dreaded step 5 - repeat. Ouch.
So, what has this got to do with audio, you say? Well, this is the deal: the deductive method works well for testing audio equipment. No lie. No BS. Absolute truth. You put a signal in, you measure it when it comes out; if it is the same, it's the same. Unfortunately, not everyone buys into this little piece of common sense. Some believe that there are real, quantifiable differences that can't be measured. Hey, I'm not an electrical or mechanical engineer, so who am I to argue?
Statistics to the Rescue: Correlation does not equal causation
If you've ever taken a statistics course, you'll have heard this phrase a million times: correlation does not equal causation. But, what does it mean? Looking back at Example 1, the alleviation of the pain in your foot coincided (correlated) with the removal of the tack. So, the tack caused the pain, right? Not to a statistician. The statistician only knows that the pain went away when the tack was removed. That could be a coincidence. No, the statistician is going to insist that you, all your friends, and a bunch of people of various ethnicities, income levels, geographic locations, and genders all stick yourselves multiple times in the foot with a tack before he'll buy that the tack was probably the cause of the pain. Idiotic? No:
Example 2: A crime statistician discovers that, with a high degree of correlation, the number of incidences of domestic violence increases with increased sales of ice cream. He, of course, runs home, and throws out all the ice cream in the house to make sure he doesn't beat his wife. He immediately gets the crap kicked out of him by his very pregnant, extremely unreasonable better half who told him to thank his lucky stars he didn't throw out the pickles, too.
Where did he go wrong? He forgot the first rule of statistics: correlation does not equal causation. Just because domestic violence increases along with ice cream sales, that does not mean that ice cream was the cause of the increase in domestic violence. It could be that the sugar in the ice cream was making people crazy, or it could be that the beaten partners are buying more ice cream to console themselves. More likely, it is that ice cream is purchased more often during the summer, and it is hot, and people are uncomfortable, and they tend to get out more, and Lord knows what else. There are a lot of potential reasons, but eating ice cream is probably not one of them.
Well, what does this have to do with audio? I'm getting to that; hold on and keep reading.
Case Study Methodology: What do you mean my own experiences don't count?
How many times have you been in an argument with someone over something where you had completely different experiences with the exact same thing?
"That movie was awesome!"
"No, it wasn't. Are you kidding me? A cappuccino enema couldn't have kept me awake during that snoozefest."
"Man, you are on crack. That was the best thing I've ever seen!"
What's the problem here? Let's say these two friends sat in the same theater, at the same time, right next to each other (probably with an empty chair between them because, well, you know); so, you are reasonably sure they had the same experience. How could they have such different opinions? We all know the answer to that: they see the movie through their own biases. One loves action flicks; one loves sappy dramas. Of course one is going to hate it. He was biased against it before they even walked in (and conversely, you could say that the other was biased for it before they even walked in).
This type of methodology is used extensively in our everyday lives by anthropologists and (interestingly enough) on NPR. It is called the Case Study Method. The Case Study Method generally looks at one (or a small group of) subject(s), and reports on their experiences. Most of us use this method daily when we ask a trusted friend which restaurant to go to, what movie to see, how they like their new car, etc. Anthropologists (not all, mind you) use this method when they imbed themselves in an isolated tribe in the middle of nowhere, then write a book about them. NPR uses it when they report on one person's experience about a particular issue (though one assumes that they are using that person's experience as corroborating evidence for the general state of things).
Now, the audio link becomes clear, doesn't it? Someone tells you he has the best speakers ever. You can now point at him like the pod people at the end of "Invasion of the Body Snatchers" screaming, "Case Study, Case Study." Hold on, says I. There is nothing wrong with the case study method. Huh? Wh…Wh…What? Yep, perfectly reasonable. Nothing wrong with it - as long as you know the limitations. And you know what? We all do. You have a friend who has the same preference as you for action flicks, but not sci-fi. You know that you tend to agree with Ebert over Roper (or vise versa). You make these mental modifications all the time without thinking about it. What you are doing is making adjustments to how much stock you put in their opinion based on your knowledge of their biases. See, we are all amateur statisticians!
Bias and Error: The whack-a-moles of the statistical world
The Case Study Method is fraught with bias - true. A good researcher will be trained to recognize his or her own biases, and put them aside (for the most part). But, bias is an ugly animal. It pops up when and where you least expect it. That is why professional reviewers tend to get more weight than lay people: they are not only experts in their field (presumably), but they have ways of controlling their own biases, with varying degrees of success. Well, the trained researcher is not satisfied taking someone else's word for it; he or she wants something more scientific.
First, before we go any further, let's talk about a little creature called error. Error comes in two flavors: systematic, and random. Systematic bias (or systematic error) is something that affects the outcome of your experiment in one direction. Random error is something that is equally likely to affect the outcome of you experiment in either direction. An example is in order:
Example 3: Children at the local Junior College are taking their statistics finals. One of the questions on the test is, "Correlation does not equal __________." The instructor forgot to take his favorite banner displaying that exact phrase down before the test. Every student got that question right. Unfortunately, when the teacher was grading the tests, he was enjoying a glass (or two or three) of his favorite single malt, and he tended to grade the later tests a bit more leniently than the first ones.
So, which is which? Well, since the banner affected all the tests in the same way (everyone got it right), it is systematic bias. Since the order of the tests was random (presumably), the effect of the professor's increasing inebriation was random error. Statisticians try to control for (eliminate) as much systematic bias as possible, and randomize the rest.
Whoa, whoa, whoa there, cowboy. Run that by me again?
Controlling for bias is trying, to the best of your ability, to eliminate the sources of error or bias. Random error is really impossible to control for, and technically there is no need to do so. If it is just as likely to affect the outcome one way or the other, it won't reliably change your results. Crap, there I go throwing around technical terms again, assuming you all know what I mean:
Example 4: When you step on the scale at the doctor's office, and, every time, it says you are fat, that is reliable. When you step on your scale at home, and it says you are fat, you step off and step on again, and it says you are slightly less fat. You step off and step on again, and it says you are slightly more fat. Your weight hasn't changed in the 14 nanoseconds from 1 st to 2 nd to 3 rd weighing, but the measurement has. So, that is unreliable. And if it's as likely to indicate that you're slightly fatter as it is to indicate that you're slightly thinner, that's random bias. If it consistently indicates that you're growing fatter with each successive weigh-in, even those that are only seconds apart, that's systematic bias.
Engineers have -ometers and -ographs that measure things reliably to the nano-whatever. They don't have to worry that much about reliability. Their tools are built to measure reliably. Statisticians are constantly worried about unreliability because of the havoc it wreaks. An unreliable test or measure might be unreliable because of random error (which makes differences harder to detect) or systematic error (changing the scores all in one direction). Regardless, unreliability is bad.
But, back to controlling and randomizing bias:When a source of systematic bias is identified, thousands of statisticians, like little cockroaches, scurry around trying to identify all the ways it can be eliminated from the experiment. In example 3 above, the source of systematic bias can be controlled by, yep, removing the banner. Not too hard, eh? But, what about when the source of systematic bias is something else? Something harder to control?
Example 5: You volunteer for your local congressperson who is up for reelection. You are tasked with finding out what the people want. A beautiful survey is developed that, because of time and cost, will be administered over the phone. Everything is going wonderfully until someone brings it to your attention that there are large segments of the community who don't have ready access, or don't have time, to use the phone: elderly people in nursing homes, the working poor, some youths who are of voting age, etc. You look over the demographics of your results, and you realize, yep, the ages of the respondents are between 28 and 56. The congressperson is not going to be pleased.
You've got systematic bias going on - a bad case of it. How do you fix it? Well, you switch methodologies. You do face-to-face interviews door-to-door. You send out mailings. You hold town meetings. Finally, you get most everyone in the area's opinions when that very same colleague mentions that your results might differ, not because of the groups, but because people answer questions differently in a face-to-face interview, a telephone interview, a paper survey, or in front of a group at a community meeting. Ack!
What about random error? Well, this is much easier to deal with: ignore it. *Gasp* Yep, that's right, ignore it. Believe it or not, random error is your friend. Let's go with an example, shall we?
Example 6: The classic random example is a coin toss. Half the time, it comes up heads; the other half, it comes up tails. Weigh yourself. Write that number down. Now, flip a coin 10 times. Each time it comes up heads, add 10lbs to your weight, and write that number down. Each time it comes up tails, subtract 10lbs from your weight, and write that number down. After 20 times, take the average. Statistically, it should be very, very close to your original weight.
So, how does this make random error your friend? Say that you know that a specific question always gets vastly different responses based on a person's gender, ethnicity, and income. If you randomly select the people you ask, then it is reasonable to expect that half of your group will be predisposed (biased) to answer one way, while the other half will be predisposed to answer the other way, effectively canceling out the biases. See, you've taken a systematic error and randomized it! Brilliant!
Confused about what AV Gear to buy or how to set it up? Join our Exclusive Audioholics E-Book Membership Program!
Recent Forum Posts:
Your link no longer takes you where you want to go.