“Let our rigorous testing and reviews be your guidelines to A/V equipment – not marketing slogans”

Overview of Audio Testing Methodologies

by Tom Andry — June 24, 2011

Revealing Flaws in the Loudspeaker Demo & Double Blind Test

How Many Blind People Does it Take to Know that Your Speakers are the Best?

originally published October 10th, 2005

Double-blind, single-blind, ABX: all of these terms and more are bandied about on forums and elsewhere when discussing someone's subjective experience with a piece of audio equipment. This editorial is meant as a primer for the testing methodologies used to validly measure someone's subjective experiences. This is not a definitive guide, nor is it meant to be. It is also not going to be a technical paper; I will use as many examples with as few numbers as possible. I want this to be accessible to all.

History - What's the deal with all this anyway?

Many moons ago, there was only experience. You put your hand in a fire, it burned, you told your caveman buddies not to do that, and…they probably did it anyway. Whether or not cavemen debated which part of the dino was the best might never be known, but we do know that only a battle to the death could "prove" one of them right. Why? Because it was all opinion: opinion without scientific fact to back it up.

Well, time marched on, and low and behold, the deductive method (still used today in mathematics) was invented by Euclid. In simple terms, one creates simple, true statements (axioms or postulates), then builds theorems based upon them. Anyone flashing back to high school geometry? I am. The deductive method is great - for mathematics. But, what if you don't have simple truths upon which to build?

Enter Sir Francis Bacon (named after my favorite pork product). Sir Bacon came up with the inductive method (commonly referred to as the Scientific Method). The Scientific Method is basically:

Observe and describe
Create a hypothesis (theory for why what you observed happened)
Make predictions based on your hypothesis
Experiment to see if your predictions hold true

Of course, step five is, "repeat." At this point, an example is in order:

Example 1: You are walking down the street. You notice a pain in your foot. You look, and there is a tack imbedded in your foot (observation). You suspect that the tack is the cause of your considerable discomfort (hypothesis). You think that if you remove the tack, the pain will subside (predict). You pull the tack out of your foot, and there is a considerable reduction of the pain, though some is still present (experiment). Now comes the dreaded step 5 - repeat. Ouch.

So, what has this got to do with audio, you say? Well, this is the deal: the deductive method works well for testing audio equipment. No lie. No BS. Absolute truth. You put a signal in, you measure it when it comes out; if it is the same, it's the same. Unfortunately, not everyone buys into this little piece of common sense. Some believe that there are real, quantifiable differences that can't be measured. Hey, I'm not an electrical or mechanical engineer, so who am I to argue?

Statistics to the Rescue: Correlation does not equal causation

If you've ever taken a statistics course, you'll have heard this phrase a million times: correlation does not equal causation. But, what does it mean? Looking back at Example 1, the alleviation of the pain in your foot coincided (correlated) with the removal of the tack. So, the tack caused the pain, right? Not to a statistician. The statistician only knows that the pain went away when the tack was removed. That could be a coincidence. No, the statistician is going to insist that you, all your friends, and a bunch of people of various ethnicities, income levels, geographic locations, and genders all stick yourselves multiple times in the foot with a tack before he'll buy that the tack was probably the cause of the pain. Idiotic? No:

Example 2: A crime statistician discovers that, with a high degree of correlation, the number of incidences of domestic violence increases with increased sales of ice cream. He, of course, runs home, and throws out all the ice cream in the house to make sure he doesn't beat his wife. He immediately gets the crap kicked out of him by his very pregnant, extremely unreasonable better half who told him to thank his lucky stars he didn't throw out the pickles, too.

Where did he go wrong? He forgot the first rule of statistics: correlation does not equal causation. Just because domestic violence increases along with ice cream sales, that does not mean that ice cream was the cause of the increase in domestic violence. It could be that the sugar in the ice cream was making people crazy, or it could be that the beaten partners are buying more ice cream to console themselves. More likely, it is that ice cream is purchased more often during the summer, and it is hot, and people are uncomfortable, and they tend to get out more, and Lord knows what else. There are a lot of potential reasons, but eating ice cream is probably not one of them.

Well, what does this have to do with audio? I'm getting to that; hold on and keep reading.

Case Study Methodology: What do you mean my own experiences don't count?

How many times have you been in an argument with someone over something where you had completely different experiences with the exact same thing?

"That movie was awesome!"

"No, it wasn't. Are you kidding me? A cappuccino enema couldn't have kept me awake during that snoozefest."

"Man, you are on crack. That was the best thing I've ever seen!"

"Whatever…loser."

What's the problem here? Let's say these two friends sat in the same theater, at the same time, right next to each other (probably with an empty chair between them because, well, you know); so, you are reasonably sure they had the same experience. How could they have such different opinions? We all know the answer to that: they see the movie through their own biases. One loves action flicks; one loves sappy dramas. Of course one is going to hate it. He was biased against it before they even walked in (and conversely, you could say that the other was biased for it before they even walked in).

This type of methodology is used extensively in our everyday lives by anthropologists and (interestingly enough) on NPR. It is called the Case Study Method. The Case Study Method generally looks at one (or a small group of) subject(s), and reports on their experiences. Most of us use this method daily when we ask a trusted friend which restaurant to go to, what movie to see, how they like their new car, etc. Anthropologists (not all, mind you) use this method when they imbed themselves in an isolated tribe in the middle of nowhere, then write a book about them. NPR uses it when they report on one person's experience about a particular issue (though one assumes that they are using that person's experience as corroborating evidence for the general state of things).

Now, the audio link becomes clear, doesn't it? Someone tells you he has the best speakers ever. You can now point at him like the pod people at the end of "Invasion of the Body Snatchers" screaming, "Case Study, Case Study." Hold on, says I. There is nothing wrong with the case study method. Huh? Wh…Wh…What? Yep, perfectly reasonable. Nothing wrong with it - as long as you know the limitations. And you know what? We all do. You have a friend who has the same preference as you for action flicks, but not sci-fi. You know that you tend to agree with Ebert over Roper (or vise versa). You make these mental modifications all the time without thinking about it. What you are doing is making adjustments to how much stock you put in their opinion based on your knowledge of their biases. See, we are all amateur statisticians!

Bias and Error: The whack-a-moles of the statistical world

The Case Study Method is fraught with bias - true. A good researcher will be trained to recognize his or her own biases, and put them aside (for the most part). But, bias is an ugly animal. It pops up when and where you least expect it. That is why professional reviewers tend to get more weight than lay people: they are not only experts in their field (presumably), but they have ways of controlling their own biases, with varying degrees of success. Well, the trained researcher is not satisfied taking someone else's word for it; he or she wants something more scientific.

First, before we go any further, let's talk about a little creature called error. Error comes in two flavors: systematic, and random. Systematic bias (or systematic error) is something that affects the outcome of your experiment in one direction. Random error is something that is equally likely to affect the outcome of you experiment in either direction. An example is in order:

Example 3: Children at the local Junior College are taking their statistics finals. One of the questions on the test is, "Correlation does not equal __________." The instructor forgot to take his favorite banner displaying that exact phrase down before the test. Every student got that question right. Unfortunately, when the teacher was grading the tests, he was enjoying a glass (or two or three) of his favorite single malt, and he tended to grade the later tests a bit more leniently than the first ones.

So, which is which? Well, since the banner affected all the tests in the same way (everyone got it right), it is systematic bias. Since the order of the tests was random (presumably), the effect of the professor's increasing inebriation was random error. Statisticians try to control for (eliminate) as much systematic bias as possible, and randomize the rest.

Whoa, whoa, whoa there, cowboy. Run that by me again?

Controlling for bias is trying, to the best of your ability, to eliminate the sources of error or bias. Random error is really impossible to control for, and technically there is no need to do so. If it is just as likely to affect the outcome one way or the other, it won't reliably change your results. Crap, there I go throwing around technical terms again, assuming you all know what I mean:

Example 4: When you step on the scale at the doctor's office, and, every time, it says you are fat, that is reliable. When you step on your scale at home, and it says you are fat, you step off and step on again, and it says you are slightly less fat. You step off and step on again, and it says you are slightly more fat. Your weight hasn't changed in the 14 nanoseconds from 1 st to 2 nd to 3 rd weighing, but the measurement has. So, that is unreliable. And if it's as likely to indicate that you're slightly fatter as it is to indicate that you're slightly thinner, that's random bias. If it consistently indicates that you're growing fatter with each successive weigh-in, even those that are only seconds apart, that's systematic bias.

Engineers have -ometers and -ographs that measure things reliably to the nano-whatever. They don't have to worry that much about reliability. Their tools are built to measure reliably. Statisticians are constantly worried about unreliability because of the havoc it wreaks. An unreliable test or measure might be unreliable because of random error (which makes differences harder to detect) or systematic error (changing the scores all in one direction). Regardless, unreliability is bad.

But, back to controlling and randomizing bias:When a source of systematic bias is identified, thousands of statisticians, like little cockroaches, scurry around trying to identify all the ways it can be eliminated from the experiment. In example 3 above, the source of systematic bias can be controlled by, yep, removing the banner. Not too hard, eh? But, what about when the source of systematic bias is something else? Something harder to control?

Example 5: You volunteer for your local congressperson who is up for reelection. You are tasked with finding out what the people want. A beautiful survey is developed that, because of time and cost, will be administered over the phone. Everything is going wonderfully until someone brings it to your attention that there are large segments of the community who don't have ready access, or don't have time, to use the phone: elderly people in nursing homes, the working poor, some youths who are of voting age, etc. You look over the demographics of your results, and you realize, yep, the ages of the respondents are between 28 and 56. The congressperson is not going to be pleased.

You've got systematic bias going on - a bad case of it. How do you fix it? Well, you switch methodologies. You do face-to-face interviews door-to-door. You send out mailings. You hold town meetings. Finally, you get most everyone in the area's opinions when that very same colleague mentions that your results might differ, not because of the groups, but because people answer questions differently in a face-to-face interview, a telephone interview, a paper survey, or in front of a group at a community meeting. Ack!

What about random error? Well, this is much easier to deal with: ignore it. *Gasp* Yep, that's right, ignore it. Believe it or not, random error is your friend. Let's go with an example, shall we?

Example 6: The classic random example is a coin toss. Half the time, it comes up heads; the other half, it comes up tails. Weigh yourself. Write that number down. Now, flip a coin 10 times. Each time it comes up heads, add 10lbs to your weight, and write that number down. Each time it comes up tails, subtract 10lbs from your weight, and write that number down. After 20 times, take the average. Statistically, it should be very, very close to your original weight.

So, how does this make random error your friend? Say that you know that a specific question always gets vastly different responses based on a person's gender, ethnicity, and income. If you randomly select the people you ask, then it is reasonable to expect that half of your group will be predisposed (biased) to answer one way, while the other half will be predisposed to answer the other way, effectively canceling out the biases. See, you've taken a systematic error and randomized it! Brilliant!

Reliability - More than what your girlfriend says you aren't

I briefly touched on reliability above. Why is reliability so important? When NASA engineers want to know the weight of the space shuttle so they know how much fuel to include, it's pretty darn important that their scale is reliable. If it is not, they might include too much fuel (a waste at best, a falling bomb at worst), or too little (nothing good can come of that), but what about for audio?

Say you want to know if your new speakers are better than your old speakers. You set both pairs of speakers up on a switch, and fill the room with your family. You play the same material, and switch between the speakers, each time asking which set of speakers the family liked better. Systematic bias would be if one set of speakers were louder than the other (studies have shown that louder equates to better for most people). Random error would be if your house was close to an interstate, and on occasion, the noise from the vehicles interfered with the test.

But, you say, only the second scenario shows an unreliable test. As long as you always set the speakers up in the same way, the test would always be biased towards the louder speakers. True, and herein lies the true evil behind the unreliable test:

An unreliable test can either make differences harder to detect (random error), or make you think that differences exist when they don't, or don't exist when they do (systematic error).

That's right, true believers, an unreliable test that is fraught with systematic error will actually make the test seem more reliable. You'll measure the same thing multiple times, and it'll come up the same every time. Reliable! Nope, 'cause it's wrong. Give someone else the same ruler, and they'll get a different result. But, it'll be the same every time. Give it to a third person, and they will consistently get a result, which is different from the first two. Unreliable – but, it appears to be reliable. Oooooh, evil!

But, how does random error make differences harder to detect? Well, you've all seen those statistics that have the + or - some % points or something. Well, that's random error. They are telling you that random error could change the results as many as X points in any direction. Basically, the larger the random error, the bigger the number; the bigger the number, the larger the range; the larger the range, the more chance there is that the middle of that range (the number they always report) will fall somewhere in the "didn't detect a difference" realm. The only fix is to increase the number of measurements taken. By increasing the number of measurements, you slowly become more and more confident that the average of your measurements more closely approximates the actual, true value. If an instrument were perfectly reliable, you could measure something once, and be done with it (even carpenters measure twice, right?). As the reliability of the instrument decreases, the number of measurements you have to take increases in order to be confident that you are close to the true value.

The Double-Blind Experiment - Finally!

Ahh, it's about time. What exactly is the double-blind method, and why is it so desirable? By now, you should be able to guess that the reason it is so desirable is that it is very good at controlling bias. The double-blind experiment is simply one in which both the participants and the researchers do not know who is in the experimental group. Aha! Clear as mud.

An experiment (in the truest sense) tests something (usually a theory or hypothesis). In a double-blind experiment, there are two groups of subjects (participants). One group gets the treatment, and the other doesn't. Generally, the subjects don't know whether they are receiving the treatment or not, but the researchers do: this is called a single-blind experiment. In the double-blind, the researchers don't know which group is which, either. This methodology is most often used in pharmaceutical research, so let's use that as an example:

Example 7: StickItToDaLittleGuy Inc. is a large pharmaceutical company that has developed a new drug to treat chronic headaches. The company recruits a large number of people with chronic headaches to be participants. A bunch of pills are created and put in either a blue or red bottle. One bottle contains the actual drug, while the other contains a sugar pill (called a placebo). Participants are randomly given either a red or blue bottle. They are instructed that, when they get a headache, they are to take two pills. If the pills haven't worked within 30 minutes, they are to take their normal medication. The company wants to know how many times the pills worked, how many times they didn't, and what, if any, side effects are experienced.

Basically, only one person, or a small group of people at the top, knows which bottles have the real medication. The people handing out the pills, collecting the data, analyzing the data, and taking the pills, have no idea which is which. But why!

The placebo effect is a real phenomenon. When someone expects something, sometimes they will experience it, regardless of whether there was a real change or not.

Example 8: When I was in Junior College, waaaaaaaaay back in the day, I ran the light board for a production of King Lear. The lighting director loved to make little changes to the lights. Eventually, whenever he wanted me to bump the lights down or up a "hair", I'd just say, "How 'bout that?" a few times, and eventually he'd say, "Perfect." Of course, I hadn't touched the lights. But, he was happy and would have sworn on a stack that there was a change.

The double-blind test puts everyone in the same position: they don't know what to expect. Of course, after the participants take the drug for the first time, they form an opinion. The researchers and the people handing out the medication form opinions as they talk to the participants, or go through the data. But, you see the idea.

ABX is another term often bandied around the forums in the same breath as double-blind. ABX is not a methodology, per se, but a way of implementing the double-blind test for audio equipment (see http://home.provide.net/~djcarlst/abx.htm). The overview is that you plug the equipment into a box (or a computer using the newest software version), and each person can switch between component one (A), component two (B), or a random selection of one or two (X):

Example 9: So, say you are testing amps. Amp A is a $100 pro amp, while amp B is a $50,000 Krell. For each test, the participants can press: the A button, and hear amp A; the B button, and hear amp B; or, the X button to hear the randomly selected amp. The participants can switch between the buttons at will until they make their decision as to which amp X is. They write that down, then move on to the next test.

So, how is this double-blind? Well, the box (or program) decides which amp is played by the X button during each test, so neither the researchers nor participants know which amp is being played by button X until after the experiment.

Validity - Small word with big implications

Now that we've touched on the double-blind experiment, we're done right? That's the end-all-be-all of how audio equipment should be tested, right? Wrong. We're not done, not by a long shot. Reducing bias is a big step toward creating a reliable and valid test, but it is not the whole game. Reliability, as described earlier, requires a measure that reads the same every time you use it. A ruler should measure 12 inches every time. But, a valid measure is one that measures the thing it was designed to measure. Uh…huh?

Example 10: A researcher has a theory. His theory is that intelligence is directly proportional to height. He measures all the students in his class every day. Each day, he comes up with the same heights (reliable). At the end of the course, his theory predicts that the taller students will have better grades than the shorter students. In fact, he finds no such thing. Both short and tall students did both well and poorly.

So, the professor had a reliable measure, but not a valid one. The tape measure measured height perfectly every time (reliable), but height did not measure intelligence as the professor had hoped (invalid). When designing an experiment, you need to be sure your measures are both reliable and valid, as well as free from the effects of bias. Put them all together, and you have a valid experiment with reliable results that are generalizable to the rest of the population.

General-who?

So, your experiment is free from the effects of bias, is valid, your measures are reliable - you're in the clear, right? Not yet. You might have a perfectly constructed test, but your participants could be from a small subset of the country/state/city/whatever. But, what does that mean?

Example 11: You are trying to show a difference between a $50 amp and a $50k amp. You gather a huge group of participants who have been shown to have excellent (way, way, way above average) hearing to ensure that if there is a difference, your measures (the ears of the participants) will hear it. Low and behold, by the end of the test, the results show that the amps, in fact, do sound different. Off you go to the forums to announce your success, when a fellow researcher comments that your test is invalid.

Why? Well, the test isn't really invalid; it is perfectly valid for all those people with exceptional hearing. Given that the absolute number of people with this exceptional hearing ability is very small, your results only show that people with exceptional hearing can hear a difference, not that anyone can hear a difference. Your test was biased, in a way, because of the type of participants you chose.

Now, one thing to know about statisticians: they can be a bit particular. If you had said, "I've shown that people with exceptional hearing can tell the difference between these two amps!" no one (with a sound statistical mind) could argue with you. However, if you said, "I've shown that people can hear a difference between these two amps," or, "I've shown that people with exceptional hearing can hear a difference between amps," or (God forbid), changed the word "shown" to "proven", you would bring down upon your head the wrath of every statistician worth his or her salt, regardless how well your study was designed. The key:

You can only generalize your results to the types of participants tested.

Oh, yeah, and statistics never prove anything. Never. Never. Never. Nope, not even if you tracked the number of times the sun has risen in the east and set in the west. To the statistician, there is always that chance it'll go the other way tomorrow. Uncertainty is a bit hard to get used to, but it is the statistician's cross to bear.

Results - The Null Hypothesis

Probability. Anyone who has watched any of the poker shows on TV sprouting up like weeds (really, racecar drivers playing poker…that's entertainment?) has seen probability in action. The percentages next to the players' names indicate the probability that they will win the hand. The fewer cards left in the deck that can do that for them, the lower the percentage. Researchers work in exactly the opposite way.Researchers assume that the treatment will have no effect. In the case of testing two amps, they assume that the amps sound the same. If they both sound the same, the participants are equally likely to choose one over the other. Therefore, if you tallied up all the votes for A and all the votes for B, you'd get something close to a 50/50 split. If, however, one of the amps does sound different, then you'd expect a 0/100 split (one amp chosen all the time). The Null Hypothesis is that there is no difference. When you reject the null hypothesis, you are saying that there is a difference. Kinda. Sorta. (Remember that statisticians never prove anything?)

But, what does that split have to be before the statistician will say that there is a difference? Well, two numbers are generally used, 0.05 and 0.01. 0.05 basically translates to a 95/5 split. 95% of the time, the participants chose correctly. Obviously, 0.01 means a 99/1 split (or, correct 99% of the time). This might seem a little steep, but it's not really. Once you see the math, it makes perfect sense. But, since this is not a technical report, just take my word for it. No? Ok.

Example 12: Someone puts two cars in front of you, and asks you 100 times which one is the Ferrari (the other is a Yugo). After each time they ask you, they randomly change the placement of the cars. How many times would you choose the Ferrari?

Now, someone puts two speakers in front of you. One is perfectly functioning, and the other has a blown woofer. Using the ABX protocol, they ask you 100 times to pick the perfectly functioning speaker. How many times would you choose the correct speaker?

Last thing, alpha is a term often brandished by statisticians to scare off the math-phobic. You'll hear things like, "The alpha was less than 0.05." This means that the test showed a difference of less than 0.05 (or greater than 95%).

Wonder Twins Powers, Activate - Form of, a statistical test!

Power is a wonderful thing, or so I hear. I gave up all of mine when I got married. But, power in the statistical sense is something totally different. The power of a test indicates how likely you are to detect a difference if it exists. Another word often used is, sensitivity. They aren't exactly the same thing, but for our purposes, they are. There are two main ways to increase your power: increase the number of participants; or, increase the discriminatory ability of your measure. Increasing the number of participants makes sense: if you ask two people 20 times which amp was better, and one got it right all twenty times while the other got it right only ten, their average would be 15 - no difference. But, if you tested ten more people, and they all got it right 20 times, suddenly you have an average of 19.2, which shows a difference. Increasing the discriminatory ability of your measure is a little different:

Example 13: By inviting only people with exceptional hearing to participate, as was done in Example 11, the researcher was trying to increase the discriminatory ability of the test. Another example would be that a yard stick has less discrimination than a ruler, which has less discrimination than a Vernier caliper , and so on.

Random error can be one of the thorns in the side of power. If the random error dictates that people will answer within a certain range, and the real difference are within that range, your differences will be undetectable. We are all familiar with the grading scale: 90% and up is an A, 80-89% is a B, and so on. Those 10 points are the range that students at that level of learning will score. If the measure (in this case, the final exam or quiz) were more accurate, then the ranges would be smaller: 95% and up is an A, etc. Detecting small differences is the hard thing. Anyone, experiment or no, can detect large differences. If your test is not sensitive enough, you may never detect the real differences that exist.

So, What Does This All Mean? - Just enough information to be dangerous

This has been an overview, and a rough one at that. There is much more I could have covered: standard distributions, quasi-experimental design, type I and II errors, and a host of other things. And maybe I will cover them in the future. One thing that I do want to stress, and I can't stress it enough: the double-blind experiment is not the only valid test for audio equipment. There are plenty of other experiments that would show results with just as much validity (the single-blind test immediately springs to mind). Just because the researcher knows which amp is being played does not necessarily mean that the test is invalid, or that the results are erroneous. A well designed test is simple, repeatable, and generalizable. Not everything has to be done double-blind. If Joe Schmoe online says he heard a difference, he heard a difference. But, the results are generalizable to only one person: him. When another person says that he or she heard the same amps, and didn't hear a difference, their results are just as generalizable - also to one person only.

Placebo effects (when the participant notices a difference because they think they should), experimenter expectancy bias (when the participant notices a difference because the experimenter thinks they should and gives off cues to the participant unknowingly), poorly designed experiments and measures - all of these and more can affect the outcome of an experiment. Systematic bias has a way of creeping in where you least expect it. Many a researcher has had his or her results and conclusions completely discredited when a new source of bias is uncovered. Careers have been lost, grants have been pulled, and lives have been destroyed. It's not a pretty sight.

That is why, when you read published articles about experiments, there is always a mention of limitations. This appeases all the dissenters by admitting what could have affected the outcome, and why the researchers think it didn't. Few experiments are without flaws, but many are without fatal flaws. Double-blind experiments are a very good way of limiting bias, but they have their own flaws, not the least of which is that they can sometimes be unethical. But, that is another story, in a different field.

-I'd like to send a special thanks to Dr. William Crano for his invaluable assistance, advice, and friendship.

Discuss This Article