Skip to content

Collection Primary Data Abroad Assignment

Learning Objective

  1. Describe the basic steps in the marketing research process and the purpose of each step.

The basic steps used to conduct marketing research are shown in Figure 10.6 “Steps in the Marketing Research Process”. Next, we discuss each step.

Figure 10.6 Steps in the Marketing Research Process

Step 1: Define the Problem (or Opportunity)

There’s a saying in marketing research that a problem half defined is a problem half solved. Defining the “problem” of the research sounds simple, doesn’t it? Suppose your product is tutoring other students in a subject you’re a whiz at. You have been tutoring for a while, and people have begun to realize you’re darned good at it. Then, suddenly, your business drops off. Or it explodes, and you can’t cope with the number of students you’re being asked help. If the business has exploded, should you try to expand your services? Perhaps you should subcontract with some other “whiz” students. You would send them students to be tutored, and they would give you a cut of their pay for each student you referred to them.

Both of these scenarios would be a problem for you, wouldn’t they? They are problems insofar as they cause you headaches. But are they really the problem? Or are they the symptoms of something bigger? For example, maybe your business has dropped off because your school is experiencing financial trouble and has lowered the number of scholarships given to incoming freshmen. Consequently, there are fewer total students on campus who need your services. Conversely, if you’re swamped with people who want you to tutor them, perhaps your school awarded more scholarships than usual, so there are a greater number of students who need your services. Alternately, perhaps you ran an ad in your school’s college newspaper, and that led to the influx of students wanting you to tutor them.

Businesses are in the same boat you are as a tutor. They take a look at symptoms and try to drill down to the potential causes. If you approach a marketing research company with either scenario—either too much or too little business—the firm will seek more information from you such as the following:

  • In what semester(s) did your tutoring revenues fall (or rise)?
  • In what subject areas did your tutoring revenues fall (or rise)?
  • In what sales channels did revenues fall (or rise): Were there fewer (or more) referrals from professors or other students? Did the ad you ran result in fewer (or more) referrals this month than in the past months?
  • Among what demographic groups did your revenues fall (or rise)—women or men, people with certain majors, or first-year, second-, third-, or fourth-year students?

The key is to look at all potential causes so as to narrow the parameters of the study to the information you actually need to make a good decision about how to fix your business if revenues have dropped or whether or not to expand it if your revenues have exploded.

The next task for the researcher is to put into writing the research objective. The research objective is the goal(s) the research is supposed to accomplish. The marketing research objective for your tutoring business might read as follows:

To survey college professors who teach 100- and 200-level math courses to determine why the number of students referred for tutoring dropped in the second semester.

This is admittedly a simple example designed to help you understand the basic concept. If you take a marketing research course, you will learn that research objectives get a lot more complicated than this. The following is an example:

“To gather information from a sample representative of the U.S. population among those who are ‘very likely’ to purchase an automobile within the next 6 months, which assesses preferences (measured on a 1–5 scale ranging from ‘very likely to buy’ to ‘not likely at all to buy’) for the model diesel at three different price levels. Such data would serve as input into a forecasting model that would forecast unit sales, by geographic regions of the country, for each combination of the model’s different prices and fuel configurations (Burns & Bush, 2010).”

Now do you understand why defining the problem is complicated and half the battle? Many a marketing research effort is doomed from the start because the problem was improperly defined. Coke’s ill-fated decision to change the formula of Coca-Cola in 1985 is a case in point: Pepsi had been creeping up on Coke in terms of market share over the years as well as running a successful promotional campaign called the “Pepsi Challenge,” in which consumers were encouraged to do a blind taste test to see if they agreed that Pepsi was better. Coke spent four years researching “the problem.” Indeed, people seemed to like the taste of Pepsi better in blind taste tests. Thus, the formula for Coke was changed. But the outcry among the public was so great that the new formula didn’t last long—a matter of months—before the old formula was reinstated. Some marketing experts believe Coke incorrectly defined the problem as “How can we beat Pepsi in taste tests?” instead of “How can we gain market share against Pepsi?” (Burns & Bush, 2010)

Video Clip

New Coke Is It! 1985

(click to see video)

This video documents the Coca-Cola Company’s ill-fated launch of New Coke in 1985.

Video Clip

1985 Pepsi Commercial—“They Changed My Coke”

(click to see video)

This video shows how Pepsi tried to capitalize on the blunder.

Step 2: Design the Research

The next step in the marketing research process is to do a research design. The research design is your “plan of attack.” It outlines what data you are going to gather and from whom, how and when you will collect the data, and how you will analyze it once it’s been obtained. Let’s look at the data you’re going to gather first.

There are two basic types of data you can gather. The first is primary data. Primary data is information you collect yourself, using hands-on tools such as interviews or surveys, specifically for the research project you’re conducting. Secondary data is data that has already been collected by someone else, or data you have already collected for another purpose. Collecting primary data is more time consuming, work intensive, and expensive than collecting secondary data. Consequently, you should always try to collect secondary data first to solve your research problem, if you can. A great deal of research on a wide variety of topics already exists. If this research contains the answer to your question, there is no need for you to replicate it. Why reinvent the wheel?

Sources of Secondary Data

Your company’s internal records are a source of secondary data. So are any data you collect as part of your marketing intelligence gathering efforts. You can also purchase syndicated research. Syndicated research is primary data that marketing research firms collect on a regular basis and sell to other companies. J.D. Power & Associates is a provider of syndicated research. The company conducts independent, unbiased surveys of customer satisfaction, product quality, and buyer behavior for various industries. The company is best known for its research in the automobile sector. One of the best-known sellers of syndicated research is the Nielsen Company, which produces the Nielsen ratings. The Nielsen ratings measure the size of television, radio, and newspaper audiences in various markets. You have probably read or heard about TV shows that get the highest (Nielsen) ratings. (Arbitron does the same thing for radio ratings.) Nielsen, along with its main competitor, Information Resources, Inc. (IRI), also sells businesses scanner-based research. Scanner-based research is information collected by scanners at checkout stands in stores. Each week Nielsen and IRI collect information on the millions of purchases made at stores. The companies then compile the information and sell it to firms in various industries that subscribe to their services. The Nielsen Company has also recently teamed up with Facebook to collect marketing research information. Via Facebook, users will see surveys in some of the spaces in which they used to see online ads (Rappeport, Gelles, 2009).

By contrast, is an example of a marketing research aggregator. A marketing research aggregator is a marketing research company that doesn’t conduct its own research and sell it. Instead, it buys research reports from other marketing research companies and then sells the reports in their entirety or in pieces to other firms. Check out’s Web site. As you will see there are a huge number of studies in every category imaginable that you can buy for relatively small amounts of money.

Your local library is a good place to gather free secondary data. It has searchable databases as well as handbooks, dictionaries, and books, some of which you can access online. Government agencies also collect and report information on demographics, economic and employment data, health information, and balance-of-trade statistics, among a lot of other information. The U.S. Census Bureau collects census data every ten years to gather information about who lives where. Basic demographic information about sex, age, race, and types of housing in which people live in each U.S. state, metropolitan area, and rural area is gathered so that population shifts can be tracked for various purposes, including determining the number of legislators each state should have in the U.S. House of Representatives. For the U.S. government, this is primary data. For marketing managers it is an important source of secondary data.

The Survey Research Center at the University of Michigan also conducts periodic surveys and publishes information about trends in the United States. One research study the center continually conducts is called the “Changing Lives of American Families” ( This is important research data for marketing managers monitoring consumer trends in the marketplace. The World Bank and the United Nations are two international organizations that collect a great deal of information. Their Web sites contain many free research studies and data related to global markets. Table 10.1 “Examples of Primary Data Sources versus Secondary Data Sources” shows some examples of primary versus secondary data sources.

Table 10.1 Examples of Primary Data Sources versus Secondary Data Sources

Primary Data SourcesSecondary Data Sources
InterviewsCensus data
SurveysWeb sites
Trade associations
Syndicated research and market aggregators

Figure 10.7

Market research aggregators buy research reports from other marketing research companies and then resell them in part or in whole to other companies so they don’t have to gather primary data.

Gauging the Quality of Secondary Data

When you are gathering secondary information, it’s always good to be a little skeptical of it. Sometimes studies are commissioned to produce the result a client wants to hear—or wants the public to hear. For example, throughout the twentieth century, numerous studies found that smoking was good for people’s health. The problem was the studies were commissioned by the tobacco industry. Web research can also pose certain hazards. There are many biased sites that try to fool people that they are providing good data. Often the data is favorable to the products they are trying to sell. Beware of product reviews as well. Unscrupulous sellers sometimes get online and create bogus ratings for products. See below for questions you can ask to help gauge the credibility of secondary information.

Gauging the Credibility of Secondary Data: Questions to Ask

  • Who gathered this information?
  • For what purpose?
  • What does the person or organization that gathered the information have to gain by doing so?
  • Was the information gathered and reported in a systematic manner?
  • Is the source of the information accepted as an authority by other experts in the field?
  • Does the article provide objective evidence to support the position presented?

Types of Research Design

Now let’s look specifically at the types of research designs that are utilized. By understanding different types of research designs, a researcher can solve a client’s problems more quickly and efficiently without jumping through more hoops than necessary. Research designs fall into one of the following three categories:

  1. Exploratory research design
  2. Descriptive research design
  3. Causal research design (experiments)

An exploratory research design is useful when you are initially investigating a problem but you haven’t defined it well enough to do an in-depth study of it. Perhaps via your regular market intelligence, you have spotted what appears to be a new opportunity in the marketplace. You would then do exploratory research to investigate it further and “get your feet wet,” as the saying goes. Exploratory research is less structured than other types of research, and secondary data is often utilized.

One form of exploratory research is qualitative research. Qualitative research is any form of research that includes gathering data that is not quantitative, and often involves exploring questions such as why as much as what or how much. Different forms, such as depth interviews and focus group interviews, are common in marketing research.

The depth interview—engaging in detailed, one-on-one, question-and-answer sessions with potential buyers—is an exploratory research technique. However, unlike surveys, the people being interviewed aren’t asked a series of standard questions. Instead the interviewer is armed with some general topics and asks questions that are open ended, meaning that they allow the interviewee to elaborate. “How did you feel about the product after you purchased it?” is an example of a question that might be asked. A depth interview also allows a researcher to ask logical follow-up questions such as “Can you tell me what you mean when you say you felt uncomfortable using the service?” or “Can you give me some examples?” to help dig further and shed additional light on the research problem. Depth interviews can be conducted in person or over the phone. The interviewer either takes notes or records the interview.

Focus groups and case studies are often utilized for exploratory research as well. A focus group is a group of potential buyers who are brought together to discuss a marketing research topic with one another. A moderator is used to focus the discussion, the sessions are recorded, and the main points of consensus are later summarized by the market researcher. Textbook publishers often gather groups of professors at educational conferences to participate in focus groups. However, focus groups can also be conducted on the telephone, in online chat rooms, or both, using meeting software like WebEx. The basic steps of conducting a focus group are outlined below.

The Basic Steps of Conducting a Focus Group

  1. Establish the objectives of the focus group. What is its purpose?
  2. Identify the people who will participate in the focus group. What makes them qualified to participate? How many of them will you need and what they will be paid?
  3. Obtain contact information for the participants and send out invitations (usually e-mails are most efficient).
  4. Develop a list of questions.
  5. Choose a facilitator.
  6. Choose a location in which to hold the focus group and the method by which it will be recorded.
  7. Conduct the focus group. If the focus group is not conducted electronically, include name tags for the participants, pens and notepads, any materials the participants need to see, and refreshments. Record participants’ responses.
  8. Summarize the notes from the focus group and write a report for management.

A case study looks at how another company solved the problem that’s being researched. Sometimes multiple cases, or companies, are used in a study. Case studies nonetheless have a mixed reputation. Some researchers believe it’s hard to generalize, or apply, the results of a case study to other companies. Nonetheless, collecting information about companies that encountered the same problems your firm is facing can give you a certain amount of insight about what direction you should take. In fact, one way to begin a research project is to carefully study a successful product or service.

Two other types of qualitative data used for exploratory research are ethnographies and projective techniques. In an ethnography, researchers interview, observe, and often videotape people while they work, live, shop, and play. The Walt Disney Company has recently begun using ethnographers to uncover the likes and dislikes of boys aged six to fourteen, a financially attractive market segment for Disney, but one in which the company has been losing market share. The ethnographers visit the homes of boys, observe the things they have in their rooms to get a sense of their hobbies, and accompany them and their mothers when they shop to see where they go, what the boys are interested in, and what they ultimately buy. (The children get seventy-five dollars out of the deal, incidentally.) (Barnes, 2009)

Projective techniques are used to reveal information research respondents might not reveal by being asked directly. Asking a person to complete sentences such as the following is one technique:

People who buy Coach handbags __________.

(Will he or she reply with “are cool,” “are affluent,” or “are pretentious,” for example?)

KFC’s grilled chicken is ______.

Or the person might be asked to finish a story that presents a certain scenario. Word associations are also used to discern people’s underlying attitudes toward goods and services. Using a word-association technique, a market researcher asks a person to say or write the first word that comes to his or her mind in response to another word. If the initial word is “fast food,” what word does the person associate it with or respond with? Is it “McDonald’s”? If many people reply that way, and you’re conducting research for Burger King, that could indicate Burger King has a problem. However, if the research is being conducted for Wendy’s, which recently began running an advertising campaign to the effect that Wendy’s offerings are “better than fast food,” it could indicate that the campaign is working.

Completing cartoons is yet another type of projective technique. It’s similar to finishing a sentence or story, only with the pictures. People are asked to look at a cartoon such as the one shown in Figure 10.8 “Example of a Cartoon-Completion Projective Technique”. One of the characters in the picture will have made a statement, and the person is asked to fill in the empty cartoon “bubble” with how they think the second character will respond.

Figure 10.8 Example of a Cartoon-Completion Projective Technique

In some cases, your research might end with exploratory research. Perhaps you have discovered your organization lacks the resources needed to produce the product. In other cases, you might decide you need more in-depth, quantitative research such as descriptive research or causal research, which are discussed next. Most marketing research professionals advise using both types of research, if it’s feasible. On the one hand, the qualitative-type research used in exploratory research is often considered too “lightweight.” Remember earlier in the chapter when we discussed telephone answering machines and the hit TV sitcom Seinfeld? Both product ideas were initially rejected by focus groups. On the other hand, relying solely on quantitative information often results in market research that lacks ideas.

Video Clip

The Stone Wheel—What One Focus Group Said

(click to see video)

Watch the video to see a funny spoof on the usefulness—or lack of usefulness—of focus groups.

Descriptive Research

Anything that can be observed and counted falls into the category of descriptive research design. A study using a descriptive research design involves gathering hard numbers, often via surveys, to describe or measure a phenomenon so as to answer the questions of who, what, where, when, and how. “On a scale of 1–5, how satisfied were you with your service?” is a question that illustrates the information a descriptive research design is supposed to capture.

Physiological measurements also fall into the category of descriptive design. Physiological measurements measure people’s involuntary physical responses to marketing stimuli, such as an advertisement. Elsewhere, we explained that researchers have gone so far as to scan the brains of consumers to see what they really think about products versus what they say about them. Eye tracking is another cutting-edge type of physiological measurement. It involves recording the movements of a person’s eyes when they look at some sort of stimulus, such as a banner ad or a Web page. The Walt Disney Company has a research facility in Austin, Texas, that it uses to take physical measurements of viewers when they see Disney programs and advertisements. The facility measures three types of responses: people’s heart rates, skin changes, and eye movements (eye tracking) (Spangler, 2009).

Figure 10.9

A woman shows off her headgear for an eye-tracking study. The gear’s not exactly a fashion statement but . . .

A strictly descriptive research design instrument—a survey, for example—can tell you how satisfied your customers are. It can’t, however, tell you why. Nor can an eye-tracking study tell you why people’s eyes tend to dwell on certain types of banner ads—only that they do. To answer “why” questions an exploratory research design or causal research design is needed (Wagner, 2007).

Causal Research

Causal research design examines cause-and-effect relationships. Using a causal research design allows researchers to answer “what if” types of questions. In other words, if a firm changes X (say, a product’s price, design, placement, or advertising), what will happen to Y (say, sales or customer loyalty)? To conduct causal research, the researcher designs an experiment that “controls,” or holds constant, all of a product’s marketing elements except one (or using advanced techniques of research, a few elements can be studied at the same time). The one variable is changed, and the effect is then measured. Sometimes the experiments are conducted in a laboratory using a simulated setting designed to replicate the conditions buyers would experience. Or the experiments may be conducted in a virtual computer setting.

You might think setting up an experiment in a virtual world such as the online game Second Life would be a viable way to conduct controlled marketing research. Some companies have tried to use Second Life for this purpose, but the results have been somewhat mixed as to whether or not it is a good medium for marketing research. The German marketing research firm Komjuniti was one of the first “real-world” companies to set up an “island” in Second Life upon which it could conduct marketing research. However, with so many other attractive fantasy islands in which to play, the company found it difficult to get Second Life residents, or players, to voluntarily visit the island and stay long enough so meaningful research could be conducted. (Plus, the “residents,” or players, in Second Life have been known to protest corporations invading their world. When the German firm Komjuniti created an island in Second Life to conduct marketing research, the residents showed up waving signs and threatening to boycott the island.) (Wagner, 2007)

Why is being able to control the setting so important? Let’s say you are an American flag manufacturer and you are working with Walmart to conduct an experiment to see where in its stores American flags should be placed so as to increase their sales. Then the terrorist attacks of 9/11 occur. In the days afterward, sales skyrocketed—people bought flags no matter where they were displayed. Obviously, the terrorist attacks in the United States would have skewed the experiment’s data.

An experiment conducted in a natural setting such as a store is referred to as a field experiment. Companies sometimes do field experiments either because it is more convenient or because they want to see if buyers will behave the same way in the “real world” as in a laboratory or on a computer. The place the experiment is conducted or the demographic group of people the experiment is administered to is considered the test market. Before a large company rolls out a product to the entire marketplace, it will often place the offering in a test market to see how well it will be received. For example, to compete with MillerCoors’ sixty-four-calorie beer MGD 64, Anheuser-Busch recently began testing its Select 55 beer in certain cities around the country (McWilliams, 2009).

Figure 10.10

Select 55 beer: Coming soon to a test market near you? (If you’re on a diet, you have to hope so!)

Many companies use experiments to test all of their marketing communications. For example, the online discount retailer (formerly called carefully tests all of its marketing offers and tracks the results of each one. One study the company conducted combined twenty-six different variables related to offers e-mailed to several thousand customers. The study resulted in a decision to send a group of e-mails to different segments. The company then tracked the results of the sales generated to see if they were in line with the earlier experiment it had conducted that led it to make the offer.

Step 3: Design the Data-Collection Forms

If the behavior of buyers is being formally observed, and a number of different researchers are conducting observations, the data obviously need to be recorded on a standardized data-collection form that’s either paper or electronic. Otherwise, the data collected will not be comparable. The items on the form could include a shopper’s sex; his or her approximate age; whether the person seemed hurried, moderately hurried, or unhurried; and whether or not he or she read the label on products, used coupons, and so forth.

The same is true when it comes to surveying people with questionnaires. Surveying people is one of the most commonly used techniques to collect quantitative data. Surveys are popular because they can be easily administered to large numbers of people fairly quickly. However, to produce the best results, the questionnaire for the survey needs to be carefully designed.

Questionnaire Design

Most questionnaires follow a similar format: They begin with an introduction describing what the study is for, followed by instructions for completing the questionnaire and, if necessary, returning it to the market researcher. The first few questions that appear on the questionnaire are usually basic, warm-up type of questions the respondent can readily answer, such as the respondent’s age, level of education, place of residence, and so forth. The warm-up questions are then followed by a logical progression of more detailed, in-depth questions that get to the heart of the question being researched. Lastly, the questionnaire wraps up with a statement that thanks the respondent for participating in the survey and information and explains when and how they will be paid for participating. To see some examples of questionnaires and how they are laid out, click on the following link:

How the questions themselves are worded is extremely important. It’s human nature for respondents to want to provide the “correct” answers to the person administering the survey, so as to seem agreeable. Therefore, there is always a hazard that people will try to tell you what you want to hear on a survey. Consequently, care needs to be taken that the survey questions are written in an unbiased, neutral way. In other words, they shouldn’t lead a person taking the questionnaire to answer a question one way or another by virtue of the way you have worded it. The following is an example of a leading question.

Don’t you agree that teachers should be paid more?

The questions also need to be clear and unambiguous. Consider the following question:

Which brand of toothpaste do you use?

The question sounds clear enough, but is it really? What if the respondent recently switched brands? What if she uses Crest at home, but while away from home or traveling, she uses Colgate’s Wisp portable toothpaste-and-brush product? How will the respondent answer the question? Rewording the question as follows so it’s more specific will help make the question clearer:

Which brand of toothpaste have you used at home in the past six months? If you have used more than one brand, please list each of them1.

Sensitive questions have to be asked carefully. For example, asking a respondent, “Do you consider yourself a light, moderate, or heavy drinker?” can be tricky. Few people want to admit to being heavy drinkers. You can “soften” the question by including a range of answers, as the following example shows:

How many alcoholic beverages do you consume in a week?

  • __0–5 alcoholic beverages
  • __5–10 alcoholic beverages
  • __10–15 alcoholic beverages

Many people don’t like to answer questions about their income levels. Asking them to specify income ranges rather than divulge their actual incomes can help.

Other research question “don’ts” include using jargon and acronyms that could confuse people. “How often do you IM?” is an example. Also, don’t muddy the waters by asking two questions in the same question, something researchers refer to as a double-barreled question. “Do you think parents should spend more time with their children and/or their teachers?” is an example of a double-barreled question.

Open-ended questions, or questions that ask respondents to elaborate, can be included. However, they are harder to tabulate than closed-ended questions, or questions that limit a respondent’s answers. Multiple-choice and yes-and-no questions are examples of closed-ended questions.

Testing the Questionnaire

You have probably heard the phrase “garbage in, garbage out.” If the questions are bad, the information gathered will be bad, too. One way to make sure you don’t end up with garbage is to test the questionnaire before sending it out to find out if there are any problems with it. Is there enough space for people to elaborate on open-ended questions? Is the font readable? To test the questionnaire, marketing research professionals first administer it to a number of respondents face to face. This gives the respondents the chance to ask the researcher about questions or instructions that are unclear or don’t make sense to them. The researcher then administers the questionnaire to a small subset of respondents in the actual way the survey is going to be disseminated, whether it’s delivered via phone, in person, by mail, or online.

Getting people to participate and complete questionnaires can be difficult. If the questionnaire is too long or hard to read, many people won’t complete it. So, by all means, eliminate any questions that aren’t necessary. Of course, including some sort of monetary incentive for completing the survey can increase the number of completed questionnaires a market researcher will receive.

Step 4: Specify the Sample

Once you have created your questionnaire or other marketing study, how do you figure out who should participate in it? Obviously, you can’t survey or observe all potential buyers in the marketplace. Instead, you must choose a sample. A sample is a subset of potential buyers that are representative of your entire target market, or population being studied. Sometimes market researchers refer to the population as the universe to reflect the fact that it includes the entire target market, whether it consists of a million people, a hundred thousand, a few hundred, or a dozen. “All unmarried people over the age of eighteen who purchased Dirt Devil steam cleaners in the United States during 2011” is an example of a population that has been defined.

Obviously, the population has to be defined correctly. Otherwise, you will be studying the wrong group of people. Not defining the population correctly can result in flawed research, or sampling error. A sampling error is any type of marketing research mistake that results because a sample was utilized. One criticism of Internet surveys is that the people who take these surveys don’t really represent the overall population. On average, Internet survey takers tend to be more educated and tech savvy. Consequently, if they solely constitute your population, even if you screen them for certain criteria, the data you collect could end up being skewed.

The next step is to put together the sampling frame, which is the list from which the sample is drawn. The sampling frame can be put together using a directory, customer list, or membership roster (Wrenn et. al., 2007). Keep in mind that the sampling frame won’t perfectly match the population. Some people will be included on the list who shouldn’t be. Other people who should be included will be inadvertently omitted. It’s no different than if you were to conduct a survey of, say, 25 percent of your friends, using friends’ names you have in your cell phone. Most of your friends’ names are likely to be programmed into your phone, but not all of them. As a result, a certain degree of sampling error always occurs.

There are two main categories of samples in terms of how they are drawn: probability samples and nonprobability samples. A probability sample is one in which each would-be participant has a known and equal chance of being selected. The chance is known because the total number of people in the sampling frame is known. For example, if every other person from the sampling frame were chosen, each person would have a 50 percent chance of being selected.

A nonprobability sample is any type of sample that’s not drawn in a systematic way. So the chances of each would-be participant being selected can’t be known. A convenience sample is one type of nonprobability sample. It is a sample a researcher draws because it’s readily available and convenient to do so. Surveying people on the street as they pass by is an example of a convenience sample. The question is, are these people representative of the target market?

For example, suppose a grocery store needed to quickly conduct some research on shoppers to get ready for an upcoming promotion. Now suppose that the researcher assigned to the project showed up between the hours of 10 a.m. and 12 p.m. on a weekday and surveyed as many shoppers as possible. The problem is that the shoppers wouldn’t be representative of the store’s entire target market. What about commuters who stop at the store before and after work? Their views wouldn’t be represented. Neither would people who work the night shift or shop at odd hours. As a result, there would be a lot of room for sampling error in this study. For this reason, studies that use nonprobability samples aren’t considered as accurate as studies that use probability samples. Nonprobability samples are more often used in exploratory research.

Lastly, the size of the sample has an effect on the amount of sampling error. Larger samples generally produce more accurate results. The larger your sample is, the more data you will have, which will give you a more complete picture of what you’re studying. However, the more people surveyed or studied, the more costly the research becomes.

Statistics can be used to determine a sample’s optimal size. If you take a marketing research or statistics class, you will learn more about how to determine the optimal size.

Of course, if you hire a marketing research company, much of this work will be taken care of for you. Many marketing research companies, like ResearchNow, maintain panels of prescreened people they draw upon for samples. In addition, the marketing research firm will be responsible for collecting the data or contracting with a company that specializes in data collection. Data collection is discussed next.

Step 5: Collect the Data

As we have explained, primary marketing research data can be gathered in a number of ways. Surveys, taking physical measurements, and observing people are just three of the ways we discussed. If you’re observing customers as part of gathering the data, keep in mind that if shoppers are aware of the fact, it can have an effect on their behavior. For example, if a customer shopping for feminine hygiene products in a supermarket aisle realizes she is being watched, she could become embarrassed and leave the aisle, which would adversely affect your data. To get around problems such as these, some companies set up cameras or two-way mirrors to observe customers. Organizations also hire mystery shoppers to work around the problem. A mystery shopper is someone who is paid to shop at a firm’s establishment or one of its competitors to observe the level of service, cleanliness of the facility, and so forth, and report his or her findings to the firm.

Video Clip

Make Extra Money as a Mystery Shopper

(click to see video)

Watch the YouTube video to get an idea of how mystery shopping works.

Survey data can be collected in many different ways and combinations of ways. The following are the basic methods used:

  • Face-to-face (can be computer aided)
  • Telephone (can be computer aided or completely automated)
  • Mail and hand delivery
  • E-mail and the Web

A face-to-face survey is, of course, administered by a person. The surveys are conducted in public places such as in shopping malls, on the street, or in people’s homes if they have agreed to it. In years past, it was common for researchers in the United States to knock on people’s doors to gather survey data. However, randomly collected door-to-door interviews are less common today, partly because people are afraid of crime and are reluctant to give information to strangers (McDaniel & Gates, 1998).

Nonetheless, “beating the streets” is still a legitimate way questionnaire data is collected. When the U.S. Census Bureau collects data on the nation’s population, it hand delivers questionnaires to rural households that do not have street-name and house-number addresses. And Census Bureau workers personally survey the homeless to collect information about their numbers. Face-to-face surveys are also commonly used in third world countries to collect information from people who cannot read or lack phones and computers.

A plus of face-to-face surveys is that they allow researchers to ask lengthier, more complex questions because the people being surveyed can see and read the questionnaires. The same is true when a computer is utilized. For example, the researcher might ask the respondent to look at a list of ten retail stores and rank the stores from best to worst. The same question wouldn’t work so well over the telephone because the person couldn’t see the list. The question would have to be rewritten. Another drawback with telephone surveys is that even though federal and state “do not call” laws generally don’t prohibit companies from gathering survey information over the phone, people often screen such calls using answering machines and caller ID.

Probably the biggest drawback of both surveys conducted face-to-face and administered over the phone by a person is that they are labor intensive and therefore costly. Mailing out questionnaires is costly, too, and the response rates can be rather low. Think about why that might be so: if you receive a questionnaire in the mail, it is easy to throw it in the trash; it’s harder to tell a market researcher who approaches you on the street that you don’t want to be interviewed.

By contrast, gathering survey data collected by a computer, either over the telephone or on the Internet, can be very cost-effective and in some cases free. SurveyMonkey and Zoomerang are two Web sites that will allow you to create online questionnaires, e-mail them to up to one hundred people for free, and view the responses in real time as they come in. For larger surveys, you have to pay a subscription price of a few hundred dollars. But that still can be extremely cost-effective. The two Web sites also have a host of other features such as online-survey templates you can use to create your questionnaire, a way to set up automatic reminders sent to people who haven’t yet completed their surveys, and tools you can use to create graphics to put in your final research report. To see how easy it is to put together a survey in SurveyMonkey, click on the following link:

Like a face-to-face survey, an Internet survey can enable you to show buyers different visuals such as ads, pictures, and videos of products and their packaging. Web surveys are also fast, which is a major plus. Whereas face-to-face and mailed surveys often take weeks to collect, you can conduct a Web survey in a matter of days or even hours. And, of course, because the information is electronically gathered it can be automatically tabulated. You can also potentially reach a broader geographic group than you could if you had to personally interview people. The Zoomerang Web site allows you to create surveys in forty different languages.

Another plus for Web and computer surveys (and electronic phone surveys) is that there is less room for human error because the surveys are administered electronically. For instance, there’s no risk that the interviewer will ask a question wrong or use a tone of voice that could mislead the respondents. Respondents are also likely to feel more comfortable inputting the information into a computer if a question is sensitive than they would divulging the information to another person face-to-face or over the phone. Given all of these advantages, it’s not surprising that the Internet is quickly becoming the top way to collect primary data. However, like mail surveys, surveys sent to people over the Internet are easy to ignore.

Lastly, before the data collection process begins, the surveyors and observers need to be trained to look for the same things, ask questions the same way, and so forth. If they are using rankings or rating scales, they need to be “on the same page,” so to speak, as to what constitutes a high ranking or a low ranking. As an analogy, you have probably had some teachers grade your college papers harder than others. The goal of training is to avoid a wide disparity between how different observers and interviewers record the data.

Figure 10.11

Training people so they know what constitutes different ratings when they are collecting data will improve the quality of the information gathered in a marketing research study.

For example, if an observation form asks the observers to describe whether a shopper’s behavior is hurried, moderately hurried, or unhurried, they should be given an idea of what defines each rating. Does it depend on how much time the person spends in the store or in the individual aisles? How fast they walk? In other words, the criteria and ratings need to be spelled out.

Collecting International Marketing Research Data

Gathering marketing research data in foreign countries poses special challenges. However, that doesn’t stop firms from doing so. Marketing research companies are located all across the globe, in fact. Eight of the ten largest marketing research companies in the world are headquartered in the United States. However, five of these eight firms earn more of their revenues abroad than they do in the United States. There’s a reason for this: many U.S. markets were saturated, or tapped out, long ago in terms of the amount that they can grow. Coke is an example. As you learned earlier in the book, most of the Coca-Cola Company’s revenues are earned in markets abroad. To be sure, the United States is still a huge market when it comes to the revenues marketing research firms generate by conducting research in the country: in terms of their spending, American consumers fuel the world’s economic engine. Still, emerging countries with growing middle classes, such as China, India, and Brazil, are hot new markets companies want to tap.

What kind of challenges do firms face when trying to conduct marketing research abroad? As we explained, face-to-face surveys are commonly used in third world countries to collect information from people who cannot read or lack phones and computers. However, face-to-face surveys are also common in Europe, despite the fact that phones and computers are readily available. In-home surveys are also common in parts of Europe. By contrast, in some countries, including many Asian countries, it’s considered taboo or rude to try to gather information from strangers either face-to-face or over the phone. In many Muslim countries, women are forbidden to talk to strangers.

And how do you figure out whom to research in foreign countries? That in itself is a problem. In the United States, researchers often ask if they can talk to the heads of households to conduct marketing research. But in countries in which domestic servants or employees are common, the heads of households aren’t necessarily the principal shoppers; their domestic employees are (Malhotra).

Translating surveys is also an issue. Have you ever watched the TV comedians Jay Leno and David Letterman make fun of the English translations found on ethnic menus and products? Research tools such as surveys can suffer from the same problem. Hiring someone who is bilingual to translate a survey into another language can be a disaster if the person isn’t a native speaker of the language to which the survey is being translated.

One way companies try to deal with translation problems is by using back translation. When back translation is used, a native speaker translates the survey into the foreign language and then translates it back again to the original language to determine if there were gaps in meaning—that is, if anything was lost in translation. And it’s not just the language that’s an issue. If the research involves any visual images, they, too, could be a point of confusion. Certain colors, shapes, and symbols can have negative connotations in other countries. For example, the color white represents purity in many Western cultures, but in China, it is the color of death and mourning (Zouhali-Worrall, 2008). Also, look back at the cartoon-completion exercise in Figure 10.8 “Example of a Cartoon-Completion Projective Technique”. What would women in Muslim countries who aren’t allowed to converse with male sellers think of it? Chances are, the cartoon wouldn’t provide you with the information you’re seeking if Muslim women in some countries were asked to complete it.

One way marketing research companies are dealing with the complexities of global research is by merging with or acquiring marketing research companies abroad. The Nielsen Company is the largest marketing research company in the world. The firm operates in more than a hundred countries and employs more than forty thousand people. Many of its expansions have been the result of acquisitions and mergers.

Step 6: Analyze the Data

Step 6 involves analyzing the data to ensure it’s as accurate as possible. If the research is collected by hand using a pen and pencil, it’s entered into a computer. Or respondents might have already entered the information directly into a computer. For example, when Toyota goes to an event such as a car show, the automaker’s marketing personnel ask would-be buyers to complete questionnaires directly on computers. Companies are also beginning to experiment with software that can be used to collect data using mobile phones.

Once all the data is collected, the researchers begin the data cleaning, which is the process of removing data that have accidentally been duplicated (entered twice into the computer) or correcting data that have obviously been recorded wrong. A program such as Microsoft Excel or a statistical program such as Predictive Analytics Software (PASW, which was formerly known as SPSS) is then used to tabulate, or calculate, the basic results of the research, such as the total number of participants and how collectively they answered various questions. The programs can also be used to calculate averages, such as the average age of respondents, their average satisfaction, and so forth. The same can done for percentages, and other values you learned about, or will learn about, in a statistics course, such as the standard deviation, mean, and median for each question.

The information generated by the programs can be used to draw conclusions, such as what all customers might like or not like about an offering based on what the sample group liked or did not like. The information can also be used to spot differences among groups of people. For example, the research might show that people in one area of the country like the product better than people in another area. Trends to predict what might happen in the future can also be spotted.

If there are any open-ended questions respondents have elaborated upon—for example, “Explain why you like the current brand you use better than any other brand”—the answers to each are pasted together, one on top of another, so researchers can compare and summarize the information. As we have explained, qualitative information such as this can give you a fuller picture of the results of the research.

Part of analyzing the data is to see if it seems sound. Does the way in which the research was conducted seem sound? Was the sample size large enough? Are the conclusions that become apparent from it reasonable?

The two most commonly used criteria used to test the soundness of a study are (1) validity and (2) reliability. A study is valid if it actually tested what it was designed to test. For example, did the experiment you ran in Second Life test what it was designed to test? Did it reflect what could really happen in the real world? If not, the research isn’t valid. If you were to repeat the study, and get the same results (or nearly the same results), the research is said to be reliable. If you get a drastically different result if you repeat the study, it’s not reliable. The data collected, or at least some it, can also be compared to, or reconciled with, similar data from other sources either gathered by your firm or by another organization to see if the information seems on target.

Stage 7: Write the Research Report and Present Its Findings

If you end up becoming a marketing professional and conducting a research study after you graduate, hopefully you will do a great job putting the study together. You will have defined the problem correctly, chosen the right sample, collected the data accurately, analyzed it, and your findings will be sound. At that point, you will be required to write the research report and perhaps present it to an audience of decision makers. You will do so via a written report and, in some cases, a slide or PowerPoint presentation based on your written report.

The six basic elements of a research report are as follows.

  1. Title Page. The title page explains what the report is about, when it was conducted and by whom, and who requested it.
  2. Table of Contents. The table of contents outlines the major parts of the report, as well as any graphs and charts, and the page numbers on which they can be found.
  3. Executive Summary. The executive summary summarizes all the details in the report in a very quick way. Many people who receive the report—both executives and nonexecutives—won’t have time to read the entire report. Instead, they will rely on the executive summary to quickly get an idea of the study’s results and what to do about those results.
  4. Methodology and Limitations. The methodology section of the report explains the technical details of how the research was designed and conducted. The section explains, for example, how the data was collected and by whom, the size of the sample, how it was chosen, and whom or what it consisted of (e.g., the number of women versus men or children versus adults). It also includes information about the statistical techniques used to analyze the data.

    Every study has errors—sampling errors, interviewer errors, and so forth. The methodology section should explain these details, so decision makers can consider their overall impact. The margin of error is the overall tendency of the study to be off kilter—that is, how far it could have gone wrong in either direction. Remember how newscasters present the presidential polls before an election? They always say, “This candidate is ahead 48 to 44 percent, plus or minus 2 percent.” That “plus or minus” is the margin of error. The larger the margin of error is, the less likely the results of the study are accurate. The margin of error needs to be included in the methodology section.

  5. Findings. The findings section is a longer, fleshed-out version of the executive summary that goes into more detail about the statistics uncovered by the research that bolster the study’s findings. If you have related research or secondary data on hand that back up the findings, it can be included to help show the study did what it was designed to do.
  6. Recommendations. The recommendations section should outline the course of action you think should be taken based on the findings of the research and the purpose of the project. For example, if you conducted a global market research study to identify new locations for stores, make a recommendation for the locations (Mersdorf, 2009).

As we have said, these are the basic sections of a marketing research report. However, additional sections can be added as needed. For example, you might need to add a section on the competition and each firm’s market share. If you’re trying to decide on different supply chain options, you will need to include a section on that topic.

As you write the research report, keep your audience in mind. Don’t use technical jargon decision makers and other people reading the report won’t understand. If technical terms must be used, explain them. Also, proofread the document to ferret out any grammatical errors and typos, and ask a couple of other people to proofread behind you to catch any mistakes you might have missed. If your research report is riddled with errors, its credibility will be undermined, even if the findings and recommendations you make are extremely accurate.

Many research reports are presented via PowerPoint. If you’re asked to create a slideshow presentation from the report, don’t try to include every detail in the report on the slides. The information will be too long and tedious for people attending the presentation to read through. And if they do go to the trouble of reading all the information, they probably won’t be listening to the speaker who is making the presentation.

Instead of including all the information from the study in the slides, boil each section of the report down to key points and add some “talking points” only the presenter will see. After or during the presentation, you can give the attendees the longer, paper version of the report so they can read the details at a convenient time, if they choose to.

Key Takeaway

Step 1 in the marketing research process is to define the problem. Businesses take a look at what they believe are symptoms and try to drill down to the potential causes so as to precisely define the problem. The next task for the researcher is to put into writing the research objective, or goal, the research is supposed to accomplish. Step 2 in the process is to design the research. The research design is the “plan of attack.” It outlines what data you are going to gather, from whom, how, and when, and how you’re going to analyze it once it has been obtained. Step 3 is to design the data-collection forms, which need to be standardized so the information gathered on each is comparable. Surveys are a popular way to gather data because they can be easily administered to large numbers of people fairly quickly. However, to produce the best results, survey questionnaires need to be carefully designed and pretested before they are used. Step 4 is drawing the sample, or a subset of potential buyers who are representative of your entire target market. If the sample is not correctly selected, the research will be flawed. Step 5 is to actually collect the data, whether it’s collected by a person face-to-face, over the phone, or with the help of computers or the Internet. The data-collection process is often different in foreign countries. Step 6 is to analyze the data collected for any obvious errors, tabulate the data, and then draw conclusions from it based on the results. The last step in the process, Step 7, is writing the research report and presenting the findings to decision makers.

Review Questions

  1. Explain why it’s important to carefully define the problem or opportunity a marketing research study is designed to investigate.
  2. Describe the different types of problems that can occur when marketing research professionals develop questions for surveys.
  3. How does a probability sample differ from a nonprobability sample?
  4. What makes a marketing research study valid? What makes a marketing research study reliable?
  5. What sections should be included in a marketing research report? What is each section designed to do?

1“Questionnaire Design,” QuickMBA, (accessed December 14, 2009).


Barnes, B., “Disney Expert Uses Science to Draw Boy Viewers,” New York Times, April 15, 2009, (accessed December 14, 2009).

Burns A. and Ronald Bush, Marketing Research, 6th ed. (Upper Saddle River, NJ: Prentice Hall, 2010), 85.

Malhotra, N., Marketing Research: An Applied Approach, 6th ed. (Upper Saddle River, NJ: Prentice Hall), 764.

McDaniel, C. D. and Roger H. Gates, Marketing Research Essentials, 2nd ed. (Cincinnati: South-Western College Publishing, 1998), 61.

McWilliams, J., “A-B Puts Super-Low-Calorie Beer in Ring with Miller,” St. Louis Post-Dispatch, August 16, 2009, (accessed April 13, 2012).

Mersdorf, S., “How to Organize Your Next Survey Report,” Cvent, August 24, 2009, (accessed December 14, 2009).

Rappeport A. and David Gelles, “Facebook to Form Alliance with Nielsen,” Financial Times, September 23, 2009, 16.

Spangler, T., “Disney Lab Tracks Feelings,” Multichannel News 30, no. 30 (August 3, 2009): 26.

Wagner, J., “Marketing in Second Life Doesn’t Work…Here Is Why!” GigaOM, April 4, 2007, (accessed December 14, 2009).

Wrenn, B., Robert E. Stevens, and David L. Loudon, Marketing Research: Text and Cases, 2nd ed. (Binghamton, NY: Haworth Press, 2007), 180.

Zouhali-Worrall, M., “Found in Translation: Avoiding Multilingual Gaffes,”, July 14, 2008, (accessed December 14, 2009).

This is a derivative of Principles of Marketing by a publisher who has requested that they and the original author not receive attribution, which was originally released and is used under CC BY-NC-SA. This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

As discussed in previous chapters, toxicogenomic studies face significant challenges in the areas of validation, data management, and data analysis. Foremost among these challenges is the ability to compile high-quality data in a format that can be freely accessed and reanalyzed. These challenges can best be addressed within the context of consortia that generate high-quality, standardized, and appropriately annotated compendia of data. This chapter discusses issues related to the role of private and public consortia in (1) sample acquisition, annotation, and storage; (2) data generation, annotation, and storage; (3) repositories for data standardization and curation; (4) integration of data from different toxicogenomic technologies with clinical, genetic, and exposure data; and (5) issues associated with data transparency and sharing. Although this chapter touches on how these issues relate to risk assessment and the associated ethical, legal, and social implications, other chapters of this report discuss these aspects of toxicogenomics in greater detail.


Because the potential of toxicogenomic studies is to improve human risk assessment, it is imperative that susceptible human populations be studied with toxicogenomics. Given the hundreds of millions of dollars already spent on clinical trials, environmental cohort studies, and measurements of human exposures to environmental chemicals, it is imperative that toxicogenomic studies make the best possible use of the human population and clinical resources already in hand.

This prospect raises many questions. How does the research community ensure that appropriate cohorts and sample repositories are available once the technologies have matured to the point of routine application? What are the limitations of using existing cohorts for toxicogenomic studies? Given the limitations of existing cohorts with respect to the informed consent, sample collection, and data formats, how should future population-based studies be designed? What is the ideal structure of a consortium that would assemble such a cohort, collect the necessary samples and data, and maintain repositories? Consideration of these questions begins with an examination of the current state of affairs.

Current Practices and Studies

Central to all population-based studies are identifying suitable cohorts, recruiting participants, obtaining informed consent, and collecting the appropriate biologic samples and associated clinical and epidemiologic data. Collecting, curating, annotating, processing, and storing biologic samples (for example, blood, DNA, urine, buccal swabs, histologic sections) and epidemiologic data for large population-based studies is a labor-intensive, expensive undertaking that requires a significant investment in future research by the initial study investigators. As a result, study investigators have reasonable concerns about making biologic samples available for general use by others in the scientific community. With the notable exception of immortalized lymphocyte cell lines that provide an inexhaustible resource from the blood of study participants, biologic specimens represent a limited resource and researchers often jealously guard unused samples for future use once the “ideal” assays become available. It is not unusual for institutions with sizeable epidemiology programs to have hundreds of thousands of blood, serum, lymphocyte, and tumor samples in storage that are connected to clinical data as well as information on demographics, lifestyle, diet, behavior, and occupational and environmental exposures.

In addition to cohorts and samples collected as part of investigator-initiated studies and programs, sample repositories have also been accrued by large consortia and cooperative groups sponsored by the public and private sectors. An example of publicly sponsored initiatives in the area of cancer research and national and international sample and data repositories is the National Cancer Institute (NCI)-sponsored Southwest Oncology Group (SWOG), one of the largest adult cancer clinical trial organizations in the world. The SWOG membership and network consists of almost 4,000 physicians at 283 institutions throughout the United States and Canada. Since its inception in 1956, SWOG has enrolled and collected samples from more than 150,000 patients in clinical trials.

Examples in the area of cancer prevention are the NCI-sponsored intervention trials such as the CARET (beta-Carotene and Retinoic Acid) study for prevention of lung cancer, and the Shanghai Breast Self Exam study conducted in factory workers in China. CARET was initiated in 1983 to test the hypothesis that antioxidants, by preventing DNA damage from free radicals present in tobacco smoke, might reduce the risk of lung cancer. CARET involved more than 18,000 current and former male and female smokers, as well as males exposed to asbestos, assigned to placebo and treatment groups with a combination of beta-carotene and vitamin A. Serum samples were collected before and during the intervention, and tumor samples continue to be accrued. The randomized trial to assess the efficacy of breast self-exam in Shanghai accrued 267,040 current and retired female employees associated with 520 factories in the Shanghai Textile Industry Bureau. Women were randomly assigned on the basis of factory to a self-examination instruction group (133,375 women) or a control group that received instruction in preventing back injury (133,665 women) (Thomas et al. 1997). A large number of familial cancer registries have also been established for breast, ovarian, and colon cancer. There are comparable cooperative groups to study noncancer disease end points, such as heart disease, stroke, and depression.

Other examples include several large prospective, population-based studies designed to assess multiple outcomes. Participants provide baseline samples and epidemiologic data and are then followed with periodic resampling and updating of information. The prototype for this type of study is the Nurses’ Health Study (NHS), established in 1976 with funding from the National Institutes of Health (NIH), with a second phase in 1989. The primary motivation for the NHS was to investigate the long-term consequences of the use of oral contraceptives. Nurses were selected because their education increased the accuracy of responses to technically worded questionnaires and because they were likely to be motivated to participate in a long-term study. More than 120,000 registered nurses were followed prospectively. Every 2 to 4 years, cohort members receive a follow-up questionnaire requesting information about diseases and health-related topics including smoking, hormone use and menopausal status, diet and nutrition, and quality of life. Biologic samples collected included toenails, blood samples, and urine.

Another example of an ongoing trial to assess multiple disease end points is the NIH-funded Women’s Health Initiative. Established in 1991, this study was designed to address the most common causes of death, disability, and poor quality of life in postmenopausal women, including cardiovascular disease, cancer, and osteoporosis. The study included a set of clinical trials and an observational study, which together involved 161,808 generally healthy postmenopausal women. The clinical trials were designed to test the effects of postmenopausal hormone therapy, diet modification, and calcium and vitamin D supplements as well as ovarian hormone therapies. The observational study had several goals, including estimating the extent to which known risk factors predict heart disease, cancers, and fractures; identifying risk factors for these and other diseases; and creating a future resource to identify biologic indicators of disease, especially substances found in blood and urine. The observational study enlisted 93,676 postmenopausal women whose health was tracked over an aver age of 8 years. Women who joined this study filled out periodic health forms and visited the research clinic 3 years after enrollment.

Sample/data repositories are also being assembled in the private sector, where pharmaceutical companies are increasingly eager to form cooperative groups or collaborations with academia and health care providers that provide access to patient cohorts for clinical trials and other population-based studies. In some cases, companies have assembled their own cohorts with samples purchased directly from medical providers. Large prospective cohorts have also been assembled, sampled, and observed by departments of health in countries with socialized medicine (for example, Sweden, Finland). There are multiple examples of established cohort studies with samples that could be used in studies that evaluate the impact of environmental exposures. For example, the Occupational and Environmental Epidemiology Branch of the NCI (NCI 2006a) conducts studies in the United States and abroad to identify and evaluate environmental and workplace exposures that may be associated with cancer risk. Another example is the Agricultural Health Study (AHS 2006), a prospective study of nearly 90,000 farmers and their spouses in Iowa and North Carolina, carried out in collaboration with the National Institute of Environmental Health Sciences and the Environmental Protection Agency (EPA). These cohort studies involve sophisticated exposure assessments and mechanistic evaluations and include intensive collaborations among epidemiologists, industrial hygienists, and molecular biologists. These studies often involve collecting biologic samples to assess biologic effects from exposure to agricultural (for example, pesticides and herbicides), industrial, and occupational chemicals.

Another valuable resource is the National Health and Nutrition Examination Survey (NHANES) conducted by the Environmental Health Laboratory of the Centers for Disease Control and Prevention (CDC) at the National Center for Environmental Health. The goal of NHANES is to identify environmental chemicals in participants, quantify their levels, and determine how these amounts relate to health outcomes. The survey design calls for collecting blood, urine, and extensive epidemiologic data (demographic, occupational, lifestyle, dietary, and medical information) from people of all ages and from all areas of the country. As such, NHANES provides representative exposure data for the entire U.S. population. Rather than estimating exposures, NHANES measures the amounts of hundreds of chemicals or their metabolites in participants’ blood and urine, using the most sensitive, state-of-the-art analytical techniques. The number of people sampled varies among compounds but is typically several hundred to thousands, which is sufficient for determining the range of exposures in the population; determining racial, age, and sex differences in exposure; detecting trends in exposures over time; and analyzing the efficacy of interventions designed to reduce exposures. In addition to the measurements of exposures, blood and urine are also used to measure biomarkers of nutrition and indicators of general health status. A report of the findings is published every 2 years.

In summary, academia, numerous government agencies, health care providers, and companies in the private sector have already invested tremendous effort and resources into accruing cohorts for population-based studies to design better drugs to predict responses to drug therapy, to assess the efficacy of drugs, to improve disease classification, to find genetic markers of disease susceptibility, to understand gene-environment interactions, and to assess the health effects of human exposures to environmental chemicals. These resources present an opportunity for developing partnerships to apply toxicogenomic technologies to ongoing and future collaborative projects.

Limitations and Barriers

The most expensive and arduous components of any population-based study are the collection of epidemiologic data and samples in a form that can be used to answer the largest number of questions in a meaningful way. Ideally, the same cohorts and samples should be available for multiple studies and reanalyses. However, for a wide variety of reasons, ranging from the types of samples collected to issues of informed consent, this is rarely possible given the structure of existing cohorts.

Structure of Cohorts

Given the cost and logistics of assembling, obtaining consent, and following large cohorts, many studies are designed to address only a few specific questions. Typically, case-control or association studies are used to test a new hypothesis before testing is expanded and validated with robust and expensive population-based cross-sectional designs or prospective studies. As a result, many studies lack the statistical power to address additional hypotheses. For example, a case control study designed to investigate the contribution of single nucleotide polymorphisms (SNPs) in a gene encoding a specific xenobiotic metabolizing enzyme in cancer will typically be underpowered to investigate the interactions of these SNPs with SNPs in other genes. Using larger cohorts that allow for a broader application would clearly be advantageous to future toxicogenomic studies.

Heterogeneity of Disease or Response Classification

Another important aspect of any population-based study is accurate genotyping and precise phenotyping of diseases and responses. Whether a study is looking at genetic linkage, disease association, susceptibility, or responsiveness to environmental agents, any phenotypic or genotypic misclassification will reduce or even obscure the truth by weakening the association between the correct genotype and the phenotype. Some diseases, such as clinical depression, migraine headaches, and schizophrenia are inherently difficult to diagnose and accurately classify at the clinical level. Likewise, an increasing number of molecular analyses of tumors indicate that significant heterogeneity exists even among tumors with a similar histopathology and that these molecular differences can influence the clinical course of the disease and the response to therapy or chemoprevention (Potter 2004). The resulting inaccuracies in genotypic and phenotypic stratification of disease in cases can limit the utility of cohorts and their associated samples. However, increased stratification of disease based on genotype and molecular phenotype can also have an adverse effect by reducing the statistical power of a cohort to detect disease association or linkage. As the capacity to define homogeneous genotypes and phenotypes increases, the size of cohorts will need to be increased well above those used in present studies.

Sample Collection

A major impediment to the use of existing cohorts is that many studies have not collected appropriate types of specimens, or the available specimens are in a format that is not amenable to toxicogenomic studies. For example, the NCI funded the Cancer Genetic Network to identify and collect data on families at elevated risk for different forms of cancer. No provisions were made for collecting biologic specimens within the funded studies, and the samples that are collected are often inappropriate for genomic analysis. Traditionally, cancer cohort studies have collected formalin-fixed, paraffin-embedded samples of tumors and adjacent normal tissue. Although DNA for SNP-based genotyping can be extracted from such fixed specimens, albeit with some difficulty (e.g., Schubert et al. 2002), these samples usually do not yield representative mRNA appropriate for gene expression profiling.

Clinical trials and epidemiologic studies have not usually collected or analyzed DNA samples (Forand et al. 2001). On the other hand, most molecular epidemiology studies diligently collected blood or serum samples for biomarker analyses, but multiple challenges remain. The way samples are collected, handled, shipped, and preserved varies greatly among studies. Many samples are flash frozen and hence are adequate for DNA extraction and genotyping, but such samples are usually limited in size and this may not allow for comprehensive genotyping. To deal with this limitation, some studies have developed transformed cell lines from human lymphocytes as a way to create inexhaustible supplies of DNA for genomic studies (e.g., Shukla and Dolan 2005). These cell lines could provide valuable experimental material to determine the impact of interindividual genetic variation on the response to compounds in in vitro studies. However, whether the results from studies with cell lines can be translated to human health effects remains to be determined.

In the case of the primary blood lymphocytes from study participants, unless steps were taken to preserve mRNA at the time of collection, the samples may have limited utility in gene expression profiling, although serum samples obviously could be used for proteomic and metabonomic studies. The NCI is currently funding a large initiative in the application of serum proteomics for early detection of disease (NCI 2006b). However, whether the methods of pres ervation currently used will allow for accurate analyses after many years of storage and to what extent different methods of sample preparation and storage affect the applicability of samples to proteomic and metabonomic analyses remain to be determined. In summary, there are numerous impediments to using existing samples available through cohort studies. There are both a need and an opportunity to standardize methodologies in ongoing studies and to design future studies to accommodate the requirements of new toxicogenomic platforms.

Data Uniformity

Another impediment to using existing cohorts for toxicogenomic applications is the lack of uniformity in data collection standards among populationbased studies. Although there will always be a need to collect data specific to a particular study or application, much can be done to standardize questionnaires that collect demographic, dietary, occupational, and lifestyle data common to all studies. Moreover, efforts should be launched to develop a standardized vocabulary for all data types, including clinical data, which can be recorded in a digitized electronic format that can be searched with text-mining algorithms. An example of such a standardized approach is the attempt to standardize the vocabulary for histopathology within the Mouse Models of Cancer Consortium (NCI 2006c). The NCI launched a related effort, the Cancer Bioinformatics Grid, to develop standardized tools, ontologies, and representations of data (NCI 2006d).

Sharing and Distributing Data

The biologic samples and data collected by clinical trials, epidemiologic studies, and human exposure studies funded by agencies such as NIH, the CDC, the EPA, and other agencies represent a significant public scientific resource. The full value of these studies for basic research and validation studies cannot be realized until appropriate data-sharing and data-distribution policies are formulated and supported by government, academia, and the private sector. Several NIH institutes such as the NCI and the National Heart, Lung, and Blood Institute (NHLBI) have drafted initial policies to address this need.

These policies have several key features that promote optimal use of the data resources, while emphasizing the protection of data on human subjects. For example, under the NHLBI policy, study coordinators retain information such as identifiers linked to personal information (date of birth, study sites, physical exam dates). However “limited access data,” in which personal data have been removed or transformed into more generic forms (for example, age, instead of date of birth), can be distributed to the broader scientific community. Importantly, study participants who did not consent to their data being shared beyond the initial study investigators are not included in the dataset. In some cases, consent forms provide the option for participants to specify whether their data can be used for commercial purposes. In these cases, it is important to discriminate between commercial purpose datasets and non-commercial purpose datasets to protect the rights of the human subjects.

Because it may be possible to combine limited access data with other publicly available data to identify particular study participants, those who want to obtain these datasets must agree in advance to adhere to established data-distribution agreements through the Institute, with the understanding that violation of the agreement may lead to legal action by study participants, their families, or the U.S. government. These investigators must also have approval of the institutional review board before they distribute the data.

Under these policies, it is the responsibility of the initial study investigators to prepare the datasets in such a way as to enable new investigators who are not familiar with the data to fully use them. In addition, documentation of the study must include data collection forms, study procedures and protocols, data dictionaries (codebooks), descriptions of all variable recoding, and a list of major study publications. Currently, the format of the data is requested to be in Statistical Analysis Software (SAS) and a generation program code for installing a SAS file from the SAS export data file is requested for NHLBI datasets. The NCI cancer bioinformatics information grid (CaBIG) initiative is exploring similar requirements.

Timing the release of limited access data is another major issue. Currently, policies differ across institutes and across study types. For large epidemiologic studies that require years of data to accumulate before analysis can begin, a period of 2 to 3 years after the completion of the final patient visit is typically allowed before release of the data. The interval is intended to strike a balance between the rights of the original study investigators and the wider scientific community.

Commercialization of repository data is another issue that could impede or enhance future studies. In Iceland, DNA samples from the cohort composed of all residents who agree to participate are offered as a commercial product for use in genetic studies. Based in Reykjavik, deCODE’s product is genetic information linked, anonymously, to medical records for the country’s 275,000 inhabitants. Iceland’s populations are geographically isolated and many share the same ancestors. In addition, family and medical records have been thoroughly recorded since the inception of the National Health Service in 1915. Icelanders also provide a relatively simple genetic background to investigate the genetics of disease. The population resource has proved its worth in studies of conditions caused by single defective genes, such as rare hereditary conditions including forms of dwarfism, epilepsy, and eye disorders. deCODE has also initiated projects in 25 common diseases including multiple sclerosis, psoriasis, pre-eclampsia, inflammatory bowel disease, aortic aneurism, alcoholism, and obesity. Clearly, future toxicogenomic studies will have to take into account possible use of such valuable, albeit commercial, resources.

Informed Consent

Perhaps the greatest barrier to using existing sample repositories and data collections for toxicogenomic studies is the issue of informed consent and its ethical, legal, and social implications (see Chapter 11). Government regulations increasingly preclude the use of patient data and samples in studies for which the patient did not specifically provide consent. Consequently, many potentially valuable resources cannot be used for applications in emerging fields such as toxicogenomics. Legislation to protect the public, such as the Health Insurance Portability and Accountability Act (HIPAA), which was enacted to protect patients against possible denial of health insurance as a result of intentional or unintentional distribution of patient health or health risk data, has unwittingly impaired population-based studies (DHHS 2006). Researchers now face tightening restrictions on the use of any data that can be connected with patient identity. Despite the significant barriers imposed by patient confidentiality and issues of informed consent, mechanisms exist for assembling large cohorts and collecting samples that can be used for multiple studies and applications to benefit public health. For example, the NHANES program under the aegis of the CDC collects blood samples for measurement of chemicals, nutritional biomarkers, and genotypes from large numbers of individuals using a consent mechanism that expressly authorizes multiple uses of biologic samples. However, because NHANES functions under the aegis of the CDC, it is exempt from many of the requirements of HIPAA. Nonetheless, solutions can be found that meet research needs while protecting the rights of individuals. An example of a successful approach is the BioBank in the United Kingdom (UK Biobank Limited 2007), a long-term project aimed at building a comprehensive resource for medical researchers by gathering information on the health and lifestyle of 500,000 volunteers between 40 and 69 years old. After providing consent, each participant donates a blood and urine sample, has standard clinical diagnostic measurements (such as blood pressure), and completes a confidential lifestyle questionnaire. Fully approved researchers can then use these resources to study the progression of illnesses such as cancer, heart disease, diabetes, and Alzheimer’s disease in these patients over the next 20 to 30 years to develop new and better ways to prevent, diagnose, and treat such problems. The BioBank ensures that data and samples are used for ethically and scientifically approved research. Issues such as informed consent, confidentiality, and security of the data are guided by an Ethics and Governance Framework overseen by an independent council.

In summary, the current state of sample repositories and their associated data are less than ideal and there are numerous limitations and barriers to their immediate use in the emerging field of toxicogenomics. Future studies require that issues in the design of cohort studies, phenotype and genotype classification, standardization of sample and data collection, database structure and sharing, and informed patient consent be addressed and mitigated. However, even when all these issues have been addressed, toxicogenomics faces additional and significant challenges related to the complexity of data collected by the various toxicogenomic technologies.


The following section provides an overview of the current standards for toxicogenomic data and a brief overview of existing databases and repositories. The remaining needs that must be met to move forward are described in the Conclusions section of this chapter.

Standards for Toxicogenomic Data

Although each individual study conducted using toxicogenomic approaches may provide insight into particular responses produced by compounds, the real value of the data goes beyond the individual assays. The real value of these data will be realized only when patterns of gene, protein, or metabolite expression are assembled and linked to a variety of other data resources. Consequently, there is a need for well-designed and well-maintained repositories for toxicogenomic data that facilitate toxicogenomic data use and allow further independent analysis. This is a daunting task. Genome sequencing projects have generated large quantities of data, but the ancillary information associated with those data are much less complex than the information necessary to interpret toxicogenomic data. Genome sequencing does not vary significantly with the tissue analyzed, the age of the organism, its diet, the time of day, hormonal status, or exposure to a particular agent. Analyses of expression patterns, however, are dramatically affected by these and a host of other variables that are essential for proper interpretation of the data.

In 1999, a group representing the major DNA sequence databases, large-scale practitioners of microarray analysis, and a number of companies developing microarray hardware, reagents, and software tools began discussing these issues. These discussions resulted in creation of the Microarray Gene Expression Data Society (MGED). MGED took on the task of developing standards for describing microarray experiments with one simple question in mind: What is the minimum information necessary for an independent scientist to perform an independent analysis of the data? Based on feedback received through numerous discussions and a series of public meetings and workshops, a new set of guidelines called the MIAME standard (Minimal Information About a Microarray Experiment) was developed (MIAME 2002). The publication of this new standard was met with general enthusiasm from the scientific community and the standards have evolved with continued input from scientists actively conducting microarray studies. To facilitate usage of this new standard, brief guidelines and a “MIAME checklist” were developed and provided to scientific journals for their use when considering publication of manuscripts presenting microarray data (Brazma et al. 2001, MGED 2005).

Whereas MIAME can be thought of as a set of guidelines for describing an experiment, it is clear that these guidelines must be translated into protocols enabling the electronic exchange of data in a standard format. To meet that challenge, a collaborative effort by members of MGED and a group at Rosetta Inpharmatics led to the development of the microarray gene expression (MAGE) object model as well as its implementation as an XML-based extensible markup language (Spellman et al. 2002). The adoption of MAGE by the Object Management Group has promoted MAGE to the status of an “official” industry standard. MAGE is now being built into a wide range of freely available microarray software, including BASE (Saal et al. 2002), BioConductor (Dudoit et al. 2003), and TM4 (Saeed et al. 2003). An increasing number of companies are also adopting MIAME and MAGE as essential components of their products. The guidelines in MIAME have continued to evolve, and both MIAME and MAGE are being prepared for a second release.

Efforts are also under way to develop an extended version of the MIAME data standard that allows for integration of other data specific to toxicologic experiments (for example, dose, exposure time, species, and toxicity end points). Deemed MIAME/Tox, this standard would form the basis for entry of standardized data into toxicogenomic databases (MIAME 2003). Similar efforts have produced microarray standards for a variety of related applications, including a MIAME/Env for environmental exposures (Sansone et al. 2004) and MIAME/Nut for nutrigenomics (Garosi et al. 2005). Work is also ongoing to develop standards for proteomics (the Minimal Information About a Proteomics Experiment) (Orchard et al. 2004) and for metabolomic profiling (the Standard Metabolic Reporting Structure) (Lindon et al. 2005a).

Public Data Repositories

The ArrayExpress database of the European Bioinformatic Institute (EBI) (Brazma et al., 2003), as well as the National Center for Biotechnology Information’s Gene Expression Omnibus (Edgar et al. 2002) and the Center for Information Biology gene Expression database at the DNA Data Bank of Japan have adopted and supported the MIAME standard. Following the same model used for sequence data, data exchange protocols are being developed to link expression data found in each of these three major repositories. Other large public databases such as the Stanford Microarray Database (SMD 2006) and CaBIG (NCI 2006d) have been developed in accordance with the MIAME guidelines. However, these are essentially passive databases that apply standards to data structure but provide little or no curation for data quality, quality assurance, or annotation. Other small-scale, publicly available tools and resources that have been developed for sharing, storing, and mining toxicogenomic data include db Zach (MSU 2007) and EDGE2, the Environment, Drugs and Gene Expression database (UW-Madison 2006).

Private and Public Toxicogenomic Consortia

Databases developed within private or public consortia are often much more proactive in curating data on the basis of quality and annotation. The pharmaceutical industry, for example, has generated very large compendia of toxicogenomic data on both proprietary and public compounds. These data are high quality but in general are not accessible to the public. Datasets and data repositories developed for toxicogenomics within the public sector, or in partnership with the private sector, are also actively curated and annotated but are made available for public access after publication of findings. Examples of such cooperative ventures are described below.

The Environmental Genome Project

The Environmental Genome Project (EGP) is an initiative sponsored by the National Institute of Environmental Health Sciences (NIEHS) that was launched in 1998. Inextricably linked to the Human Genome Project, the underlying premise for the EGP was that a subset of genes exists that have a greater than average influence on human susceptibility to environmental agents. Therefore, it was reasoned that the identification of these environmentally responsive genes, and characterization of their SNPs, would lead to enhanced understanding of human susceptibility to diseases with an environmental etiology. The EGP provided funding for extramural research projects in multiple areas, including bioinformatics/statistics; DNA sequencing; functional analysis; population-based epidemiology; ethical, legal, and social issues; technology development; and mouse models of disease. A variety of mechanisms including centers such as the Comparative Mouse Genomics Centers Consortium, the Toxicogenomics Research Consortium, the SNPs Program, and the Chemical Effects in Biological Systems database, which are described below, provided research support.

The NIEHS scientific community selected the list of environmentally responsive genes to be analyzed, which currently numbers 554 genes, although this list is not inclusive and is subject to ongoing modification. Among the environmentally responsive genes are those involved in eight categories or ontogenies including cell cycle, DNA repair, cell division, cell signaling, cell structure, gene expression, apoptosis, and metabolism. The goal of the EGP included resequencing these genes in 100 to 200 individuals representing the range of human diversity. Polymorphisms are then assessed for their impact on gene function. However, the small size of the program precluded disease-association studies. The goal of the program was to provide a database of SNPs that could then be used in larger epidemiologic studies.

The Pharmacogenetics Research Network

The NIH-sponsored Pharmacogenetics Research Network is a nationwide collaborative research network that seeks to enhance understanding of how genetic variation among individuals contributes to differences in responses to drugs. Part of this network is the Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB), a publicly available Internet research tool that curates genetic, genomic, and cellular phenotypes as well as clinical information obtained from participants in pharmacogenomic studies. The database includes, but is not limited to, information obtained from clinical, pharmacokinetic, and pharmacogenomic research on drugs targeting the cardiovascular system, the pulmonary systems, and cancer as well as studies on biochemical pathways, metabolism, and transporter domains. PharmGKB encourages submission of primary data from any study on how genes and genetic variation affect drug responses and disease phenotypes.

The NIEHS Toxicogenomics Research Consortium

As a complement to its ongoing participation with the International Life Sciences Institute’s Health and Environmental Sciences Institute (ILSI-HESI) Consortium, the NIEHS, in conjunction with the National Toxicogenomics Program (NTP) established the Toxicogenomics Research Consortium (TRC). The goals were to enhance research in the area of environmental stress responses using transcriptome profiling by (1) developing standards and “best practices” by evaluating and defining sources of variation across labs and platforms, and (2) contributing to the development of a robust relational database that combines toxicologic end points with changes in the expression patterns of genes, proteins, and metabolites.

The TRC included both consortium-wide collaborative studies and independent research projects at each consortium member site. On the whole, the TRC was successful and achieved many of its goals. The TRC published a landmark paper that was the first to not only define the various sources of error and variability in microarray experiments but also to quantify the relative contributions of each source (Bammler et al. 2005). The study confirmed the finding of the ILSI-HESI Consortium (see page 165), indicating that, despite the lack of concordance across platforms in the expression of individual genes, concordance was high when considering affected biochemical pathways. The TRC results were also concordant with work by other groups who arrived at similar conclusions (Bammler et al. 2005, Dobbin et al. 2005, Irizarry et al. 2005, Larkin et al. 2005). Although the amount of funding limited the scope of independent research projects funded within each center, this component was highly successful and has led to numerous outstanding publications in the field (e.g., TRC 2003).

The Chemical Effects in Biologic Systems Knowledge Base

The Chemical Effects in Biological Systems (CEBS) (Waters et al. 2003b) knowledge base developed at the National Center for Toxicogenomics at the NIEHS represents a first attempt to create a database integrating various aspects of toxicogenomics with classic information on toxicology. The goal was to create a database containing fully documented experimental protocols searchable by compound, structure, toxicity end point, pathology end point, gene, gene group, SNP, pathway, and network as a function of dose, time, and the phenotype of the target tissue. These goals have not been realized.

ILSI-HESI Consortium

The ILSI-HESI Consortium on the Application of Genomics to Mechanism Based Risk Assessment is an example of a highly successful, international collaboration between the private and public sectors, involving 30 laboratories from industry, academia, and national laboratories. The ILSI-HESI Toxicogenomics Committee developed the MIAME implementation for the toxicogenomic community in collaboration with the European Bioinformatics Institute (a draft MIAME/Tox standard is included in Appendix B). In addition to helping develop data standards, the consortium was also the first large-scale experimental program to offer practical insights into genomic data exchange issues.

Using well-studied compounds with well-characterized organ-specific toxicities (for example, acetaminophen), the consortium conducted more than 1,000 microarray hybridizations on Affymetrix and cDNA platforms to determine whether known mechanisms of toxicity can be associated with characteristic gene expression profiles. These studies validated the utility of gene expression profiling in toxicology, led to the adoption of data standards, evaluated methodologies and standard operating procedures, performed cross-platform comparisons, and investigated dose and temporal responses on a large scale. The studies also demonstrated that, despite differences in gene-specific data among platforms, expression profiling on any platform revealed common pathways affected by a given exposure. Multiple landmark papers published the research findings of the consortium (Amin et al. 2004; Baker et al. 2004; Chu et al. 2004; Kramer et al. 2004; Mattes 2004; Mattes et al. 2004; Pennie et al. 2004; Thompson et al. 2004; Ulrich et al. 2004; Waring et al. 2004).

The Consortium for Metabonomic Toxicology

Another example of a highly successful consortium, the Consortium for Metabonomic Toxicology (COMET) was a collaborative venture between University College London, six major pharmaceutical companies, and a nuclear magnetic resonance (NMR) instrumentation company. The main objectives of the consortium were similar to those of the ILSI-HESI Consortium but with pro ton NMR-based metabonomic profiling (Lindon et al. 2003). The group developed a database of 1H NMR spectra from rats and mice dosed with model toxins and an associated database of meta-data on the studies and samples. The group also developed powerful informatic tools to detect toxic effects based on NMR spectral profiles.

Since its inception, COMET defined sources of variation in data, established baseline levels and ranges of normal variation, and completed studies on approximately 150 model toxins (Lindon et al. 2005b). In addition, the analytical tools developed have been validated by demonstrating a high degree of accuracy in predicting the toxicity of toxins that were not included in the training dataset.

Commercial Databases: Iconix Pharmaceuticals and Gene Logic

Databases designed for use in predictive toxicology have also been developed in the private sector for commercial applications. An example is the DrugMatrix database developed at Iconix in 2001. Working with Incyte, MDS Pharma Services, and Amersham Biosciences, Iconix selected more than 1,500 compounds, all of which were used to generate in vitro expression profiles of primary rat hepatocytes and in vivo gene expression profiles of heart, kidney, and liver tissue from exposed rats. They then used these data to populate a database that included publicly available data on biologic pathways, toxicity associated with each compound, and pathology and in vitro pharmacology data. The database was then overlaid with software tools for rapid classification of new compounds by comparisons with the expression patterns of compounds in the database.

A similar database for predictive, mechanistic, and investigational toxicology, deemed Tox-Express, was developed at Gene Logic. These databases are large by current standards and hence these companies occupy a market niche in predictive toxicology. The information in these databases far exceeds that available through publicly available sources.


Leveraging Existing Studies

Given the potential of toxicogenomic studies to improve human risk assessment, it is imperative to conduct toxicogenomic studies of human populations. Academic institutions, government agencies, health care providers, and the private sector have already invested tremendous effort and resources into accruing cohorts for population-based studies to predict responses to drug therapy, to improve disease classification, to find genetic markers of disease susceptibility, to understand gene-environment interactions, and to assess effects of human exposures to environmental chemicals. Given the hundreds of millions of dollars already spent on clinical trials, environmental cohort studies, and measurements of human exposures to environmental chemicals, it is imperative that toxicogenomic studies make the best possible use of the resources at hand. Ensuring access to the samples and data produced by existing studies will require consortia and other cooperative interactions involving health care providers, national laboratories, academia, and the private sector. Sharing of data and samples also raises important ethical, legal, and social issues, such as informed consent and confidentiality, that must be addressed.

Even though existing cohorts and ongoing studies should be leveraged when possible, the current state of sample repositories and their associated data is less than ideal, and there are numerous limitations and barriers to their immediate use in toxicogenomics. For example, there is little uniformity in data collection standards among population-based studies, and these studies for the most part were not designed with genomic research in mind. This lack of uniformity and the fact that samples and data are seldom collected in a format that can be assessed by genomic technologies present a major impediment to using existing cohorts for toxicogenomic applications. Although there will always be a need to collect data specific to a particular study or application, much can be done to standardize collection of samples and of demographic, dietary, occupational, and lifestyle data common to all studies.

Databases and Building a Toxicogenomic Database

Toxicogenomic technologies generate enormous amounts of data—on a scale even larger than sequencing efforts like the Human Genome Project. Current public databases are inadequate to manage the types or volumes of data expected to be generated by large-scale applications of toxicogenomic technologies or to facilitate the mining and interpretation of the data that are just as important as its storage. In addition, because the predictive power of databases increases iteratively with the addition of new compounds, significant benefit could be realized from incorporating greater volumes of data. A large, publicly accessible database of quality data would strengthen the utility of toxicogenomic technologies, enabling more accurate prediction of health risks associated with existing and newly developed compounds, providing context to toxicogenomic data generated by drug and chemical manufacturers, informing risk assessments, and generally improving the understanding of toxicologic effects.

The type of database needed is similar in concept to the vision for the CEBS database (Waters et al. 2003b), which was envisioned to be more than a data repository and, more importantly, to provide tools for analyzing and using data. The vision also included elements of integration with other databases and multidimensionality in that the database would accommodate data from various toxicogenomic technologies beyond gene expression data. While CEBS (CEBS 2007) is not well populated with data, adding data will not solve its shortcomings; the original goals for the database were not implemented.

The lack of a sufficient database in the public sector represents a serious obstacle to progress for the academic research community. Development of the needed database could be approached by either critically analyzing and improving the structure of CEBS or starting a new database. Creating an effective database will require the close collaboration of diverse toxicogenomic communities (for example, toxicologists and bioinformatic scientists) who can work together to create “use cases” that help specify how the data will be collected and used, which will dictate the structure of the database that will ultimately be created. These are the first and second of the three steps required to create a database.

  1. Create the database, including not only the measurements that will be made in a particular toxicogenomic assay but also how to describe the treatments and other information that will assist in data interpretation.

  2. Create software tools to facilitate data entry, retrieval and presentation, and analysis of relationships.

  3. Develop a strategy to populate the database with data, including minimum quality standards for data.

Where will the data for such a database come from? Ideally, this should be organized as an international project involving partnership among government, academia, and industrial organizations to generate the appropriate data; an example of such coordinated efforts is the SNP Consortium. However, developing such a database is important enough that it needs to be actively pursued and not delayed for an extensive time.

One potential source of toxicogenomic data is repositories maintained by companies, both commercial toxicogenomic database companies and drug or chemical companies that have developed their own datasets. Although industry groups have been leaders in publishing “demonstration” toxicogenomic studies, the data (e.g., Daston 2004; Moggs et al. 2004; Ulrich et al. 2004; Naciff et al. 2005a,b) published to date are thought to represent a small fraction of the data companies maintain in internal databases that often focused on proprietary compounds. It is unlikely that much of these data will be available in the future without appropriate incentives and resolution of complex issues involving the economic and legal ramifications of releasing safety data on compounds studied in the private sector. Furthermore, extensive data collections even more comprehensive and systematically collected than these maintained by companies are necessary if the field is to advance.

The NTP conducts analyses of exposure, particularly chronic exposures, that would be difficult to replicate elsewhere and could serve as a ready source of biologic material for analysis. One possibility is to build the database on the existing infrastructure of the NTP at the NIEHS; at the least, the NTP should play a role in populating any database.

Although creating and populating a relevant database is a long-term project, work could begin immediately on constructing and populating an initial database to serve as the basis of future development. A first-generation dataset could be organized following the outline in Box 10-1.

BOX 10-1

Possible Steps for Developing an Initial Database. Select two or more classes of compounds targeting different organ systems. These compounds should be well-characterized compounds with well-understood modes of action. Define appropriate sample populations (more...)

This dataset could be analyzed for various applications, including classifying new compounds, identifying new classes of compounds, and searching for mechanistic insight. The preliminary database would also drive second-generation experiments, including a further study in populations larger than those used in the initial studies to assess the generalizability of any results obtained and second-generation experiments examining different doses, different exposure durations, different species and strains, and relevant human exposure, if possible.

A practical challenge of a toxicogenomic data project would be that much of the database development and data mining work are not hypothesis-driven science and generally are not compatible with promotion and career advancement for those in the academic sector. This is despite the fact that creating useful databases, software tools, and proven analysis approaches is essential to the success of any project. If the ability to analyze the data is limited to a small community of computational biologists able to manipulate abstract data formats or run command line programs, then the overall impact of such a project will be minimized. Thus, ways must be found to stimulate the participation of the academic sector in this important work.

Finally, although this may seem like an ambitious project, it should be emphasized that it would only be a start. Moving toxicogenomics from the research laboratory to more widespread use requires establishing useful data and standards that can be used both for validating the approaches and as a reference for future studies.


Short Term


Develop a public database to collect and facilitate the analysis of data from different toxicogenomic technologies and associated toxicity data.


Identify existing cohorts and ongoing population-based studies that are already collecting samples and data that could be used for toxicogenomic studies (for example, the CDC NHANES effort), and work with the relevant organizations to gain access to samples and data and address ethical and legal issues.1


Develop standard operating procedures and formats for collecting samples and epidemiologic data that will be used for toxicogenomic studies. Specifically, there should be a standard approach to collecting and storing demographic, lifestyle, occupation, dietary, and clinical (including disease classification) data that can be included and queried with a public database.



Develop approaches to populate a database with high-quality toxicogenomic data.

  1. Incorporate existing datasets from animal and human toxicogenomic studies, from both the private and public sectors, into the public data repository, provided that they meet the established criteria.

  2. Communicate with the private sector, government, and academia to formulate mutually acceptable and beneficial approaches and policies for appropriate data sharing and data distribution that encourage including data from the private and public sectors.


Develop additional analytical tools for integrating epidemiologic and clinical data with toxicogenomic data.


NIEHS, in conjunction with other institutions, should cooperate to create a centralized national biorepository for human clinical and epidemiologic samples, building on existing efforts.

Long Term


Work with public and private agencies to develop large, well-designed epidemiologic toxicogenomic studies of potentially informative populations—studies that include collecting samples over time and carefully assessing environmental exposures.


Ethical and legal issues regarding human subject research are described in Chapter 11.