October 30, 2017 | Author: Anonymous | Category: N/A
explained final utilities as a function Amy Orme PROCEEDINGS OF THE shapley value regression ......
PROCEEDINGS OF THE SAWTOOTH SOFTWARE CONFERENCE October 2013
Copyright 2014 All rights reserved. No part of this volume may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from Sawtooth Software, Inc.
iv
FOREWORD These proceedings are a written report of the seventeenth Sawtooth Software Conference, held in Dana Point, California, October 16–18, 2013. Two-hundred ten attendees participated. This conference included a separate Healthcare Applications in Conjoint Analysis track; however these proceedings contain only the papers delivered at the Sawtooth Software Conference. The focus of the Sawtooth Software Conference continues to be quantitative methods in marketing research. The authors were charged with delivering presentations of value to both the most sophisticated and least sophisticated attendees. Topics included choice/conjoint analysis, surveying on mobile platforms, Menu-Based Choice, MaxDiff, hierarchical Bayesian estimation, latent class procedures, optimization routines, cluster ensemble analysis, and random forests. The papers and discussant comments are in the words of the authors and very little copy editing was performed. At the end of each of the papers, we’re pleased to display photographs of the authors and co-authors who attended the conference. We appreciate their cooperation to sit for these portraits! It lends a personal touch and makes it easier for readers to recognize them at the next conference. We are grateful to these authors for continuing to make this conference a valuable event and advancing our collective knowledge in this exciting field. Sawtooth Software June, 2014
vi
CONTENTS 9 THINGS CLIENTS GET WRONG ABOUT CONJOINT ANALYSIS .................................................1 Chris Chapman, Google QUANTITATIVE MARKETING RESEARCH SOLUTIONS IN A TRADITIONAL MANUFACTURING FIRM: UPDATE AND CASE STUDY .................................................................................................... 13 Robert J. Goodwin, Lifetime Products, Inc. CAN CONJOINT BE FUN?: IMPROVING RESPONDENT ENGAGEMENT IN CBC EXPERIMENTS ........................................... 39 Jane Tang & Andrew Grenville, Vision Critical MAKING CONJOINT MOBILE: ADAPTING CONJOINT TO THE MOBILE PHENOMENON ........... 55 Chris Diener, Rajat Narang, Mohit Shant, Hem Chander & Mukul Goyal, AbsolutData CHOICE EXPERIMENTS IN MOBILE WEB ENVIRONMENTS ........................................................ 69 Joseph White, Maritz Research USING COMPLEX CHOICE MODELS TO DRIVE BUSINESS DECISIONS ..................................... 83 Karen Fuller, HomeAway, Inc. & Karen Buros, Radius Global Market Research AUGMENTING DISCRETE CHOICE DATA—A Q-SORT CASE STUDY ....................................... 97 Brent Fuller, Matt Madden & Michael Smith, The Modellers MAXDIFF AUGMENTATION: EFFORT VS. IMPACT ............................................................... 105 Urszula Jones, TNS & Jing Yeh, Millward Brown WHEN U = ßX IS NOT ENOUGH: MODELING DIMINISHING RETURNS AMONG CORRELATED CONJOINT ATTRIBUTES ...................................................................................................... 115 Kevin Lattery, Maritz Research RESPONDENT HETEROGENEITY, VERSION EFFECTS OR SCALE? A VARIANCE DECOMPOSITION OF HB UTILITIES ................................................................ 129 Keith Chrzan & Aaron Hill, Sawtooth Software FUSING RESEARCH DATA WITH SOCIAL MEDIA MONITORING TO CREATE VALUE ............... 135 Karlan Witt & Deb Ploskonka, Cambia Information Group BRAND IMAGERY MEASUREMENT: ASSESSMENT OF CURRENT PRACTICE AND A NEW APPROACH .............................................................................................................. 147 Paul Richard McCullough, MACRO Consulting, Inc. i
ACBC REVISITED ............................................................................................................. 165 Marco Hoogerbrugge, Jeroen Hardon & Christopher Fotenos, SKIM Group RESEARCH SPACE AND REALISTIC PRICING IN SHELF LAYOUT CONJOINT (SLC) ................. 181 Peter Kurz, TNS Infratest, Stefan Binner, bms marketing research + strategy & Leonhard Kehl, Premium Choice Research & Consulting ATTRIBUTE NON-ATTENDANCE IN DISCRETE CHOICE EXPERIMENTS ..................................... 195 Dan Yardley, Maritz Research ANCHORED ADAPTIVE MAXDIFF: APPLICATION IN CONTINUOUS CONCEPT TEST ............... 205 Rosanna Mau, Jane Tang, LeAnn Helmrich & Maggie Cournoyer, Vision Critical HOW IMPORTANT ARE THE OBVIOUS COMPARISONS IN CBC? THE IMPACT OF REMOVING EASY CONJOINT TASKS ........................................................... 221 Paul Johnson & Weston Hadlock, SSI SEGMENTING CHOICE AND NON-CHOICE DATA SIMULTANEOUSLY ................................... 231 Thomas C. Eagle, Eagle Analytics of California EXTENDING CLUSTER ENSEMBLE ANALYSIS VIA SEMI-SUPERVISED LEARNING ...................... 251 Ewa Nowakowska, GfK Custom Research North America & Joseph Retzer, CMI Research, Inc. THE SHAPLEY VALUE IN MARKETING RESEARCH: 15 YEARS AND COUNTING ....................... 267 Michael Conklin & Stan Lipovetsky, GfK DEMONSTRATING THE NEED AND VALUE FOR A MULTI-OBJECTIVE PRODUCT SEARCH......... 275 Scott Ferguson & Garrett Foster, North Carolina State University A SIMULATION BASED EVALUATION OF THE PROPERTIES OF ANCHORED MAXDIFF: STRENGTHS, LIMITATIONS AND RECOMMENDATIONS FOR PRACTICE.................................... 305 Jake Lee, Maritz Research & Jeffrey P. Dotson, Brigham Young University BEST-WORST CBC CONJOINT APPLIED TO SCHOOL CHOICE: SEPARATING ASPIRATION FROM AVERSION ........................................................................ 317 Angelyn Fairchild, Research, RTI International, Namika Sagara & Joel Huber, Duke University DOES THE ANALYSIS OF MAXDIFF DATA REQUIRE SEPARATE SCALING FACTORS? .............. 331 Jack Horne & Bob Rayner, Market Strategies International
ii
USING CONJOINT ANALYSIS TO DETERMINE THE MARKET VALUE OF PRODUCT FEATURES .... 341 Greg Allenby, Ohio State University, Jeff Brazell, The Modellers, John Howell, Penn State University & Peter Rossi, University of California Los Angeles THE BALLAD OF BEST AND WORST ...................................................................................... 357 Tatiana Dyachenko, Rebecca Walker Naylor & Greg Allenby, Ohio State University
iii
iv
SUMMARY OF FINDINGS The seventeenth Sawtooth Software Conference was held in Dana Point, California, October 16–18, 2013. The summaries below capture some of the main points of the presentations and provide a quick overview of the articles available within the 2013 Sawtooth Software Conference Proceedings. 9 Things Clients Get Wrong about Conjoint Analysis (Chris Chapman, Google): Conjoint analysis has been used with great success in industry, Chris explained, but this often leads to some clients having misguided expectations regarding the technique. As a prime example, many clients are hoping that conjoint analysis will predict the volume of demand. While conjoint can provide important input to a forecasting model, it usually cannot alone predict volume without other inputs such as awareness, promotion, channel effects and competitive response. Chris cautioned about examining average part-worth utility scores only without consideration for the distribution of preferences (heterogeneity), which often reveal profitable niche strategies. He also recommended fielding multiple studies with modest sample sizes that examine a business problem using different approaches rather than fielding one high-budget, large sample size survey. Finally, Chris stressed that leveraging insights from analytics (such as from conjoint) is better than relying solely on managerial instincts. It will generally increase the upside potential and reduce the downside risk for business decisions. Quantitative Marketing Research Solutions in a Traditional Manufacturing Firm: Update and Case Study (Robert J. Goodwin, Lifetime Products, Inc.): Bob’s presentation highlighted the history of Lifetime’s use of conjoint methods to help it design and market its consumer-oriented product line. He also presented findings regarding a specific test involving Adaptive CBC (ACBC). Regarding practical lessons learned while executing numerous conjoint studies at Lifetime, Bob cited not overloading the number of attributes in the list just because the software can support them. Some attributes might be broken out and investigated using nonconjoint questions. Also, because many respondents do not care much about brand name in the retailing environment that Lifetime engages, Bob has dropped brand from some of his conjoint studies. But, when he wants to measure brand equity, Bob uses a simulation method to estimate the value of a brand, in the context of competitive offerings and the “None” alternative. Finally, Bob conducted a split-sample test involving ACBC. He found that altering the questionnaire settings to focus the “near-neighbor design” either more tightly or less tightly around the respondent’s BYO-specified concept didn’t change results much. This, he argued, demonstrates the robustness of ACBC results to different questionnaire settings. Can Conjoint Be Fun?: Improving Respondent Engagement in CBC Experiments (Jane Tang and Andrew Grenville, Vision Critical): Traditional CBC can be boring for respondents. This was mentioned in a recent Greenbook blog. Jane gave reasons why we should try to engage respondents in more interesting surveys, such as a) cleaner data often result, and b) happy respondents are happy panelists (and panelist cooperation is key). Ways to make surveys more fun include using adaptive tasks that seem to listen and respond to respondent preferences as well as feedback mechanisms that report something back to respondents based on their preferences. Jane and her co-author Andrew did a split-sample test to see if adding adaptive tasks and a feedback mechanism could improve CBC results. They employed tournament tasks, wherein concepts that win in earlier tasks are displayed again in later tasks. They also employed a simple level-counting mechanism to report back the respondent’s preferred product concept. v
Although their study design didn’t include good holdouts to examine predictive validity, there was at least modest evidence that the adaptive CBC design had lower error and performed better. Also, some qualitative evidence suggested that respondents preferred the adaptive survey. After accounting for scale differences (noise), they found very few differences in the utility parameters for respondents receiving “fun” versus standard CBC surveys. In sum, Jane suggested that if the utility results are essentially equivalent, why not let the respondent have more fun? Making Conjoint Mobile: Adapting Conjoint to the Mobile Phenomenon (Chris Diener, Rajat Narang, Mohit Shant, Hem Chander, and Mukul Goyal, AbsolutData): Chris and his coauthors examined issues involving the use of mobile devices to complete complex conjoint studies. Each year more respondents are choosing to complete surveys using mobile devices, so this topic should interest conjoint analysis researchers. It has been argued that the small screen size for mobile devices may make it nearly impossible to conduct complex conjoint studies involving relatively large lists of attributes. The authors conducted a split-sample experiment involving the US and India using five different kinds of conjoint analysis variants (all sharing the same attribute list). The variants included standard CBC, partial-profile, and adaptive methods (ACBC). Chris found only small differences in the utilities or the predictive validity for PCcompleted surveys versus mobile-completed surveys. Surprisingly, mobile respondents generally reported no readability issues (ability to read the questions and concepts on the screen) compared to PC respondents. The authors concluded that conjoint studies, even those involving nine attributes (as in their example) can be done effectively among those who elect to complete the surveys using their mobile devices (providing researchers keep the surveys short, use proper conjoint questionnaire settings, and emphasize good aesthetics). Choice Experiments in Mobile Web Environments (Joseph White, Maritz Research): Joseph looked at the feasibility of conducting two separate eight-attribute conjoint analysis studies using PC, tablet, or mobile devices. He compared results based on the Swait-Louviere test, allowing him to examine results in terms of scale and parameter equivalence. He also examined internal and external fit criteria. His conjoint questionnaires included full-profile and partial-profile CBC. Across both conjoint studies, he concluded that respondents who choose to complete the studies via a mobile device show predictive validity at parity with, or better than, those who choose to complete the same study via both PC and Tablet. Furthermore, the response error for mobile-completed surveys is on par with PC. He summed it up by stating, “Consistency of results in both studies indicate that even more complicated discrete choice experiments can be readily completed in mobile computing environments.” Using Complex Choice Models to Drive Business Decisions (Karen Fuller, HomeAway, Inc. and Karen Buros, Radius Global Market Research): Karen Fuller and Karen Buros jointly presented a case study involving a complex menu-based choice (MBC) experiment for Fuller’s company, HomeAway. HomeAway offers an online marketplace for vacation travelers to find rental properties. Vacation home owners and property managers list rental property on HomeAway’s website. The challenge for HomeAway was to design the pricing structure and listing options to better support the needs of owners and to create a better experience for travelers. Ideally, this would also increase revenues per listing. They developed an online questionnaire that looked exactly like HomeAway’s website, including three screens to fully select all options involving creating a listing. This process nearly replicated HomeAway’s existing enrollment process (so much so that some respondents got confused regarding whether they had completed a survey or done the real thing). Nearly 2,500 US-based respondents vi
completed multiple listings (MBC tasks), where the options and pricing varied from task to task. Later, a similar study was conducted in Europe. CBC software was used to generate the experimental design, the questionnaire was custom-built, and the data were analyzed using MBC software. The results led to specific recommendations for management, including the use of a tiered pricing structure, additional options, and to increase the base annual subscription price. After implementing many of the suggestions of the model, HomeAway has experienced greater revenues per listing and the highest renewal rates involving customers choosing the tiered pricing. Augmenting Discrete Choice Data—A Q-sort Case Study (Brent Fuller, Matt Madden, and Michael Smith, The Modellers): Sometimes clients want to field CBC studies that have an attribute with an unusually large number of levels, such as messaging and promotion attributes. The problem with such attributes is obtaining enough precision to avoid illogical reversals while avoiding excessive respondent burden. Mike and Brent presented an approach to augmenting CBC data with Q-sort rankings for these attributes involving many levels. A Q-sort exercise asks respondents to sort items into a small number of buckets, but where the number of items assigned per bucket is fixed by the researcher. The information from the Q-sort can be appended to the data as a series of inequalities (e.g., level 12 is preferred to level 18) constructed as new choice tasks. Mike and Brent found that the CBC data without Q-sort augmentation had some illogical preference orderings for the 19-level attribute. With augmentation, the reversals disappeared. One problem with augmenting the data is that it can artificially inflate the importance of the augmented attribute relative to the non-augmented attributes. Solutions to this problem include scaling back the importances to the original importances (at the individual level) given from HB estimation of the CBC data prior to augmentation. MaxDiff Augmentation: Effort vs. Impact (Urszula Jones, TNS and Jing Yeh, Millward Brown): Ula (Urszula) and Jing described the common challenge that clients want to use MaxDiff to test a large number of items. With standard rules of thumb (to obtain stable individual-level estimates), the number of choice tasks becomes very large per respondent. Previous solutions presented at the Sawtooth Software Conference include augmenting the data using Q-Sort (Augmented MaxDiff), Express MaxDiff (each respondents sees only a subset of the items), or Sparse MaxDiff (each respondent sees each item fewer than three times). Ula and Jing further investigated whether Augmented MaxDiff was worth the additional survey programming effort (as it is the most complicated) or whether the other approaches were sufficient. Although the authors didn’t implement holdout tasks that would have given a better read on predictive validity of the different approaches, they did draw some conclusions. They concluded a) at the individual level, Sparse MaxDiff is not very precise, but in aggregate the results are accurate, b) If you have limited time, augmenting data using rankings of the top items is probably better than augmenting the bottom items, and c) Augmenting on both top and bottom items is best if you need accurate individual-level results for TURF or clustering. When U = ßx Is Not Enough: Modeling Diminishing Returns among Correlated Conjoint Attributes (Kevin Lattery, Maritz Research): When conjoint analysis studies involve a large number of binary (on/off) features, standard conjoint models tend to over-predict interest in product concepts loaded up with nearly all the features and to under-predict product concepts including very few of the features. This occurs because typical conjoint analysis (using main effects estimation) assumes all the attributes are independent. But, Kevin explained, there often are diminishing returns when bundling multiple binary attributes (though the problem isn’t vii
limited to just binary attributes). Kevin reviewed some design principles involving binary attributes (avoid the situation in which the same number of “on” levels occurs in each product concept). Next, Kevin discussed different ways to account for the diminishing returns among a series of binary items. Interaction effects can partially solve the problem (and are a stable and practical solution if the number of binary items is about 3 or fewer). Another approach is to introduce a continuous variable for the number of “on” levels within a concept. But, Kevin proposed a more complete solution that borrows from nested logit. He demonstrated greater predictive validity to holdouts for the nested logit approach than the approach of including a term representing the number of “on” levels in the concept. One current drawback, he noted, is that his solution may be difficult to implement with HB estimation. Respondent Heterogeneity, Version Effects or Scale? A Variance Decomposition of HB Utilities (Keith Chrzan and Aaron Hill, Sawtooth Software): When researchers use CBC or MaxDiff, they hope the utility scores are independent of which version (block) each respondent received. However, one of the authors, Keith Chrzan, saw a study a few years ago in which assignment in cluster membership was not independent of questionnaire version (>95% confidence). This lead to further investigation in which for more than half of the examined datasets, the authors found a statistically significant version effect upon final estimated utilities (under methods such as HB). Aaron (who presented the research at the conference) described a regression model that they built which explained final utilities as a function of a) version effect, b) scale effect (response error), and c) other. The variance captured in the “other” category is assumed to be the heterogeneous preferences of respondents. Across multiple datasets, the average version effect accounted for less than 2% of the variance in final utilities. Scale accounted for about 11%, with the remaining attributable to substantive differences in preferences across respondents or other unmeasured sources. Further investigation using synthetic respondents led the authors to conclude that the version effect was psychological rather than algorithmic. They concluded that although the version effect is statistically significant, it isn’t strong enough to really worry about for practical applications. Fusing Research Data with Social Media Monitoring to Create Value (Karlan Witt and Deb Ploskonka, Cambia Information Group): Karlan and Deb described the current business climate, in which social media provides an enormous volume of real-time feedback about brand health and customer engagement. They recommended that companies should fuse social media measurement with other research as they design appropriate strategies for marketing mix. The big question is how best to leverage the social media stream, especially to move beyond the data gathering and summary stages to actually using the data to create value. Karlan and Deb’s approach starts with research to identify the key issues of importance to different stakeholders within the organization. Next, they determine specific thresholds (from social media metrics) that would signal to each stakeholder (for whom the topic is important) that a significant event was occurring that required attention. Following the event, the organization can model the effect of the event on Key Performance Indicators (KPIs) by different customer groups. The authors presented a case study to illustrate the principles. Brand Imagery Measurement: Assessment of Current Practice and a New Approach (Paul Richard McCullough, MACRO Consulting, Inc.): Dick (Richard) reviewed the weaknesses in current brand imagery measurement practices, specifically the weaknesses of the rating scale (lack of discrimination, scale use bias, halo). A new approach, brand-anchored MaxDiff, removed halo, avoids scale use bias, and is more discriminating. The process involves showing viii
the respondent a brand directly above a MaxDiff question involving, say, 4 or 5 imagery items. Respondents indicate which of the items most describes the brand and which least describes the brand. Anchored scaling MaxDiff questions (to estimate a threshold anchor point) allow comparisons across brands and studies. But, anchored scaling re-introduces some scale use bias. Dick tried different approaches to reduce the scale use bias associated with anchored MaxDiff using an empirical study. Part of the empirical study involved different measures of brand preference. He found that MaxDiff provided better discrimination, better predictive validity (of brand preference), and greater reduction of brand halo and scale use bias than traditional ratingsbased measures of brand imagery. Ratings provided no better predictive validity of brand preference in his model than random data. However, the new approach also took more respondent time and had higher abandonment rates. ACBC Revisited (Marco Hoogerbrugge, Jeroen Hardon, and Christopher Fotenos, SKIM Group): Christopher and his co-authors reviewed the stages in an Adaptive CBC (ACBC) interview and provided their insights into why ACBC has been a successful conjoint analysis approach. They emphasized that ACBC has advantages with more complex attribute lists and markets. The main thrust of their paper was to test different ACBC interviewing options, including a dynamic form of CBC programmed by the SKIM Group. They conducted a splitsample study involving choices for televisions. They compared default CBC and ACBC questionnaires to modifications of ACBC and CBC. Specifically, they investigated whether dropping the “screener” section in ACBC would hurt results; using a smaller random shock within summed pricing; whether to include price or not in ACBC’s unacceptable questions; and the degree to which ACBC samples concepts directly around the BYO-selected concept. For the SKIM-developed dynamic CBC questionnaire, the first few choice tasks were exactly like a standard CBC task. The last few tasks displayed winning concepts chosen in the first few tasks. In terms of prediction of the holdout tasks, all ACBC variants did better than the CBC variants. None of the ACBC variations seemed to make much difference, suggesting that the ACBC procedure is quite robust even with simplifications such as removing the screening section. Research Space and Realistic Pricing in Shelf Layout Conjoint (SLC) (Peter Kurz, TNS Infratest, Stefan Binner, bms marketing research + strategy, and Leonhard Kehl, Premium Choice Research & Consulting): In the early 1990s, the first CBC questionnaires only displayed a few product concepts on the screen, without the use of graphics. Later versions supported shelflooking displays, complete with graphics and other interactive elements. Rather than using lots of attributes described in text, the graphics themselves portrayed different sizes, claims, and package design elements. However, even the most sophisticated computerized CBC surveys (including virtual-reality) cannot reflect the real situation of a customer at the supermarket. The authors outlined many challenges involving shelf layout conjoint (SLC). Some of the strengths of SLC, they suggested, are in optimization of assortment (e.g., line extension problems, substitution) and price positioning/promotions. Certain research objectives are problematic for SLC, including volumetric predictions, positioning of products on the shelf, and new product development. The authors concluded by offering specific recommendations for improving results when applying SLC, including: use realistic pricing patterns and ranges within the tasks, use realistic tag displays, and reducing the number of parameters to estimate within HB models. Attribute Non-Attendance in Discrete Choice Experiments (Dan Yardley, Maritz Research): When respondents ignore certain attributes when answering CBC tasks, this is called “attribute non-attendance.” Dan described how some researchers in the past have asked ix
respondents directly which attributes they ignored (stated non-attendance) and have used that information to try to improve the models. To test different approaches to dealing with nonattendance, Dan conducted two CBC studies. The first involved approximately 1300 respondents, using both full- and partial-profile CBC. The second involved about 2000 respondents. He examined both aggregate and disaggregate (HB) models in terms of model fit and out-of-sample holdout prediction. Dan also investigated ways to try to ascertain from HB utilities that respondents were ignoring certain attributes (rather than rely on stated nonattendance). For attributes deemed to have been ignored by a respondent, the codes in the independent variable matrix were held constant at zero. He found that modeling stated nonattendance had little impact on the results, but usually slightly reduced the fit to holdouts. He experimented with different cutoff rates under HB modeling to deduce whether individual respondents had ignored attributes. For his two datasets, he was able to slightly improve prediction of holdouts using this approach. Anchored Adaptive MaxDiff: Application in Continuous Concept Test (Rosanna Mau, Jane Tang, LeAnn Helmrich, and Maggie Cournoyer, Vision Critical): Rosanna and her coauthors investigated the feasibility of using an adaptive form of anchored MaxDiff within multiwave concept tests as a replacement for traditional 5-point rating scales. Concept tests have traditionally been done with 5-point scales, with the accompanying lack of discrimination and scale use bias. Anchored MaxDiff has proven to have superior discrimination, but the stability of the anchor (the buy/no buy threshold) has been called into question in previous research presented at the Sawtooth Software conference. Specifically, the context of how many concepts are being evaluated within the direct anchoring approach can affect the absolute position of the anchor. This would be extremely problematic for using the anchored MaxDiff approach to compare the absolute desirability of concepts across multiple waves of research that involve differing numbers and quality of product concepts. To reduce the context effect for direct anchor questions, Rosanna and her co-authors used an Adaptive MaxDiff procedure to obtain a rough rank-ordering of items for each respondent. Then, in real-time, they asked respondents binary purchase intent questions for six items ranging along the continuum of preference from the respondent’s best to the respondent’s worst items. They compared results across multiple waves of data collection involving different numbers of product concepts. They found good consistency across waves and that the MaxDiff approach led to greater discrimination among the top product concepts than the ratings questions. How Important Are the Obvious Comparisons in CBC? The Impact of Removing Easy Conjoint Tasks (Paul Johnson and Weston Hadlock, SSI): One well-known complaint about CBC questionnaires is that they can often display obvious comparisons (dominated concepts) within a choice task. Obvious comparisons are those which the respondent recognizes that one concept is logically inferior in every way to another concept. After encountering a conjoint analysis study where a full 60% of experimentally designed choice tasks included a logically dominated concept, Paul and Weston decided to experiment on the effect of removing dominated concepts. They fielded that same study among 500 respondents, where half the sample received the typically designed CBC tasks and the other half received CBC tasks wherein the authors removed any tasks including dominated concepts, replacing them tasks without dominated concepts (by modifying the design file in Excel). They found little differences between the two groups in terms of predictability of holdout tasks or length of time to complete the CBC questionnaire. They asked some follow-up qualitative questions regarding that survey taking experience and found no significant differences between the two groups of respondents. Paul and x
Weston concluded that if it requires extra effort on the part of the researcher to modify the experimental design to avoid dominated concepts, then it probably isn’t worth the extra effort in terms of quality of the results or respondent experience. Segmenting Choice and Non-Choice Data Simultaneously (Thomas C. Eagle, Eagle Analytics of California): This presentation focused on how to leverage both choice data (such as CBC or MaxDiff) and non-choice data (other covariates, whether nominal or continuous) to develop effective segmentation solutions. Tom compared and contrasted two common approaches: a) first estimating individual-level utility scores using HB and then using those scores plus non-choice data as basis variables within cluster analysis, or b) simultaneous utility estimation leveraging choice and non-choice data using latent class procedures, specifically Latent Gold software. Tom expressed that he worries about the two-step procedure (HB followed by clustering), for at least two reasons: first, errors in the first stage are taken as given in the second stage; and second, HB involves prior assumptions of population normality, leading to at least some degree of Bayesian smoothing to the mean—which is at odds with the notion of forming distinct segments. Using simulated data sets with known segmentation structure, Tom compared the two approaches. Using the two-stage approach leads to the additional complication of needing to somehow normalize the scores for each respondent to try to remove the scale confound. Also, there are issues involving setting HB priors that affect the results, but aren’t always clear to the researcher regarding which settings to invoke. Tom also found that whether using the two-stage approach or the simultaneous one-stage approach, the BIC criterion often failed to point to the correct number of segments. He commented that if a clear segmentation exists (wide separation between groups and low response error), almost any approach will find it. But, any segmentation algorithm will find patterns in data even if meaningful patterns do not exist. Extending Cluster Ensemble Analysis via Semi-Supervised Learning (Ewa Nowakowska, GfK Custom Research North America and Joseph Retzer, CMI Research, Inc.): Ewa and Joseph’s work focused on obtaining not only high quality segmentation results, but actionable ones, where actionable is defined as having particular managerial relevance (such as discriminating between intenders/non-intenders). They also reviewed the terminology of unsupervised vs. supervised learning. Unsupervised learning involves discovering latent segments in data using a series of basis variables (e.g., cluster algorithms). Supervised learning involves classifying respondents into specific target outcomes (e.g., purchasers and nonpurchasers), such as logistic regression, CART, Neural Nets, and Random Forests. Semisupervised learning combines aspects of supervised and unsupervised learning to find segments that are of high quality (in terms of discrimination among basis variables) and actionable (in terms of classifying respondents into categories of managerial interest). Ewa and Joe’s main tools to do this were Random Forests (provided in R) and Sawtooth Software’s CCEA (Convergent Cluster & Ensemble Analysis). The authors used the multiple solutions provided by Random Forests to compute a respondent-by-respondent similarities matrix (based on how often respondents ended up within the same terminal node). They employed hierarchical clustering analysis to develop cluster solutions based on the similarities data. These cluster solutions were combined with standard unsupervised cluster solutions (developed on the basis variables) to create ensembles of segmentation solutions, which CCEA software in turn used to create a high quality and actionable consensus cluster solution. Ewa and Joe wrapped it up by showing a webbased simulator that assigns respondents into segments based on responses to basis variables. xi
The Shapley Value in Marketing Research: 15 Years and Counting (Michael Conklin and Stan Lipovetsky, GfK): Michael (supported by co-author Stan) explained that Shapley Value not only is an extension of standard TURF analysis, but it can be applied in numerous other marketing research problems. The Shapley Value derives from game theory. In simplest terms, one can think about the value that a hockey player provides to a team (in terms of goals scored by the team per minute) when this player is on the ice. For marketing research TURF problems, the Shapley Value is the unique value contributed by a flavor or brand within a lineup when considering all possible lineup combinations. As one possible extension, Michael described a Shapley Value model to predict share of choice for SKUs on a shelf. Respondents indicate which SKUs are in the consideration set and the Shapley Value is computed (across thousands of possible competitive sets) for each SKU. This value considers the likelihood that the SKU is in the consideration set and importantly the likelihood that the SKU is chosen within each set (equal to 1/n for each respondent, where n is the number of items in the consideration set). The benefits of this simple model of consumer behavior are that it can accommodate very large product categories (many SKUs) and it is very inexpensive to implement. The drawback is that each SKU must be a complete, fixed entity on its own (not involving varying attributes, such as prices). As yet another field for application of Shapley Value, Michael spoke of its use in drivers analysis (rather than OLS or other related techniques). However, Michael emphasized that he thought the greatest opportunities for Shapley Value in marketing research lie in the product line optimization problems. Demonstrating the Need and Value for a Multi-objective Product Search (Scott Ferguson and Garrett Foster, North Carolina State University): Scott and Garrett reviewed the typical steps involved in optimization problems for conjoint analysis, including estimating respondent preferences via a conjoint survey and gathering product feature costs. Usually, such optimization tasks involve setting a single goal, such as optimization of share of preference, utility, revenue, or profit. However, there is a set of solutions on an efficient frontier that represent optimal mixes of multiple goals, such as profit and share of preference. For example, two solutions may be very similar in terms of profit, but the slightly lower profit solution may provide a large gain in terms of share of preference. A multi-objective search algorithm reports dozens or more results (product line configurations) to managers along the efficient frontier (among multiple objectives) for their consideration. Again, those near-optimal solutions represent different mixes of multiple objectives (such as profit and share of preference). Scott’s application involved genetic algorithms. More than two objectives might be considered, Scott elaborated, for instance profit, share of preference, and likelihood to be purchased by a specific respondent demographic. One of the keys to being able to explore the variety of near-optimal solutions, Scott emphasized, was the use of software visualization tools. A Simulation Based Evaluation of the Properties of Anchored MaxDiff: Strengths, Limitations and Recommendations for Practice (Jake Lee, Maritz Research and Jeffrey P. Dotson, Brigham Young University): Jake and Jeff conducted a series of simulation studies to test the properties of three methods for anchored MaxDiff: direct binary, indirect (dual response), and the status quo method. The direct approach involves asking respondents if each item is preferred to (more important than) the anchor item (where the anchor item is typically a buy/no buy threshold or important/not important threshold). The indirect dual-response method involves asking (after each MaxDiff question) if all the items shown are important, all are not important, or some are important. The status quo approach involves adding a new item to the item list that indicates the status quo state (e.g., no change). Jake and Jeff’s simulation studies examined data xii
situations in which respondents were more or less consistent along with whether the anchor position was at the extreme of the scale or near the middle. They concluded that under most realistic conditions, all three methods work fine. However, they recommended that the status quo method be avoided if all items are above or below the threshold. They also reported that if respondent error was especially high, the direct method should be avoided (though this usually cannot be known ahead of time). Best-Worst CBC Conjoint Applied to School Choice: Separating Aspiration from Aversion (Angelyn Fairchild, RTI International, Namika Sagara and Joel Huber, Duke University): Most CBC research today asks respondents to select the best concept within each set. Best-Worst CBC involves asking respondents to select both the best and worst concepts within sets of at least three concepts. School choice involves both positive and negative reactions to features, so it naturally would seem a good topic for employing best-worst CBC. Joel and his co-authors fielded a study among 150 parents with children entering grades 6–11. They used a gradual, systematic introduction of the attributes to respondents. Before beginning the CBC task, they asked respondents to select the level within each attribute that best applied to their current school; then, they used warm-up tradeoff questions that showed just a few attributes at a time (partial-profile). When they compared results from best-only versus worst-only choices, they consistently found smaller utility differences between the best two levels for best-only choices. They also employed a rapid OLS-based utility estimation for on-the-fly estimation of utilities (to provide real-time feedback to respondents). Although the simple method is not expected to provide as accurate results as an HB model run on the entire dataset, the individual-level results from the OLS estimation correlated quite strongly with HB results. They concluded that if the decision to be studied involves both attraction and avoidance, then a Best-Worst CBC approach is appropriate. Does the Analysis of MaxDiff Data Require Separate Scaling Factors? (Jack Horne and Bob Rayner, Market Strategies International): The traditional method of estimating scores for MaxDiff experiments involves combining both best and worst choices and estimating as a single multinomial logit model. Fundamental to this analysis is the assumption that the underlying utility or preference dimension is the same whether respondents are indicating which items are best or which are worst. It also assumes that response errors for selecting bests are equivalent to errors when selecting worsts. However, empirical evidence suggests that neither the utility scale nor the error variance is the same for best and worst choices in MaxDiff. Using simulated data, Jack and his co-author Bob investigated to what degree incompatibilities in scale between best and worst choices affects the final utility scores. They adjusted the scale of one set of choices relative to the other by multiplying the design matrix for either best or worst choices by a constant, prior to estimating final utility scores. The final utilities showed the same rank-order before and after the correction, though the utilities did not lie perfectly on a 45-degree line when the two sets were XY scatter plotted. Next, the authors turned to real data. They first measured the scale of bests relative to worsts and next estimated a combined model with correction for scale differences. Whether correcting for scale or not resulted in essentially the same holdout hit rate for HB estimation. They concluded that although combining best and worst judgments without an error scale correction biases the utilities, the resulting rank order of the items remains unchanged and is likely too small to change any business decisions. Thus, the extra work is probably not justified. As a side-note, the authors suggested that comparing best-only and worstonly estimated utilities for each respondent is yet another way to identify and clean noisy respondents. xiii
Using Conjoint Analysis to Determine the Market Value of Product Features (Greg Allenby, Ohio State University, Jeff Brazell, The Modellers, John Howell, Penn State University, and Peter Rossi, University of California Los Angeles): The main thrust of this paper was to outline a more defensible approach for using conjoint analysis to attach economic value to specific features than is commonly used in many econometric applications and in intellectual property litigation. Peter (and his co-authors) described how conjoint analysis is often used in high profile lawsuits to assess damages. One of the most common approaches used by expert witnesses is to take the difference in utility (for each respondent) between having and not having the infringed upon feature and dividing it by the price slope. This, Peter argued, is fraught with difficulties including a) certain respondents projected to pay astronomically high amounts for features, and b) the approach ignores important competitive realities in the marketplace. Those wanting to present evidence for high damages are prone to use conjoint analysis in this way because it is relatively inexpensive to conduct and the difference in utility divided by price slope method to compute the economic value of features usually results in very large estimates for damages. Peter and his co-authors argued that to assess the economic value of a feature to a firm requires conducting market simulations (a share of preference analysis) involving a realistic set of competitors, including the outside good (the “None” category). Furthermore, it requires a game theoretic approach to compare the industry equilibrium prices with and without the alleged patent infringement. This involves allowing each competitor to respond to the others via price changes to maximize self-interest (typically, profit). The Ballad of Best and Worst (Tatiana Dyachenko, Rebecca Walker Naylor, and Greg Allenby, Ohio State University): Greg presented research completed primarily by the lead author, Tatiana, at our conference (unfortunately, Tatiana was unable to attend). In that work, Tatiana outlines two different perspectives regarding MaxDiff. Most current economic models for MaxDiff assume that the utilities should be invariant to the elicitation procedure. However, psychological theories would expect different elicitation modes to produce different utilities. Tatiana conducted an empirical study (regarding concerns about hair health among 594 female respondents, aged 50+) to test whether best and worst responses lead to different parameters, to investigate elicitation order effects, and to build a comprehensive model that accounted for both utility differences between best and worst answers as well as order effects (best answered first or worst answered first). Using her model, she and her co-authors found significant terms associated with elicitation order effects and the difference between bests and worsts. In their data, respondents were more sure about the “worsts” than the “bests” (lower error variance around worsts). Furthermore, they found that the second decision made by respondents was less error prone. Tatiana and her co-authors recommend that researchers consider which mode of thinking is most appropriate for the business decision in question: maximizing best aspects or minimizing worst aspects. Since the utilities differ depending on the focus on bests or worsts, the two are not simply interchangeable. But, if researchers decide to ask both best and worsts, they recommend analyzing it using a model such as theirs that can account for differences between bests and worst and also for elicitation order effects.
xiv
9 THINGS CLIENTS GET WRONG ABOUT CONJOINT ANALYSIS CHRIS CHAPMAN1 GOOGLE
ABSTRACT This paper reflects on observations from over 100 conjoint analysis projects across the industry and multiple companies that I have observed, conducted, or informed. I suggest that clients often misunderstand the results of conjoint analysis (CA) and that the many successes of CA may have created unrealistic expectations about what it can deliver in a single study. I describe some common points of misunderstanding about preference share, feature assessment, average utilities, and pricing. Then I suggest how we might make better use of distribution information from Hierarchical Bayes (HB) estimation and how we might use multiple samples and studies to inform client needs.
INTRODUCTION Decades of results from the marketing research community demonstrate that conjoint analysis (CA) is an effective tool to inform strategic and tactical marketing decisions. CA can be used to gauge consumer interest in products and to inform estimates of feature interest, brand equity, product demand, and price sensitivity. In many well-conducted studies, analysts have demonstrated success using CA to predict market share and to determine strategic product line needs.2 However, the successes of CA also raise clients’ expectations to levels that can be excessively optimistic. CA is widely taught in MBA courses, and a new marketer in industry is likely soon to encounter CA success stories and business questions where CA seems appropriate. This is great news . . . if CA is practiced appropriately. The apparent ease of designing, fielding, and analyzing a CA study presents many opportunities for analysts and clients to make mistakes. In this paper, I describe some misunderstandings that I’ve observed in conducting and consulting on more than 100 CA projects. Some of these come from projects I’ve fielded while others have been observed in consultation with others; none is exemplary of any particular firm. Rather, the set of cases reflects my observations of the field. For each one I describe the problem and how I suggest to rectify it in clients’ understanding. All data presented here are fictional. The data primarily concern an imaginary “designer USB drive” that comprises nominal attributes such as size (e.g., Nano, Full-length), design style, ordinal attributes of capacity (e.g., 32 GB), and price. The data were derived by designing a choice-based conjoint analysis survey, having simulated respondents making choices, and estimating the utilities using Hierarchical Bayes multinomial logit estimation. For full details,
1 2
[email protected] There are too many published successes for CA to list them comprehensively. For a start, see papers in this and other volumes of the Proceedings of the Sawtooth Software Conference. Published cases where this author contributed used CA to inform strategic analysis using game theory (Chapman & Love, 2012), to search for optimum product portfolios (Chapman & Alford, 2010), and to predict market share (Chapman, Alford, Johnson, Lahav, & Weidemann, 2009). This author also helped compile evidence of CA reliability and validity (Chapman, Alford, & Love, 2009).
1
refer to the source of the data: simulation and example code given in the R code “Rcbc” (Chapman, Alford, and Ellis, 2013; available from this author). The data here were not designed to illustrate problems; rather, they come from didactic R code. It just happens that those data—like data in most CA projects—are misinterpretable in all the common ways.
MISTAKE #1: CONJOINT ANALYSIS DIRECTLY TELLS US HOW MANY PEOPLE WILL BUY THIS PRODUCT A simple client misunderstanding is that CA directly estimates how many consumers will purchase a product. It is simple to use part-worth utilities to estimate preference share and interpret this as “market share.” Table 1 demonstrates this using the multinomial logit formula for aggregate share between two products. In practice, one might use individual-level utilities in a market simulator such as Sawtooth Software SMRT, but the result is conceptually the same. Table 1: Example Preference Share Calculation Product 1 Product 2 Total
Sum of utilities 1.0 0.5 --
Exponentiated 2.72 1.65 4.37
Share of total 62% 38%
As most research practitioners know but many clients don’t (or forget), the problem is this: preference share is only partially indicative of real market results. Preference share is an important input to a marketing model, yet is only one input among many. Analysts and clients need to determine that the CA model is complete and appropriate (i.e., valid for the market) and that other influences are modeled, such as awareness, promotion, channel effects, competitive response, and perhaps most importantly, the impact of the outside good (in other words, that customers could choose none of the above and spend money elsewhere). I suspect this misunderstanding arises from three sources. First, clients very much want CA to predict share! Second, CA is often given credit for predicting market share even when CA was in fact just one part of a more complex model that mapped CA preference to the market. Third, analysts’ standard practice is to talk about “market simulation” instead of “relative preference simulation.” Instead of claiming to predict market share, I tell clients this: conjoint analysis assesses how many respondents prefer each product, relative to the tested alternatives. If we iterate studies, know that we’re assessing the right things, calibrate to the market, and include other effects, we will get progressively better estimates of the likely market response. CA is a fundamental part of that, yet only one part. Yes, we can predict market share (sometimes)! But an isolated, singleshot CA is not likely to do so very well.
MISTAKE #2: CA ASSESSES HOW GOOD OR BAD A FEATURE (OR PRODUCT) IS The second misunderstanding is similar to the first: clients often believe that the highest partworth indicates a good feature while negative part-worths indicate bad ones. Of course, all utilities really tell us is that, given the set of features and levels presented, this is the best fit to a
2
set of observed choices. Utilities don’t indicate absolute worth; inclusion of different levels likely would change the utilities. A related issue is that part-worths are relative within a single attribute. We can compare levels of an attribute to one another—for instance, to say that one memory size is preferable to another memory size—but should not directly compare the utilities of levels across attributes (for instance, to say that some memory size level is more or less preferred than some level of color or brand or processor). Ultimately, product preference involves full specification across multiple attributes and is tested in a market simulator (I say more about that below). I tell clients this: CA assesses tradeoffs among features to be more or less preferred. It does not assess absolute worth or say anything about untested features.
MISTAKE #3: CA DIRECTLY TELLS US WHERE TO SET PRICES Clients and analysts commonly select CA as a way to assess pricing. What is the right price? How will such-and-such feature affect price? How price sensitive is our audience? All too often, I’ve seen clients inspect the average part-worths for price—often estimated without constraints and as piecewise utilities—and interpret them at face value. Figure 1 shows three common patterns in price utilities; the dashed line shows scaling in exact inverse proportion to price, while the solid line plots the preference that we might observe from CA (assuming a linear function for patterns A and B, and a piecewise estimation in pattern C, although A and B could just as well be piecewise functions that are monotonically decreasing). In pattern A, estimated preference share declines more slowly than price (or log price) increases. Clients love this: the implication is to price at the maximum (presumably not to infinity). Unfortunately, real markets rarely work that way; this pattern more likely reflects a method effect where CA underestimates price elasticity. Figure 1: Common Patterns in Price Utilities
A: Inelastic demand
B: Elastic demand
C: Curved demand
In pattern B, the implication is to price at the minimum. The problem here is that relative preference implies range dependency. This may simply reflect the price range tested, or reflect that respondents are using the survey for communication purposes (“price low!”) rather than to express product preferences. Pattern C seems to say that some respondents like low prices while others prefer high prices. Clients love this, too! They often ask, “How do we reach the price-insensitive customers?” The 3
problem is that there is no good theory as to why price should show such an effect. It is more likely that the CA task was poorly designed or confusing, or that respondents had different goals such as picking their favorite brand or heuristically simplifying the task in order to complete it quickly. Observation of a price reversal as we see here (i.e., preference going up as price goes up in some part of the curve) is more likely an indication of a problem than an observation about actual respondent preference! If pattern C truly does reflect a mixture of populations (elastic and inelastic respondents) then there are higher-order questions about the sample validity and the appropriateness of using pooled data to estimate a single model. In short: pattern C is seductive! Don’t believe it unless you have assessed carefully and ruled out the confounds and the more theoretically sound constrained (declining) price utilities. What I tell clients about price is: CA provides insight into stated price sensitivity, not exact price points or demand estimates without a lot more work and careful consideration of models, potentially including assessments that attempt more realistic incentives, such as incentivealigned conjoint analysis (Ding, 2007). When assessing price, it’s advantageous to use multiple methods and/or studies to confirm that answers are consistent.
MISTAKE #4: THE AVERAGE UTILITY IS THE BEST MEASURE OF INTEREST I often see—and yes, sometimes even produce—client deliverables with tables or charts of “average utilities” by level. This unfortunately reinforces a common cognitive error: that the average is the best estimate. Mathematically, of course, the mean of a distribution minimizes some kinds of residuals—but that is rarely how a client interprets an average! Consider Table 2. Clients interpret this as saying that Black is a much better feature than Tiedye. Sophisticated ones might ask whether it is statistically significant (“yes”) or compute the preference share for Black (84%). None of that answers the real question: which is better for the decision at hand? Table 2: Average Feature Utilities Feature Black Tie-dye ...
Average Utility 0.79 -0.85 ...
Figure 3 is what I prefer to show clients and presents a very different picture. In examining Black vs. Tie-dye, we see that the individual-level estimates for Black have low variance while Tie-dye has high variance. Black is broadly acceptable, relative to other choices, while Tie-dye is polarizing. Is one better? That depends on the goal. If we can only make a single product, we might choose Black. If we want a diverse portfolio with differently appealing products, Tie-dye might fit. If we have a way to reach respondents directly, then Silver might be appealing because a few people strongly prefer it. Ultimately this decision should be made on the basis of market simulation (more on that below), yet understanding the preference structure more fully may help an analyst understand the market and generate hypotheses that otherwise might be overlooked. 4
Figure 3: Distribution of Individual-Level Utilities from HB Estimation
The client takeaway is this: CA (using HB) gives us a lot more information than just average utility. We should use that information to have a much better understanding of the distribution of preference.
MISTAKE #5: THERE IS A TRUE SCORE The issue about average utility (problem #4 above) also arises at the individual level. Consider Figure 4, which presents the mean betas for one respondent. This respondent has low utilities for features 6 and 10 (on the X axis) and high utilities for features 2, 5, and 9. It is appealing to think that we have a psychic X-ray of this respondent, that there is some “true score” underlying these preferences, as a social scientist might say. There are several problems with this view. One is that behavior is contextually dependent, so any respondent might very well behave differently at another time or in another context (such as a store instead of a survey). Yet even within the context of a CA study, there is another issue: we know much more about the respondent than the average utility! Figure 4: Average Utility by Feature, for One Respondent
Now compare Figure 5 with Figure 4. Figure 5 shows—for the same respondent—the withinrespondent distribution of utility estimates across 100 draws of HB estimates (using Monte Carlo Markov chain, or MCMC estimation). We see significant heterogeneity. An 80% or 95% credible interval on the estimates would find few “significant” differences for this respondent. This is a
5
more robust picture of the respondent, and inclines us away from thinking of him or her as a “type.” Figure 5: Distribution of HB Beta Estimates by Feature, for the Same Respondent
What I tell clients is this: understand respondents in terms of tendency rather than type. Customers behave differently in different contexts and there is uncertainty in CA assessment. The significance of that fact depends on our decisions, business goals, and ability to reach customers.
MISTAKE #6: CA TELLS US THE BEST PRODUCT TO MAKE (RATHER EASILY) Some clients and analysts realize that CA can be used not only to assess preference share and price sensitivity but also to inform a product portfolio. In other words, to answer “What should we make?” An almost certainly wrong answer would be to make the product with highest utility, because it is unlikely that the most desirable features would be paired with the best brand and lowest price. A more sophisticated answer searches for preference tradeoff vs. cost in the context of a competitive set. However, this method capitalizes on error and precise specification of the competitive sets; it does not examine the sensitivity and generality of the result. Better results may come by searching for a large set of near-optimum products and examine their commonalities (Chapman and Alford, 2010; cf. Belloni et al., 2008). Another approach, depending on the business question, would be to examine likely competitive response to a decision using a strategic modeling approach (Chapman and Love, 2012). An analyst could combine the approaches: investigate a set of many potential near-optimal products, choose a set of products that is feasible, and then investigate how competition might respond to that line. Doing this is a complex process: it requires extraordinarily high confidence in one’s data, and then one must address crucial model assumptions and adapt (or develop) custom code in R or some other language to estimate the models (Chapman and Alford, 2010; Chapman and Love, 2012). The results can be extremely informative—for instance, a product identified in Chapman and Alford (2010) was identified by the model fully 17 months in advance of its introduction to the market by a competitor—but arriving at such an outcome is a complex undertaking built on impeccable data (and perhaps luck).
6
In short, when clients wish to find the “best product,” I explain: CA informs us about our line, but precise optimization requires more models, data, and expertise.
MISTAKE #7: GET AS MUCH STATISTICAL POWER (SAMPLE) AS POSSIBLE This issue is not specific to CA but to research in general. Too many clients (and analysts) are impressed with sample size and automatically assume that more sample is better. Figure 6 shows the schematic of a choice-based conjoint analysis (CBC) study I once observed. The analyst had a complex model with limited sample and wanted to obtain adequate power. Each CBC task presented 3 products and a None option . . . and respondents were asked to complete 60 such tasks! Figure 6: A Conjoint Analysis Study with Great “Power”
Power is directly related to confidence intervals, and the problem with confidence intervals (in classical statistics) is that they scale to the inverse square root of sample size. When you double the sample size, you only reduce the confidence interval by 30% (1-1/√2). To cut the confidence interval in half requires 4x the sample size. This has two problems: diminishing returns, and lack of robustness to sample misspecification. If your sample is a non-probability sample, as most are, then sampling more of it may not be the best approach. I prefer instead to approach sample size this way: determine the minimum sample needed to give an adequate business answer, and then split the available sampling resources into multiple chunks of that size, assessing each one with varying methods and/or sampling techniques. We can have much higher confidence when findings come from multiple samples using multiple methods. What I tell clients: instead of worrying about more and more statistical significance, we should maximize interpretative power and minimize risk. I sketch what such multiple assessments might look like. “Would you rather have: (1) Study A with N=10000, or (2) Study A with 1200, Study B with 300, Study C with 200, and Study D with 800?” Good clients understand immediately that despite having ¼ the sample, Plan 2 may be much more informative!
7
MISTAKE #8: MAKE CA FIT WHAT YOU WANT TO KNOW To address tough business questions, it’s a good idea to collect customer data with a method like CA. Unfortunately, this may yield surveys that are more meaningful to the client than the respondent. I find this often occurs with complex technical features (that customers may not understand) and messaging statements (that may not influence CA survey behavior). Figure 7 presents a fictional CBC task about wine preferences. It was inspired by a poorly designed survey I once took about home improvement products; I selected wine as the example because it makes the issue particularly obvious. Figure 7: A CBC about Wine Imagine you are selecting a bottle of wine for a special celebration dinner at home. If the following wines were your only available choices, which would you purchase? 75% Cabernet Sauvignon 75% Cabernet Sauvignon Blend 20% Merlot 15% Merlot 4% Cabernet Franc 10% Cabernet Franc 1% Malbec Custom crush
Negotiant
Bottle size
700ml
750ml
Cork type
Grade 2
Double disk (1+1)
(None, unfined)
Potassium caseinate
Bottling line type
Mobile
On premises
Origin of bottle glass
Mexico
China
◌
◌
Winery type
Fining agent
Our fictional marketing manager is hoping to answer questions like these: should we fine our wines (cause them to precipitate sediment before bottling)? Can we consider cheaper bottle sources? Should we invest in an in-house bottling line (instead of truck that moves between facilities)? Can we increase the Cabernet Franc in our blend (for various possible reasons)? And so forth. Those are all important questions but posing their technical features to customers results in a survey that only a winemaker could answer! A better survey would map the business consideration to features that a consumer can address, such as taste, appearance, aging potential, cost, and critics’ scores. (I leave the question of how to design that survey about wine as an exercise for the reader.) This example is extreme, yet how often do we commit similar mistakes in areas where we are too close to the business? How often do we test something “just to see if it has an effect?” How often do we describe something the way that R&D wants? Or include a message that has little if any real information? And then, when we see a null effect, are we sure that it is because customers don’t care, or could it be because the task was bad? (A similar question may be asked in case of significant effects.) And, perhaps most dangerously, how often do we field a CA without doing do a small-sample pretest?
8
The implication is obvious: design CA tasks to match what respondents can answer reliably and validly. And before fielding, pretest the attributes, levels, and tasks to make sure!
(NON!-) MISTAKE #9: IT’S BETTER THAN USING OUR INSTINCTS Clients, stakeholders, managers, and sometimes even analysts are known to say, “Those results are interesting but I just don’t believe them!” Then an opinion is substituted for the data. Of course CA is not perfect—all of the above points demonstrate ways in which it may go wrong, and there are many more—but I would wager this: a well-designed, well-fielded CA is almost always better than expert opinion. Opinions of those close to a product are often dramatically incorrect (cf. Gourville, 2004). Unless you have better and more reliable data that contradicts a CA, go with the CA. If we consider this question in terms of expected payoff, I propose that the situation resembles Figure 8. If we use data, our estimates are likely to be closer to the truth than if we don’t. Sometimes they will be wrong, but will not be as wrong on average as opinion would be. Figure 8: Expected Payoffs with and without Data
Use data Use instinct
Decision correct
Decision incorrect
High precision
(high gain) Low precision (modest gain)
Low inaccuracy (modest loss) High inaccuracy (large loss)
Net expectation: Positive Negative
When we get a decision right with data, the relative payoff is much larger. Opinion is sometimes right, but likely to be imprecise; when it is wrong, expert opinion may be disastrously wrong. On the other hand, I have yet to observe a case where consumer data has been terribly misleading; the worst case I’ve seen is when it signals a need to learn more. When opinion and data disagree, explore more. Do a different study, with a different method and different sampling. What I tell clients: it’s very risky to bet against what your customers are telling you! An occasional success—or an excessively successful single opiner—does not disprove the value of data.
MISTAKE #10 AND COUNTING Keith Chrzan (2013) commented on this paper after presentation at the Sawtooth Software Conference and noted that attribute importance is another area where there is widespread confusion. Clients often want to know “Which attributes are most important?” but CA can only answer this with regard to the relative utilities of the attributes and features tested. Including (or omitting) a very popular or unpopular level on one attribute will alter the “importance” of every other attribute!
CONCLUSION Conjoint analysis is a powerful tool but its power and success also create conditions where client expectations may be too high. We’ve seen that some of the simplest ways to view CA 9
results such as average utilities may be misleading, and that despite client enthusiasm they may distract from answering more precise business questions. The best way to meet high expectations is to meet them! This may require all of us to be more careful in our communications, analyses, and presentations. The issues here are not principally technical in nature; rather they are about how conjoint analysis is positioned and how expectations are set and upheld through effective study design, analysis, and interpretation. I hope the paper inspires you—and even better, inspires and informs clients.
Chris Chapman
ACKNOWLEDGEMENTS I’d like to thank Bryan Orme, who provided careful, thoughtful, and very helpful feedback at several points to improve both this paper and the conference presentation. If this paper is useful to the reader, that is in large part due to Bryan’s suggestions (and if it’s not useful, that’s due to the author!) Keith Chrzan also provided thoughtful observation and reflections during the conference. Finally, I’d like to thank all my colleagues over the years, and who are reflected in the reference list. They spurred the reflections more than anything I did.
REFERENCES Belloni, A., Freund, R.M, Selove, M., and Simester, D. (2008). Optimal product line design: efficient methods and comparisons. Management Science 54: 9, September 2008, pp. 1544– 1552. Chapman, C.N., Alford, J.L., and Ellis, S. (2013). Rcbc: marketing research tools for choicebased conjoint analysis, version 0.201. [R code Chapman, C.N., and Love, E. (2012). Game theory and conjoint analysis: using choice data for strategic decisions. Proceedings of the 2012 Sawtooth Software Conference, Orlando, FL, March 2012. Chapman, C.N., and Alford, J.L. (2010). Product portfolio evaluation using choice modeling and genetic algorithms. Proceedings of the 2010 Sawtooth Software Conference, Newport Beach, CA, October 2010. 10
Chapman, C.N., Alford, J.L., Johnson, C., Lahav, M., and Weidemann, R. (2009). Comparing results of CBC and ACBC with real product selection. Proceedings of the 2009 Sawtooth Software Conference, Del Ray Beach, FL, March 2009. Chapman, C.N., Alford, J.L., and Love, E. (2009). Exploring the reliability and validity of conjoint analysis studies. Presented at Advanced Research Techniques Forum (A/R/T Forum), Whistler, BC, June 2009. Chrzan, K. (2013). Remarks on “9 things clients wrong about conjoint analysis.” Discussion at the 2013 Sawtooth Software Conference, Dana Point, CA, October 2013. Ding, M. (2007). An incentive-aligned mechanism for conjoint analysis. Journal of Marketing Research, 2007, pp. 214–223. Gourville, J. (2004). Why customers don’t buy: the psychology of new product adoption. Case study series, paper 9-504-056. Harvard Business School, Boston, MA.
11
QUANTITATIVE MARKETING RESEARCH SOLUTIONS IN A TRADITIONAL MANUFACTURING FIRM: UPDATE AND CASE STUDY ROBERT J. GOODWIN LIFETIME PRODUCTS, INC.
ABSTRACT Lifetime Products, Inc., a manufacturer of folding furniture and other consumer hard goods, provides a progress report on its quest for more effective analytic methods and offers an insightful new ACBC case study. This demonstration of a typical adaptive choice study, enhanced by an experiment with conjoint analysis design parameters, is intended to be of interest to new practitioners and experienced users alike.
INTRODUCTION Lifetime Products, Inc. is a privately held, vertically integrated manufacturing company headquartered in Clearfield, Utah. The company manufactures consumer hard goods typically constructed of blow-molded polyethylene resin and powder-coated steel. Its products are sold to consumers and businesses worldwide, primarily through a wide range of discount and department stores, home improvement centers, warehouse clubs, sporting goods stores, and other retail and online outlets. Over the past seven years, the Lifetime Marketing Research Department has adopted progressively more sophisticated conjoint analysis and other quantitative marketing research tools to better inform product development and marketing decision-making. The company’s experiences in adopting and cost-effectively utilizing these sophisticated analytic methods— culminating in its current use of Sawtooth Software’s Adaptive Choice-Based Conjoint (ACBC) software—were documented in papers presented at previous Sawtooth Software Conferences (Goodwin 2009, and Goodwin 2010). In this paper, we first provide an update on what Lifetime Products has learned about conjoint analysis and potential best practices thereof over the past three years. Then, for demonstration purposes, we present a new Adaptive CBC case on outdoor storage sheds. The paper concludes with a discussion of our experimentation with ACBC design parameters in this shed study.
I. WHAT WE’VE LEARNED ABOUT CONJOINT ANALYSIS This section provides some practical advice, intended primarily for new and corporate practitioners of conjoint analysis and other quantitative marketing tools. This is based on our experience at Lifetime Products as a “formerly new” corporate practitioner of conjoint analysis.
13
#1. Use Prudence in Conjoint Analysis Design
One of the things that helped drive our adoption of Sawtooth Software’s Adaptive ChoiceBased Conjoint program was its ability to administer conjoint analysis designs with large numbers of attributes, without overburdening respondents. The Concept Screening phase of the ACBC protocol allows each panelist to create a “short list” of potentially acceptable concepts using whatever decision-simplification techniques s/he wishes, electing to downplay or even ignore attributes they considered less essential to the purchase decision. Further, we could allow them to select a subset of the most important attributes for inclusion (or, alternatively, the least important attributes for exclusion) for the rest of the conjoint experiment. Figure 1 shows an example page from our first ACBC study on Storage Sheds in 2008. Note that, in the responses entered, the respondent has selected eight attributes to include—along with price and materials of construction, which were crucial elements in the experiment—while implicitly excluding the other six attributes from further consideration. Constructed lists could then be used to bring forward only the Top-10 attributes from an original pool of 16 attributes, making the exercise much more manageable for the respondent. While the part-worths for an excluded attribute would be zero for that observation, we would still capture the relevant utility of that attribute for another panelist who retained it for further consideration in the purchase decision. Figure 1 Example of Large-scale ACBC Design STORAGE SHEDS
We utilized this “winnowing” feature of ACBC for several other complex-design studies in the year or two following its adoption at Lifetime. During the presentation of those studies to our internal clients, we noticed a few interesting behaviors. One was the virtual fixation of a few clients on a “pet” feature that (to their dismay) registered minimal decisional importance 14
following Hierarchical Bayes (HB) estimation. While paying very little attention to the most important attributes in the experiment, they spent considerable time trying to modify the attribute to improve its role in the purchase decision. In essence, this diverted their attention from what mattered most in the consumers’ decision to what mattered least. The more common client behavior was what could be called a “reality-check” effect. Once the clients realized (and accepted) the minimal impact of such an attribute on purchase decisions, they immediately began to concentrate on the more important array of attributes. Therefore, when it came time to do another (similar) conjoint study, they were less eager to load up the design with every conceivable attribute that might affect purchase likelihood. Since that time, we have tended not to load up a conjoint study with large numbers of attributes and levels, just because “it’s possible.” Instead, we have sought designs that are more parsimonious by eliminating attributes and levels that we already know to be less important in consumers’ decision-making. As a result, most of our recent studies have gravitated around designs of 8–10 attributes and 20–30 levels. Occasionally, we have found it useful to assess less-important attributes—or those that might be more difficult to measure in a conjoint instrument—by testing them in regular questions following the end of the conjoint experiment. For example, Figure 2 shows a follow-up question in our 2013 Shed conjoint analysis survey to gauge consumers’ preference for a shed that emphasized ease of assembly (at the expense of strength) vis-à-vis a shed that emphasized strength (at the expense of longer assembly times). (This issue is relevant to Lifetime Products, since our sheds have relatively large quantities of screws—making for longer assembly times— but are stronger than most competitors’ sheds.) Figure 2 Example of Post-Conjoint Preference Question STORAGE SHEDS
#2. Spend Time to Refine Conjoint Analysis Instruments
Given the importance of the respondent being able to understand product features and attributes, we have found it useful to spend extra time on the front end to ensure that the survey instrument and conjoint analysis design will yield high-quality results. In a previous paper (Goodwin, 2009), we reported the value of involving clients in instrument testing and debugging. In a more general sense, we continue to review our conjoint analysis instruments and designs with multiple iterations of client critique and feedback. 15
As we do so, we look out for several potential issues that could degrade the quality of conjoint analysis results. First, we wordsmith attribute and level descriptions to maximize clarity. For example, with some of our categories, we have found a general lack of understanding in the marketplace regarding some attributes (such as basketball height-adjustment mechanisms and backboard materials; shed wall, roof and floor materials; etc.). Attributes such as these necessitate great care to employ verbiage that is understandable to consumers. Another area we look out for involves close substitutes among levels of a given attribute, where differences might be difficult for consumers to perceive, even in the actual retail environment. For example, most mid-range basketball goals have backboard widths between 48 and 54 inches, in 2-inch increments. While most consumers can differentiate well between backboards at opposite ends of this range, they frequently have difficulty deciding—or even differentiating—among backboard sizes with 2-inch size differences. Recent qualitative research with basketball system owners has shown that, even while looking at 50-inch and 52-inch models side-by-side in a store, it is sometimes difficult for them (without looking at the product labeling) to tell which one is larger than the other. While our effort is not to force product discrimination in a survey where it may not exist that strongly in the marketplace itself, we want to ensure that panelists are given a realistic set of options to choose from (i.e., so the survey instrument is not the “problem”). Frequently, this means adding pictures or large labels showing product size or feature call-outs to mimic in-store shopping as much as possible. #3. More Judicious with Brand Names
Lifetime Products is not a household name like Coke, Ford, Apple, and McDonald’s. As a brand sold primarily through big-box retailers, Lifetime Products is well known among the category buyers who put our product on the shelf, but less so among consumers who take it off the shelf. In many of our categories (such as tables & chairs, basketball, and sheds), the assortment of brands in a given store is limited. Consequently, consumers tend to trust the retailer to be the brand “gatekeeper” and to carry only the best and most reliable brands. In doing so, they often rely less on their own brand perceptions and experiences. Lifetime Products’ brand image is also confounded by misconceptions regarding other entities such as the Lifetime Movie Channel, Lifetime Fitness Equipment, and even “lifetime” warranty. There are also perceptual anomalies among competitor brands. For example, Samsonite (folding tables) gets a boost from their well-known luggage line, Cosco (folding chairs) is sometimes mistaken for the Costco store brand, and Rubbermaid (storage sheds) has a halo effect from the wide array of Rubbermaid household products. Further, Lifetime kayaks participate in a market that is highly fragmented with more than two dozen small brands, few of which have significant brand awareness. As a result, many conjoint analysis studies we have done produce flat brand utility profiles, accompanied by low average-attribute-importance scores. This is especially the case when we include large numbers of brand names in the exercise. Many of these brands end up with utility scores lower than the “no brand” option, despite being well regarded by retail chain store buyers. Because of these somewhat-unique circumstances in our business, Lifetime often uses heavily abridged brand lists in its conjoint studies, or in some cases drops the brand attribute altogether. In addition, in our most recent kayak industry study (with its plethora of unknown
16
brands), we had to resort to surrogate descriptions such as “a brand I do not know,” “a brand I know to be good,” and so forth. #4. Use Simulations to Estimate Brand Equity
Despite the foregoing, there are exceptions (most notably in the Tables & Chairs category) where the Lifetime brand is relatively well known and has a long sales history among a few key retailers. In this case, our brand conjoint results are more realistic, and we often find good perceptual differentiation among key brand names, including Lifetime. Lifetime sales managers often experience price resistance from retail buyers, particularly in the face of new, lower-price competition from virtually unknown brands (in essence, “no brand”). In instances like these, it is often beneficial to arm these sales managers with statistical evidence of the value of the Lifetime brand as part of its overall product offering. Recently, we generated such a brand equity analysis for folding utility tables using a reliable conjoint study conducted a few years ago. In this context, we defined “per-unit brand equity” as: The price reduction a “no-name” brand would have to use in order to replace the Lifetime brand and maintain Lifetime’s market penetration. The procedure we used for this brand equity estimation was as follows: 1. Generate a standard share of preference simulation, with the Lifetime table at its manufacturer’s suggested retail price (MSRP), two competitor offerings at their respective MSRPs, and the “None” option. (See left pie chart in Figure 3.) 2. Re-run the simulation using “no brand name” in place of the Lifetime brand, with no other changes in product specifications (i.e., an exact duplicate offering except for the brand name). The resulting share of preference for the “no-name” offering (which otherwise duplicated the Lifetime attributes) decreased from the base-case share. (Note that much of that preference degradation went to the “None” option, not to the other competitors, suggesting possible strength of the Lifetime brand over the existing competitors as well.) 3. Gradually decrease the price of the “no-name” offering until its share of preference matched the original base case for the Lifetime offering. In this case, the price differential was about -6%, which represents a reasonable estimate of the value of the Lifetime brand, ceteris paribus. In other words, a no-name competitor with the same specification as the Lifetime table would have to reduce its price 6% in order to maintain the same share of preference as the Lifetime table. (See right pie chart in Figure 3.)
17
Figure 3 Method to Estimate Lifetime Brand Value Conjoint – Successes
Price Difference for Lifetime vs. No-Brand: Estimate of Lifetime Brand Value Shares of Preference When Lifetime Brand Available
Would Not Buy Any of These
Lifetime Table @ Retail Price
Shares of Preference When Replaced by “No-Name” 6% Price Reduction Needed to Garner the Same Share as Lifetime Would Not Buy Any of These
Brand X Table Brand Y Table Goodwin 9/18/13
No-Brand Table @ 6% Lower Price
Brand X Table Brand Y Table 15
The 6% brand value may not seem high, compared with the perceived value of more wellknown consumer brands. Nevertheless, in the case of Lifetime tables, this information was very helpful for our sales managers responding to queries from retail accounts to help justify higher wholesale prices than the competition. #5. Improved Our Simulation Techniques
Over the past half-dozen years of using conjoint analysis, Lifetime has improved its use of simulation techniques to help inform management decisions and sales approaches. In these simulations, we generally have found it most useful to use the “None” option in order to capture buy/no-buy behavior and to place relatively greater emphasis on share of preference among likely buyers. Most importantly, this approach allows us to measure the possible expansion (or contraction) of the market due to the introduction of new product (or the deletion of an existing product). We have found this approach particularly useful when simulating the behavior of likely customers of our key retail partners and the change in their “market” size. Recently, we conducted several simulation analyses to test the impact of pricing strategies for our retail partners. We offer two of them here. In both cases, the procedure was to generate a baseline simulation, not only of shares of preference (i.e., number of units), but also of revenue and (where available) retail margin. We then conducted experimental “what-if” simulations to compare with the baseline scenario. Because both situations involved multiple products—and the potential for cross-cannibalization, we measured performance for the entire product line at the retailer.
18
The first example involved a lineup of folding tables at a relatively large retail account (See Figure 4). In question was the price of a key table model in that lineup, identified as Table Q in the graphic. The lines in the graphic represent changes in overall table lineup units, revenue, and retail margin, indexed against the current price-point scenario for Table Q (Index = 1.00). We ran a number of experimental simulations based on adjustments to the Table Q price point and ran the share of preference changes through pricing and margin calculations for the entire lineup. Figure 4 Using Simulations to Measure Unit, Revenue & Margin Changes
As might be expected, decreasing the price of Table Q (holding all other prices and options constant) would result in moderate increases in overall numbers of units sold (solid line), smaller increases in revenue (due to the lower weighted-average price with the Table Q price cut; dashed line), and decreases in retail margin (dotted line). Note that these margin decreases would rapidly become severe, since the absolute value of a price decrease is applied to a much smaller margin base. (See curve configurations to the left of the crossover point in Figure 4.) On the other hand, if the price of Table Q were to be increased, the effects would go in the opposite direction in each case: margin would increase, and revenue and units would decrease. (See curve configurations to the right of the crossover point in Figure 4.) This Figure 4 graphic, along with the precise estimates of units, revenue, and margin changes with various Table Q price adjustments, provided the account with various options to be considered, in light of their retail objectives to balance unit, revenue, and margin objectives. The second example involves the prospective introduction of a new and innovative version of a furniture product (non-Lifetime) already sold by a given retailer. Three variants of the new product were tested: Product A at high and moderate price levels and Product B (an inferior 19
version of Product A) at a relatively low price. Each of these product-price scenarios was compared with the base case for the current product continuing to be sold by itself. And, in contrast to the previous table example, only units (share) and revenue were measured in this experiment (retail margin data for the existing product were not available). (See Figure 5) Figure 5 Using Simulations to Inform the Introduction of a New Product
Note retailer’s increase in total unit volume with introduction of New Product
“Sweet Spot” Product/ Pricing Options
Goodwin 9/18/13
18
The first result to note (especially from the retailer’s perspective) is the overall expansion of unit sales under all three new-product-introduction scenarios. This would make sense, since in each case there would be two options for consumers to consider, thus reducing the proportion of retailers’ consumers who would not buy either option at this store (light gray bar portions at top). The second finding of note (especially from Lifetime’s point of view) was that unit sales of the new concept (black bars at the bottom) would be maximized by introducing Product A at the moderate price. Of course, this would also result in the smallest net unit sales of the existing product (dark gray bars in the middle). Finally, the matter of revenue is considered. As seen in the revenue index numbers at the bottom margin of the graphic (with current case indexed at 1.00), overall retail revenue for this category would be maximized by introducing Product A at the high price (an increase of 26% over current retail revenue). However, it also should be noted that introducing Product A at the moderate price would also result in a sizable increase in revenue over the current case (+21%, only slightly lower than that of the high price). Thus Lifetime and retailer were presented with some interesting options (see the “Sweet Spot” callout in Figure 5), depending on how unit and revenue objectives for both current and new products were enforced. And, of course, they also could consider introduction of Product B 20
at its low price point, which would result in the greatest penetration among the retailer’s customers, but at the expensive of almost zero growth in overall revenue. #6. Maintain a Standard of Academic Rigor
Let’s face it: in a corporate-practitioner setting (especially as the sole conjoint analysis “expert” in the company), it’s sometimes easy to become lackadaisical about doing conjoint analysis right! It’s easy to consider taking short cuts. It’s easy to take the easy route with a shorter questionnaire instead of the recommended number of ACBC Screening steps. And, it’s easy to exclude holdout tasks in order to keep the survey length down and simplify the analysis. Over the past year or so, we have concluded that, in order to prevent becoming too complacent, a corporate practitioner of conjoint analysis may need to proactively maintain a standard of academic rigor in his/her work. It is important to stay immersed in conjoint analysis principles and methodology through seminars, conferences (even if the technical/mathematical details are a bit of a comprehension “stretch”), and the literature. And, in the final analysis, there’s nothing like doing a paper for one of those conferences to re-institute best practices! Ideally a paper such as this should include some element of “research on research” (experiment with methods, settings, etc.) to stretch one’s capabilities even further.
II. 2013 STORAGE SHED ACBC CASE STUDY It had been nearly five years since Lifetime’s last U.S. storage shed conjoint study when the Company’s Lawn & Garden management team requested an update to the earlier study. In seeking a new study, their objective was to better inform current strategic planning, tactical decision-making, and sales presentations for this important category. At the same time, this shed study “refresh” presented itself as an ideal vehicle as a case study for the current Sawtooth Software conference paper to illustrate the Company’s recent progress in its conjoint analysis practices. The specific objectives for including this case study in this current paper are shown below.
Demonstrate a typical conjoint analysis done by a private practitioner in an industrial setting.
Validate the new conjoint model by comparing market simulations with in-sample holdout preference tasks (test-retest format).
Include a “research on research” split-sample test on the effects of three different ACBC research design settings.
An overview of the 2013 U.S. Storage Shed ACBC study is included in the table below:
21
ACBC Instrument Example Screenshots
As many users and students of conjoint analysis know, Sawtooth Software’s Adaptive Choice-based conjoint protocol begins with a Build-Your-Own (BYO) exercise to establish the respondent’s preference positioning within the array of all possible configurations of the product in question. (Figure 12, to be introduced later, illustrates this positioning visually.) Figure 6 shows a screenshot of the BYO exercise for the current storage shed conjoint study.
22
Figure 6 Build-Your-Own (BYO) Shed Exercise
The design with 9 non-price attributes and 25 total levels results in a total of 7,776 possible product configurations. In addition, the range of summed prices from $199 to $1,474, amplified by a random price variation factor of ±30 percent (in the Screening phase of the protocol, to follow) provides a virtually infinite array of product-price possibilities. Note the use of conditional graphics (shed rendering at upper right) to help illustrate three key attributes that drive most shed purchase decisions (square footage, roof height, and materials of construction). Following creation of the panelist’s preferred (BYO) shed design, the survey protocol asks him/her to consider a series of “near-neighbor” concepts and to designate whether or not each one is a possibility for purchase consideration. (See Figure 7 and, later, Figure 12.) In essence, the subject is asked to build a consideration set of possible product configurations from which s/he will ultimately select a new favorite design. This screening exercise also captures any non-
23
compensatory selection behaviors, as the respondent can designate some attribute levels as ones s/he must have—or wants to exclude—regardless of price. Figure 7 Shed Concept Screening Exercise
Note again the use of conditional graphics, which help guide the respondents by mimicking a side-by-side visual comparison common to many retail shed displays. Following the screening exercise and the creation of an individualized short list of possible configurations, these concepts are arrayed in a multi-round “tournament” setting where the panelist ultimately designates the “best” product-price option. Conditional graphics again help facilitate these tournament choices. (See Figure 8)
24
Figure 8 Shed Concept “Tournament” Exercise
The essence of the conjoint exercise is not to derive the “best” configuration, however. Rather, it is to discover empirically how the panelist makes the simulated purchase decision, including the -
relative importance of the attributes in that decision, levels of each attribute that are preferred, interaction of these preferences across all attributes and levels, and implicit price sensitivity for these features, individually and collectively.
As we like to tell our clients trying to understand the workings and uses of conjoint analysis, “It’s the journey—not the destination—that’s most important with conjoint analysis.” ACBC Example Diagnostic Graphics
Notwithstanding the ultimate best use of conjoint analysis as a tool for market simulations, there are a few diagnostic reports and graphics that help clients understand what the program is doing for their respective study. First among these is the average attribute-importance distribution, in this case derived through Hierarchical Bayes estimation of the individual partworths from the Multinomial Logit procedure. (See Figure 9)
25
Figure 9 Relative Importance of Attributes from HB Estimation
It should be noted that these are only average importance scores, and that the simulations ultimately will take into account each individual respondent’s preferences, especially if those preferences are far different from the average. Nevertheless, our clients (especially sales managers who are reporting these findings to retail chain buyers) can relate to interpretations such as “20 percent of a typical shed purchase decision involves—or is influenced by—the size of the shed.” Note in this graphic that price occupies well over one-third of the decision space for this array of shed products. This is due in large part to the wide range of prices ($199 minus 30 percent, up to $1,474 plus 30 percent) necessary to cover the range of sheds from a 25-squarefoot sheet metal model with few add-on features up to a 100-square-foot wooden model with multiple add-ons. Within a defined sub-range of shed possibilities, most consumers would consider (say, plastic sheds in the 50-to-75-square-foot range, with several feature add-ons), the relative importance of price would diminish markedly and the importance of other attributes would increase. A companion set of diagnostics to the importance pie chart above involves line graphs showing the relative conjoint utility scores (usually zero-centered) showing the relative preferences for levels within each attribute. Again, recognizing that these are only averages, they provide a quick snapshot of the overall preference profile for attributes and levels. They also provide a good diagnostic to see if there are any reversals (e.g., ordinal-scale levels that do not follow a consistent progression of increasing or decreasing utility scores). (See Figure 10)
26
Figure 10 Average Conjoint Utility Scores from HB Estimation Lifetime Shed Conjoint Study 2013 Average Conjoint Utility Comparison (selected attributes) Survey Sampling Inc. - Nationwide n=643
80 60
Average Conjoint Utility
40 20
0 -20 -40 -60
-80
CONSTRUCTION
SQUARE FOOTAGE
ROOF HEIGHT
WALL STYLE
FLOORING
2 shelves
NOT included
Plywood
Plastic
NOT included
Brick-style
Siding-style
Plain
8 Feet
6 Feet
100 SF (c. 10'x10')
75 SF (c. 8 'x8 ')
50 SF (c. 7'x7')
25 SF (c. 5'x5')
Treated Wood
Steel-reinforced Resin
Sheet Metal
-100
SHELVING
The final diagnostic graphic we offer is the Price Utility Curve. (See Figure 11) It is akin to the Average Level Utility Scores, just shown, except that (a) in contrast to most feature-based attributes, its curve has a negative slope, and (b) it can have multiple, independently sloped curve segments (eight in this case), using ACBC’s Piecewise Pricing estimation option. Our clients can also relate to this as a surrogate representation for a demand curve, with varying slopes (price sensitivity).
27
Figure 11 Price Utility Curve Using Piecewise Method Lifetime Shed Conjoint Study 2013 Price Utility Curves using Piecewise Method: Negative Price Constraint Survey Sampling Inc. - Nationwide n=571 net
Price Utility Score (higher = more preferred)
200
Relevant Range for Plastic Sheds (25-100 SF)
150
Wooden Sheds 100 50
Sheet Metal Sheds
0 -50 -100
Note possible perceptual breakpoint at $999
-150 -200 -250 $0
$200
$400
$600
$800
$1,000 $1,200 Retail Price
$1,400
$1,600
$1,800
$2,000
OVERALL (571)
There are a few items of particular interest in this graphic. First, the differences in price ranges among the three shed types are called out. Although the differentiation between sheet metal and plastic sheds is fairly clear-up, there is quite a bit of overlap between plastic and wooden sheds. Second, the price cut points have been set at $200 increments to represent key perceptual price barriers (especially $1,000, where there appears to be a possible perceptual barrier in the minds of consumers).
III. ACBC EXPERIMENTAL DESIGN AND TEST RESULTS This section describes the experimental design of the 2013 Storage Shed ACBC study, and the “research-on-research” question we attempt to answer. It also discusses the holdout task specification and the measures used to determine the precision of the conjoint model in its various test situations. Finally, the results of this experimental test are presented. Split-Sample Format to Test ACBC Designs
The Shed study used a split-sample format with three Adaptive Choice design variants based on incrementally relaxed definitions of the “near-neighbor” concept in the Screener section of the ACBC protocol. We have characterized those three design variants as Version 1—Conservative departure from the respondent’s BYO concept, Version 2—Moderate departure, and Version 3— Aggressive departure. (See Figure 12)
28
Figure 12 Conservative to Aggressive ACBC Design Strategies
ACBC Design Strategy: Near-Neighbors instead of “Full Factorial”
Total multivariate attribute space (9 attributes) with nearly 7,800 unique product combinations, plus a virtually infinite number of prices) BYO Shed Configuration (respondent’s “ideal” shed)
VERSION 3 / “Aggressive” Vary 4-5 attributes from BYO concept per task (n=228)
VERSION 2 / “Moderate” Vary 3-4 attributes from BYO concept per task (n=210)
VERSION 1 / “Conservative” Vary 2-3 attributes from BYO concept per task (n=205)
Adapted from Orme’s 2008 ACBC Beta Test Instructional Materials
Each qualified panelist was assigned randomly to one of the three questionnaire versions. As a matter of course, we verified that the demographic and product/purchase profiles of each of the three survey samples were similar (i.e., gender, age, home ownership, shed ownership, shed purchase likelihood, type of shed owned or likely to purchase, and preferred store for a shed purchase). Going into this experiment, we had several expectations regarding the outcome. First, we recognized that Version 1—Conservative would be the least-efficient experimental design, because it defined “near neighbor” very closely, and therefore the conjoint choice tasks would include only product configurations very close to the BYO starting point (only 2 to 3 attributes were varied from the BYO-selected concept to generate additional product concepts). At the other end of the spectrum, Version 3—Aggressive would have the widest array of product configurations (varying from 4 to 5 of the attributes from the respondent’s BYO-selected concept to generate additional concepts), resulting in a more efficient design. This is borne out by Defficiency calculations provided by Bryan Orme of Sawtooth Software using the results of this study. As shown in Figure 13, the design of the Version 3 conjoint experiment was 27% more efficient than that of the Version 1 experiment.
29
Figure 13 Calculated D-efficiency of Design Versions
D-Efficiency 0.44
Version 2 (3-4 changes from BYO)
0.52
Version 3 (4-5 changes from BYO)
0.56
Increasing D-efficiency
Version 1 (2-3 changes from BYO)
Index 100
118 127
Calculations courtesy of Bryan Orme, using Sawtooth CVA and ACBC Version 8.2 Despite the statistical efficiency advantage of Version 3, we fully expected Version 1 to provide the most accurate results. In thinking about the flow of the interview, we felt Version 1 would be the most user-friendly for a respondent, since most of the product configurations shown in the Screening section would be very close to his/her BYO specification. The respondent would feel that the virtual interviewer (“Kylie” in this case) is paying attention to his/her preferences, and therefore would remain more engaged in the interview process and (presumably) be more consistent in answering the holdout choices. In contrast, the wider array of product configurations from the more aggressive Version 3 approach might be so far afield that the panelist would feel the interviewer is not paying as much attention to previous answers. As a result, s/he might become frustrated and uninvolved in the process, thereby registering lessreliable utility scores. One of the by-products of this test was the expectation that those participating in the Version 1 questionnaire would see relatively more configurations they liked and would therefore bring forward more options (“possibilities”) into the Tournament section. Others answering the Version 3 questionnaire would see fewer options they liked and therefore would bring fewer options forward into the Tournament. As shown in Figure 14, this did indeed happen, with a slightly (but significantly) larger average number of conjoint concepts being brought forward in Version 1 than in Version 3.
30
Figure 14 Distribution of Concepts Judged to be “a Possibility” Lifetime Shed Conjoint Study 2013 Distribution of Shed Concepts Judged to be "a Possibility" Survey Sampling Inc. - Nationwide n=643
100
Differences in mean number of “possibilities” among 3 ACBC versions are significant (P=.034)
90 80
Cumulative Percent
70
Version 3 / Aggressive: Smaller numbers of “possibilities”
60
50
Version 1 / Conservative: Larger numbers of “possibilities”
40 30 20 10
0 0
1
2
3
4
5
Version 1 (Mean=15.7)
6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Number of Shed Concepts judged to be "a Possibility" (Max=28)
Version 2 (Mean=15.0)
Version 3 (Mean=14.3)
TOTAL (Mean=15.0)
Validation Using In-sample Holdout Questions
To validate the results of this survey experiment we used four in-sample holdout questions containing three product concepts each. The same set of holdouts was administered to all respondents in each of the three versions of the survey instrument. We also used a test-retest verification procedure, where the same set of four questions was repeated immediately, but with the order of presentation of concepts in each question scrambled. A summary of this Holdout Test-Retest procedure is included in the table below:
In order to maximize the reality of the holdouts, we generated the product configuration scenarios using real-life product options and prices (including extensive level overlap, in some cases). Of the 12 concepts shown, five were plastic sheds (three based on current Lifetime models and the other two based on typical competitor offerings), four were wooden sheds (competitors), and three were sheet metal sheds (competitors). In an effort to test realistic level overlap, the first holdout task (repeated in scrambled order in the fifth task) contained a Lifetime plastic shed and a smaller wooden shed, both with the same retail price. Likewise, in the second 31
(and sixth) task Lifetime and one of its key plastic competitors were placed head-to-head. In keeping with marketplace realities, both of these models were very similar, with the Lifetime offering having a $100 price premium to account for three relatively minor product feature upgrades. (As will be seen shortly, this scenario made for a difficult decision for many respondents, and they were not always consistent during the test-retest reality check.) To illustrate these holdout task-design considerations, two examples are shown below: Figure 15 (which compares Holdout Tasks #1 and #5) and Figure 16 (which compares Holdout Tasks #2 and #6). Figure 15 In-Sample Holdout Tasks #1 & #5
In-Sample Holdout Tasks #1 & #5 Lifetime 8x10 Shed
Task #1 (First Phase)
TASK #1 Version 1 Version 2 Version 3 TOTAL
Concept 1 19.5% 16.2% 11.4% 15.6%
RJG 9/17/13 Revision
Concept 2 22.9% 20.0% 23.2% 22.1%
Task #5 (Second Phase/Scrambled)
Concept 3 57.6% 63.8% 65.4% 62.4%
TASK #5 Version 1 Version 2 Version 3 TOTAL
Concept 1 21.5% 20.0% 20.6% 20.7%
Concept 2 61.0% 64.8% 64.0% 63.3%
Concept 3 17.6% 15.2% 15.4% 16.0%
32
This set of holdout concepts nominally had the most accurate test-retest results of the four— and provided the best predictive ability for the conjoint model as well. Note that the shares of preference for the Lifetime 8x10 Shed are within one percentage point of each other. Also, note that the Lifetime shed was heavily preferred over the comparably priced (but smaller and more sparsely featured) wooden shed.
32
Figure 16 In-Sample Holdout Tasks #2 & #6
In-Sample Holdout Tasks #2 & #6 Lifetime 7x7 Shed
Task #2 (First Phase)
Task #6 (Second Phase/Scrambled)
Close competitor to Lifetime 7x7 TASK #2 Version 1 Version 2 Version 3 TOTAL
Concept 1 19.5% 25.7% 27.6% 24.4%
Concept 2 38.0% 35.7% 32.9% 35.5%
Concept 3 42.4% 38.6% 39.5% 40.1%
TASK #6 Version 1 Version 2 Version 3 TOTAL
Concept 1 36.1% 31.9% 32.5% 33.4%
Concept 2 30.2% 31.9% 31.1% 31.1%
Concept 3 33.7% 36.2% 36.4% 35.5%
RJG 9/17/13 Revision
33
Holdouts 3 & 7 &and 4 &the 8 had moderately reliable replication rates of the four— This set of holdout concepts nominally had least accurate test-retest results and provided the worst predictive ability for the conjoint model. Note that the shares of preference for the Lifetime 7x7 Shed and those of the competitor 7x7 shed varied more between the Test and Retest phases than the task in Figure 15. It is also interesting to note that the competitor shed appeared to pick up substantial share of preference in the Retest phase when both products are placed side-by-side in the choice task. This suggests that, when the differences are well understood, consumers may not evaluate the $100 price premium for the more fully featured Lifetime shed very favorably.
Test Settings and Conditions
Here are the key settings and conditions of the Shed conjoint experiment and validation:
Randomized First Choice simulation method: o Scale factor (Exponent within the simulator) adjusted within version to minimize errors of prediction (Version 1 = 0.55, Version 2 = 0.25, Version 3 = 0.35) o Root Mean Square Error (rather than Mean Absolute Error) used in order to penalize extreme errors
Piecewise price function with eight segments, with the price function constrained to be negative
Hit Rates used all eight holdout concepts (both Test and Retest phases together)
33
Deleted 72 bad cases prior to final validation: o “Speeders” (less than four minutes) and “Sleepers” (more than one hour) o Poor Holdout Test-Retest consistency (fewer than two out of four tasks consistent) o Discrimination/straight-line concerns (panelist designated no concepts or all 28 concepts as “possibilities” in the Screening section)
Note that the cumulative impact of all these adjustments on Hit Rates was about +4 to +5 percentages points (i.e., moved from low 60%’s to mid 60%’s, as shown below). The validation results for each survey version—and each holdout set—are provided in Figure 17. In each case, the Root Mean Square Error and Hit Rate are reported, along with the associated Test-Retest Rate (TRR). Figure 17 Summary of Validation Results Root MSE
Hit Rate
Test-Retest Rate
Version 1 / Conservative
4.4
64%
82%
Version 2 / Moderate
6.5
65%
81%
Version 3 / Aggressive
6.5
69%
85%
Holdouts 1 & 5
5.2
75%
87%
Holdouts 2 & 6
4.9
54%
73%
Holdouts 3 & 7
3.8
68%
85%
Holdouts 4 & 8
8.6
66%
85%
OVERALL
5.9
66%
83%
Across All 4 Sets of Holdouts
Across All 3 Questionnaire Versions
Here are some observations regarding the results on this table:
34
Overall, the Relative Hit Rate (RHR) was about as expected (66% HR / 83% TRR = 80% RHR).
The nominally increasing hit rate from Version 1 to Version 3 was not expected (we had expected it to decrease).
There was a general lack of consistency between Root MSEs and Hit Rates, suggesting a lack of discernible impact of ACBC design (as tested) on precision of estimation.
Holdouts #2 & #6 were the most difficult for respondents to deal with, with a substantially lower Hit Rate and Test-Retest Rate (but, interestingly, not the highest RMSE!).
In order to determine statistically whether adjustments in the ACBC design had a significant impact on ability to predict respondents’ choices, we generated the following regression model, wherein we tested for significance in the two categorical (dummy) variables representing incremental departures from the near-neighbor ACBC base case. We also controlled for overall error effects, as measured by the holdout Test-Retest Rate. (See Figure 18) Figure 18 Shed ACBC Hit Rate Model Shed ACBC Hit Rate Model
?
Hit Rate = (ACBC Version, Test-Retest Rate) Empirical Regression Model: HR {0-1} = .34 + .015 (V2 {0,1}) + .040 (V3 {0,1} + .36 TRR {0-1} Note Constants: V1 (Conservative) = .34; V2 (Moderate) = .355; V3 (Aggressive) = .38 Model Significance Overall (F)
P=.000
Adjusted R2=0.070
V2 (Dummy) Coefficient (T)
P=.568
V3 (Dummy) Coefficient (T)
P=.114
Test-Retest Coefficient (T)
P=.000
9/17/13 HereRJGare ourRevision observations on this regression model:
36
Hit Rates increased only 1.5% and 4.0% from Version 1 (base case) to Versions 2 and 3, respectively. Neither one of these coefficients was significant at the .05 level.
The error-controlling variable (Test-Retest Rate) was significant—with a positive coefficient—suggesting that hit rates go up as respondents pay closer attention to the quality and consistency of their responses.
Of course, the overall model is significant, but only because of Test-Retest controlling variable.
Questionnaire version (i.e., aggressiveness of ACBC design) does NOT have a significant impact on Hit Rates
KEY TAKEAWAYS Validation procedures verify that the overall predictive value of the 2013 Storage Shed ACBC Study is reasonable. The overall Relative Hit Rate was .80 (or .66 / .83), implying that model predictions are 80% as good as the Test-Retest rate. Root MSEs were about 6.0—with corresponding MAEs in the 4–5 range—which is generally similar to other studies of this nature.
35
The evidence does not support the notion that Version 1 (the current, conservative ACBC design) provides the most valid results. Even controlling for test-retest error rates, there was no statistical difference in hit rates among the three ACBC design approaches. (In fact, if there were an indication of one design being more accurate than the other, one might argue that it could be in favor of Version 3, the most aggressive approach, with its nominally positive coefficient.) While this apparent lack of differentiation in results among the three ACBC designs could be disappointing from a theoretical point of view, there are some positive implications: For Lifetime Products: Since differences in predictive ability among the three test versions were not significant, we can combine data sets (n=571) for better statistical precision for client simulation applications. For Sawtooth Software—and the Research Community in general: The conclusion that there are no differences in predictive ability, despite using a variety of conjoint design settings, could be a good “story to tell” about the robustness of ACBC procedures in different design settings. This is especially encouraging, given the prospect of even more-improved design efficiencies with the upcoming Version 8.3 of ACBC. Lifetime Products’ experiences and learnings over the past few years suggest several key takeaways, particularly for new practitioners of conjoint analysis and other quantitative marketing tools.
Continue to explore and experiment with conjoint capabilities and design options.
Look for new applications for conjoint-driven market simulations.
Continuously improve your conjoint capabilities.
Don’t let it get too routine! Treat your conjoint work with academic rigor.
Robert J. Goodwin
36
REFERENCES Goodwin, Robert J: Introduction of Quantitative Marketing Research Solutions in a Traditional Manufacturing Company: Practical Experiences. Proceedings of the Sawtooth Software Conference, March 2009, pp. 185–198. Goodwin, Robert J: The Impact of Respondents’ Physical Interaction with the Product on Adaptive Choice Results. Proceedings of the Sawtooth Software Conference, October 2010, pp. 127–150. Johnson, Richard M., Orme, Bryan K., Huber, Joel & Pinnell, Jon: Testing Adaptive ChoiceBased Conjoint Designs, 2005. Sawtooth Software Research Paper Series. Orme, Bryan K., Alpert, Mark I. & Christensen, Ethan: Assessing the Validity of Conjoint Analysis—Continued, 1997. Sawtooth Software Research Paper Series. Orme, Bryan K.: Fine-Tuning CBC and Adaptive CBC Questionnaires, 2009. Sawtooth Software Research Paper Series. Special acknowledgement and thanks to: Bryan Orme (Sawtooth Software Inc.) Paul Johnson, Tim Smith & Gordon Bishop (Survey Sampling International) Chris Chapman (Google Inc.) Clint Morris & Vince Rhoton (Lifetime Products, Inc.)
37
CAN CONJOINT BE FUN?: IMPROVING RESPONDENT ENGAGEMENT IN CBC EXPERIMENTS JANE TANG ANDREW GRENVILLE VISION CRITICAL
SUMMARY Tang and Grenville (2010) examined the tradeoff between the number of choice tasks and the number of respondents for Choice Based Conjoint (CBC) studies in the era of on-line panels. The results showed that respondents become less engaged in later tasks. Increasing the number of choice tasks brought limited improvement in the model’s ability to predict respondents’ behavior, and actually decreased model sensitivity and consistency. In 2012, we looked at how shortening CBC exercises impacts the individual level precision of HB models, with a focus on the development of market segmentation. We found that using a slightly smaller number of tasks was not harmful to the segmentation process. In fact, under most conditions, a choice experiment using only 10 tasks was sufficient for segmentation purposes. However, a CBC exercise with only 8 to 10 tasks is still considered boring by many respondents. In this paper, we looked at two ideas that may be useful in improving respondents’ enjoyment level: 1. Augmenting the conjoint exercise using adaptive/tournament based choices. 2. Sharing the results of the conjoint exercise. Both of these interventions turn out to be effective, but in different ways. The adaptive/tournament tasks make the conjoint exercise less repetitive, and at the same time provide a better model fit and more sensitivity. Sharing results has no impact on the performance of the model, but respondents did find the study more “fun” and more enjoyable to complete. We encourage our fellow practitioners to review conjoint exercises from the respondent’s point of view. There are many simple things we can do to make the exercise appealing, and perhaps even add some “fun.” While these new approaches may not yield better models, simply giving the respondent a more enjoyable experience, and by extension making him a happier panelist, (and one who is less likely to quit the panel) would be a goal worth aiming for.
1. INTRODUCTION In the early days of CBC, respondents were often recruited into “labs” to complete questionnaires, either on paper or via a CAPI device. They were expected to take up to an hour to complete the experiment and were rewarded accordingly. The CBC tasks, while more difficult to complete than other questions (e.g., endless rating scale questions), were considered interesting by the respondents. Within the captive environment of the lab, respondents paid attention to the attributes listed and considered tradeoffs among the alternatives. Fatigue still crept in, but not until after 20 or 30 such tasks.
39
Johnson & Orme (1996) was the earliest paper the authors are aware of to address the suitable length of a CBC experiment. The authors determined that respondents could answer at least 20 choice tasks without degradation in data quality. Hoogerbrugge & van der Wagt (2006) was another paper to address this issue. It focused on holdout task choice prediction. They found that 10–15 tasks are generally sufficient for the majority of studies. The increase in hit rates beyond that number was minimal. Today, most CBC studies are conducted online using panelists as respondents. CBC exercises are considered a chore. In the verbatim feedback from our panelists, we see repeated complaints about the length and repetitiveness of choice tasks. Tang and Grenville (2010) examined the tradeoff between the number of choice tasks and the number of respondents in the era of on-line panels. The results showed that respondents became less engaged in later tasks. Therefore, increasing the number of choice tasks brought limited improvement in the model’s ability to predict respondents’ behavior, and actually decreased model sensitivity and consistency. In 2012, we looked at how shortening CBC exercises affected the individual-level precision of HB models, with a focus on the development of market segmentation. We found that using a slightly smaller number of tasks was not harmful to the segmentation process. In fact, under most conditions, a choice experiment using only 10 tasks was sufficient for segmentation purposes. However, a CBC exercise with only 8 to 10 tasks is still considered boring by many respondents. The GreenBook blog noted this, citing CBC tasks as number four in a list of the top ten things respondents hate about market research studies.
40
http://www.greenbookblog.org/2013/01/28/10-things-i-hate-about-you-by-mr-r-e-spondent/
2. WHY “FUN” MATTERS? An enjoyable respondent survey experience matters in two ways: Firstly, when respondents are engaged they give better answers that show more sensitivity, less noise and more consistency. In Suresh & Conklin (2010), the authors observed that faced with the same CBC exercise, those respondents who received the more complex brand attribute section chose “none” more often and had more price order violations. In Tang & Grenville (2010), we observed later choice tasks result in more “none” selections. When exposed to a long choice task, the respondents’ choices contained more noise, resulting in less model sensitivity and less consistency (more order violations). Secondly, today a respondent is often a panelist. A happier respondent is more likely to respond to future invites from that panel. From a panelist retention point of view, it is important to ensure a good survey experience. We at Vision Critical are in a unique position to observe this dynamic. Vision Critical’s Sparq software enables brands to build insight communities (a.k.a. brand panels). Our clients not only use our software, but often sign up for our service in recruiting and maintaining the panels. From a meta analysis of 393 panel satisfaction surveys we
41
conducted for our clients, we found that “Survey Quality” is the Number Two driver of panelist satisfaction, just behind “your input is valued” and ahead of incentives offered. Relative Importance of panel service attributes: The input you provide is valued
16%
The quality of the studies you receive
15%
The study topics
15%
The incentives offered by the panel
13%
The newsletters / communications that you receive
12%
The look and feel of studies
9%
The length of each study
8%
The frequency of the studies
8%
The amount of time given to respond to studies
6%
There are many aspects to survey quality, not the least of which is producing a coherent and logical survey instrument/questionnaire and having it properly programmed on a webpage. A “fun” and enjoyable survey experience also helps to convey the impression of quality.
3. OUR IDEAS There are many ways a researcher can create a “fun” and enjoyable survey. Engaging question types that make use of rich media tools can improve the look and feel of a webpage on which the question is presented, and make it easier for the respondent to answer those questions. Examples of that can be found in Reid et al. (2007).
Aside from improving the look, feel and functionality of the webpages, we can also change how we structure the questions we ask to make the experience more enjoyable. Puleson & Sleep’s (2011) award winning ESOMAR congress paper gives us two ideas.
42
The first is introducing a game-playing element into our questioning. In the context of conjoint experiments, we consider how adaptive choice tasks could be used to achieve this. We can structure conjoint tasks to resemble a typical game, so the tasks become harder as one progresses through the levels. Orme (2006) showed how this could be accomplished in an adaptive MaxDiff experiment. A MaxDiff experiment is where a respondent is shown a small set of options, each described by a short description, and asked to choose the option he prefers most as well as the option he prefers least. In a traditional MaxDiff, this task is followed by many more sets of options with all the sets having the same number of options. In an Adaptive MaxDiff experiment, this series of questioning is done in stages. While the respondents see the traditional MaxDiff tasks in the first stage, those options chosen as preferred “least” in stage 1 are dropped off in stage 2. The options chosen as preferred “least” in stage 2 are dropped off in stage 3, etc. The numbers of options used in the comparison in each stage get progressively smaller, so there are changes in the pace of the questions. Respondents can also see how their choices result in progressively more difficult comparisons. At the end, only the favorites are left to be pitted against each other. Orme (2006) showed that respondents thought this experience was more enjoyable. This type of adaptive approach is also at work in Sawtooth Software’s Adaptive CBC (ACBC) product. The third step in an ACBC experiment is a choice tournament based on all the product configurations in a respondent’s consideration set. Tournament Augmented Conjoint (TAC) has been tried before by Chrzan & Yardly (2009). In their paper, the authors added a series of tournament tasks to the existing CBC tasks. However, as the CBC section was already quite lengthy, with accurate HB estimates, the authors concluded that the additional TAC tasks provided only modest and non-significant improvements, which did not justify the extra time it took to complete the questionnaire. However, we hypothesize that if we have a very short CBC exercise and make the tournament tasks quite easy (i.e., pairs), the tournament tasks may bring more benefits, or at least be more enjoyable for the panelists. Our second idea comes from Puleson & Sleep (2011), who offered respondents a “two-way conversation.” From Vision Critical’s panel satisfaction research, we know that people join panels to provide their input. While respondents feel good about the feedback they provide, they want to know that they have been heard. Sharing the results of studies they have completed is a tangible demonstration that their input is valued. Most panel operators already do this, providing feedback on the survey results via newsletters and other engagement tools. However, we can go further. News media websites often have quick polls where they pose a simple question to anyone visiting the website. As soon as a visitor provides her answer, she can see the results from all the respondents thus far. That is an idea we want to borrow. Dahan (2012) showed an example of personalized learning from a conjoint experiment. A medical patient completed a series of conjoint tasks. Once he finished, he received the results outlining his most important outcome criterion. This helped the patient to communicate his needs and concerns to his doctors. It could also help him make future treatment decisions. Something like this could be useful for us as well.
43
4. FIELD EXPERIMENT We chose a topic that is of lasting interest to the general population: dating. We formulated a conjoint experiment to determine what women were looking for in a man. Cells
The experiment was fielded in May 2013 in Canada, US, UK and Australia. We had a sample size of n=600 women in each country. In each country, respondents were randomly assigned into one of four experimental cells. CBC (8 Tasks, Triples) CBC (8 Tasks, Triples) + Shareback CBC (5 Task, Triples) + Tournament (4 Tasks, Pairs) CBC (5 Task, Triples) + Tournament (4 Tasks, Pairs) + Shareback
n= 609 623 613 618
While the CBC only cells received 8 choice tasks, all of them were triples. The Tournament cells had 9 tasks, 5 triples and 4 pairs. The amount of information, based on the number of alternatives seen by each respondent, was approximately the same in all the cells. We informed the respondents in the Shareback cells at the start of the interview that they would receive the results from the conjoint experiment after it was completed. Questionnaire
The questionnaire was structured as follows: 1. Intro/Interest in topic 2. All about you: Demos/Personality/Preferred activity on dates 3. What do you look for in a man? o Personality 44
o BYO: your ideal “man” 4. Conjoint Exercise per cell assignment 5. Share Back per cell assignment 6. Evaluation of the study experience A Build-Your-Own (BYO) task in which we asked the respondents to tell us about their ideal “man” was used to educate the respondents on the factors and levels used in the experiment. Tang & Grenville (2009) showed that a BYO task was effective in preparing respondents for making choice decisions. Vision Critical’s standard study experience module was used to collect the respondents’ evaluation data. This consisted of 4 attribute ratings measured on a 5-point agreement scale, and any volunteered open-ended verbatim comments on the study topic and survey experience. The four attribute ratings were:
Overall, this survey was easy to complete I enjoyed filling out this survey I would fill out a survey like this again The time it took to complete the survey was reasonable
Factors & Levels
The following factors were included in our experiment. Note that body type images are used in the BYO task only. Attribute:
Level:
Level:
Level:
Level:
Level:
Age Height
Much older than me, Much taller than me Big & Cuddly
A bit older than me A little taller than me Big & Muscly
About the same age Same height as me Athletic & Sporty
A bit younger than me Shorter than me Lean & Fit
Much younger than me
images used at the BYO question only, not in conjoint task
Body Type
Career Activity Attitude towards Family/Kids Personality Flower Scale
Yearly Income
Notes:
Driven to succeed and make Works hard, but with a good Has a job, but it's only to pay money work/life balance the bills Prefers day to day life over Exercise fanatic Active, but doesn't overdo it exercise Happy as a couple
Wants a few kids
Wants a large family
Reliable & Practical Flowers, even when you are not expecting Pretty low Under $50,000 Under $30,000 Under £15,000
Funny & Playful Flowers for the important occasions Low middle $50,000 to $79,999 $30,000 to $49,999 £15,000 – £39,999
Sensitive & Empathetic Flowers only when he’s saying sorry Middle $80,000 to $119,999 $50,000 to $99,999 £40,000 – £59,999
Prefers to find work when he needs it
Serious & Determined
Passionate & Spontaneous
"What are flowers?" High middle $120,000 to $159,999 $100,000 to $149,999 £60,000 – £99,999
Really high $160,000 or more $150,000 or more £100,000 or more
Australia US/Canada UK
45
Screen Shots
The A CBC task was presented to the respondent as follow:
The adaptive/tournament tasks were formulated as follows: Set 1 Set 2 Set 3 Set 4
46
Randomly order the 5 winners from the CBC tasks, label them as item 1 to item 5. Item 1 Item 3 Item 5 winner from Set 2
Item 2 Item 4 Winner from Set 1 Winner from Set 3
Drop the loser from Set 1 Drop the loser from Set 2 Drop the loser from Set 3
The tournament task was shown as:
The personalized learning page was shown as follows:
Personalized learning was based on frequency count only. For each factor, we counted how often each level was presented to that respondent and how often it was chosen when presented. The most frequently chosen level was presented back to the respondents. These results were presented mostly for fun—the counting analysis was not the best for providing this kind of individual feedback. The actual profiles presented to each individual respondent in her CBC tasks were not perfectly balanced; the Tournament cells, where the winners were presented to each respondent, would also have added bias for the counting analysis. If we wanted to focus on getting accurate individual results, something like an 47
individual level logit model would be preferred. However, here we felt the simple counting method would be sufficient and it was easy for our programmers to implement. Each respondent in the Shareback cells was also shown the aggregate results from her fellow countrywomen who had completed the survey thus far in her experiment cell.
5. RESULTS We built HB models for each of the 4 experimental cells separately. Part-worth utilities were estimated for all the factors. Sawtooth Software’s CBC/HB product was used for the estimation. Model Fit/Hit Rates
We deliberately did not design holdout tasks for this study. We wanted to measure the results of making a study engaging, and using holdout tasks makes the study take longer to complete, which tends to have the opposite effect. Instead of purposefully designed holdout tasks, we randomly held out one of the CBC tasks to measure model fit. Since respondents tend to spend much longer time at their first choice task, we decided to exclude the 1st task for this purpose. For each respondent, one of her 2nd, 3rd, 4th and 5th CBC tasks was randomly selected as the holdout task. The hit rates for the Tournament cells (63%) were much higher than the CBC cells (54%). That result was surprising at first, since we would expect no significant improvement in model performance for the Tournament cells. However, while the randomly selected holdout task was held out from the model, the winner from that task was still included in the tournament tasks, which may explain the increased performance. In order to avoid any influence of the random holdout task, we reran the models for the tournament cells again, holding out information from the holdout task itself, and any tournament tasks related to its winner. The new hit rates (52%) are now comparable to that of the CBC cells. 48
However, by holding out not only the selected random holdout task, but also at least one and potentially as many as three out of the four tournament tasks, we might have gone too far in withholding information from the modeling. Had full information been used in the modeling, we expect the tournament cells would have a better model fit and be better able to predict respondent’s choice behavior. Respondents seem to agree with this. Those who participated in the tournament thought we did a better job of presenting them their personalized learning information. While this information is based on a crude counting analysis and has potential bias issues, it is still comforting to see this result.
The improvement in model fit is also reflected in a higher scale parameter in the model, with the tournament cells showing stronger preferences, i.e., less noise and higher sensitivity. The graph below shows the simulated preference shares for a “man” for each factor at the designated level one at a time (holding all other factors at neutral). The shares are rescaled so that the average across the levels within each factor sums to 0.
49
“Fun” & Enjoyment
Respondents had a lot of fun during this study. The topbox ratings for all 4 items track much higher than the ratings for the congressional politics CBC study used in our 2010 experiment. Disappointingly, there are not any differences across the 4 experimental cells among these ratings. We suspect this is due to the high interest in the topic of dating and the fact we went out of way to make the experience a good one for all the cells. Had we tested these interventions in a less interesting setting (e.g., smartphone), we think we would have seen larger effects. Interestingly, we saw significant differences in the volunteered open-ended verbatim answers from respondents. Many of these verbatim answers are about how they enjoyed the study experience and had fun completing the survey. Respondents in the Shareback cells volunteered more comments and more “fun”/enjoyment comments than the non-shareback cells.
50
While an increase of 6.7% to 9.0% appears to be only a small improvement, given that only 13% of the respondents volunteered any comments at all across the 4 cells, this reflects a sizeable change.
6. CONCLUSIONS & RECOMMENDATION Both of these interventions are effective, but in different ways. The adaptive/Tournament tasks make the conjoint exercise less repetitive and less tedious, and at the same time provide better model fit and more sensitivity in the results. While sharing results has no impact on the performance of the model, the respondents find the study more fun and more enjoyable to complete. Should we worry about introducing bias with these methods? The answer is no. Adaptive methods have been shown to give results consistent with the traditional approaches in many different settings, both for Adaptive MaxDiff (Orme 2006) and numerous papers related to ACBC. Aside from the scale difference, our results from the Tournament cells are also consistent with that from the traditional CBC cells. Advising respondents that we would share back the results of the findings also had no impact on their choice behaviors.
We encourage fellow practitioners to review conjoint exercises from the respondent’s point of view. There are many simple things we can do to make the exercise appealing, and perhaps even add “fun.” While these new approaches may not yield better models, simply giving the respondent a more enjoyable experience, and by extension making him a happier panelist, would be a goal worth aiming for. In the words of a famous philosopher:
51
While conjoint experiments may not be enjoyable by nature, there is no reason respondents cannot have a bit of fun in the process.
Jane Tang
REFERENCES Chrzan, K. & Yardley, D. (2009), “Tournament-Augmented Choice-Based Conjoint” Sawtooth Software Conference Proceedings. Dahan, E. (2012), “Adaptive Best-Worst Conjoint (ABC) Analysis” Sawtooth Software Conference Proceedings. Hoogerbrugge, M. and van der Wagt, K. (2006), “How Many Choice Tasks Should We Ask?” Sawtooth Software Conference Proceedings. Johnson, R. and Orme, B. (1996), “How Many Questions Should You Ask In Choice-Based Conjoint Studies?” ART Forum Proceedings. Orme, B. (2006), “Adaptive Maximum Difference Scaling” Sawtooth Software Technical Paper Library Puleson, J. & Sleep, D. (2011), “The Game Experiments: Researching how gaming techniques can be used to improve the quality of feedback from on-line research” ESOMAR Congress 2011 Proceedings
52
Reid, J., Morden, M. & Reid, A. (2007) “Maximizing Respondent Engagement: The Use of Rich Media” ESOMAR Congress Full paper can be downloaded from http://vcu.visioncritical.com/wpcontent/uploads/2012/02/2007_ESOMAR_MaximizingRespondentEngagement_ORIGINAL -1.pdf Suresh, N. and Conklin, M. (2010), “Quantifying the Impact of Survey Design Parameters on Respondent Engagement and Data Quality” CASRO Panel Conference. Tang, J. and Grenville, A. (2009), “Influencing Feature Price Tradeoff Decisions in CBC Experiments,” Sawtooth Software Conference Proceedings. Tang, J. & Grenville, A. (2010), “How Many Questions Should You Ask in CBC Studies?— Revisited Again” Sawtooth Software Conference Proceedings. Tang, J. & Grenville, A. (2012), “How Low Can You Go?: Toward a better understanding of the number of choice tasks required for reliable input to market segmentation” Sawtooth Software Conference Proceedings.
53
MAKING CONJOINT MOBILE: ADAPTING CONJOINT TO THE MOBILE PHENOMENON CHRIS DIENER1 RAJAT NARANG2 MOHIT SHANT3 HEM CHANDER4 MUKUL GOYAL5 ABSOLUTDATA
INTRODUCTION: THE SMART AGE With “smart” devices like smartphones and tablets integrating the “smartness” of personal computers, mobiles and other viewing media, a monumental shift has been observed in the usage of smart devices for information access. The sales of smart devices have been estimated to cross the billion mark in 2013. The widespread usage of these devices has impacted the research world too. A study found that 64% of survey respondents preferred smartphone surveys, 79% of them preferring to do so due to the “on-the-go” nature of it (Research Now, 2012). Multiple research companies have already started administering surveys for mobile devices, predominantly designing quick hit mobile surveys to understand the reactions and feedback of consumers, onthe-go.
Prior research (“Mobile research risk: What happens to data quality when respondents use a mobile device for a survey designed for a PC,” Burke Inc, 2013) has suggested that when comparing the results of surveys adapted for mobile devices to those on personal computers, respondent experience is poorer and data quality is comparable for surveys on mobile and personal computers.
This prior research also discourages the use of complex research techniques like conjoint on the mobile platform. This comes as no surprise, as conjoint has long been viewed as a complex and slightly monotonous exercise from the respondent’s perspective. Mobile platform’s small viewer interface and internet speed can act as potential barriers for using conjoint.
ADAPTING CONJOINT TO THE MOBILE PLATFORM Recognizing the need to reach respondents who are using mobile devices, research companies have introduced three different ways of conducting surveys on mobile platforms— web browsers based, app based and SMS based. Of these three, web browser is the most widely used, primarily due to the limited customization required to host the surveys simultaneously on mobile platforms and personal computers. The primary focus of mobile-platform-based surveys 1
Senior Vice President, AbsolutData Intelligent Analytics [Email:
[email protected]] Senior Expert, AbsolutData Intelligent Analytics [Email:
[email protected]] 3 Team Lead, AbsolutData Intelligent Analytics [Email:
[email protected]] 4 Senior Analyst, AbsolutData Intelligent Analytics [Email:
[email protected]] 5 Senior Programmer, AbsolutData Intelligent Analytics [Email:
[email protected]] 2
55
is short and simple surveys like customer satisfaction, initial product reaction, and attitude and usage studies. However, currently the research industry is hesitant to conduct conjoint studies on mobile platform due to concerns with:
Complexity—conjoint is known to be complex and intimidating exercise due to the number of tasks and the level of detail shown in the concepts Inadequate representation on the small screen—having large number and long concepts on the screen can affect readability Short attention span of mobile users Possibility of a conjoint study with large number of attributes and tasks—if a large number of attributes are being used, there is a possibility of the entire concept not being shown on a single screen, requiring a user to scroll Penetration of smartphones in a region
In this paper, we hypothesize that all these can be countered (with the exception of smart phone penetration) by focusing on improving the aesthetics and simplifying the conjoint tasks, as illustrated in Figure 1. Our changes include:
56
Improving aesthetics o Coding the task outlay to optimally use the entire screen space o Minimum scrolling to view the tasks o Reduction of number of concepts being shown on a screen Simplifying conjoint tasks o Reduction in number of tasks o Reduction of number of attributes on a screen o Using simplified conjoint methodologies
Figure 1
We improve aesthetics using programming techniques. To simplify conjoint tasks and make them more readable to the respondents, we customize several currently used techniques to adapt them to the mobile platform and compare their performance.
CUSTOMIZING CURRENT TECHNIQUES Shortened ACBC
Similar to ACBC, this method uses pre-screening to identify most important attributes for each respondent. The attributes selected in the pre-screening stage qualify through to the Build Your Own (BYO) section and the Near Neighbor section. We omitted the choice tournament in order to reduce the number of tasks being evaluated. ACBC is known to present more simplistic tasks and better respondent engagement by focusing on high priority attributes. We further simplified the ACBC tasks by truncating the list of attributes and hence reducing the length of the concepts on the screen. Also, the number of concepts per screen was reduced to 2 to simplify the tasks for respondents. An example is shown in Figure 2.
57
Figure 2 Shortened ACBC Screenshot
Pairwise Comparison Rating (PCR)
We customize an approach similar to the Conjoint Value Analysis (CVA) method. Similar to CVA, this method shows two concepts on the screen. We show respondents a 9 point scale and ask to indicate preference—4 points on the left/right indicating preference for product on the left/right respectively. Rating point in the middle implies that neither of the products is preferred. For the purpose of estimation, we convert the rating responses to:
Discrete Choice—If respondents mark either half of the scale, the data file (formatted as CHO for utility estimation with Sawtooth Software) reports the concept as being selected. Whereas, if they mark the middle rating point, it implies that they have chosen the None option. Chip Allocation—We convert the rating given by the respondents to volumetric shares in the CHO file. In case of a partial liking of the concept (wherein the respondent marked 2/3/4 or 6/7/8), we allocate the rest of the share to the “none” option. So, for example, a rating of 3 would indicate “Somewhat prefer left,” so 50 points would go to left concept and 50 to none. Similarly, a rating of 4 would indicate 25:75 in favor of none.
We include chip allocation to understand the impact of a reduced complexity approach on the results. Also, in estimating results using chip allocation methods, the extent of likeability of the product can also be taken into account (as opposed to single select in traditional CBC). We use two concepts per screen to simplify the tasks for respondents, as shown in Figure 3.
58
Figure 3 Pairwise Comparison Rating Screenshot
CBC (3 concepts per screen)
The 3-concept per screen CBC we employ is identical to CBC conducted on personal computers (Figure 4). We do this to compare the data quality across platforms for an identical method. Figure 4 CBC Mobile (3 concepts) Screenshot
CBC (2 concepts per screen)
Similar to traditional CBC, we also include a CBC with only 2 concepts shown per screen. This allows us to understand the result of direct CBC simplification, an example of which is shown in Figure 5.
59
Figure 5 CBC mobile (2 concepts) Screenshot
Partial Profile
This method is similar to traditional Partial Profile CBC where we fix the primary attributes on the screen and then rotated a set of secondary attributes. We further simplify the tasks by reducing the length of the concept by showing a truncated list of attributes and also by reducing the number of concepts shown per screen (2 concepts per screen) as is shown in Figure 6. Figure 6 Partial Profile Screenshot
RESEARCH DETAILS The body of data collected to test our hypotheses and address our objectives is taken from quantitative surveys we conducted in the US and India. In each country, we surveyed 1200 60
respondents (thanks to uSamp and IndiaSpeaks for providing sample in US and India respectively). Each of our tested techniques was evaluated by 200 distinct respondents. Results were also gathered for traditional CBC (3 concepts per screen) administered on personal computer to compare as a baseline of data quality. The topic of the surveys was the evaluation of various brands of tablet devices (a total of 9 attributes were evaluated). In addition to including the conjoint tasks, the surveys also explored respondent reaction to the survey, as well as some basic demographics. The end of survey questions included:
The respondent’s experience in taking the survey on mobile platforms Validity of results from the researcher’s perspective Evaluate the merits and demerits of each technique Evaluate if the efficacy of the techniques differ according to online survey conducting maturity of the region (lesser online surveys are conducted in India than in the US)
RESULTS After removing the speeders and straightliners, we evaluate the effectiveness of the different techniques from two perspectives—the researcher perspective (technical) and respondent perspective (experiential). We compare the results for both countries side-by-side to highlight key differences.
RESEARCHER’S PERSPECTIVE Correlation analysis
Pearson correlations compare the utilities of each of the mobile methods with utilities of CBC on personal computer. The results of the correlations are found in Table 1. Most of the methods with the exception of PCR (estimated using chip allocation) show very high correlation and thus appear to mimic the results displayed by personal computers. This supports the notion that we are getting similar and valid results between personal computer and mobile platform surveys.
61
Table 1. Correlation analysis with utilities of CBC PC
Holdout accuracy
We placed fixed choice tasks in the middle and at the end of the exercise for each method. Due to the varying nature of the methods (varying number of concepts and attributes shown per task), the fixed tasks were not uniform across methods, i.e., each fixed task was altered as per the technique in question. For example, partial profile only had 6 attributes for 2 concepts versus a full profile CBC which had 9 attributes for 3 concepts and hence the fixed tasks were designed accordingly. As displayed in Table 2, Holdout task prediction rates are strong and in the typical expected range. All the methods customized for mobile platforms do better than CBC on personal computers (with the exception of PCR—Chip Allocation). CBC with 3 tasks, either on Mobile or on PC, did equally well. Table 2. Hit Rate Analysis Arrows indicate statistical significant difference from CBC PC
62
When we adjust these hit rates for the number of concepts presented, discounting the hit rates for tasks with fewer concepts1, the relative accuracy of the 2 versus the 3 concept tasks shifts. The adjusted hit rates are shown in Table 3. With adjusted hit rates, the 3 concept task gains the advantage over 2 concepts. We interpret these unadjusted and adjusted hit rates together to indicate that, by and large, the 2 and 3 concept tasks generate similar hit rates. Also in the larger context of comparisons, except for PCR, all of the other techniques are very comparable to CBC PC and do well. Table 3. Adjusted Hit Rate Analysis
MAE
MAE scores displayed in Table 4 tell a similar story with simplified methods like CBC (2 concepts) and Partial Profile doing better than CBC on personal computer. Table 4. MAE Analysis
1
We divided the hit rates with the probability of selection of each concept on a screen. E.g. for a task with two concepts and none option, the random probability of selection will be 33.33%. Therefore, all hit rates obtained for fixed tasks with two concepts were divided by 33.33% to get the index score
63
RESPONDENT’S PERSPECTIVE Average Time
Largely, respondents took more time to evaluate conjoint techniques on the mobile platforms. As displayed in Chart 1, shortened ACBC takes more time for respondents to evaluate, particularly for the respondents from India. This is expected due to the rigorous nature of the method. PCR also took a lot of time to evaluate, especially for the respondents from India. This might indicate that a certain level of maturity is required from respondents for evaluation of complex conjoint techniques. This is reflective of the fact that online survey conduction is still at a nascent stage in India. Respondents took the least amount of time to evaluate CBC Mobile (2 concepts) indicating that respondents can comprehend simpler tasks quicker. Chart 1. Average time taken (in mins)
Readability
Respondents largely find tasks to be legible on mobile. This might be attributed to the reduced list of attributes being shown on the screen. Although, as seen in Chart 2, surprisingly, CBC Mobile (3 concepts) also does great on this front, which means that optimizing screen space on mobiles can go a long way in providing readability.
64
Chart 2. Readability of methods on PC/Mobile screens Arrows indicate statistical significant difference from CBC PC
Ease of understanding
Respondents found the concepts presented on the mobile platform easy to understand and the degree of understanding is comparable to conjoint on personal computers. Thus, conjoint research can easily be conducted on mobile platforms too. Chart 3. Readability of methods on PC/Mobile screen Arrows indicate statistically significant difference from CBC PC
Enjoyability
US respondents found the survey to be significantly less enjoyable than their Indian counterparts as displayed in Chart 4. This might be due to the fact that online survey market in US is quite saturated as compared to Indian market, which is still nascent. Therefore, respondent exposure to online surveys might be significantly higher in the US contributing to the low enjoyability. 65
Chart 4. Enjoyability of methods on PC/Mobile screen Arrows indicate statistical significant difference from CBC PC
Encouragement to give honest opinion
Respondents find that all the methods encouraged honest opinions in the survey. Chart 5. Encouragement to give honest opinions Arrows indicate statistical significant difference from CBC PC
Realism of tablet configuration
Respondents believe that the tablet configurations are realistic. As seen in Chart 6, all the methods are more or less at par with CBC on personal computers. This gives us confidence in the results because the same tablet configuration was used in all techniques.
66
Chart 6. Realism of tablet configuration Arrows indicate statistical significant difference from CBC PC
SUMMARY OF RESULTS On the whole, all of the methods we customized for the mobile platform did very well in providing good respondent engagement and providing robust data quality. Although conjoint exercises with 3 concepts perform well on data accuracy parameters, they don’t bode well as far as the respondent experience is concerned. However, the negative effect on respondent experience can be mitigated by optimal use of screen space and resplendent aesthetics. Our findings indicate that a conjoint exercise with 2 concepts is the best of the alternative methods we tested in terms of enriching data quality as well as user experience. CBC with 2 concepts performs exceptionally well in providing richer data quality and respondent engagement than CBC on personal computers. The time taken to complete the exercise is also at par with that of CBC on PC. PCR (discrete estimation) does fairly well too. However, its practical application might be debated, with other methods being equally robust than it, if not more, and easier to implement. One may consider lowering the number of attributes being shown on the screen in conjunction with the reduction of the number of concepts by the usage of partial profile and shortened ACBC exercise. Although these methods score high on data accuracy parameters, respondents find them slightly hard to understand as full profile of the products being offered is not present. However, once respondents cross the barrier of understanding, these methods prove extremely enjoyable and encourage them to give honest responses. They also take a longer time to evaluate. Therefore, these should be used in studies where the sole component of the survey design is the conjoint exercise.
CONCLUSION This paper shows that researchers can confidently conduct conjoint in mobile surveys. Respondents enjoy taking conjoint surveys on their mobile, probably due its “on-the-go” nature. 67
Researchers might want to adopt simple techniques like screen space optimization and simplification of tasks in order to conduct conjoint exercises on mobile platforms. This research also indicates that the data obtained from conjoint on mobile platforms is robust and mirrors data from personal computers to a certain extent (shown by high correlation numbers). This research supports the idea that researchers can probably safely group responses from mobile platforms and personal computers and analyze them without the risk of error.
Chris Diener
68
CHOICE EXPERIMENTS IN MOBILE WEB ENVIRONMENTS JOSEPH WHITE MARITZ RESEARCH
BACKGROUND Recent years have witnessed the rapid adoption of increasingly mobile computing devices that can be used to access the internet, such as smartphones and tablets. Along with this increased adoption we see an increasing proportion of respondents complete our web-based surveys in mobile environments. Previous research we have conducted suggests that these mobile responders behave similarly to PC and tablet responders. However, these tests have been limited to traditional surveys primarily using rating scales and open-ended responses. Discrete choice experiments may present a limitation for this increasingly mobile respondent base as the added complexity and visual requirements of such studies may make them infeasible or unreliable for completion on a smartphone. The current paper explores this question through two large case studies involving more complicated choice experiments. Both of our case studies include design spaces based on 8 attributes. In one we present partial and full profile sets of 3 alternatives, and in the second we push respondents even further by presenting sets of 5 alternatives each. For both case studies we seek to understand the potential impact of conducting choice experiments in mobile web environments by investigating differences in parameter estimates, respondent error, and predictive validity by form factor of survey completion.
CASE STUDY 1: TABLET Research Design
Our first case study is a web-based survey among tablet owners and intenders. The study was conducted in May of 2012 and consists of six cells defined by design strategy and form factor. The design strategies are partial and full profile, and the form factors are PC, tablet, and mobile. The table below shows the breakdown of completes by cell.
PC Tablet Mobile
Partial Profile 201 183 163
Full Profile 202 91 164
Partial profile respondents were exposed to 16 choice sets with 3 alternatives, and full profile respondents completed 18 choice sets with 3 alternatives. The design space consisted of 7 attributes with 3 levels and one attribute with 2 levels. Partial profile tasks presented 4 of the 8 attributes in each task. All respondents were given the same set of 6 full profile holdout tasks after the estimation sets, each with 3 alternatives. Below is a typical full profile choice task.
69
Please indicate which of the following tablets you would be most likely to purchase. Operating System Apple Windows Android Memory 8 GB 64 GB 16 GB Included Cloud 5 GB 50 GB None Storage (additional at extra cost) Price $199 $799 $499 Screen Resolution High definition Extra-high definition High definition display (200 pixels display display (200 pixels per inch) (300 pixels per inch) per inch) Camera Picture 5 Megapixels 0.3 Megapixels 2 Megapixels Quality Warranty 1 Year 3 Months 3 Years Screen Size 7˝ 10˝ 5˝
Analysis
By way of analysis, the Swait-Louviere test (Swait & Louviere, 1993) is used for parameter and scale equivalence tests on aggregate MNL models estimated in SAS. Sawtooth Software’s CBC/HB is used for predictive accuracy and error analysis, estimated at the cell level, i.e., design strategy by form factor. In order to account for demographic differences by device type, data are weighted by age, education, and gender, with the overall combined distribution being used as the target to minimize any distortions introduced through weighting. Partial Profile Results
Mobile
m = 0.96
lA = 62 => Reject H1A
R2 = 0.81
Apple
Android
Tablet
m = 1.05
lA = 156 => Reject H1A
Tablet m = 1.00
Apple
PC
m = 1.00
R2 = 0.89
PC
m = 1.00
The Swait-Louviere parameter equivalence test is a sequential test of the joint null hypothesis that scale and parameter vectors of two models are equivalent. In the first step we test for equivalence of parameter estimates allowing scale to vary. If we fail to reject the null hypothesis for this first step then we move to step 2 where we test for scale equivalence. R2 = 0.94
Mobile
m = 0.94
lA = 36 => Reject H1A
In each pairwise test we readily reject the null hypothesis that parameters do not differ beyond scale, indicating that we see significant differences in preferences by device type. Because we reject the null hypothesis in the first stage of the test we are unable to test for
70
significant differences in scale. However, the relative scale parameters suggest we are not seeing dramatic differences in scale by device type. The chart below shows all three parameter estimates side-by-side. When we look at the parameter detail, the big differences we see are with brand, price, camera, and screen size. Not surprisingly, brand is most important for tablet responders with Apple being by far the most preferred option. It is also not too surprising that mobile responders are the least sensitive to screen size. This suggests we are capturing differences one would expect to see, which would result in a more holistic view of the market when the data are combined. 0.60
0.40
0.20
0.00
-0.20
-0.40
-0.60
App
And Brand
Win
8GB 16GB 64GB 0GB Hard Drive
5GB 50GB Att4
Cloud Storage
Price
PP PC
HD
XHD
Resolution
PP Tablet
0.3
2.0
5.0
Camera (Mpx)
3Mo
1Yr Warranty
3Yr
5"
7"
10"
Screen Size
PP Mobile
We used mean absolute error (MAE) and hit rates as measures of predictive accuracy. The HB parameters were tuned to optimize MAE with respect to the six holdout choice tasks by form factor and design strategy. The results for the partial profile design strategy are shown in the table below. PP PC PP Tablet PP Mobile
Base 201 183 163
MAE 0.052 0.058 0.052
Hit Rate 0.60 0.62 0.63
In terms of hit rate both mobile and tablet responders are marginally better than PC responders, although not significantly. While tablet responders have a higher MAE, mobile responders are right in line with PC. At least in terms of in-sample predictive accuracy it appears that mobile responders are at par with their PC counterparts. We next present out-of-sample results in the table below. Holdouts PC Tablet Mobile Average
Random 0.143 0.160 0.142
PC 0.052 0.103 0.060 0.082
Prediction Utilities Tablet Mobile 0.082 0.053 0.058 0.074 0.077 0.052 0.080 0.056
71
All respondents were presented with the same set of 6 holdout tasks. These tasks were used for out-of-sample predictive accuracy measures by looking at the MAE of PC responders predicting tablet responders holdouts, as an example. In the table above the random column shows the mean absolute deviation from random of the choices to provide a basis from which to judge relative improvement of the model. The remainder of the table presents MAE when utilities estimated for the column form factor were used to predict the holdout tasks for the row form factor. Thus the diagonal is the in-sample MAE and off diagonal is out-of-sample. Finally, the average row is the average MAE for cross-form factor. For example, the average PC MAE of 0.082 is the average MAE using PC based utilities to predict tablet and mobile holdouts individually. Mobile outperforms both tablet and PC in every pairwise out-of-sample comparison. In other words, mobile is better at predicting PC than tablet and better than PC at predicting tablet holdouts. This can be seen at both the detail and average level. In fact, Mobile is almost as good at predicting PC holdouts as PC responders. Wrapping up the analysis of our partial profile cells we compared the distribution of RLH statistics output from Sawtooth Software’s CBC/HB to see if there are differences in respondent error by device type. Note that the range of the RLH statistic is from 0 to 1,000 (three implied decimal places), with 1,000 representing no respondent error. That is, when RLH is 1,000 choices are completely deterministic and the model explains the respondent’s behavior perfectly. In the case of triples, a RLH of roughly 333 is what one would expect with completely random choices where the model adds nothing to explain observed choices. Below we chart the cumulative RLH distributions for each form factor.
RLH Cumulative Distributions
Cumulative Percent
100%
PP PC
PP Tablet PP Mobile
0%
0
200
400
600
800
1000
RLH
Just as with a probability distribution function, the cumulative distribution function (CDF) allows us to visually inspect and compare the first few moments of the underlying distribution to understand any differences in location, scale (variance), or skew. Additionally, the CDF allows us to directly observe percentiles, thereby quantifying where excess mass may be and how much mass that is. 72
The CDF plots above indicate the partial profile design strategy results in virtually identically distributed respondent error by form factor. As this represents three independent CBC/HB runs (models were estimated separately by form factor) we are assured that this is indeed not an aggregate result. Full Profile Results
R2 = 0.89
R2 = 0.81
R2 = 0.94
R2
R2
R2
Apple
= 0.72
Tablet m = 1.00
m = 1.00
= 0.81
Apple
PC
PC
m = 1.00
Partial profile results suggest mobile web responders are able to reliably complete smaller tasks on either their tablet or smartphone. As we extend this to the full profile strategy the limitations of form factor screen size, especially among smartphone responders, may begin to impact quality of results. We take the same approach to analyzing the full profile strategy as we did with partial profile, first considering parameter and scale equivalence tests.
Windows
PP R2
= 0.55
Apple
Android
Android Price Mobile
Tablet
m = 0.96
lA = 61 => Reject H1A
Mobile
m = 1.05
lA = 99 => Reject H1A
m = 0.94
lA = 138 => Reject H1A
As with the partial profile results, we again see preferences differing significantly beyond scale. The pairwise tests above show even greater differences in aggregate utility estimates than with partial profile results, as noted by the sharp decline in parameter agreement as measured by the R2 fit statistic. However, while we are unable to statistically test for differences in scale we again see relative scale parameter estimates near 1 suggesting similar levels of error. 0.60
0.40
0.20
0.00
-0.20
-0.40
-0.60 App
And Brand
Win
8GB 16GB 64GB 0GB Hard Drive
5GB 50GB Att4
Cloud Storage FP PC
Price
HD
XHD
Resolution
FP Tablet
0.3
2.0
5.0
Camera (Mpx)
3Mo
1Yr Warranty
3Yr
5"
7"
10"
Screen Size
FP Mobile
Studying the parameter estimate detail in the chart above, we see a similar story as before with preferences really differing on brand, price, camera, and screen size. And again, the differences are consistent with what we would expect given the market for tablets, PCs, and 73
smartphones. For all three device types Apple is the preferred tablet brand which is consistent with the market leader position they enjoy. Android, not having a real presence (if any) in the PC market is the least preferred for PC and tablet responders, which is again consistent with market realities. Android showing a strong second position among mobile responders is again consistent with market realities as Android is a strong player in the smartphone market. Tablet responders also being apparently less sensitive to price is also what we would expect given the price premium of Apple’s iPad. In-sample predictive accuracy is presented in the table below and we again see hit rates for tablet and PC responders on par with one another. However, under the full profile design strategy mobile responders outperform PC responders in terms of both MAE and hit rate, with the latter being significant with 90% confidence for a one tail test. In terms of MAE tablet responders outperform both mobile and PC responders. In-sample predictive accuracy of tablet responders in terms of MAE is most likely the result of brand being such a dominant attribute.
FP PC FP Tablet FP Mobile
Base 202 91 164
MAE 0.044 0.028 0.033
Hit Rate 0.70 0.70 0.76
Looking at out-of-sample predictive accuracy in the table below we see some interesting results across device type. First off, the random choice MAE is consistent across device types making the direct comparisons easier in the sense of how well the model improves prediction over the no information (random) case. The average cross-platform MAE is essentially the same for mobile and PC responders, again suggesting that mobile responders provide results that are on par with PC, at least in terms of predictive validity. Interestingly, and somewhat surprisingly, utilities derived from mobile responders are actually better at predicting PC holdouts than those derived from PC responders. Holdouts PC Tablet Mobile Average
Random 0.163 0.163 0.166
PC 0.044 0.056 0.059 0.058
Prediction Utilities Tablet Mobile 0.101 0.038 0.028 0.074 0.102 0.033 0.102 0.056
While PC and mobile responders result in similar predictive ability, tablet responders are much worse at predicting out-of-sample holdouts. On average tablet responders show almost twice the out-of-sample error as mobile and PC responders, and compared to in-sample accuracy tablet out-of-sample has nearly 4 times the amount of error. This is again consistent with a dominant attribute or screening among tablet responders. If tablet responder choices are being determined by a dominant attribute then we would expect to see the mass of the RLH CDF shifted to the right. The cumulative RLH distributions are shown below for each of the three form factors.
74
RLH Cumulative Distributions
Cumulative Percent
100%
FP PC
FP Tablet FP Mobile
0%
0
200
400
600
800
1000
RLH
As with partial profile, full profile PC and mobile groups show virtually identical respondent error. However, we do see noticeably greater error variance among tablet responders, with greater mass close to the 1,000 RLH mark as well as more around the 300 range. This suggests that we do indeed see more tablet responders making choices consistent with a single dominating attribute. In order to explore the question of dominated choice behavior further, we calculated the percent of each group who chose in a manner consistent with a non-compensatory or dominant preference structure. A respondent was classified as choosing according to dominating preferences if he/she always chose the option with the same level for a specific attribute as the most preferred. For example, if a respondent always chose the Apple alternative we would say their choices were determined by the brand Apple. This should be a rare occurrence if people are making trade-offs as assumed by the model. The results from this analysis are in the table below. Dominating Preference Partial Profile Full Profile PC Tablet Mobile PC Tablet Mobile 201 183 163 202 91 164 92 44 47 48 33 29 0.23 0.24 0.29 0.24 0.36 0.18
Base Number Percent P-Values* PC 0.770 0.141 Tablet 0.312 * P-Values based on two-tail tests
0.027
0.156 0.001
These results are for dominating preference as described above. In the upper table are the summary statistics by form factor and design strategy. The “Base” row indicates the number of respondents in that cell, “Number” those responding in a manner consistent with dominated preferences, and “Percent” what percent that represents. For example, 23% of PC responders in 75
the partial profile strategy made choices consistent with dominated preferences. The lower table presents p-values associated with the pairwise tests of significance between incidences of dominated choices. Looking at the partial profile responders, this means that the 24% Tablet versus the 23% PC responders has an associated p-value of 77% indicating the two are not significantly different from one another. It should not be surprising that we see no significant differences between form factors for the partial profile in terms of dominated choices because of the nature of the design strategy. In half of the partial profile exercises a dominating attribute will not be shown, forcing respondents to make trade-offs based on attributes lower in their preference structure. However, when we look at the full profile we do see significant differences by form factor. Tablet responders are much more likely to exhibit choices consistent with dominated preferences or screening strategies than either mobile or PC responders who are on par with one another. Differences are both significant with 95% confidence as indicated by the bold p-values in the lower table.
CASE STUDY II: SIGNIFICANT OTHER Research Design
The second case study was also a web-based survey, this time among people who were either in, or interested in being in, a long-term relationship. We again have an eight attribute design, although we increase the complexity of the experiment by presenting sets of five alternatives per choice task. A typical task is shown below. Which of these five significant others do you think is the best for you? Attractiveness Romantic/ Passionate Honesty/Loya lty Funny Intelligence Political Views Religious Views Annual Income
Not Very Attractive Not Very Romantic/ Passionate Mostly Trust
Very Attractive Somewhat Romantic/ Passionate Can’t Trust
Not Very Attractive Very Romantic/ Passionate Can’t Trust
Very Funny
Sometimes Funny Not Very Smart Strong Republican Religious— Not Christian
Very Funny
Pretty Smart Strong Democrat Christian
Brilliant
$15,000
$15,000
Strong Republican No Religion/Sec ular $15,000
Very Attractive Not Very Romantic/ Passionate Completely Trust Not Funny
Somewhat Attractive Not Very Romantic/ Passionate Completely Trust Not Funny
Not Very Smart Strong Democrat Religious— Not Christian
Pretty Smart
$40,000
Strong Democrat No Religion/Sec ular $40,000
All attributes other than annual income are 3 level attributes. Annual income is a 5 level attribute ranging from $15,000 to $200,000. There were two cells in this study, one for 76
estimation and one to serve as a holdout sample. The estimation design consists of 5 blocks of 12 tasks each. The holdout sample also completed a block of 12 tasks, and both estimation and holdout samples completed the same three holdout choice tasks. The holdout sample consists of only PC and tablet responders. Given the amount of overlap with 5 alternatives in each task, combined with the amount of space required to present the task, we expect mobile responders to be pushed even harder than with the tablet study. Analysis
We again employ aggregate MNL for parameter equivalence tests and Hierarchical Bayes via Sawtooth Software’s CBC/HB for analysis of respondent error and predictive accuracy. However, in contrast to the tablet study we did not set individual quotas for device type, so a lack of reasonable balance results in taking a matching approach to analysis rather than simply weighting by demographic profiles. These matching dimensions are listed in the table below. Matching Dimension Design Block Age Gender Children in House Income
Cells 1–5 18–34, 35+ Male/Female Yes/No 1
Mobile
R2 = 0.95
S&L Test Results 1,000 Iterations
Fail to reject H1A: 19.5%
R2 = 0.84 Average PC
Fail to reject H1B: 76.9% 0% 0.5
1.0
1.5
2.0
In the first panel the average PC parameter estimates are plotted against the mobile parameter estimates. We see a high degree of alignment overall with an R2 value of 0.95. However, the two outliers point to the presence of a dominating attribute, so we also present the fit for the inner set of lesser important attributes, where we still see strong agreement with an R2 value of 0.84, which is a correlation of over 0.9. The middle panel shows the distribution of the relative scale parameter estimated in the first step of the Swait-Louviere test with mobile showing slightly higher scale about 72% of the time. The right panel above summarizes the test results. Note that if PC and mobile responders were to result in significantly different preferences or scale that we would expect to fail to reject the null hypothesis no more than 5% of the time for tests at the 95% level of confidence. In both H1A (parameter) and H1B (scale) we fail to reject the null hypothesis well in excess of 5% of the time, indicating that we do not see significant differences in preferences or scale between PC and Mobile. Looking at the detailed parameter estimates in the chart below further reinforces similarity of data after controlling for demographics.
78
1.5 1.0 0.5 0.0
-0.5 Avg PC
-1.0
Mobile
Attractiveness
Romantic/ Passionate
Honesty/ Loyalty
Funny
Intelligence
Political Views
Annual Income
No Religion
Non-Christian
Christian
Strong Democrat
Swing Voter
Strong Repub
Brilliant
Pretty Smart
Not Very
Very
Sometime
Not Funny
Complete
Mostly
Can't Trust
Very
Somewhat
Not Very
Very
Somewhat
Not Very
-1.5
Religious Views
Comparing tablet and PC parameters and scale we see an even more consistent story. The results are summarized in the charts below. Even when we look at the consistency between parameter estimates on the lesser important inner attributes we have an R2 fit statistic of 0.94, which is a correlation of almost 0.97. Over 90% of the time the relative Tablet scale parameter is greater than 1, suggesting that we may be seeing slightly less respondent error among those completing the survey on a tablet. However, as the test results to the right indicate, neither parameter estimates nor scales differ significantly. MNL Parameter Comparison
Tablet Relative Scale (m) 25%
90.7% > 1
Tablet
R2 = 0.98
S&L Test Results 1,000 Iterations
Fail to reject H1A: 75.1%
R2 = 0.94 PC
Fail to reject H1B: 51.8% 0% 0.5
1.0
1.5
2.0
Test results for mobile versus tablet also showed no significant differences in preferences or scale. Although as noted earlier, those results are not presented due to available sample sizes and that the story is sufficiently similar as to not add meaningfully to the discussion. Turning to in-sample predictive accuracy, holdout MAE and hit rates are presented in the table below.
79
Base MAE Tablet 73 0.050 Mobile 88 0.037 * Mean after 100 iterations
PC Matched MAE* Hit Rate* 0.039 0.53 0.034 0.54
Hit Rate 0.53 0.53
In the table, the first MAE and Hit Rate columns refer to the results for the row form factor responders. For example, among tablet responders the in-sample holdout MAE is 0.050 and hit rate is 53%. The PC Matched MAE and Hit Rate refer to the average MAE and hit rate over the iterations matching PC completes to row form factor responders. In this case, the average MAE for PC responders matched to tablet is 0.039, with a mean hit rate of 53%. Controlling for demographic composition and sample sizes brings all three very much in line with one another in terms of in-sample predictive accuracy, although tablet responders appear to be the least consistent internally with respect to MAE. Out-of-sample predictive accuracy shows a similar story for mobile compared to PC responders. Once we control for sample size differences and demographic distributions, mobile and PC responders have virtually the same out-of-sample MAE. PC responders when matched to tablet did show a slightly higher out-of-sample MAE than actual tablet responders, although we do not conclude this to be a substantial strike against the form factor. Out-of-sample results are summarized below. MAE Tablet 0.050 Mobile 0.037 * Mean after 100 iterations
PC Matched MAE* 0.039 0.034
The results thus far indicate that mobile and tablet responders provide data that is at least on par with PC responders in terms of preferences, scale, and predictive accuracy. To wrap up the results of our significant other case study we look at respondent error as demonstrated with the RLH cumulative distributions in the chart below.
RLH Cumulative Distribution 100%
Mobile Tablet PC - Mobile PC - Tablet
0% 0
80
200
400
600
800
1000
We again see highly similar distributions of RLH by form factor. The PC matched cumulative distribution curves are based on data from all 100 iterations, which explains the relative smooth shape of the distribution. There is possibly a slight indication that there is less error among the mobile and tablet responders, although we do not view this as substantially different. The slight shift of mass to the right is consistent with relative scale estimates in our parameter equivalence tests, which were not statistically significant.
CONCLUSION In both our tablet and significant other studies we see similar results regardless of which form factor the respondent chose to complete the survey. However, the tablet study does indicate the potential for capturing differing preferences by device type of survey completion. Given the context of that study, this finding is not at all surprising, and in fact is encouraging in that we are capturing more of the heterogeneity in preferences we would expect to see in the marketplace. It would be odd if tablet owners did not exhibit different preferences than non-owners given the experience with usage. On the other hand, we observe the same preferences regardless of form factor in the significant other study, which is what we would expect for a non-technical topic unrelated to survey device type. More importantly than preference structures, which we should not expect to converge a priori, both of our studies indicate that the quality of data collected via smartphone is on par with, or even slightly better than, that collected from PC responders. In terms of predictive accuracy, both in and out-of-sample, and respondent error, we can be every bit as confident in choice experiments completed in a mobile environment as in a traditional PC environment. Responders who choose to complete surveys in a mobile environment are able to do so reliably, and we should therefore not exclude those folks from choice experiments based on assumptions of the contrary. In light of the potential for capturing different segments in terms of preferences, we should actually welcome the increased diversity offered by presenting choice experiments in different web environments.
81
Joseph White
REFERENCES Swait, J., & Louviere, J. (1993). The Role of the Scale Parameter in the Estimation and Comparison of Multinomial Logit Models. Journal of Marketing Research, 30(3), 305–314.
82
USING COMPLEX MODELS TO DRIVE BUSINESS DECISIONS KAREN FULLER HOMEAWAY, INC. KAREN BUROS RADIUS GLOBAL MARKET RESEARCH
ABSTRACT HomeAway offers an online marketplace for vacation travelers to find rental properties. Vacation home owners and property managers list rental property on one or more of HomeAway’s websites. The challenge for HomeAway was to design the pricing structure and listing options to better support the needs of owners and to create a better experience for travelers. Ideally, this would also increase revenues per listing. They developed an online questionnaire that looked exactly like the three pages vacation homeowners use to choose the options for their listing(s). This process nearly replicated HomeAway’s existing enrollment process (so much so that some respondents got confused regarding whether they had completed a survey or done the real thing). Nearly 2,500 US-based respondents completed multiple listings (MBC tasks), where the options and pricing varied from task to task. Later, a similar study was conducted in Europe. CBC software was used to generate the experimental design, the questionnaire was custom-built, and the data were analyzed using MBC (Menu-Based Choice) software. The results led to specific recommendations for management, including the use of a tiered pricing structure, additional options, and an increase in the base annual subscription price. After implementing many of the suggestions of the model, HomeAway has experienced greater revenues per listing and the highest renewal rates involving customers choosing the tiered pricing.
THE BUSINESS ISSUES HomeAway Inc., located in Austin Texas, is the world’s largest marketplace for vacation home rentals. HomeAway sites represent over 775,000 paid listings for vacation rental homes in 171 countries. Many of these sites recently merged under the HomeAway corporate name. For this reason, subscription configurations could differ markedly from site-to-site. Vacation home owners and property managers list their rental properties on one or more HomeAway sites for an annual fee. The listing typically includes details about the size and location of the property, photos of the home, a map, availability calendar and occasionally a video. Travelers desiring to rent a home scan the listings in their desired area, choose a home, contact and rent directly from the owner and do not pay a fee to HomeAway. HomeAway’s revenues are derived solely from owner subscriptions. Owners and property managers have a desire to enhance their “search position,” ranking higher in the available listings, to attract greater rental income. HomeAway desired to create a more uniform approach for listing properties across its websites, enhance the value and ease of listing for the owner, and encourage owners to provide high quality listings while creating additional revenue.
83
THE BUSINESS ISSUE The initial study was undertaken in the US for sites under the names HomeAway.com and VRBO.com. The HomeAway.com annual subscription included a thumbnail photo next to the property listing, 12 photos of the property, a map, availability calendar and a video. Owners could upload additional photos if desired. The search position within the listings was determined by an algorithm rating the “quality” of the listing. The VRBO.com annual subscription included four photos. Owners could pay an additional fee to show more photos which would move their property up in the search results. With the purchase of additional photos came enhancements such as a thumbnail photo, map and video. The business decision entailed evaluating an alternative tiered pricing system tied to the position on the search results (e.g., Bronze, Silver and Gold) versus alternative tiered systems based on numbers of photos.
THE STUDY DESIGN The designed study required 15 attributes arrayed as follows using an alternative-specific design through Sawtooth Software’s CBC design module:
Five alternative “Basic Listing” options: o Current offer based on photos o Basic offer includes fewer photos with the ability to pay for extra photos and obtain “freebies” (e.g., thumbnail photo) and improve search position o Basic offer includes fewer photos and includes “freebies” (e.g., thumbnail photo). The owner can “buy up” additional photos to improve search position o Basic offer includes many photos but no “freebies.” Pay directly for specific search position and obtain “freebies.” o Basic offer includes many photos and “freebies.” Pay directly for specific search position. Pricing for Basic Offers—five alternatives specific to the “Basic Listing” “Buy Up” Tiers offered specific to Basic Listing offer—3, 7 and 11 tiers Tier prices—3 levels under each approach Options to list on additional HomeAway sites (US only, Worldwide, US/Europe/Worldwide options) Prices to list on additional sites—3 price levels specific to option Other listing options (Directories and others)
THE EXERCISE Owners desiring to list a home on the HomeAway site select options they wish to purchase on a series of three screens. For this study the screens were replicated to closely resemble the
84
look and functionality of the sign-up procedure on the website. These screens are shown in the Appendix. Additionally, the search position shown under the alternative offers was customized to the specific market where the rental home was located. In smaller markets “buying up” might put the home in the 10th position out of 50 listings; in other larger markets the same price might only list the home 100th out of 500 listings. As the respondent moved from one screen to the next the “total spend” was shown. The respondent always had the option to return to a prior screen and change the response until the full sequence of three screens was complete. Respondents completed eight tasks of three screens.
THE INTERVIEW AND SAMPLE The study was conducted through an online interview in the US in late 2010/early 2011 among current and potential subscribers to the HomeAway service. The full interview ran an average of 25 minutes.
903 current HomeAway.com subscribers 970 current VRBO.com subscribers 500 prospective subscribers who rent or intend to rent a home to vacationers and do not list on a HomeAway site
Prospective subscribers were recruited from an online panel.
THE DATA Most critical to the usefulness of the results is an assurance that the responses are realistic, that respondents were not overly fatigued and were engaged in the process. To this end, the median and average “spend” per task are examined and shown in the table below:
These results resemble closely the actual spend among current subscribers. Additionally, spend by task did not differ markedly. A large increase/decrease in spend in the early/later tasks might indicate a problem in understanding or responding to the tasks. The utility values for each of the attribute levels were estimated using Sawtooth Software’s HB estimation. Meaningful interactions were embedded into the design. Additional cross-effects were evaluated. 85
To further evaluate data integrity HB results were run for all eight tasks in total, the first six tasks in total and the last six tasks in total. The results of this exercise indicate that using all eight tasks was viable. Results did not differ in a meaningful way when beginning or ending tasks were dropped from the data runs. The results for several of the attributes are shown in the following charts.
86
THE DECISION CRITERIA Two key measures were generated for use by HomeAway in their financial models to implement a pricing strategy—a revenue index and a score representing the appeal of the offer to homeowners. These measures were generated in calculations in an Excel-based simulator, an example of which is shown below:
87
N =1470
N = 829
N = 310
N = 147
N = 184
Total
Small
Medium
Large
Extra Large
Exercise
Appeal Score of Option:
84.1
86.5
83.1
78.5
79.5
Subgroup
Revenue Index:
115.0
111.7
116.8
124.0
119.7
Basic Listing Type:
Price:
How you buy up:
Additonal Price:
5 Photos 6 Photos 7 Photos 8 Photos 9 Photos 10 Photos 11 Photos 12 Photos 13 Photos 14 Photos 15 Photos 16 Photos
Listings on additional sites:
Basic US
Base Price $30 $60 $90 $120 $150 $180 $210 $240 $270 $300 $330
Total 68.2% 3.0% 0.0% 0.0% 0.0% 0.0% 0.0% 12.8% 0.0% 0.0% 0.0% 16.1%
Small 73.2% 2.9% 0.0% 0.0% 0.0% 0.0% 0.0% 10.4% 0.0% 0.0% 0.0% 13.7%
Medium 66.5% 3.3% 0.0% 0.0% 0.0% 0.0% 0.0% 14.6% 0.0% 0.0% 0.0% 15.6%
Large 51.2% 1.6% 0.0% 0.0% 0.0% 0.0% 0.0% 22.2% 0.0% 0.0% 0.0% 25.6%
Extra Large 62.4% 3.9% 0.0% 0.0% 0.0% 0.0% 0.0% 13.0% 0.0% 0.0% 0.0% 20.6%
Total 64.3% 35.8%
Small 70.7% 29.4%
Medium 60.8% 39.2%
Large 45.1% 54.9%
Extra Large 56.9% 43.1%
Additional Price:
No Additional Price 100
Featured Listing: No Additional Listing
88
`
Featured Listing 1 Month 3 Months 6 Months 12 Months
plus $49 plus $99 plus $149 plus $199
Total 0.0% 11.5% 0.1% 9.3%
Small 0.0% 9.9% 0.1% 8.5%
Medium 0.0% 12.2% 0.2% 10.4%
Large 0.0% 13.8% 0.0% 13.3%
Extra Large 0.0% 15.9% 0.0% 7.8%
Featured Directory Golf Directory Ski Directory
$59 for 12 months $59 for 12 months
4.3% 4.1%
2.8% 2.6%
4.1% 4.1%
3.0% 3.0%
12.3% 12.1%
Additonal Features Special Offer
$20 per week
4.4%
2.8%
4.1%
3.0%
13.1%
In this simulator, the user can specify the availability of options for the homeowner, pricing specific to each option and the group of respondents to be studied. The appeal measures are indicative of the interest level for that option among homeowners in comparison to the current offer. The revenue index is a relative measure indicating the degree to which the option studied might generate revenue beyond the “current” offer (Index = 100). The ideal offer would generate the highest appeal while maximizing revenue. The results in the simulator were weighted to reflect the proportion of single and dual site subscribers and potential prospects (new listings) according to current property counts.
BUSINESS RECOMMENDATIONS The decision whether to move away from pricing based on the purchase of photos in a listing to an approach based on direct purchase of a listing “tier” was critical for the sites studied as well as other HomeAway sites. Based on these results, HomeAway chose to move to a tiered pricing approach. Both approaches held appeal to homeowners but the tiered approach generated the greater upside revenue potential. Additional research also indicated that the tiered system provided a better traveler experience in navigating the site. While the study evaluated three, seven and eleven tier approaches, HomeAway chose a five tier approach (Classic, Bronze, Silver, Gold and Platinum). Fewer tiers, generally, outperformed higher tier offers. The choice of five tiers offered HomeAway greater flexibility in its offer. Price tiers were implemented in market at $349 (Classic); $449 (Bronze); $599 (Silver); $749 (Gold) and $999 (Platinum). Each contained “value-added” features bundled in the offer to allow for greater price flexibility. These represent a substantive increase in the base annual subscription prices. HomeAway continued to offer cross-sell options and additional listing offers (feature directories, feature listings and other special offers) to generate additional revenue beyond the base listing.
SOME LESSONS LEARNED FOR FUTURE MENU-BASED STUDIES Substantial “research” learning was also generated through this early foray into menu-based choice models. We believe that one of the keys to the success of this study was the “strive for realism” in the presentation of the options to respondents. (The task was sufficiently realistic that HomeAway received numerous phone calls from its subscribers asking why their “choices” in the task had not appeared in their listings.) Realism was implemented not only in the “look” of the pages but also in the explanations of the listing positions calculated based on their own listed homes. Also critical to success of any menu-based study is the need to strike a “happy medium” in terms of number of variables studied and overall sample size. o While the flexibility of the approach makes it tempting to the researcher to “include everything,” parsimony pays. Both the task and the analysis can be overwhelming when non-critical variables to the decision are included. 89
Estimation of cross-effects is also challenging. Too many cross-effects quickly result in model over-specification resulting in cross-cancellations of the needed estimations. o Sufficient sample size is likewise critical but too much sample is likewise detrimental. In the study design keep in mind that many “sub-models” are estimated and sample must be sufficient to allow a stable estimation at the individual level. Too much sample however presents major challenges to computing power and ultimately simulation. In simulation it is important to measure from a baseline estimate. This is a research exercise with awareness and other marketing measures not adequately represented. Measurement from a baseline levels the playing field for these factors having little known effect providing confidence in the business decisions made. This is still survey research and we expect a degree of over-statement by respondents. Using a “baseline” provides a consistency to the over-statement.
IN-MARKET EXPERIENCE HomeAway implemented the recommendations in market for its HomeAway and VRBO businesses. Subsequent to the effort, the study was repeated, in a modified form, for HomeAway European sites. In market adoption of the tiered system exceeded the model predictions. Average revenue per listing increased by roughly 15% over the prior year. Additionally, HomeAway experienced the highest renewal rates among subscribers adopting the tiered system. Brian Sharples, Co-founder and Chief Executive Officer noted: “The tiered pricing research allowed HomeAway to confidently launch tiered pricing to hundreds of thousands of customers in the US and European markets. Our experience in market has been remarkably close to what the research predicted, which was that there would be strong demand for tiered pricing among customers. Not only were we able to provide extra value for our customers but we also generated substantial additional revenue for our business.”
90
Karen Fuller
Karen Buros
91
APPENDIX—SCREEN SHOTS FROM THE SURVEY
92
93
94
95
AUGMENTING DISCRETE CHOICE DATA—A Q-SORT CASE STUDY BRENT FULLER MATT MADDEN MICHAEL SMITH THE MODELLERS
ABSTRACT There are many ways to handle conjoint attributes with many levels including progressive build tasks, partial profile tasks, tournament tasks and adaptive approaches. When only one attribute has many levels, an additional method is to augment choice data with data from other parts of the survey. We show how this can be accomplished with a standard discrete choice task and a Q-Sort exercise.
PROBLEM DEFINITION AND PROPOSED SOLUTION Often, clients come to us with an attribute grid that has one attribute with a large number of levels. Common examples include promotions or messaging attributes. Having too many levels in an attribute can lead to excessive respondent burden, insufficient level exposure and nonintuitive results or reversals. To help solve the issue we can augment discrete choice data with other survey data focused on that attribute. Sources for augmentation could include MaxDiff exercises, Q-Sort and other ranking exercises, rating batteries and other stated preference questions. Modeling both sets of data together allows us to get more information and better estimates for the levels of the large attribute. We hope to find the following in our augmented discrete choice studies: 1st priority—Best estimates of true preference 2nd priority—Better fit with an external comparison 3rd priority—Better holdout hit rates and lower holdout MAEs Approaches like this are fairly well documented. In 2007 Hendrix and Drucker showed how data augmentation can be used on a MaxDiff exercise with a large number of items. Rankings data from a Q-Sort task were added to the MaxDiff information and used to improve the final model estimates. In another paper in 2009 Lattery showed us how incorporating stated preference data as synthetic scenarios to a conjoint study can improve estimation of individual utilities, higher hit rates and more consistent utilities resulted. Our augmenting approach is similar and we present two separate case studies below. Both augment discrete choice data with synthetic scenarios. The first case study augments a single attribute that has a large number of levels using data from a Q-Sort exercise about that attribute. The second case study augments several binary (included/excluded) attributes with data from a separate scale rating battery of questions.
CASE STUDY 1 STRUCTURE We conducted a telecom study with a discrete choice task trading off attributes such as service offering, monthly price, additional fees and contract type. The problematic attribute listed 97
promotion gifts that purchasers would receive for free when signing up for the service. This attribute had 19 levels that the client wanted to test. We were concerned that the experimental design would not give sufficient coverage to all the levels of the promotion attribute and that the discrete choice model would yield nonsensical results. We know from experience that ten levels for one attribute is about the limit of what a respondent can realistically handle in a discrete choice exercise. The final augmented list is shown in Table 1. Table 1. Case Study 1 Augmented Promotion Attribute Levels Augmentation List $100 Gift Card $300 Gift Card $500 Gift Card E-reader Gaming Console 1 Tablet 1 Mini Tablet Tablet 2 Medium Screen TV Small Screen TV Gaming Console 2 HD Headphones Headphones Home Theatre Speakers 3D Blu-Ray Player 12 month Gaming Subscription We built a standard choice task (four alternatives and 12 scenarios) with all attributes. Later in the survey respondents were asked a Q-Sort exercise with levels from the free promotion gift attribute. Our Q-Sort exercise included the questions below to obtain a multi-step ranking. These ranking questions took one and a half to two minutes for respondents to complete. 1) Which of the following gifts is most appealing to you? 2) Of the remaining gifts, please select the next 3 which are most appealing to you. (Select 3 gifts) 3) Of the remaining gifts, which is the least appealing to you? (Select one) 4) Finally, of the remaining gifts, please select the 3 which are least appealing to you (Select 3 gifts) In this way we were able to obtain promotion gift ranks for each respondent. We coded the Q-Sort choices into a discrete choice data framework as a series of separate choices and appended these as extra scenarios within the standard discrete choice data. Based on ranking comparisons the item chosen as the top rank was coded to be chosen compared to all others. The items chosen as “next top three” each beat all the remaining items. The bottom three each beat the last ranked item in pairwise scenarios. We estimated two models to begin with, one standard discrete choice model and a discrete choice with the additional Q-Sort scenarios. 98
CASE STUDY 1 RESULTS As expected, the standard discrete choice model without the Q-Sort augmentation yielded nonsensical results for the promotion gift attribute. Some of the promotions we tested included prepaid gift cards. As seen in Table 2, before integrating the Q-Sort data, we saw odd reversals, for example, the $100 and $300 prepaid cards were preferred over the $500 card on many of the individual level estimates. When the Q-Sort augment was applied to the model the reversals disappeared almost completely. The prepaid card ordering was logical (the $500 card was most preferred) and rank-ordering made sense for other items in the list. Table 2. Case Study 1 Summary of Individual Level Reversals Individual Reversals DCM Only DCM + Q-Sort
$100 > $300
$300 > $500
$100 > $500
59.8% 0.0%
60.8% 0.8%
82.3% 0.0%
As a second validation, we assigned approximate MSRP figures to the promotions, figuring they would line up fairly well with preferences. As seen in Figure 1, when plotting the DCM utilities against the MSRP values, the original model had a 29% r-square. After integrating the QSort data, we saw the r-square increase to 58%. Most ill-fitting results were due to premium offerings, where respondents likely would not see MSRP as a good indicator of value. Figure 1. Case Study 1 Comparison of Average Utilities and MSRP
Priorities one and two mentioned above seem to be met in this case study. The augmented model gave us gave us estimates which we believe are closer to true preference, and our augmented model better matches with the external check to MSRP. The third priority of getting improved hit rates and MAEs with the augmented model proved more elusive with this data set. The Augmented model did not significantly improve holdout hit rates or MAE (see Table 4). This was a somewhat puzzling result. One explanation is that shares were very low in this study, in the 1% to 5% range. The promotion attribute does not add any additive predictive value because the hit rate is very high and has very little room for improvement. As a validation of this theory 99
we estimated a model without the promotion attribute, completely dropping it from the model. We were still able to obtain 93% holdout hit rates in this model confirming that the promotion attribute did not add any predictive power to our model. In a discrete choice model like this, people’s choices might be largely driven by the top few attributes, yet when the market offerings tie on those attributes, then mid-level attributes (like promotions in this study) will matter more. Our main goal in this study was to get a more sensible and stable read on the promotion attribute and not explicitly trying to improve the model hit rates. We are also not disappointed that the hit rate and MAE were not improved because the hit rate and MAE showed a high degree of accuracy in both instances. Table 3. Case Study 1 Promotion Rank Orderings, Models, MSRP, Q-Sort MSRP DCM Only DCM + Q-Sort Rank Rank Rank
Promotion $500 Gift Card Tablet 1 Home Theatre Speakers Mini Tablet $300 Gift Card Medium Screen TV Gaming Console 2 E-reader Gaming Console 1 Tablet 2 Small Screen TV HD Headphones 3D Blu-Ray Player Headphones $100 Gift Card 12 month Gaming Subscription
1 2 3 4 5 6 7 8 8 8 8 8 8 14 15 16
5 2 7 1 4 9 15 12 13 6 8 14 11 10 3 16
1 3 7 5 2 4 14 9 12 8 13 15 10 11 6 16
Q-Sort Rank 1 3 7 5 2 4 12 8 11 9 13 15 14 10 6 16
Table 4. Case Study 1 Comparison of Hit Rates, MAE, and Importances DCM Only DCM + Q-Sort 93.4% 93.3% Holdout hit rate 0.0243 0.0253 MAE 7.3% 14.5% Average Importance One problem with augmenting discrete choice data is that it often will artificially inflate the importance of the augmented attribute relative to the non-augmented attributes. Our solution to this problem was to scale back the importances to the original un-augmented model importance at the individual level. We feel that there are also some other possible solutions that could be 100
investigated further. For example, we could apply a scaling parameter such that we scale the augmented parameter as well as minimizing the MAE or maximizing the hit rate. An alternative to augmenting for completely removing all reversals is to constrain at the respondent level using MSRP information. Our constrained model maintained the 93% holdout hit rates and comparable levels of MAE. The constrained model also deflated the average importance of the promotion attribute to 4.8%. We thought it was better to augment in this case study since there are additional trade-offs to be considered besides MSRP. For example, a respondent might value a tablet at $300 but might prefer a product with a lower MSRP because they already own a tablet.
CASE STUDY 2 STRUCTURE We conducted a second case study which was also a discrete choice model in the telecom space. The attributes included 16 distinct features that had binary (included/excluded) levels. Other attributes in the study also included annual and monthly fees. After the choice task, respondents were asked about their interest in using each feature (separately) on a 1–10 rating scale. This external task took up to 2 minutes for respondents to complete. If respondents answered 9 or 10 then we added extra scenarios to the regular choice scenarios. Each of these extra scenarios was a binary comparison, each item vs. a “none” alternative.
CASE STUDY 2 RESULTS As expected the rank ordering of the augmented task aligned slightly better than what we would intuitively expect (rank orders shown in Table 5). As far as gauging to an external source, this case study was a little bit more difficult than the previous one because we could not assign something as straightforward as MSRP to the features. We looked at the top rated security software products (published by a security software review company) and counted the number of times the features were included in each and ranked them. Figure 2 shows this comparison. Here we feel the need to emphasize key reasons not to constrain to an external source. First, often it is very difficult to find external sources. Second, if an external source is found, it can be difficult to validate and have confidence in. Last, even if there is a valid external source such as MSRP, it still might not make sense to constrain given that there could be other value tradeoffs to consider. Similar to the first case study, we did not see improved hit rates or MAEs in the second case study. Holdout hit rates came out at 98% for both augmented and un-augmented models, and MAEs were not statistically different from each other. We are not concerned with this nonimprovement because of the high degree of accuracy of both the augmented and non-augmented models.
101
Table 5. Case Study 2 Rank Orders Feature
Stated Rank DCM Only Rank Order Order
DCM + Stated Rank Order
Feature 1
1
1
1
Feature 2
2
3
3
Feature 3
3
2
2
Feature 4
4
5
4
Feature 5
5
4
5
Feature 6
6
9
7
Feature 7
7
11
9
Feature 8
8
15
10
Feature 9
9
6
6
Feature 10
10
8
8
Feature 11
11
13
14
Feature 12
12
10
13
Feature 13
13
12
12
Feature 14
14
7
11
Feature 15
15
14
15
Feature 16
16
16
16
Figure 2. Case Study 2 External Comparison
DISCUSSION AND CONCLUSIONS Augmenting a choice model with a Q-Sort or ratings battery can improve the model in the following ways. First, the utility values are more logical and fit better with the respondents’ true 102
values for the attribute levels. Second, the utility values have a better fit with external sources of value. It is not a given that holdout hit rates and MAE are improved with augmentation, although we would hope that they would be in most conditions. We feel that our hit rates and MAE did not improve in these cases because of the low likelihood of choice in the products we studied and the already high pre-augmentation hit rates. There are tradeoffs to consider when deciding to augment or constrain models. First, there is added respondent burden in asking the additional Q-Sort or other exercise used for augmentation. In our cases the extra information was collected in less than two additional minutes. Second, there is additional modeling and analysis time spent to integrate the augmentation. In our cases the augmented HB models took 15% longer to converge. Third, there is a tendency for the attribute that is augmented to have inflated importances or sensitivities and we suggest scaling the importances by either minimizing MAE or using the un-augmented importances. Lastly, one should consider reliability of external sources to check the augmentation against or to use for constraining.
Brent Fuller
Michael Smith
APPENDIX Figure 3 shows an example of an appended augmented scenario from the first case study. In scenario 13, item 15 was chosen as the highest-ranking item from the Q-Sort exercise. All other attributes for the augmented tasks are coded as 0. Figure 3. Example of un-augmented Coding matrix Scenario Alternative 1 1 1 2 1 3 1 4 … … 12 1 12 2 12 3 12 4
y 1 0 0 0 … 0 0 1 0
tv_2 0 1 0 0 … 0 0 1 -1
tv_3 0 0 0 0 … 0 0 0 -1
tv_4 0 0 1 0 … 1 0 0 -1
tv_5 1 0 0 0 … 0 0 0 -1
tv_6 0 0 0 1 … 0 1 0 -1
promo_1 promo_2 promo_3 promo_4 promo_5 promo_6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … … … … … … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
… … … … … … … … … …
promo_19 1 0 0 0 … 0 1 0 0
103
Figure 4. Example of Augmented Coding Matrix Scenario Alternative 1 1 1 2 1 3 1 4 … … 12 1 12 2 12 3 12 4 13 1 13 2 13 3 13 4 13 5 13 6 13 7 13 10 13 12 13 13 13 14 13 15 13 16 13 17 13 18 13 19
y 1 0 0 0 … 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
tv_2 0 1 0 0 … 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tv_3 0 0 0 0 … 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tv_4 0 0 1 0 … 1 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tv_5 1 0 0 0 … 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tv_6 0 0 0 1 … 0 1 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
promo_1 promo_2 promo_3 promo_4 promo_5 promo_6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … … … … … … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
… … … … … … … … … … … … … … … … … … … … … … … … … …
promo_19 1 0 0 0 … 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
REFERENCES Hendrix, Phil and Drucker, Stuart (2007), “Alternative Approaches to MaxDiff with Large Sets of Disparate Items—Augmented and Tailored MaxDiff” 2007 Sawtooth Software Conference Proceedings, 169–187. Lattery, Kevin (2009), “Coupling Stated Preferences with Conjoint Tasks to Better Estimate Individual-Level Utilities” 2009 Sawtooth Software Conference Proceeding, 171–184.
104
MAXDIFF AUGMENTATION: EFFORT VS. IMPACT URSZULA JONES TNS JING YEH MILLWARD BROWN
BACKGROUND In recent years MaxDiff has become a household name in marketing research as it is more and more commonly used to assess the relative performance of various statements, products, or messages. As MaxDiff grows in popularity, it is often called upon to test a large number of items; requiring lengthier surveys in the form of more choice tasks per respondent in order to maintain predictive accuracy. Oftentimes MaxDiff scores are used as inputs to additional analyses (e.g., TURF or segmentation), therefore a high level of accuracy for both best/top and worst/bottom attributes is a must. Based on standard rules of thumb for obtaining stable individual-level estimates, the number of choice tasks per respondent becomes very large as the number of items to be tested increases. For example, 40 items requires 24–30 choice tasks per respondent (assuming 4–5 items per task). Yet at the same time our industry, and society in general, is moving at a faster pace with decreasing attention spans; therefore necessitating shorter surveys to maintain respondent engagement. Data quality suffers at 10 to 15 CBC choice tasks per respondent (Tang and Grenville 2010). Researchers, therefore, find ourselves being pulled by two opposing demands— the desire to test larger sets of items and the desire for shorter surveys—and faced with the consequent challenge of balancing predictive accuracy and respondent fatigue. To accommodate such situations, researchers have developed various analytic options, some of which were evaluated in Dr. Ralph Wirth and Anette Wolfrath’s award winning paper “Using MaxDiff for Evaluating Very Large Sets of Items: Introduction and Simulation-Based Analysis of a New Approach.” In Express MaxDiff each respondent only evaluates a subset of the larger list of items based on a blocked design and the analysis leverages HB modeling (Wirth and Wolfrath 2012). In Sparse MaxDiff each respondent sees each item less than the rule of thumb of 3 times (Wirth and Wolfrath 2012). In Augmented MaxDiff, MaxDiff is supplemented by Q-Sort informed phantom MaxDiff tasks (Hendrix and Drucker 2007). Augmented MaxDiff was shown to have the best predictive power, but comes at the price of significantly longer questionnaires and complex programming requirements (Wirth and Wolfrath 2012). Thus, these questions still remained: 1. Given the complex programming and additional questionnaire time, is Augmented MaxDiff worth doing or is Sparse MaxDiff doing a sufficient job? 2. If augmentation is valuable, how much is needed? 3. Could augmentation be done using only “best” items or should “worst” items also be included?
105
CASE STUDY AND AUGMENTATION PROCESS To answer these questions regarding Augmented MaxDiff, we used a study of N=676 consumers with chronic pain. The study objectives were to determine the most motivating messages as well as the combination of messages that had the most reach. Augmented MaxDiff marries MaxDiff with Q-Sort. Respondents first go through the MaxDiff section per usual, completing the choice tasks as determined by an experimental design. Afterwards, respondents complete the Q-Sort questions. The Q-Sort questions allow researchers to ascertain additional inferred rankings on the tested items. The Q-Sort inferred rankings are used to create phantom MaxDiff tasks, MaxDiff tasks that weren’t actually asked to respondents, but researchers can infer from other data what the respondents would have selected. The phantom MaxDiff tasks are used to supplement the original MaxDiff tasks and thus create a super-charged CHO file (or response file) for utility estimation. See Figure 1 for an overview of the process. Figure 1: MaxDiff Augmentation Process Overview
In our case study, there were 46 total messages tested via Sparse MaxDiff using 13 MaxDiff questions, 4 items per screen, and 26 blocks. Following the MaxDiff section, respondents completed a Q-Sort exercise. Q-Sort can be done in a variety of ways. In this case, MaxDiff responses for “most” and “least” were tracked via programming logic and entered into the two Q-Sort sections—one for “most” items and one for “least” items. The first question in the Q-Sort section for “most” items showed respondents the statements they selected as “most” throughout the MaxDiff screens and asked them to choose their top four. The second question in the Q-Sort section for “most” items asked respondents to choose the top one from the top four. The Q-Sort section for “least” items mirrored the Q-Sort section for “most” items. The first question in the Q-sort section for “least” items showed respondents the statements they selected 106
as “least” throughout the MaxDiff screens and asked them to choose their bottom four. The second question in the Q-Sort section for “least” items asked respondents to choose the bottom one from the bottom four. See Figure 2 for a summary of the Q-Sort questions. Figure 2: Summary of Q-Sort Questions
From the Q-Sort section on “most” items, for each respondent researchers know: the best item from Q-Sort, the second best items from Q-Sort (of which there are three), and the third tier of best items from Q-Sort (the remaining “most” items not selected in Q-Sort section for “most” items). And from the Q-Sort section on “least” items, for each respondent researchers also know: the worst item from Q-Sort, the second to worst items from Q-Sort (of which there are three), and the third tier of worst items from Q-Sort (the remaining “least” items not selected in Q-Sort section for “least” items). The inferred rankings from this data are custom for each respondent, but at a high level we know:
The best item from Q-Sort (1 item) > All other items The second best items from Q-Sort (3 items) > All other items except the best item from Q-Sort The worst item from Q-Sort (1 item) < All other items The second to worst items from Q-Sort (3 items) < All other items except the least item from Q-Sort
Using these inferred rankings, supplemental phantom MaxDiff tasks are created. Although respondents were not asked these questions, their answers can be inferred, assuming that respondents would have answered the new questions consistently with the observed questions. 107
Since the MaxDiff selections vary by respondent; the Q-Sort questions, the inferred rankings from Q-Sort, and finally the phantom MaxDiff tasks are also customized to each respondent. From respondents’ Q-Sort answers, we created 18 supplemental phantom MaxDiff tasks as well as the inferred “most” (noted with “M”) and “least” (noted with “L”) responses. See Figure 3 for the phantom MaxDiff tasks we created. Figure 3: Supplemental phantom MaxDiff tasks using both the Q-Sort section on “most” items and the Q-Sort section on “least” items
Respondents’ Q-Sort answers are matched to the supplemental Q-Sort-Based MaxDiff tasks (i.e., the phantom MaxDiff tasks) to produce a new design file that merges both the original MaxDiff design and the Q-Sort supplements. Figure 4 illustrates the process in detail.
108
Figure 4: Generating a new design file that merges original MaxDiff with Q-Sort supplements (i.e., phantom MaxDiff tasks)
Likewise respondents’ Q-Sort answers are used to infer their responses to the phantom MaxDiff tasks to produce a new response file that merges both responses to the original MaxDiff and the Q-Sort supplements. The new merged design and response files are combined in a supercharged CHO file and used for utility estimation. Figure 5 provides an illustration.
109
Figure 5: Generating a new response file that merges original MaxDiff responses with responses to Q-Sort supplements (i.e., phantom MaxDiff tasks)
EXPERIMENT Our experiment sought to answer these questions: 1. Given the complex programming and additional questionnaire time, is Augmented MaxDiff worth doing or is Sparse MaxDiff doing a sufficient job? 2. If augmentation is valuable, how much is needed? 3. Could augmentation be done using only “best” items or should “worst” items also be included? To answer these questions, the authors compared the model fit of Sparse MaxDiff with Augmented MaxDiff when two types of Q-Sort augmentations are done: Augmentation of best (or top) items only. Augmentation of both best and worst (or top and bottom) items. Recall that we generated supplemental phantom MaxDiff tasks using both the Q-Sort section for the “most” items and the Q-Sort section for the “least” items (see Figure 3). When testing augmentation including only “best” items we created supplemental phantom MaxDiff tasks using only the Q-Sort section for the “most” items as shown in Figure 6. Again, responses for “most” (noted with “M”) and “least” (noted with “L”) for each phantom tasks can be inferred.
110
Figure 6: Supplemental phantom MaxDiff tasks using only the Q-Sort section on “most” items.
We also evaluated the impact of degree of augmentation on model fit by examining MaxDiff Augmentation including 3, 5, 7, 9, and 18 supplemental phantom MaxDiff tasks.
FINDINGS As expected, heavier augmentation improves fit. Heavier augmentation appends more Q-Sort data and Q-Sort data is presumably consistent to MaxDiff data. Thus heavier augmentation appends more consistent data and we therefore expected overall respondent consistency measurements to increase. Percent Certainty and RLH are both higher for heavy augmentation compared to Sparse MaxDiff (i.e., no augmentation) and lighter augmentation as shown in Figure 7.
111
Figure 7: Findings from our estimation experiment
Surprisingly Best-Only Augmentation outperforms Best-Worst Augmentation even though less information is used with Best-Only Augmentation (Best-Only % Cert=0.85 and RLH=0.81; Best-Worst % Cert=0.80 and RLH=0.76). To further understand this unexpected finding, we did a three-way comparison of (1) Sparse MaxDiff without Augmentation (“No Q-Sort”) versus (2) Sparse MaxDiff with Best-Only Augmentation (“Q-Sort Best Only”) versus (3) Sparse MaxDiff with Best-Worst Augmentation (“Q-Sort Best & Worst”). The results showed that at the aggregate level, the story is the same regardless of whether augmentation is used. Spearman’s Rank Order correlation showed strong, positive, and statistically significant correlations between the three indexed MaxDiff scores with rs(46)>=0.986, p=.000 (see Figure 8). Note that for the two cases when augmentation was employed for this test, we used 18 supplemental phantom MaxDiff tasks.
112
Figure 8: Spearmans Rank Order Correlation for indexed MaxDiff scores from (1) Sparse MaxDiff without Augmentation (“No Q-Sort”) versus (2) Sparse MaxDiff with Best-Only Augmentation (“Q-Sort Best Only”) versus (3) Sparse MaxDiff with Best-Worst Augmentation (“Q-Sort Best & Worst”)
We further compared matches between (1) the top and bottom items based on MaxDiff scores versus (2) top and bottom items based on Q-Sort selections. An individual-level comparison was used to show the percent of times there is a match between Q-Sort top four items and MaxDiff top four items as well as between Q-Sort bottom four items and MaxDiff bottom four items. We found that at the respondent level the model is imprecise without augmentation. In particular, the model is less precise at the “best” end compared to the “worst” end (45% match on “best” items vs. 51% match on “worst”). In other words, researchers can be more sure about MaxDiff items that come out at the bottom as compared to MaxDiff items that come out at the top. This finding was consistent with results from a recent ART forum paper by Dyachenko, Naylor, & Allenby (2013). The implication for MaxDiff Augmentation is that augmenting on best items is critical due to the lower precision around those items.
CONCLUSIONS As expected, at the respondent level Sparse MaxDiff is more imprecise compared to Sparse MaxDiff with augmentation. However, at the aggregate level the results of Sparse MaxDiff are similar to results with augmentation. Therefore, for studies where aggregate level results are sufficient, Sparse MaxDiff is suitable. But for studies where stability around individual-level estimates is needed, augmenting Sparse MaxDiff is recommended. By augmenting on “best” items only, researchers can get a better return with a shorter questionnaire and less complex programming compared to augmenting on both “best” and “worst” items. MaxDiff results were shown to be less accurate at the “best” end and augmentation on “best” items improved fit. Best-only augmentation requires a shorter questionnaire compared to augmentation on both “best” and “worst” items. Heavy augmenting whether augmenting on “best” only or both “best” and “worst” is critical when other analyses (e.g., TURF, clustering) are required. The accuracy of utilities estimated
113
from heavy augmentation was superior to the accuracy of utilities estimated from lighter augmentation. Finally, if questionnaire real estate allows, obtain additional information from which the augmentation can benefit. For example, in the Q-Sort exercise, instead of asking respondents to select the top/bottom four and then the top/bottom 1; ask respondents to rank the top/bottom 4. This additional ranking information allows more flexibility in creating the item combinations for the supplemental phantom MaxDiff tasks, and we therefore hypothesize, better utility estimates.
Urszula Jones
Jing Yeh
REFERENCES Dyachenko, Tatiana; Naylor, Rebecca; & Allenby, Greg (2013), “Models of Sequential Evaluation in Best-Worst Choice Tasks,” 2013 Sawtooth Software Conference Proceedings. Tang, Jane; Grenville, Andrew (2012), “How Many Questions Should You Ask in CBC Studies?—Revisited Again,” 2010 Sawtooth Software Conference Proceedings, 217–232. Wirth, Ralph; Wolfrath, Anette (2012), “Using MaxDiff for Evaluating Very Large Sets of Items: Introduction and Simulation-Based Analysis of a New Approach,” 2012 Sawtooth Software Conference Proceedings, 99–109.
114
WHEN U = BX IS NOT ENOUGH: MODELING DIMINISHING RETURNS AMONG CORRELATED CONJOINT ATTRIBUTES KEVIN LATTERY MARITZ RESEARCH
1. INTRODUCTION The utility of a conjoint alternative is typically assumed to be a simple sum of the betas for each attribute. Formally, we define utility U = βx. This assumption is very reasonable and robust. But there are cases where U = βx is not enough, it is just too simple. One example of this, which is the focus of this paper, is when the utilities of attributes in a conjoint study are correlated. Correlated attributes can arise in many ways, but one of the most prevalent ways is with binary yes/no attributes. An example of binary attributes appears below where the attributes marked with an x mean that the benefit is being offered, and a blank means it is not.
Non Binary Attribute1 Non Binary Attribute2
Program 1 Level 1 Level 2
Discounts on equipment purchases Access to online equipment reviews by other members Early access to new equipment (information, trial, purchase, etc.) Custom fittings
x
Members-only logo balls, tees, tags, etc
x
Program 2 Program 3 Level 3 Level 2 Level 2 Level 3 x
x x
x x
x
x
If the binary attributes are correlated, then adding more benefits does not give a consistent or steady lift to utility. The result is that the standard U = βx model tends to over-predict interest when there are more binary features (Program 1 above) and under-predict product concepts that have very few of the features (Program 3 above). This can be a critical issue when running simulations. One common marketing issue is whether a lot of smaller cheaper benefits can compensate for more significant deficiencies. Clients will simulate a baseline first product, and then a second cheaper product that cuts back on substantial features. They will then see if they can compensate for the second product’s shortcomings by adding lots of smaller benefits. We have seen many cases where simulations using the standard model suggest a client can improve a baseline product by offering a very inferior product with a bunch of smaller benefits. And even though this is good news to the client who would love this to be true, it seems highly dubious even to them.
115
The chart below shows the results of one research study. The x-axis is the relative number of binary benefits shown in an alternative vs. the other alternative in its task. This conjoint study showed two alternatives with only 1 to 3 binary benefits, so we simply take the difference in the number of benefits shown. So showing alternative A with 1 benefit and alternative B with 3 benefits, we have (1 - 3) = -2 and (3 - 1) = 2. The difference in binary attributes is -2 and +2. The vertical axis is corresponding error, Predicted Share—Observed Share, for the in-sample tasks.
This kind of systematic bias, where we overstate the share as we add more benefits, is extremely problematic when it occurs. While this paper focuses on binary attributes, the problem occurs in other contexts, and the new model we propose can be used in those as well. This paper also discusses some of the issues involved when we change the standard conjoint model to something other than U = βx. In particular, we discuss how it is not enough to simply change the functional form when running HB. Changing the utility function also requires other significant changes in how we estimate parameters.
2. LEARNING FROM CORRELATED ALTERNATIVES There is not much discussion in the conjoint literature about correlated attributes, but there is a large body of work on correlated alternatives. Formally, we talk the “Independence of Irrelevant Alternatives” (IIA) assumption, where the basic logit model assumes the alternatives are not correlated. The familiar blue bus/red bus example is an example of correlated alternatives. Introducing a new bus that is the same as the old bus with a different color creates correlation among the two alternatives: red bus and blue bus are correlated. The result is parallel to what we described above for correlated attributes: the model over-predicts how many people ride the bus. In fact, if we keep adding the same bus with different colors—yellow, green, fuchsia, etc.—the basic logit model will eventually predict that nearly everyone rides the bus.
116
One of the classic solutions to correlated alternatives is Nested Logit. With Nested Logit, alternatives within a nest are correlated with one another. In the case of buses, one would group the various colors of buses together into one nest of correlated alternatives. The degree of correlation is specified by an additional λ parameter. When λ=1, there is no correlation among the alternatives in a nest, and as λ goes to 0, the correlation increases. One solves for λ, given the specific data. In the case of red bus/blue bus, we expect λ near 0. The general model we propose follows the mathematical structure of nested logit, but instead of applying the mathematical structure to alternatives, we apply it to attributes. This means we think of the attributes as being in nests, grouping together those attributes that are correlated, and measuring the degree of that correlation by introducing an additional λ parameter. In our case, we are working at the respondent level, and each respondent will have their own λ value. For some respondents, a nest of attributes may have little correlation, while for others the attributes may be highly correlated. Recall that the general formulation for a nested logit, where the nest is composed of alternatives 1 through n with non-exponentiated utilities of U1 . . . Un is: eU = [(eU1) 1/λ + (eU2) 1/λ + ... + (eUn)1/λ ]λ where 0 < λ =0. We are talking about diminishing returns, so we are assuming each beta is positive, but the return may diminish when it is with other attributes in the same nest. (If the betas were not all positive the idea of diminishing returns would not make any sense.) This new formulation has several excellent properties, which it shares with nested logit. 1) When λ =1, the formulation reduces to the standard utility. This is also its maximum value. 2) As λ shrinks towards 0, the total utility also shrinks. At the limit of λ =0, the total utility is just the utility of the attribute in the nest with the greatest value. 3) The range of utility values is from the utility of the single best attribute to the simple sum of the attributes. We are shrinking the total utility from the simple sum down to an amount that is at least the single highest attribute.
117
4) Adding a new attribute (changing xi from 0 to 1) will always add some amount of utility (though it could be very close to 0). It is not possible to have reversals, where adding an item shrinks the total utility down lower than it was before adding the item. 5) The amount of shrinkage depends upon the size of the betas and their relative difference. Betas that are half as large will shrink half as much. In addition, the amount of shrinkage also depends upon their relative size. Three nearly equal betas will shrink differently than 3 where one is much smaller. Before turning to the results using this nested attribute formulation, I want to briefly describe and evaluate some other methods that have been proposed to deal with the problem of diminishing returns among correlated attributes.
3. ALTERNATIVE SOLUTIONS TO DIMINISHING RETURNS There are three alternative methods I will discuss here, which other practitioners have shared with me in their efforts to solve the problem of diminishing returns. Of course, there are likely other methods as well. The advantage of all the alternative methods discussed here is that they do not require one to change the underlying functional form, so they can be easily implemented, with some caveats. 3.1 Recoding into One Attribute with Many Levels
One method for modeling a set of binary attributes is to code them into one attribute with many levels. For example, 3 binary attributes A, B, C can be coded into one 7-level attribute: A, B, C, AB, AC, BC, ABC (or an 8-level, if none is included as a level). This gives complete control over how much value each combination of benefits has. To implement it, you would also need to add six constraints: – – – – – –
AB > A AB > B AC > A AC > C ABC > AB ABC > AC
With only 2–3 attributes, this method works fine. It adds quite a few more parameters to estimate, but is very doable. When we get to 4 binary attributes, we need to make a 15- or 16level attribute with 28 constraints. That is more levels and constraints than most will feel comfortable including in the model. With 5 attributes, things are definitely out of hand, requiring an attribute with 31 or 32 levels and many constraints. While this method gives one complete control over every possible combination, it is not very parsimonious, even with 4 binary attributes. My recommendation is to use this approach only when you have 2–3 binary attributes. 3.2 Adding Interaction Effects
A second approach is to add interaction terms. For example, if one is combining attribute A and B, the resulting utility is A + B - AB, where the last term is an interaction value that represents the diminishing returns. When AB is 0, there is no diminishing. 118
It is not clear how this works with more attributes. With 3 binary attributes A, B, C, the total utility might be A + B + C - (AB + AC + BC + ABC). But notice that we have lots of interactions to subtract. When we get to 4 attributes, it is even more problematic, as we have 11 interactions to subtract out. If we use all possible interactions, we wind up with one degree of freedom per level, just as in the many-leveled attribute approach, so this method is like that of 3.1 in terms of number of parameters. One might reduce the number of parameters by only subtracting the pairwise interactions, imposing a little more structure on the problem. Even pairs alone can still be quite a few interactions, however. The bigger problem with this method is how to make the constraints work. We must be careful not to subtract out too much for any specific combination. Otherwise, it will appear that we add a benefit, and actually reduce the utility, a “reversal.” This means trying to set up very complex constraints. So we need constraints like (A + B + C) > (AB + AC + BC + ABC). Each interaction term will be involved in many such constraints. Such a complex constraint structure will most likely have a negative impact on estimation. I see no value in this approach, and recommend the previous recoding option in 3.1 as much easier to implement. 3.3 Adding One or More Variables for Number of Items
The basic idea of this approach is like that of subtracting interactions, but using a general term for the interactions based on the number of binary items, rather than estimating each interaction separately. So for two attributes A and B, the total utility would be: A + B - k, and for 3 attributes it might be A + B + C - 2k In this case, k represents the amount we are subtracting for each additional benefit. With 5 benefits, we would be subtracting 4k. The modeling involves adding an additional variable to the design code. This new variable counts the number of binary attributes in the alternative. So an alternative with 5 binary attributes has a new variable with a value of 4 (we use number of attributes - 1). The value k can be estimated just like any other beta in the model, constrained for the proper sign. One can even generalize the approach by making it non-linear, perhaps adding a squared term for instance. Or we can make the amount subtracted vary for each number of benefits. So for two attributes: A + B - m, and for 3 attributes it might be A+ B+ C - n In this very general case, we model a specific value for each number of binary attributes, with constraints applied so that we subtract out more as the number of attributes increases (so n > m above). The number of additional parameters to estimate is the range of the number of attributes in an alternative (minus 1). So if the design shows alternatives ranging from 1 binary benefit to 9, we would estimate 8 additional parameters, ordinally constrained. I think this approach is very clever, and in some cases it may work. The general problem I have with it is that it can create reversals. Take the simple case of two binary attributes A + B - k. In that case, k should be smaller than A or B. The same applies to different pairs of attributes. In the case of C + D - k, we want k to also be less than C or D. In general we want k to be less than the minimum value of all the binary attributes. Things get even more complicated when we turn 119
to 3 attributes. In the end, I’m not certain these constraints can be resolved. One may just have to accept reversals. In the results section, I will show the degree to which reversals occurred using this method in my data sets. This is not to say that the method would never be an improvement over the base model, but one should be forewarned about the possibility of reversals. Another disadvantage of this method is that shrinkage is purely a function of the number of items. Given 3 items, one will always subtract out the same value regardless of what those items are. In contrast, the nested logit shrinks the total utility based on the actual betas. So with the nested formulation, 3 items with low utility are shrunk less than 3 items with high utility, in proportion to their scale, based on their relative size to each other, and the shrinkage cannot exceed the original effects. Basing shrinkage on the actual betas is most likely better than shrinkage based solely on the number of items. That said, making shrinkage a function of the betas does impose more difficulty in estimation—a topic we turn to in the next section.
4. ESTIMATION OF NESTED ATTRIBUTE FORMULATION (LATENT CLASS ENSEMBLES AND HB) Recall that the proposed formula for a nest of attributes 1 through n, with corresponding betas Bi and indicators xi is: U = [(x1B1)1/λ + (x2B2)1/λ + ... + (xnBn)1/λ ]λ , where λ ϵ (0,1], Bi >=0 In the results section, we estimated this model using Latent Class Ensembles. This method was developed by Kevin Lattery and presented in detail in an AMA ART Forum 2013 paper, “Improving Latent Class Conjoint Predictions With Nearly Optimal Cluster Ensembles.” As we mention there, the algorithm is run using custom SAS IML code. We give a brief description of the method below. Latent Class Ensembles is an extension of Latent Class. But rather than one Latent Class solution, we develop many Latent Class solutions. This is easy to do because Latent Class is subject to local optima. By using many different starting points and relaxing the convergence criteria (max LL/min LL > .999 for last 10 iterations), we create many different latent class solutions. These different solutions form an ensemble. Each member of the ensemble (i.e., each specific latent class solution) gives a prediction for each respondent. We average over those predictions (across the ensemble) to get the final prediction for each respondent. So if we generate 30 different Latent Class solutions, we get 30 different predictions for a specific respondent, and we average across those 30 predictions. This method significantly improves the predictive power versus a single Latent Class solution. The primary reason for using Latent Class Ensembles is that the model was easier to estimate. One of the problems with nested logit is that it is subject to local maxima. So it is common to estimate nested logits in three steps: 1) Assume λ =1, and estimate the betas, 2) Keep betas fixed from 1), and estimate λ, 3) Using betas and λ from 2) as starting points to estimate both simultaneously. We employed this same three-step process in the latent class ensemble approach. So each iteration of the latent class algorithm required three regressions for one segment, rather than one. But other than that, we simply needed to change the function and do logistic regression. One 120
additional wrinkle is that we estimated 1/ λ rather than λ. We then set the constraint on 1/ λ to range from 1 to 5. Using the inverse broadens the range of what is being estimated, and makes it easier to estimate than the equivalent λ ranging from .2 to 1. In some cases I have capped 1/ λ at 10 rather than 5. If one allows 1/ λ to get too large, it can cause calculation overflow errors. At .2, the level of shrinkage is quite close to that one would find at the limit of 0. Our intent was to also use the nested formulation within HB. That however proved to be more difficult. Simply changing the functional form and constraining 1/ λ did not work at all. HB did not converge, and any result we tried performed more poorly than the traditional non-nested formula. We tried estimation first using our Maritz specific HB package in SAS and then also using the R package ChoiceModelR. Both of these had to be modified for the custom function. One problem is that λ is not a typical parameter like the betas. Instead, λ is something we apply to the betas in exponential form. So within HB, λ should not be included in the covariance matrix of betas. In addition, we should draw from a separate distribution for λ, including a separate jumping factor in the Gibbs sampler. This then raises the question of order of estimation, especially given the local optima that nested functions often have. My recommendation is as follows: 1) Estimate the betas without λ (assume λ =1) 2) Estimate 1/λ using its own 1-attribute variance matrix, assuming fixed betas from above 3) Estimate betas and 1/λ starting with covariance matrix of betas in 1) and 1/λ covariance matrix in 2) We have not actually programmed this estimation method, but offer it as a suggestion for those who would like to use HB. It is clear to us that one needs, at the very least, separate draws and covariance matrices for betas and 1/λ. The three steps above recognize that and adopt the conservative null hypothesis that λ will be 1, moving away from 1 only if the data support that. The three steps parallel the three steps taken in the Latent Class ensembles, and the procedure of nested logit more generally: sequential, then simultaneous. My recommendation would be to first do step 1 across many MCMC iterations until convergence, estimating betas only for several thousand iterations. Then do step 2 across many MCMC iterations, estimating λ. Finally, do step 3, which differs from the first two steps in that each iteration estimates new betas and then new λs.
5. EMPIRICAL RESULTS Earlier we showed the results for the standard utility HB model. When we apply the nested attribute utility function (estimated with Latent Class ensembles), we get the following result. We show the new nested method just to the right of the standard method that was shown in the original figure:
121
Nested Model Standard Utility
The nested formulation not only shows a reduction in error, but removes most of the number of item bias. The error is now almost evenly distributed across the relative number of items. The data set above is our strongest case study of the presence of item bias. We believe there a few reasons for this, which make this data set somewhat unique. First, this study only measured binary attributes. Most conjoint studies have a mixture of binary and non-binary attributes. When there is error in the non-binary attributes, the number of attributes bias is less apparent. A second reason is that this study showed exactly two alternatives. This makes it easy to plot the difference in the number of items in each task. If alternative A showed 4 binary attributes and alternative B showed 3 binary attributes, the difference is +1 for A and -1 for B. In most of our studies we have more than 2 alternatives. So the relative number of binary attributes is more complicated. The next case study had a more complex design, with 3 alternatives and both binary and non-binary attributes. Here is a sample task:
122
Points Earned Room Upgrades Food & Beverage Discount Frequent Flyer Miles
Additional Benefits
Loyalty Program 1 100 pts per $100 None None 1,500 miles per stay • Early check-in and extended check-out
Loyalty Program 2 Loyalty Program 3 200 pts per $100 400 pts per $100 2 per year Unlimited 10% Off 15% Off 500 per stay 1,000 miles per stay • 2 one-time guest passes • Complimentary to an airport lounge welcome gift/snack
• Turndown service
• Gift shop discount
• Priority check-in and express check-out • Complimentary breakfast • No blackout dates for reward nights
In this case, we estimated the relative number of binary attributes by subtracting from the mean. The mean number of binary attributes in the example above is (5+2+1)/3 = 2.67. We then subtract the mean from the number of binary attributes in each alternative. So alternative 1 has 5 - 2.67 = 2.33 more binary attributes than average. Alternative 2 has 2 - 2.67 = -.67, and alternative 3 has 1 - 2.67 = -1.67. Note that these calculations are not used in the modeling; we are only doing them to facilitate plotting the bias in the results. The chart below shows the results for each in-sample task. The correlation is .33, much higher than it should be; ideally, there should be no relationship here.
The slope of the line is 2.4%, which means that each time we add an attribute we are overstating share by an average of 2.4%. This assumes a 3-alternative scenario. With two alternatives that slope would likely be higher. Clearly there is systematic bias. It is not as clean as 123
the first example, in part because we have noise from other attributes as well. Moreover, we should note that this is in-sample share. Ideally, one would like out-of-sample tasks, which we expect will show even more bias than above. If you see the bias in-sample it is even more likely to be found out-of-sample. Using the new model, the slope of the line relating relative number of attributes to error in share is only 0.9%. This is much closer to the desired value of 0. We also substantially reduced the mean absolute error, from 5.8% to 2.6%.
The total sample size for this case study was 1,300. 24.7% of those respondents had a λ of 1, meaning no diminishing returns for them. 34.7% had a λ of .2, the smallest used in this model. The remaining 40.7% had a median of .41, so the λ values skewed lower. The chart below shows the cumulative percentage of λ values, flipped at the median. Clearly there were many λ values significantly different from 1.
124
Cumulative Distribution of Estimated λ Values (Flipped at the Median; So Cumulative from Each End to the Middle)
We also fit a model using the method discussed in section 3.3, adding a variable equal to the number of binary attributes minus 1. This offered a much smaller improvement. The slope of the line bias was reduced from 2.4% to 1.8%. So if one is looking for a simple approach that may offer some help in reducing systematic error, the section 3.3 approach may be worth considering. However, we remain skeptical of it because of the problem of reversals, which we discuss below. One drawback to the number of binary attributes method is reversals. These occur when the constant we are subtracting is greater than the beta for an attribute. Properly fixing these reversals is extremely difficult. One attribute, Gift Shop Discount, showed a reversal at the aggregate level. On its own, it added benefit, but when you added it to any other existing binary attribute the predicted result was lower. Clearly, this would be counterintuitive, and need to be fixed in such a model. It turns out that every one of the 24 binary attributes had a reversal for some respondents, using the method in 3.3. In addition to “Gift Shop Discounts,” two other attributes had reversals for over 40% of the respondents. This is clearly an undesirable level of reversals, and could show up as reversals in aggregate calculations for some subgroups. For this reason, we remain skeptical of using this method. The nested logit formulation never produces reversals.
6. NOTES ON THE DESIGN OF BINARY ATTRIBUTES While the focus of this paper has been on the modeling of binary attributes, there are a few crucial comments I need to make about the experimental design with sets of binary attributes. After presenting initial versions of this paper, one person exclaimed that given a set of say 8 binary attributes, they like to show the same number of binary attributes in each alternative and task. For example, the respondent will always see 3 binary attributes in each alternative and task. This makes the screen look perfectly square, and gives the respondent a consistent experience. Of course, it also means you won’t notice any number of attribute bias. But it doesn’t mean the bias is not there; we are just in denial. In addition to avoiding (but not solving) the issue, showing the same number of binary attributes is typically an absolutely terrible idea. If one always shows 3 binary attributes, then 125
your model can only predict what happens if each alternative has the same number of binary attributes. The design is inadequate to make any other predictions. You really don’t know anything about 2 binary vs. 3 binary vs. 4 binary, etc. To explain this point, consider the following 3 alternative tasks, with the betas shown in the cells: Non-Binary1 Non-Binary2 Binary 1 Binary 2 Binary 3 Binary 4 Binary 5 Utility Exponentiated U Probability
Program 1 1.0 1.5 2.0 0.5 0.2
Program 2 -3.0 1.5 2.0
1.2 0.8 5.2 181.3 81.1%
2.5 12.2 5.5%
Program 3 2.0 -0.5 0.5 0.2 1.2
3.4 30.0 13.4%
Now what happens if we add 2 to each binary attribute? We get the values below: Non-Binary1 Non-Binary2 Binary 1 Binary 2 Binary 3 Binary 4 Binary 5 Utility Exponentiated U Probability
Program 1 1.0 1.5 4.0 2.5 2.2
Program 2 -3.0 1.5 4.0
3.2 2.8 11.2 73,130 81.1%
8.5 4,915 5.5%
Program 3 2.0 -0.5 2.5 2.2 3.2
9.4 12,088 13.4%
The final predictions are identical! In fact, we can add any constant value to each binary attribute and get the same prediction. So given our design, each binary attribute is really Beta + k, where k can be anything. In general this is really bad because the choice of k makes a big difference when you vary the number of attributes. Comparing a 2-item alternative with a 5 alternative item results in a 3k difference in utilities, which leads to a big difference in estimated shares or probabilities. The value of k matters in the simulations, but the design won’t let us estimate it! The only time this design is not problematic is when you keep the number of binary attributes the same in every alternative for your simulations as well as in the design. To do any modeling on how a mixed number of binary attributes works, the design must also have a mixed number of binary attributes. This is true entirely independently of whether we’re trying to account for diminishing returns. In general, we recommend you have as much 126
variability in your design as you want to simulate. Sometimes this is not reasonable, but it should always be the goal. One of the more common mistakes conjoint designers make is that they will let each binary attribute be randomly present or absent. So if one has 8 binary attributes, most of the alternatives have about 4 binary attributes. Very few, if any tasks, will show 1 binary attribute vs. 8 binary attributes. But it is important that the design show this range of 1 vs. 8 if we are to accurately model this extreme difference. My recommendation is to create designs where the range of binary attributes varies from extreme to small, and evenly distribute tasks across this range. For example, each respondent might see 3 tasks with extreme differences, 3 with somewhat extreme, 3 moderate, and 3 with minimal differences. The key is to get different levels of contrast in the relative number of binary attributes if one wants to detect and model those kinds of differences.
7. CONCLUSION The correlation of effects among attributes in a conjoint study means that our standard U = βx may not be adequate. One way to deal with that correlation is to borrow the formulation of nested logit, which was meant to deal with correlated alternatives. More specifically, the utility for a nest of attributes 1 . . . n was defined as: U = [(x1B1)1/λ + (x2B2)1/λ + ... + (xnBn)1/λ ]λ Employing that formulation in HB has its challenges, as one should specify separate draws and covariance matrices for the betas and λ. We recommended a three stage approach for HB estimation: 1) Estimate betas without λ (assume λ =1) 2) Estimate 1/λ using its own 1 attribute variance matrix, assuming fixed betas from above 3) Estimate betas and 1/λ starting with covariance matrix of betas in 1) and 1/λ covariance matrix in 2) While we did not validate the nested logit formulation in HB, we did test a similar three step approach using the methodology of latent class ensembles: sequential estimation of betas, followed by estimation of λ and then a simultaneous estimation. The nested attribute model estimated this way significantly reduced the overstatement of share that happens with the standard model when adding correlated binary attributes. In this paper we have only discussed grouping all of the binary attributes together in one nest. But of course, that assumption is most likely too simplistic. In truth, some of the binary attributes may be correlated with each other, while others are not. Our next steps are to work on ways to determine which attributes belong together, and whether that might vary by respondent. There are several possible ways one might group attributes together. Judgment is one method, based on one’s conceptual understanding of which attributes belong together. In some cases, our judgment may even structure the survey design, as in this example:
127
Another possibility is empirical testing of different nesting models, much like the way one tests different path diagrams in PLS or SEM. We also plan to test rating scales. By asking respondents to rate the desirability of attributes we can use correlation matrices and variable clustering to determine which attributes should be put into a nest with one another. As noted in the paper, one might even create hierarchies of nests, as one does with nested logits. We have just begun to develop the possibilities of nested correlated attributes, and welcome further exploration.
Kevin Lattery
128
RESPONDENT HETEROGENEITY, VERSION EFFECTS OR SCALE? A VARIANCE DECOMPOSITION OF HB UTILITIES KEITH CHRZAN, AARON HILL SAWTOOTH SOFTWARE
INTRODUCTION Common practice among applied marketing researchers is to analyze discrete choice experiments using Hierarchical Bayesian multinomial logit (HB-MNL). HB-MNL analysis produces a set of respondent-specific part-worth utilities which researchers hope reflect heterogeneity of preferences among their samples of respondents. Unfortunately, two other potential sources of heterogeneity, version effects and utility magnitude, could create preferenceirrelevant differences in part-worth utilities among respondents. Using data from nine commercial choice experiments (provided by six generous colleagues) and from a carefully constructed data set using artificial respondents, we seek to quantify the relative contribution of version effects and utility magnitude on heterogeneity of part-worth utilities. Any heterogeneity left unexplained by these two extraneous sources may represent real differences in preferences among respondents.
BACKGROUND Anecdotal evidence and group discussions at previous Sawtooth Software conferences have identified version effects and differences in utility magnitudes as sources of preference-irrelevant heterogeneity among respondents. Version effects occur when respondents who receive different versions or blocks of choice questions end up with different utilities. One of the authors recalls stumbling upon this by accident, using HB utilities in a segmentation study only to find that the needs-based segment a respondent joined depended to a statistically significant extent on the version of the conjoint experiment she received. We repeated these analyses on the first five data sets made available to us. First we ran cluster analysis on the HB utilities using the ensembles analysis in CCEA. Crosstabulating the resulting cluster assignments by version numbers we found significant 2 statistics in three of the five data sets. Moreover, of the 89 utilities estimated in the five models, analysis of variance identified significant F statistics for 31 of them, meaning they differed by version. Respondents may have larger or smaller utilities as they answer more or less consistently or as their individual utility model fits their response data more or less well. The first data set made available to us produced a fairly typical finding: the respondent with the most extreme utilities (as measured by their standard deviation) had part-worths 4.07 times larger than those for the respondent with the least extreme utilities. As measures of respondent consistency, differences in utility magnitude are one (but not the only) manifestation of the logit scale parameter (Ben Akiva and Lerman 1985). Recognizing that the response error quantified by the scale parameter can
129
affect utilities in a couple of different ways, we will refer to this effect as one of utility magnitude rather than of scale. With evidence of both version effects and utility magnitude effects, we undertook this research to quantify how much of between-respondent differences in utilities they explain. If it turns out that these explain a large portion of the differences we see in utilities across respondents we will have to question how useful it is to have respondent-level utilities:
If much of the heterogeneity we observe owes to version effects, perhaps we should keep our experiments small enough (or our questionnaire long enough) to have just a single version of our questionnaire, to prevent version effects, for example; of course this raises the question of which one version would be the best one to use; If differences in respondent consistency explain much of our heterogeneity then perhaps we should avoid HB models and their illusory view of preference heterogeneity.
If these two factors explain very small portions of observed utility heterogeneity, however, then we can be more confident that our HB analyses are measuring preference heterogeneity.
VARIANCE DECOMPOSITION OF COMMERCIAL DATA SETS In order to decompose respondent heterogeneity into its components we needed data sets with more than a single version, but with enough respondents per version to give us statistical power to detect version effects. Kevin Lattery (Maritz Research), Jane Tang (Vision Critical), Dick McCullough (MACRO Consulting), Andrew Elder (Illuminas), and two anonymous contributors generously provided disguised data sets that fit our requirements. The nine studies differed in terms of their designs, the number of versions and of respondents they contained: Study 1 2 3 4 5 6 7 8 9
Design 5 x 43 x 32 23 items/14 quints 15 items/10 quads 12 items/9 quads 47 items/24 sextuples 90 items/46 sextuples 8 x 42 x 34 x 2 52 x 42 x 32 2 4 x 35 x 2 + NONE
Number of versions 10 8 6 8 4 6 2 10 6
Total sample size 810 1,000 2,002 1,624 450 527 5,701 148 1,454
For each study we ran the HB-MNL using CBC/HB software and using settings we expected practitioners would use in commercial applications. For example, we ran a single HB analysis across all versions, not a separate HB model for each version; estimating a separate covariance matrix for each version could create a lot of additional noise in the utilities. Or again, we ran the CBC/HB model without using version as a covariate: using the covariate could (and did) exaggerate the importance of version effects in a way a typical commercial user would not. With the utilities and version numbers in hand we next needed a measure of utility magnitude. We tried four different measures and decided to use the standard deviation of a 130
respondent’s part-worth utilities. This measure had the highest correlations with the other measures; all were highly correlated with one another, however, so the results reported below are not sensitive to this decision. For variance decomposition we considered a variety of methods but they all came back with very similar results. In the end we opted to use analysis of covariance (ANCOVA). ANCOVA quantifies the contribution of categorical (version) and continuous (utility magnitude) predictors on a dependent variable (in this case a given part-worth utility). We ran the ANCOVA for all utilities in a given study and then report the average contribution of version and utility magnitude across the utilities as the result for the given study. The variance unexplained by either version effect or utility magnitude may owe to respondent heterogeneity of preferences or to other sources of heterogeneity not included in the analysis. Because of an overlap in the explained variance from the two sources we ran an averaging-over-orderings ANCOVA to distribute the overlapping variance. So what happened? The following table relates results for the nine studies and for the average across them. Study
Design
1 2 3 4 5 6 7 8 9
5 x 43 x 32 23 items/14 quints 15 items/10 quads 12 items/9 quads 47 items/24 sextuples 90 items/46 sextuples 8 x 42 x 34 x 2 52 x 42 x 32 42 x 35 x 2 + NONE
Mean
Number of versions 10 8 6 8 4 6 2 10 6
Total sample size 810 1,000 2,002 1,624 450 527 5,701 148 1,454
Variance (magnitude) (%) 6.2 8.0 6.4 10.3 7.9 17.0 7.3 27.8 6.7
Variance (version) (%) 4.9 1.1 0.1 1.1 0.7 1.2 0 3.3 1.1
Variance (residual) (%) 88.9 90.9 93.5 88.6 91.4 81.8 92.7 68.9 92.2
10.8
1.4
87.8
On average, the bulk of heterogeneity owes neither to scale effects nor to utility magnitude effects. In other words, up to almost 90% of measured heterogeneity may reflect preference heterogeneity among respondents. Version has a very small effect on utilities—statistically significant, perhaps, but small. Version effects account for just under 2% of utility heterogeneity on average and never more than 5% in any study. Heterogeneity owing to utility magnitudes explains more—about 11% on average and as high as 27.8% in one of the studies (the subject of that study should have been highly engaging to respondents—they were all in the market for a newsy technology product; the same level of engagement should also have been present in study 7, however, based on a very high response rate in that study and the importance respondents would have placed on the topic). In other words, in some studies a quarter or more of observed heterogeneity may owe to the magnitude effects that reflect only differences in respondent consistency. Clearly magnitude effects merit attention and we should consider removing them when appropriate (e.g., for reporting of utilities but not in simulators) through scaling like Sawtooth Software’s zero-centered diffs. 131
IS THERE A MECHANICAL SOURCE OF THE VERSION EFFECT? It could be the version effect is mechanical—something about the separate versions itself causes the effect. To test this we looked at whether the version effect occurs among artificial respondents with additive, main-effects utilities: if it does, then the effect is mechanical and not psychological. Rather than start from scratch and create artificial respondents who might have patterns of preference heterogeneity unlike those of humans, we started with utility data from human respondents. For a worst-case analysis, we used the study with the highest contribution from the version effect, study 1 above. We massaged these utilities gently, standardizing each respondent’s utilities so that all respondents had the same magnitude of utilities (the same standard deviation across utilities) as the average of human respondents in study 1. In doing this we retain a realistic human-generated pattern of heterogeneity, at the same time removing utility magnitude as an explanation for any heterogeneity. Then we added logit choice rule-consistent independently, identically distributed (i.i.d.) errors from a Gumbel distribution to the total utility of each alternative in each choice set for each respondent. We had our artificial respondents choose the highest utility alternative in each choice set and choice sets constituted the same versions the human respondents received in Study 1. Finally, we ran HB-MNL to generate utilities. When we ran the decomposition described above on the utilities the version effects virtually disappeared, with the variance explained by version effects falling from 4.9% of observed heterogeneity for human respondents to 0.04% among the artificial respondents. Thus a mechanical source does not explain the version effect.
DO CONTEXT EFFECTS EXPLAIN THE VERSION EFFECT? Perhaps context effects could explain the version effect. Perhaps some respondents see some particular levels or level combinations early in their survey and they answer the remainder of the survey differently than do respondents who saw different initial questions. At the conference the suggestion came up that we could investigate this by checking whether the version effect differs in studies wherein choice tasks appear in random orders versus those in which choice tasks occur in a fixed order. We went back to our nine studies and found that three of them showed tasks in a random order while six had their tasks asked in a fixed order within each version. It turned out that studies with randomized task orders had slightly smaller version effects (explaining 0.7% of observed heterogeneity) than those with tasks asked in a constant order in each version (explaining 1.9% of observed heterogeneity). So the order in which respondents see choice sets may explain part of the small version effect we observe. Knowing this we can randomize question order within versions to obscure, but not remove, the version effect.
CONCLUSIONS/DISCUSSION The effect of version on utilities is often significant but invariably small. It accounts for an average under 2% of total observed heterogeneity in our nine empirical studies and in no case as high as 5%. A much larger source of variance in utilities owes to magnitude differences, an average of almost 11% in our studies and nearly 28% in one of them. The two effects together
132
account for about 12% of total observed heterogeneity in the nine data sets we investigated and thus do not by themselves explain more than a fraction of the total heterogeneity we observe. We would like to say that this means that the unexplained 88% of variance in utilities constitutes true preference heterogeneity among respondents but we conclude more cautiously that true preference heterogeneity or as yet unidentified sources of variance explain the remaining 88%. Some of our simulation work points to another possible culprit: differences in respondent consistency (reflected in the logit scale factor) may have both a fixed and a random component to their effect on respondent-level utilities. Our analysis, using utility magnitude as a proxy, attempts to pull out the fixed component of the effect respondent consistency has on utilities but it does not capture the random effect. This turns out to be a large enough topic in its own right to take us beyond the scope of this paper. We think it best left as an investigation for another day.
Keith Chrzan
Aaron Hill
REFERENCES Ben-Akiva, M. and S.R. Lerman (1985) Discrete Choice Analysis: Theory and Application to Travel Demand. Cambridge: MIT.
133
FUSING RESEARCH DATA WITH SOCIAL MEDIA MONITORING TO CREATE VALUE KARLAN WITT DEB PLOSKONKA CAMBIA INFORMATION GROUP
OVERVIEW Many organizations collect data across multiple stakeholders (e.g., customers, shareholders, investors) and sources (e.g., media, research, investors), intending to use this information to become more nimble and efficient, able to keep pace with rapidly changing markets while proactively managing business risks. In reality, companies often find themselves drowning in data, struggling to uncover relevant insights or common themes that can inform their business and media strategy. The proliferation of social media, and the accompanying volume of news and information, has added a layer of complexity to the already overwhelming repository of data that organizations are mining. So the data is “big” long before the promise of “Big Data” leverage and insights are seen. Attempts to analyze data streams individually often fail to uncover the most relevant insights, as the resulting information typically remains in its original silo rather than informing the broader organization. When consuming social media, companies are best served by fusing social media data with data from other sources to uncover cohesive themes and outcomes that can be used to drive value across the organization.
USING DATA TO UNLOCK VALUE IN AN ORGANIZATION Companies typically move through three distinct stages of data integration as they try to use data to unlock value in their organization: Stages of Data Integration
Stage 1: Instill Appetite for Data
This early stage often begins with executives saying, “We need to track everything important to our business.” In Stage 1, companies develop an appetite for data, but face a number of challenges putting the data to work. The richness and immediacy of social media feedback provides opportunity for organizations to quickly identify risks to brand health, opportunities for customer engagement, and other sources of value creation, making the incorporation of social
135
media data an important component of business intelligence. However, these explosively rich digital channels often leave companies drowning in data. With over 300 million Facebook users, 200 million bloggers, and 25 million Twitter users, companies can no longer control the flood of information circulating about their firms. Translating data into metrics can provide some insights, but organizations often struggle with: Interacting with the tools set up for accessing these new data streams; Gaining a clear understanding of what the metrics mean, what a strong or weak performance on each metric would look like, and what impact they might represent for the business; and Managing the volume of incoming information to identify what should be escalated for potential action. To address these challenges, data streams are often driven into silos of expertise within an organization. In Stage 1 firms, the silos, such as Social Media, Traditional Media, Paid Media Data, and Research, seldom work closely together to bring all the organization’s data to full Big Data potential. Instead, the periodic metrics from a silo such as social media (number of blogs, tweets, news mentions, etc.) published by silos lack richness and recommended action, leading organizations to realize they don’t need data, they need information. Stage 2: Translate Data into Information
In Stage 2, companies often engage external expertise, particularly in social media, in an effort to understand how best to digest and prioritize information. In outsourcing the data consumption and analysis, organizations may see reports with norms and additional contextual data, but still no “answers.” In addition, though an organization may have a more experienced person monitoring and condensing information, the data has effectively become even more siloed, with research and media expertise typically having limited interaction. Most organizations remain in this stage of data use, utilizing dashboards and publishing reports across the organization, but never truly understanding how organizations faced with an influx of data can intelligently consume information to drive value creation. Stage 3: Create Value
Organizations that move to Stage 3 fuse social media metrics with other research data streams to fully inform their media strategy and other actions they might take. The following approach outlines four steps a company can take to help turn their data into actionable information. 1. Identify Key Drivers of Desired Business Outcomes Using Market Research: In today’s globally competitive environment, company perception is impacted by a broad variety of stakeholders, each with a different focus on what matters. In this first step, organizations can identify the Key Performance Indicators (attributes, value propositions, core brand values, etc.) that are most important to each stakeholder audience. From this, derived importance scores are calculated to quantify the importance of each KPI. As seen below, research indicates different topics are more or less important to different stakeholders. This becomes important as we merge this data into the social media
136
monitoring effort. As spikes occur in social media for various topics, a differential response by stakeholder can occur, informed by the derived importance from research. Example: Topic Importance by Stakeholder
2. Identify Thresholds: Once the importance scores for KPI’s are established, organizations must identify thresholds for each audience segment. Alternative Approaches for Setting “Alert Thresholds” 1. Set a predefined threshold, e.g., 10,000 mentions warrants attention. 2. Compare to past data. Set a multiplier of the absolute number of mentions over a given time period, e.g., 2x, 3x; alternatively, use the Poisson test to identify a statistical departure from the expected. 3. Compare to past data. If the distribution is normal, set alerts for a certain number of standard deviations from the mean, if not use the 90th or 95th percentile of historical data for similar time periods. 4. Model historical data with time series analyses to account for trends and/or seasonality in the data. From this step, organizations should also determine the sensitivity of alerts based on individual client preferences.
137
“High” Sensitivity Alerts are for any time there is a spike in volume that may or may not have significant business impacts.
“Low” Sensitivity Alerts are only for extreme situations which will likely have business implications.
138
3. Media Monitoring: Spikes in media coverage are often event-driven, and by coding incoming media by topic or theme, clients can be cued to which topics may be spiking at any given moment. In the example below, digital media streams from Forums, Facebook, YouTube, News, Twitter, and Blogs are monitored, with a specific focus on two key topics of interest. Example: Media Monitoring by Topic
Similarly, information can be monitored by topic on a dashboard, with topic volume spikes triggering alerts delivered according to their importance, as in the example below:
139
Although media monitoring in this instance is set up by topic, fusing research data with social media data allows the importance to each stakeholder group to be identified, as in the example below:
4. Model Impact on KPI’s: Following an event, organizations can model the impact on the Key Performance Indicators by audience. Analyzing pre- and post-event measures across KPIs will help to determine the magnitude of any impact, and will also help uncover if a specific sub-group within an audience was impacted more severely than others. Identifying the key attributes most impacted within the most affected sub-group would suggest a course of action that enables an organization to prioritize its resources.
CASE STUDY A 2012 study conducted by Cambia Information Group investigated the potential impact of social media across an array of KPIs among several stakeholders of interest for a major retailer. Cambia had been conducting primary research for this client for a number of years among these audiences and was able to supplement this research with media performance data across the social media spectrum. Step 1: Identify Key Drivers of Desired Business Outcomes Using Market Research
This first step looks only at the research data. The research data used is from an on-going tracking study that incorporates both perceptual and experiential-type attributes about this company and top competitors. From the research data, we find that different topics are more or less important to different stakeholders. The data shown in this case study is real, although the company and attribute labels have been abstracted to ensure the learnings the company gained through this analysis remain a competitive advantage. The chart below shows the beta values across the top KPIs across all stakeholder groups. If you are designing the research as a benchmark specifically to inform this type of intelligence system, the attributes should be those things that can impact a firm’s success. For example, if you are designing attributes for a conjoint study and it is about cars, and all competitors in the study have equivalent safety records, safety may not make the list of attributes to be included. Choice of color or 2- vs. 4-doors might be included. 140
However, when you examine the automotive industry over time with various recalls and elements that have caused significant brand damage among car manufacturers, safety would be a key element. We recommend that safety would be a variable that is included in a study informing a company about the media and what topics have the ability to impact the firm (especially negatively). Car color and 2- vs. 4-doors would not be included in this type of study. Just looking at the red-to-green conditional formatting of the betas on the table below, it is immediately clear that the importance values vary within and between key stakeholder groups.
Step 2: Identify Thresholds for Topics
The second step moves from working with the research data to working with the social media data sources. The goal of this step is to develop a quantitative measure of social media activity and set a threshold that, once reached, can trigger a notification to the client. This is a multi-step process. For this data set, we used the following methodology: 1. Relevant topics are identified to parallel those chosen for the research. Social media monitoring tools have the ability to listen to the various channels and tag the pieces that fall under each topic. 2. Distributions of volume by day by topic were studied for their characteristics. While some topics maintained a normal distribution, others did not. Given the lack of normality, an approach involving mean and standard deviation was discarded. 3. Setting a pre-defined threshold was discarded as too difficult to support in a client context. Additionally, the threshold would need to take into account the increasing volume of social media over time. 4. Time series analyses would have been intriguing and are an extension to be considered down the road, although it requires specialized software and a client who is comfortable with advanced modeling. 5. Distributions by day evidence a “heartbeat” pattern—low on the weekends, higher on the weekdays. Thresholds need to account for this differential. Individuals clearly engage in
141
social media behavior while at work—or more generously, perhaps as part of their role at work. 6. For an approach that a client could readily explain to others, it was settled on referencing the median of the non-normal distribution, and from this point forward taking the 90th or 95th percentile and flagging it for alerts. Given that some clients may wish to be notified more often, an 85th percentile is also offered. Cuts were identified for both weekday and weekend, and to account for the rise in social media volume, reference no more than the past 6 months of data.
So the thresholds were set up with the high (85th percentile) and low (95th percentile) sensitivity levels. For our client, a manager-level team member received all high-sensitivity notifications (high sensitivity as described earlier means it detects every small movement and sends a notice). Senior staff received only low sensitivity notices. Since these were only the top 5% of all events, these were hypothesized to carry potential business implications. Step 3: Media Monitoring
This step is where the research analytics and the social media analytics come together. The attributes measured in the research and for which we have importance values by attribute by stakeholder group are aligned with the topics that have been set up in the social media monitoring tools. Because we were dealing with multiple topics across multiple stakeholder groups, we chose to extend the online reporting for the survey data to provide a way to view the status of all topics, in addition to the email alerts.
142
Looking at this by topic (topics are equivalent to attributes, like “safety”), shows the current status of each, such as the view below:
The specific way these are shown can be incorporated in many different ways depending on the systems available to you. Step 4: Model Impact on KPI’s
Although media monitoring is set up by topic, the application of the research data allows the importance to each stakeholder group to be identified. As an example, a spike in social media about low employee wages or other fair labor practice violations might have a negative impact on the employee audience, a moderate impact on customers, and no impact on the other stakeholder groups.
143
Using this data, our client was very rapidly able to respond to an event that popped in the media that was unexpected. The company was immediately able to identify which audiences would potentially be impacted by the media coverage, and focus available resources in messaging to the right audiences.
CONCLUSION: This engagement enabled the firm to take three key steps: 1. Identify problem areas or issues, and directly engage key stakeholder groups, in this case the Voting Public and their Investors; 2. Understand the window of opportunity (time lag) between negative coverage and its impact on the organization’s brand health; 3. Predict the brand health impact from social media channels, and affect that impact through messaging of their own. Potential Extensions for inclusion of alert or risk ratings include: 1. Provide “share of conversation” alerts, 2. Develop alert ratings within segments, 3. Incorporate potential media exposure to calculate risk ratio for each stakeholder group to any particular published item, 4. Expand the model which includes the impact of each source, 5. A firm’s overall strategy.
144
Karlan Witt
145
BRAND IMAGERY MEASUREMENT: ASSESSMENT OF CURRENT PRACTICE AND A NEW APPROACH1 PAUL RICHARD MCCULLOUGH MACRO CONSULTING, INC.
EXECUTIVE SUMMARY Brand imagery research is an important and common component of market research programs. Traditional approaches, e.g., ratings scales, have serious limitations and may even sometimes be misleading. MaxDiff scaling adequately addresses the major problems associated with traditional scaling methods, but historically has had, within the context of brand imagery measurement, at least two serious limitations of its own. Until recently, MaxDiff scores were comparable only to items within the MaxDiff exercise. Traditional MaxDiff scores are relative, not absolute. Dual Response (anchored)MaxDiff has substantially reduced this first problem but may have done so at the price of reintroducing scale usage bias. The second problem remains: MaxDiff exercises that span a reasonable number of brands and brand imagery statements often take too long to complete. The purpose of this paper is to review the practice and limitations of traditional brand measurement techniques and to suggest a novel application of Dual Response MaxDiff that provides a superior brand imagery measurement methodology that increases inter-item discrimination and predictive validity and eliminates both brand halo and scale usage bias.
INTRODUCTION Brand imagery research is an important and common component of most market research programs. Understanding the strengths and weaknesses of a brand, as well as its competitors, is fundamental to any marketing strategy. Ideally, any brand imagery analysis would not only include a brand profile, providing an accurate comparison across brands, attributes and respondents, but also an understanding of brand drivers or hot buttons. Any brand imagery measurement methodology should, at a minimum, provide the following:
Discrimination between attributes, for a given brand (inter-attribute comparisons) Discrimination between respondents or segments, for a given brand and attribute (inter-respondent comparisons) Good fitting choice or purchase interest model to identify brand drivers (predictive validity)
With traditional approaches to brand imagery measurement, there are typically three interdependent issues to address: 1
Minimal variance across items, i.e., flat responses Brand halo
The author wishes to thank Survey Sampling International for generously donating a portion of the sample used in this paper.
147
Scale usage bias
Resulting data are typically non-discriminating, highly correlated and potentially misleading. With high collinearity, regression coefficients may actually have reversed signs, leading to absurd conclusions, e.g., lower quality increases purchase interest. While scale usage bias may theoretically be removed via modeling, there is reason to suspect any analytic attempt to remove brand halo since brand halo and real brand perceptions are typically confounded. That is, it is difficult to know whether a respondent’s high rating of Brand A on perceived quality, for example, is due to brand halo, scale usage bias or actual perception. Thus, the ideal brand imagery measurement technique will exclude brand halo at the data collection stage rather than attempt to correct for it at the analytic stage. Similarly, the ideal brand imagery measurement technique will eliminate scale usage bias at the data collection stage as well. While the problems with traditional measurement techniques are well known, they continue to be widely used in practice. Familiarity and simplicity are, no doubt, appealing benefits of these techniques. Among the various methods used historically, the literature suggests that comparative scales may be slightly superior. An example of a comparative scale is below:
Some alternative techniques have also garnered attention: MaxDiff scaling, method of paired comparisons (MPC) and Q-sort. With the exception of Dual Response MaxDiff (DR MD), these techniques all involve relative measures rather than absolute. MaxDiff scaling, MPC and Q-sort all are scale-free (no scale usage bias), potentially have no brand halo2 and demonstrate more discriminating power than more traditional measuring techniques. MPC is a special case of MaxDiff; as it has been shown to be slightly less effective it will not be further discussed separately. With MaxDiff scaling, the respondent is shown a random subset of items and asked to pick which he/she most agrees with and which he/she least agrees with. The respondent is then shown several more subsets of items. A typical MaxDiff question is shown below:
2
These techniques do not contain brand halo effects if and only if the brand imagery measures are collected for each brand separately rather than pooled.
148
Traditional MaxDiff3
With Q-sorting, the respondent is asked to place into a series of “buckets” a set of items, or brand image attributes, from best describes the brand to least describes the brand. The number of items in each bucket roughly approximates a normal distribution. Thus, for 25 items, the number of items per bucket might be: First bucket Second bucket Third bucket Fourth bucket Fifth bucket Sixth bucket Seventh bucket
1 item 2 items 5 items 9 items 5 items 2 items 1 item
MaxDiff and Q-sorting adequately address two of the major issues surrounding monadic scales, inter-attribute comparisons and predictive validity, but due to their relative structure do not allow inter-brand comparisons. That is, MaxDiff and Q-sorting will determine which brand imagery statements have higher or lower scores than other brand imagery statements for a given brand but can’t determine which brand has a higher score than any other brand on any given statement. Some would argue that MaxDiff scaling also does not allow inter-respondent comparisons due to the scale factor. Additionally, as a practical matter, both techniques currently accommodate fewer brands and/or attributes than traditional techniques. Both MaxDiff scaling and Q-sorting take much longer to field than other data collection techniques and are not comparable across studies with different brand and/or attribute sets. Qsorting takes less time to complete than MaxDiff and is somewhat less discriminating. As mentioned earlier, MaxDiff can be made comparable across studies by incorporating the Dual Response version of MaxDiff, which allows the estimation of an absolute reference point. This reference point may come at a price. The inclusion of an anchor point in MaxDiff exercises may reintroduce scale usage bias into the data set. However, for Q-sorting, there is currently no known approach to establish an absolute reference point. For that reason, Q-sorting, for the purposes of this paper, is eliminated as a potential solution to the brand measurement problem. Also, for both MaxDiff and Q-sorting the issue of data collection would need to be addressed. As noted earlier, to remove brand halo from either a MaxDiff-based or Q-sort-based 3
The form of MaxDiff scaling used in brand imagery measurement is referred to as Brand-Anchored MaxDiff (BA MD)
149
brand measurement exercise, it will be necessary to collect brand imagery data on each brand separately, referred to here as Brand-Anchored MaxDiff. If the brands are pooled in the exercise, brand halo would remain. Thus, there is the very real challenge of designing the survey in such a way as to collect an adequate amount of information to accurately assess brand imagery at the disaggregate level without overburdening the respondent. Although one could estimate an aggregate level choice model to estimate brand ratings, that approach is not considered viable here because disaggregate brand ratings data are the current standard. Aggregate estimates would yield neither familiar nor practical data. Specifically, without disaggregate data, common cross tabs of brand ratings would be impossible as would the more advanced predictive model-based analyses.
A NEW APPROACH Brand-Anchored MaxDiff, with the exception of being too lengthy to be practical, appears to solve, or at least substantially mitigate, most of the major issues with traditional methods of brand imagery measurement. The approach outlined below attempts to minimize the survey length of Brand-Anchored MaxDiff by increasing the efficiency of two separate components of the research process:
Survey instrument design Utility estimation
Survey Instrument
A new MaxDiff question format, referred to here as Modified Brand-Anchored MaxDiff, accommodates more brands and attributes than the standard design. The format of the Modified Brand-Anchored MaxDiff used in Image MD is illustrated below:
150
To accommodate the Dual Response form of MaxDiff, a Direct Binary Response question is asked prior to the MBA MD task set4:
To address the potential scale usage bias of MaxDiff exercises with Direct Binary Response, a negative Direct Binary Response question, e.g., For each brand listed below, please check all the attributes that you feel strongly do not describe the brand, is also included.5 As an additional attempt to mitigate scale usage bias, the negative Direct Binary Response was asked in a slightly different way for half the sample. Half the sample were asked the negative Direct Binary Response question as above. The other half were asked a similar question except that respondents were required to check as many negative items as they had checked positive. The first approach is referred to here as unconstrained negative Direct Binary Response and the second is referred to as constrained negative Direct Binary Response. In summary, Image MD consists of an innovative MaxDiff exercise and two direct binary response questions, as shown below:
4 5
This approach to Anchored MaxDiff was demonstrated to be faster to execute than the traditional Dual Response format (Lattery 2010). Johnson and Fuller (2012) note that Direct Binary Response yields a different threshold than traditional Dual Response. By collecting both positive and negative Direct Binary Response data, we will explore ways to mitigate this effect.
151
It is possible, in an online survey, to further increase data collection efficiency with the use of some imaginative programming. We have developed an animated way to display Image MD tasks which can be viewed at www.macroinc.com (Research Techniques tab, MaxDiff Item Scaling). Thus, the final form of the Image MD brand measurement technique can be described as Animated Modified Brand-Anchored MaxDiff Scaling with both Positive and Negative Direct Binary Response. Utility Estimation
Further, an exploration was conducted to reduce the number of tasks seen by any one respondent and still retain sufficiently accurate disaggregate brand measurement data. MaxDiff utilities were estimated using a Latent Class Choice Model (LCCM) and using a Hierarchical Bayes model (HB). By pooling data across similarly behaving respondents (in the LCCM), we hoped to substantially reduce the number of MaxDiff tasks per respondent. This approach may be further enhanced by the careful use of covariates. Another approach that may require fewer MaxDiff tasks per person is to incorporate covariates in the upper model of an HB model or running separate HB models for segments defined by some covariate. To summarize, the proposed approach consists of:
152
Animated Modified Brand-Anchored MaxDiff Exercise With Direct Binary Responses (both positive and negative) Analytic-derived parsimony: o Latent Class Choice Model: Estimate disaggregate MaxDiff utilities
Use of covariates to enhance LCCM accuracy o Hierarchical Bayes: HB with covariates in upper model Separate HB runs for covariate-defined segments Adjusted Priors6
RESEARCH OBJECTIVE The objectives, then, of this paper are:
To compare this new data collection approach, Animated Modified Brand-Anchored MaxDiff with Direct Binary Response, to a traditional approach using monadic rating scales To compare the positive Direct Binary Response and the combined positive and negative Direct Binary Response To confirm that Animated Modified Brand-Anchored MaxDiff with Direct Binary Response eliminates brand halo To explore ways to include an anchor point without reintroducing scale usage bias To explore utility estimation accuracy of LCCM and HB using a reduced set of MaxDiff tasks To explore the efficacy of various potential covariates in LCCM and HB
STUDY DESIGN A two cell design was employed: Traditional brand ratings scales in one cell and the new MaxDiff approach in the other. Both cells were identical except in the method that brand imagery data were collected:
Traditional brand ratings scales o Three brands, each respondent seeing all three brands o 12 brand imagery statements Animated Modified Brand-Anchored MaxDiff with Direct Binary Response o Three brands, each respondent seeing all three brands o 12 brand imagery statements o Positive and negative Direct Binary Response questions
Cells sizes were:
Monadic ratings cell - n = 436 Modified MaxDiff - n = 2,605 o Unconstrained negative DBR - n = 1,324 o Constrained negative DBR - n = 1,281
The larger sample size for the second cell was intended so that attempts to reduce the minimum number of choice tasks via LCCM and/or HB could be fully explored.
6
McCullough (2009) demonstrates that tuning HB model priors can improve hit rates in sparse data sets.
153
Both cells contained:
Brand imagery measurement (ratings or MaxDiff) Brand affinity measures Demographics Holdout attribute rankings data
RESULTS Brand Halo
We check for brand halo using confirmatory factor analysis, building a latent factor to capture any brand halo effect. If the brand halo exists, the brand halo latent factor will positively influence scores on all items. We observed a clear brand halo effect among the ratings scale data, as expected. The unanchored MaxDiff data showed no evidence of the effect, also as expected. The positive direct binary response reintroduced the brand halo effect to the MaxDiff ratings at least as strong as the ratings scale data. This was not expected. However, the effect seems to be totally eliminated with the inclusion of either the constrained or unconstrained negative direct binary question. Brand Halo Confirmatory Factor Analytic Structure
154
Brand Halo Latent Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12
Ratings Std Beta 0.85 0.84 0.90 0.86 0.77 0.85 0.83 0.82 0.88 0.87 0.77 0.88
Prob *** *** *** *** *** *** *** *** *** *** *** na
No DBR Std Beta -0.14 -0.38 -0.20 0.10 -0.68 -0.82 0.69 0.24 0.58 0.42 -0.05 0.26
Prob *** *** *** *** *** *** *** *** *** *** 0.02 na
Positive DBR Std Beta 0.90 0.78 0.95 0.90 0.88 0.87 0.83 0.75 0.90 0.94 0.85 0.91
Prob *** *** *** *** *** *** *** *** *** *** *** na
Unconstrained Negative DBR Std Beta Prob 0.44 *** -0.56 *** 0.42 *** 0.30 *** 0.03 0.25 -0.21 *** 0.42 *** 0.01 0.87 0.77 *** 0.86 *** 0.07 0.02 0.69 na
Constrained Negative DBR Std Beta Prob 0.27 *** -0.72 *** 0.32 *** 0.16 *** 0.01 0.78 -0.24 *** 0.20 *** -0.23 *** 0.62 *** 0.90 *** -0.12 *** 0.53 na
Scale Usage
As with our examination of brand halo, we use confirmatory factor analysis to check for the presence of a scale usage factor. We build in latent factors to capture brand halo per brand, and build another latent factor to capture a scale usage bias independent of brand. If a scale usage bias exists, the scale latent factor should load positively on all items for all brands.
155
Scale Usage Bias and Brand Halo Confirmatory Factor Analytic Structure
We observe an obvious scale usage effect with the ratings data, where the scale usage latent loads positively on all 36 items. Again, the MaxDiff with only positive direct binary response shows some indication of scale usage bias, even with all three brand halo latents simultaneously accounting for a great deal of collinearity. Traditional MaxDiff, and the two versions including positive and negative direct binary responses all show no evidence of a scale usage effect. Scale Usage Latent Number of Negative Loadings Number of Statistically Significant loadings
156
Ratings
No DBR
Positive DBR
Unconstrained Negative DBR
Constrained Negative DBR
0
14
5
10
15
36
31
29
33
30
Predictive Validity
In the study design we included a holdout task which asked respondents to rank their top three item choices per brand, giving us a way to test the accuracy of the various ratings/utilities we collected. In the case of all MaxDiff data we compared the top three scoring items to the top three ranked holdout items per person, and computed the hit rate. This approach could not be directly applied to scale ratings data due to the frequency of flat responses (e.g., it is impossible to identify top three if all items were rated the same). For the ratings data we estimated hit rate using this approach: if the highest rated item from the holdout received the highest ratings score which was shared by n items, we added 1/n to the hit rate. Similarly, the second and third highest ranked holdout items received an adjusted hit point if those items were among the top 3 rated items. We observe that each of the MaxDiff data sets vastly outperformed ratings scale data, which performed roughly the same as randomly guessing the top three ranked items. Hit Rates
Random Numbers
Ratings
No DBR
Positive DBR
Unconstrained Negative DBR
Constrained Negative DBR
1 of 1
8%
14%
27%
28%
27%
26%
32%
30%
62%
64%
62%
65%
61%
51%
86%
87%
86%
88%
(1 or 2) of 2 (1, 2 or 3) of 3
Inter-item discrimination
Glancing visually at the resulting item scores, we can see that each of the MaxDiff versions show greater inter-item discrimination, and among those, both negative direct binary versions bring the lower performing brand closer to the other two brands.
157
Ratings Scales
MaxDiff with Positive DBR
MaxDiff with Positive DBR & Constrained Negative DBR
158
MaxDiff with Positive DBR & Unconstrained Negative DBR
To confirm, we considered how many statistically significant differences between statements could be observed within each brand per data collection method. The ratings scale data yielded fewest statistically significant differences across items, while the MaxDiff with positive and unconstrained negative direct binary responses yielded the most. Traditional MaxDiff and MaxDiff with positive and constrained negative direct binary responses also performed very well, while the MaxDiff with only positive direct binary performed much better than ratings scale data, but clearly not as well as the remaining three MaxDiff methods. Average number of statistically significant differences across 12 items
Ratings
Brand#1
No DBR
Positive DBR
Unconstrained Negative DBR
Constrained Negative DBR
1.75
4.46
3.9
4.3
4.68
New Brand
0
4.28
3.16
4.25
4.5
Brand#2
1
4.69
3.78
4.48
4.7
Completion Metrics
With using a more sophisticated data collection method come a few costs in respondent burden. It took respondents much longer to complete any of the MaxDiff exercises than it took them to complete the simple ratings scales. The dropout rate during the brand imagery section of the survey (measured as percentage of respondents who began that section but failed to finish it) was also much higher among the MaxDiff versions. Though on the plus side for the MaxDiff versions, when preparing the data for analysis we were forced to drop far fewer respondents due to flat-lining.
159
Ratings
All MaxDiff Versions
Brand Image Measurement Time (Minutes)
1.7
6
Incompletion Rate
9%
31%
Post-field drop rate
32%
4%
Exploration to Reduce Number of Tasks Necessary
We find these results to be generally encouraging, but would like to explore if anything can be done to reduce the increased respondent burden and dropout rates. Can we reduce the number of tasks each respondent is shown, without compromising the predictive validity of the estimated utilities? To find out, we estimated disaggregate utilities using two different estimation methods (Latent Class and Hierarchical Bayes), varying the numbers of tasks, and using certain additional tools to bolster the quality of the data (using covariates, or adjusting priors, etc.). We continued only with the two MaxDiff methods with both positive and negative direct binary responses, as those two methods proved best in our analysis. All estimation routines were run for both the unconstrained and constrained versions, allowing us to further compare these two methods. Our chosen covariates included home ownership (rent vs. own), gender, purchase likelihood for the brand we were researching, and a few others. Including these covariates when estimating utilities in HB should yield better individual results by allowing the software to make more educated estimates based on respondents’ like peers. With covariates in place, utilities were estimated using data from 8 (full sample), 4, and 2 MaxDiff tasks, and hit rates were computed for each run. We were surprised to discover that using only 2 tasks yielded only slightly less accuracy than using all 8 tasks. And in all cases, hit rates seem to be mostly maintained despite decreased data. Using Latent Class the utilities were estimated again using these same 6 data sub-sets. As with HB, reducing the number of tasks used to estimate the utilities had minimal effect on the hit rates. It is worth noting here, that when using all 8 MaxDiff tasks latent class noticeably underperforms Hierarchical Bayes, but this disparity decreases as tasks are dropped.
160
Various Task Hit Rates
HB
LC
Unconstrained Negative DBR 8 Tasks 4 Tasks 2 Tasks
Constrained Negative DBR 8 Tasks 4 Tasks 2 Tasks
1 of 1
27%
21%
20%
26%
24%
22%
(1 or 2) of 2
62%
59%
58%
65%
61%
59%
(1, 2 or 3) of 3
86%
85%
82%
88%
86%
85%
1 of 1
19%
20%
19%
21%
21%
22%
(1 or 2) of 2
54%
57%
56%
61%
59%
56%
(1, 2 or 3) of 3
81%
82%
83%
84%
84%
82%
In estimating utilities in Hierarchical Bayes, it is possible to adjust the Prior degrees of freedom and the Prior variance. Generally speaking, adjusting these values allows the researcher to change the emphasis placed on the upper level model. In dealing with sparse data sets, adjusting these values may lead to more robust individual utility estimates. Utilities were estimated with data from 4 tasks, and with Prior degrees of freedom from 2 to 1000 (default is 5), and Prior variance from 0.5 to 10 (default is 2). Hit rates were examined at various points on these ranges, and compared to the default settings. After considering dozens of non-default configurations we observed essentially zero change in hit rates. At this point it seemed that there was nothing that could diminish the quality of these utilities, which was a suspicious finding. In searching for a possible explanation, we hypothesized that these data simply have very little heterogeneity. The category of product being researched is not emotionally engaging (light bulbs), and the brands being studied are not very differentiated. To test this hypothesis, an additional utility estimation was performed, using only data from 2 tasks, and with a drastically reduced sample size of 105. Hit rates were computed for the low sample run both at the disaggregate level, that is using unique individual utilities, and then again with each respondents utilities equal to the average of the sample (constant utilities).
161
Random Choices
1 of 1 (1 or 2) of 2 (1, 2 or 3) of 3
Unconstrained Negative DBR HB HB HB 2 Tasks 8 Tasks 2 Tasks N=105 N=1,324 N=105 Constant Utils
8%
27%
22%
25%
32%
62%
59%
61%
61%
86%
82%
82%
These results seem to suggest that there is very little heterogeneity for our models to capture in this particular data; explaining why even low task utility estimates yield fairly high hit rates. Unfortunately, this means what we cannot say whether we can reduce survey length of this new approach by reducing the number of tasks needed for estimation. Summary of Results
Provides Absolute Reference Point Brand Halo Scale Usage Bias Inter-Item Discrimination Predictive Validity Complete Time Dropout Rate Post-Field Drop Rate
Constrained Unconstrained Negative Negative DBR DBR
Ratings
No DBR
Positive DBR
No Yes Yes
No No No
Yes Yes Yes
Yes No No
Yes No No
Very Low Very Low Fast Low High
High High Slow High Low
Fairly High High Slow High Low
High High Slow High Low
High High Slow High Low
CONCLUSIONS The form of MaxDiff referred to here as Animated Modified Brand-Anchored MaxDiff Scaling with both Positive and Negative Direct Binary Response is superior to rating scales for measuring brand imagery: -Better inter-item discrimination -Better predictive validity -Elimination of brand halo -Elimination of scale usage bias -Fewer invalid completes
162
Using positive DBR alone to estimate MaxDiff utilities reintroduces brand halo and possibly scale usage bias. Positive DBR combined with some form of negative DBR to estimate MaxDiff utilities eliminates both brand halo and scale usage bias. Utilities estimated with Positive DBR have slightly weaker inter-item discrimination than utilities estimated with Negative DBR. The implication to these findings regarding DBR is that perhaps MaxDiff, if anchored, should always incorporate both positive and negative DBR since positive DBR alone produces highly correlated MaxDiff utilities with less inter-item discrimination. Another, more direct implication, is that Brand-Anchored MaxDiff with both positive and negative DBR is superior to Brand-Anchored MaxDiff with only positive DBR for measuring brand imagery. Animated Modified Brand-Anchored MaxDiff Scaling with both Positive and Negative Direct Binary Response takes longer to administer and has higher incompletion rates, however, and further work needs to be done to make the data collection and utility estimation procedures more efficient.
Paul Richard McCullough
REFERENCES Bacon, L., Lenk, P., Seryakova, K., and Veccia, E. (2007), “Making MaxDiff more informative: statistical data fusion by way of latent variable modeling,” 2007 Sawtooth Software Conference Proceedings, Santa Rosa, CA Bacon, L., Lenk, P., Seryakova, K., and Veccia, E. (2008), “Comparing Apples to Oranges,” Marketing Research Magazine, Spring 2008 Bochenholt, U. (2004), “Comparative judgements as an alternative to ratings: Identifying the scale origin,” Psychological Methods Chrzan, Keith and Natalia Golovashkina (2006), “An Empirical Test of Six Stated Importance Measures,” International Journal of Marketing Research Chrzan, Keith and Doug Malcom (2007), “An Empirical Test of Alternative Brand Systems,” 2007 Sawtooth Software Conference Proceedings, Santa Rosa, CA 163
Chrzan, Keith and Jeremy Griffiths (2005), “An Empirical Test of Brand-Anchored Maximum Difference Scaling,” 2005 Design and Innovations Conference, Berlin Cohen, Steven H. (2003), “Maximum Difference Scaling: Improved Measures of Importance and Preference for Segmentation,” Sawtooth Software Research Paper Series Dillon, William R., Thomas J. Madden, Amna Kirmani and Soumen Mukherjee (2001), "Understanding What’s in a Brand Rating: A Model for Assessing Brand and Attribute Effects and Their Relationship to Brand Equity," JMR Hendrix, Phil, and Drucker, Stuart (2007), “Alternative Approaches to Maxdiff With Large Sets of Disparate Items-Augmented and Tailored Maxdiff, 2007 Sawtooth Software Conference Proceedings, Santa Rosa, CA Horne, Jack, Bob Rayner, Reg Baker and Silvo Lenart (2012), “Continued Investigation Into the Role of the ‘Anchor’ in MaxDiff and Related Tradeoff Exercises,” 2012 Sawtooth Software Conference Proceedings, Orlando, FL Johnson, Paul, and Brent Fuller (2012), “Optimizing Pricing of Mobile Apps with Multiple Thresholds in Anchored MaxDiff,” 2012 Sawtooth Software Conference Proceedings, Orlando, FL Lattery, Kevin (2010), “Anchoring Maximum Difference Scaling Against a Threshold-Dual Response and Direct Binary Responses,” 2010 Sawtooth Software Conference Proceedings, Newport Beach, CA Louviere, J.J., Marley, A.A.J., Flynn, T., Pihlens, D. (2009), “Best-Worst Scaling: Theory, Methods and Applications,” CenSoc: forthcoming. Magidson, J., and Vermunt, J.K. (2007a), “Use of a random intercept in latent class regression models to remove response level effects in ratings data. Bulletin of the International Statistical Institute, 56th Session, paper #1604, 1–4. ISI 2007: Lisboa, Portugal Magidson, J., and Vermunt, J.K. (2007b), “Removing the scale factor confound in multinomial logit choice models to obtain better estimates of preference,” 2007 Sawtooth Software Conference Proceedings, Santa Rosa, CA McCullough, Paul Richard (2009), “Comparing Hierarchical Bayes and Latent Class Choice: Practical Issues for Sparse Data Sets,” 2009 Sawtooth Software Conference Proceedings, Delray Beach, FL Vermunt and Magidson (2008), LG Syntax User’s Guide: Manual for Latent GOLD Choice 4.5 Syntax Module Wirth, Ralph, and Wolfrath, Annette (2012), “Using MaxDiff for Evaluating Very Large Sets of Items,” 2012 Sawtooth Software Conference Proceedings, Orlando, FL
164
ACBC REVISITED MARCO HOOGERBRUGGE JEROEN HARDON CHRISTOPHER FOTENOS SKIM GROUP
ABSTRACT Adaptive Choice-Based Conjoint (ACBC) was developed by Sawtooth Software in 2009 as an alternative to their classic CBC software in order to obtain better respondent data in complex choice situations. Similar to Adaptive Conjoint Analysis (ACA) many years ago, this alternative adapts the design of the choice experiment to the specific preferences of each respondent. Despite its strengths, ACBC has not garnered the popularity that ACA did as CBC has maintained the dominant position in the discrete choice modeling market. There are several possibilities concerning the way ACBC is assessed and its various features that may explain why this has happened. In this paper, we compare ACBC to several other methodologies and variations of ACBC itself in order to assess its performance and look into potential ways of improving it. What we show is that ACBC does indeed perform very well for modeling choice behavior in complex markets and it is robust enough to allow for simplifications without harming results. We also present a new discrete choice methodology called Dynamic CBC, which combines features of ACBC and CBC to provide a strong alternative to CBC in situations where running an ACBC study may not be applicable. Though this paper will touch on some of the details of the standard ACBC procedure, for a more in-depth overview and introduction to the methodology please refer to the ACBC Technical Paper published in 2009 from the Sawtooth Software Technical Paper Series.
BACKGROUND Our methodological hypotheses as to why ACBC has not yet caught up to the popularity of CBC are related to the way in which the methodology has been critiqued as well as how respondents take the exercise: 1. Our first point relates to the way in which comparison tests between ACBC and CBC have been performed in the past. We believe that ACBC will primarily be successful in markets for which the philosophy behind ACBC is truly applicable. This means that the market should consist of dozens of products so that consumers need to simplify their choices upfront by consciously or subconsciously creating an evoked set of products from which to choose. This evoked set can be different for each consumer: some consumers may restrict themselves to one or more specific brands while other consumers may only shop within specific price tiers. This aligns with the non-compensatory decision making behavior that Sawtooth Software (2009) was aiming to model when designing the ACBC platform. For example, shopping for technology products (laptops, tablets, etc.), subscriptions (mobile, insurance), or cars may be very well suited for modeling via ACBC.
165
If we are studying such a market, the holdout task(s), which are the basis for comparison between methodologies, should also reflect the complexity of the market! Simply put, the holdout tasks should be similar to the scenario that we wish to simulate. Whereas in a typical ACBC or CBC choice exercise we may limit respondents to three to five concepts to ensure that they assess all concepts, holdout tasks for complex markets can be more elaborate as they are used for model assessment rather than deriving preference behavior. 2. For many respondents in past ACBC studies we have seen that they may not have a very ‘rich’ choice tournament because they reject too many concepts in the screening section of the exercise. Some of them do not even get to the choice tournament or see just one choice tournament task. In an attempt to curb this kind of behavior we have tried to encourage respondents to allow more products through the screening section by moderately rephrasing the question text and/or the text of the two answer options (would / would not consider). We have seen that this helps a bit though not enough in expanding the choice tournament. The simulation results are mostly based on the choices in the tournament, so fewer choice tasks could potentially lead to a less accurate prediction. If ACBC may be adjusted such that the choice tournament provides ‘richer’ data, the quality of the simulations (and for the holdout predictions) may be improved. 3. Related to our first point on the realism of simulated scenarios, for many respondents we often see that the choice tournament converges to a winning concept that has a price way below what is available in the market (and below what is offered in the holdout task). With the default settings of ACBC (-30% to +30% of the summed component price), we have seen that approximately half of respondents end up with a winning concept that has a price 15– 30% below the market average. Because of this, we may not learn very much about what these respondents would choose in a market with realistic prices. A way to avoid this is to follow Sawtooth Software’s early recommendation to have an asymmetric price range (e.g., from 80% to 140% of market average) and possibly also to use a narrower price range if doing so is more in line with market reality. 4. A completely different kind of hypothesis refers to the design algorithm of ACBC. In ACBC, near-orthogonal designs are generated consisting of concepts that are “nearneighbors” to the respondent’s BYO task configuration while still including the full range of levels across all attributes. The researcher has some input into this process in that they can specify the total number of concepts to generate, the minimum and maximum number of levels to vary from the BYO concept, and the percent deviation from the total summed price as mentioned above (Sawtooth Software 2009). It may be the case that ACBC (at its default settings) is too extreme in its design of concepts at the respondent level and the Hierarchical Bayes estimation procedure may not be able to fully compensate for this in its borrowing process across respondents. For this we propose two potential solutions: a. A modest improvement for keeping the design as D-efficient as possible is to maximize the number of attributes to be varied from the BYO task. b. A solution for this problem may be found outside ACBC as well. An adaptive method that starts with a standard CBC and preserves more of CBC’s D-
166
efficiency throughout the exercise may lead to a better balance between Defficiency and interactivity than is currently available in other methodologies.
THE MARKET In order to test the hypotheses mentioned above, we needed to run our experiment in a market that is suitably complex enough to evoke the non-compensatory choice behavior of respondents that is captured by the ACBC exercise. We therefore decided to use the television market in the United States. As the technology that goes into televisions has become more advanced (think smart TVs, 3D, etc.) and the number of brands available has expanded, the choice of a television has grown more elaborate as has the pricing structure of the category. There are so many features that play a role in the price of a television and the importance of these features between respondents can vary greatly. Based on research of televisions widely available to consumers in the US, we ran our study with the following attributes and levels: Attribute Brand Screen Size Screen Type Resolution Wi-Fi Capability 3D Capability Number of HDMI Inputs Total Price
Levels Sony, Samsung, LG, Panasonic, Vizio 22˝, 32˝, 42˝, 52˝, 57˝, 62˝, 67˝, 72˝ LED, LCD, LED-LCD, Plasma 720p, 1080p Yes/No Yes/No 0–3 connections Summed attribute price ranged from $120 to $3,500
In order to qualify for the study, respondents had to be between the ages of 22 and 65, live independently (i.e., not supported by their parents), and be in the market for a new television sometime within the next 2 years. Respondent recruitment came through uSamp’s online panel.
THE METHODS In total we looked into eight different methodologies in order to assess the performance of standard ACBC and explore ways of improving it as per our previously mentioned hypotheses. These methods included variations of ACBC and CBC as well as our own SKIM-developed alternative called Dynamic CBC, which combines key features of both methods. It is important to note that the same attributes and levels were used across all methods and that the designs contained no prohibitions within or between attributes. Additionally, all methods showed respondents a total summed component price for each television, consistent with a standard ACBC exercise, and was taken by a sample of approximately 300 respondents (2,422 in total). The methods are as follows: A. Standard ACBC—When referring to standard ACBC, we mean that respondents completed all portions of the ACBC exercise (BYO, screening section, choice tournament) and that the settings were generally in line with those recommended by Sawtooth Software including having between two and four attributes varied from the BYO exercise in the generation of concepts, six screening tasks with four concepts each, unacceptable questions after the third and fourth screening task, one must-have question 167
B.
C.
D.
E.
F. G.
168
after the fifth screening task, and a price range varying between 80–140% of the total summed component price. Additionally, for this particular leg price was excluded as an attribute in the unacceptable questions asked during the screening section. ACBC with price included in the unacceptable questions—This leg had the same settings as the standard ACBC leg however price was included as an attribute in the unacceptable questions of the screening section. The idea behind this is to increase the number of concepts that make it through the screening section in order to create a richer choice tournament for respondents, corresponding with our hypothesis that a longer, more varied choice tournament could potentially result in better data. If respondents confirm that a very high price range is unacceptable to them, the concepts they evaluate no longer contain these higher price points, so we could expect respondents to accept more concepts in the further evaluation. This of course also creates the risk that the respondent is too conservative in screening based on price and leads to a much shorter choice tournament. ACBC without a screening section—Again going back to the hypothesis of the “richness” of the choice tournament for respondents, this leg followed the same settings as our baseline ACBC exercise but skipped the entire screening section of the methodology. By skipping the screening section, this ensured that respondents would see a full set of choice tasks (in this case 12). The end result is still respondent-specific as designs are built with the “near-neighbor” concept in mind but they are just prevented from further customizing their consideration set for the choice tournament. We thereby have collected more data from the choice tournament for all respondents, providing more information for utility estimation. Additionally, skipping the screening section may lead to increased respondent engagement by way of shortening the overall length of interview. ACBC with a narrower tested price range—To test our hypothesis on the reality of the winning concept price for respondents, we included one test leg that had the equivalent settings of the standard ACBC leg with the exception of using a narrower tested price deviation from the total summed component price. For this cell a range of 90% to 120% of the total summed component price was used as opposed to 80% to 140% as used by the other legs. Although this is not fully testing our hypothesis as we are not comparing to the default 70% to 130% range, we feel that the 80% to 140% range is already an accepted better alternative to the default range and we can learn more by seeing if an even narrower range improves results. ACBC with 4 attributes varied from the BYO concept—We tested our last hypothesis concerning the design algorithm of ACBC by including a leg which forced there to be four attributes varied (out of seven possible) from the BYO concept across all concepts generated for each respondent whereas all other ACBC legs had between two and four attributes varied as previously mentioned. The logic behind this is to ensure that the nonBYO levels show up more often in the design and therefore push it closer to being more level-balanced and therefore more statistically efficient. Standard CBC—The main purpose of this leg was to serve as a comparison to standard ACBC and the other methods. Price balanced CBC—A CBC exercise that shows concepts of similar price on each screen. This has been operationalized by means of between-concept prohibitions on price. This is similar to the way ACBC concepts within a task are relatively utility balanced since they are all built off variations of the respondent’s BYO concept. Although this is
not directly tied into one our hypotheses concerning ACBC, it is helpful in understanding if applying a moderate utility balance within the CBC methodology could help improve results. H. Dynamic CBC—As previously mentioned, this method is designed by SKIM and includes features of both CBC and ACBC. Just like a standard CBC exercise, Dynamic CBC starts out with an orthogonal design space for all respondents as there is no BYO or screening exercise prior to the start of the choice tasks. However like ACBC, the method is adaptive in the way in which it displays the later tasks of the exercise. At several points throughout the course of the exercise, one of the attributes in the pre-drawn orthogonal design has its levels replaced with a level from a concept previously chosen by the respondent. Between these replacements, respondents were asked which particular attribute (in this case television feature) they focused on the most when making their selections in previous tasks. The selection of the attribute to replace was done randomly though the attribute that the respondent stated they focused on the most when making their selection was given a higher probability of being drawn in this process. This idea is relatively similar to the “near-neighbors” concept of ACBC in the sense that we are fixing an attribute to a level which we are confident that the respondent prefers (like their BYO preferred level) and thus force them to make trade-offs between other attributes to gain more insights into their preferences in those areas. In this particular study, this adaptive procedure occurred at three points in the exercise.
THE COMPARISON At this point we have not yet fully covered how we went about testing our first hypothesis concerning the way in which comparison tests have been performed between ACBC and other methodologies. Aside from applying all methodologies to a complex market, in this case the television market in the United States, the holdout tasks had to be representative of the typical choice that a consumer might have to make in reality. After all, a more relevant simulation from such a study is one that tries to mimic reality as best as possible. In order to better represent the market, respondents completed a set of three holdout tasks consisting of twenty concepts each (not including a none option). Within each holdout task, concepts were relatively similar to one another in terms of their total price so as not to create “easy” decisions for respondents that always went for a particular price tier or feature. We aimed to cover a different tier of the market with each task so as not to alienate any respondents because of their preferred price tier. An example holdout task can be seen below:
169
While quite often holdout tasks will be similar to the tasks of the choice exercise, in the case of complex markets it would be very difficult to have many screening or choice tournament tasks that show as many as twenty concepts on the screen and still expect reliable data from respondents. In terms of the reality of the exercise, it is important to distinguish between the reality of choice behavior and the decision in the end. Whereas the entire ACBC exercise can help us better understand the non-compensatory choice behavior of respondents while only showing three to four concepts at a time, we still must assess its predictive power in the context of an “actual” complex purchase that we aim to model in the simulator. As a result of having so many concepts available in the holdout tasks, the probability of a successful holdout prediction is much lower than what one may be used to seeing in similar research. For example, if we were to use hit rates as a means of assessing each methodology, by random chance we would be correct just 5% of the time. Holdout hit rates also provide no information about the proximity of the prediction, therefore giving no credit for when a method comes close to predicting the correct concept. Take the following respondent simulation for example: Concept
Share of Preference
1
0.64%
11
1.48%
2
0.03%
12
0.88%
3
0.65%
13
29.78%
4
0.33%
14
0.95%
5
4.68%
15
27.31%
6
0.18%
16
12.99%
7
4.67%
17
4.67%
8
2.40%
18
0.17%
9
1.75%
19
4.28%
10
0.07%
20
2.11%
If, for example, in this particular scenario the respondent chose concept 15, which comes in at a close second to concept 13, it would simply count the methodology as being incorrect despite how close it came to being correct. By definition, the share of preference model is also 170
telling us that there is roughly a 70.22% chance that the respondent would not choose concept 13, so it is not necessarily incorrect in the sense of telling us that the respondent would not choose concept 13 either. Because of this potential for hit rates to understate the accuracy of a method, particularly one with complex holdout tasks, we decided to use an alternative success metric: the means of the share of preference of the concepts that the respondents chose in the three holdout tasks. In the example above, if the respondent chose concept 15, their contribution to the mean share of preference metric for the methodology that they took would be a score of 27.31%. As a benchmark, this could be compared to the predicted share of preference of a random choice, which in this case would be 5%. Please note that all mean share of preferences reported for the remainder of this paper used a default exponent of one in the modeling process. While testing values of the exponent, we found that any exponent value between 0.7 and 1 yielded similar optimal results. In addition, in the tables we have also added the “traditional” hit rate, comparing first choice simulations with the actual answers to holdout tasks. A methodological note on the two measures of performance has been added at the end of this paper.
RESPONDENTS ENJOYED BOTH CBC AND ACBC EXERCISES As a first result, it is important to note that there were no significant differences (using an alpha of 0.05) in respondent enjoyment or ease of understanding across the methodologies. At the end of the survey respondents were asked whether they enjoyed taking the survey and whether they found it difficult to fill in the questionnaire. Please note though that these questions were asked in reference to the entire survey, not necessarily just the discrete choice portion of the study. A summary of the results from these questions can be found below:
A key takeaway from this is that there is little need to worry about whether respondents will have a difficult time taking a CBC study or enjoy it any less than an ACBC study despite being less interactive. Though this is a purely directional result, it is interesting to note that the 171
methodology that rates least difficult to fill in was ACBC without a screening section, even less so than a standard CBC exercise that does not include a BYO task.
MAIN RESULTS Diving into the more important results now, what we see is that all variations of ACBC significantly outperformed all variations of CBC, though there is no significant difference (using an alpha of 0.05) between the different ACBC legs. Interestingly though, Dynamic CBC was able to outperform both standard and price balanced CBC. A summary of the results can be found in the chart below. As noticeable, all the above conclusions hold for both metrics.
Some comments on specific legs: -
-
-
172
ACBC with a price range of 90%-120% behaved as expected in the sense that the concepts in the choice tournament had more realistic price levels. However, this did not result in more accurate predictions. ACBC with price in the unacceptable questions behaved as expected in the sense that no less than 40% of respondents rejected concepts above a certain price point. However, this did not result in a significantly richer choice tournament and consequently did not lead to better predictions. ACBC without a screening section behaved as expected in the sense that there was a much richer choice tournament (by definition, because all near-neighbor concepts were in the tournament). This did not lead to an increase in prediction performance, however. But perhaps more interesting is the fact that the prediction performance did not decline either in comparison with standard ACBC. Apparently a rich tournament compensates for dropping the screening section, so it seems like the screening section can be skipped altogether, saving a substantial amount of interview time (except for the most disengaged respondents who formerly largely skipped the tournament).
INCREASING THE GRANULARITY OF THE PRICE ATTRIBUTE As we mentioned earlier, each leg showed respondents a summed total price for each concept and this price was allowed some variation according to the ranges that we specified in the setup. This is done by taking the summed component price of the concept and multiplying it by a random number in line with the specified variation range. Because of this process, price is a continuous variable rather than discrete as with the other attributes in the study. Therefore in order to estimate the price utilities, we chose to apply a piecewise utility estimation for the attribute as is generally used in ACBC studies involving price. In the current ACBC software, users are allowed to specify up to 12 cut-points in the tested price range for which price slopes will be estimated between consecutive points. Using this method, the estimated slope is consistent between each of the specified cut-points so it is important to choose these cut-points in such a way that it best reflects the varying consumer price sensitivities across the range (i.e., identifying price thresholds or periods of relative inelasticity). Given the wide range of prices tested in our study (as mentioned earlier, the summed component price ranged from $120 to $3,500) we felt that our price range could be better modeled using more cut-points than the software currently allows. Since the concepts shown to a respondent within ACBC are near-neighbors of their BYO concept, it is quite possible that through this and the screening exercise we only collected data for these respondents on a subset of the total price range. This would therefore make much of the specified cut-points irrelevant to this respondent and lead to a poorer reflection of their price sensitivity across the price range that was most relevant to them. We saw this happening especially with respondents with a BYO choice in the lowest price range. Based on this and the relative count frequencies (percent of times chosen out of times shown) in the price data, we decided to increase the number of cut-points to 28 and run the estimation using CBC HB. As you can see in the data below, ACBC benefits much more from increasing the number of cut-points than CBC (i.e., the predicted SoP in the ACBC legs increases, while the predicted SoP in the CBC legs remains stable; the predicted traditional hit rate in the ACBC legs remains nearly stable while for CBC it decreases). A reasonable explanation for this difference between the ACBC and CBC legs is the fact that in ACBC legs the variation in prices in each individual interview is generally a lot smaller than in a CBC study. So in ACBC studies it does not harm to have that many cut-points and when you would look at the mean SoP metric, it seems actually better to do so.
173
AN ALTERNATIVE MODEL COMPARISON In order to make sure that our results were not purely the result of our mean share of preference metric for comparing the methodologies, we also looked into the mean squared error of the predicted share of preferences across respondents. Using this metric, we could see how much the simulated preference shares deviated from the actual holdout task choices. This was calculated as:
Where the aggregate concept share from the holdout task is the percent of respondents that selected the concept in the holdout task and the aggregate mean predicted concept SoP is the mean share of preference for the concept across all respondents. As displayed in the table below, using mean squared error as a success metric also confirms our previous results based on the mean share of preference metric:
174
When looking into these results, it was a bit concerning that the square root of the mean squared error was close to the average aggregate share of preference for a concept (5% = 1/20 concepts), particularly for the CBC exercises. Upon further investigation, we noticed that there was one particular concept in one of the holdout tasks that generates a large amount of the squared error in the table above. This concept was the cheapest priced concept in the holdout task that contained products of a higher price tier than the other two holdout tasks. At an aggregate level, this particular concept had an actual holdout share of 43% but the predicted share of preference was significantly lower: the ACBC modules had an average share of preference of 27% for this concept whereas the standard CBC leg (8%) and price-balanced CBC leg (17%) performed much worse. By removing this one concept from the consideration set, it helped to relieve the mean squared error metric quite a bit:
Although “only” one concept seems to primarily distort the results, it is nevertheless meaningful to dive further into it. After all, we do not want to get an entirely wrong share from 175
the simulator for just one product, even if it is just one product. In the section below we have looked at it in more depth. Should you be willing to skip this section we can already share with you that we find that brand sensitivity seems to be overestimated by CBC-HB and ACBC-HB and price sensitivity seems to be underestimated. This applies to all CBC and ACBC legs (although ACBC legs slightly less so). The one concept with a completely wrong MSE contribution is exactly the cheapest of the 20 concepts in that task, hence its predicted share is way lower than its stated actual share.
CLUSTER ANALYSIS ON PREDICTED SHARES OF PREFERENCE FOR HOLDOUT TASKS A common way to cluster respondents based on conjoint data is by means of Latent Class analysis. In this way we get an understanding of differences between respondents regarding all attribute levels in the study. A somewhat different approach is to cluster respondents based on predicted shares of preference in the simulator. In this way we get an understanding of differences between respondents regarding their preferences for actual products in the market. While Latent Class is based on the entire spectrum of attribute levels, clustering on predicted shares narrows the clustering down to what is relevant in the current market. We took the latter approach and applied CCEA for clustering on the predicted shares of preference for the concepts in the three holdout tasks (as mentioned earlier, the holdout tasks were meant to be representative of a simulated complex market). We combined all eight study legs together to get a robust sample for this type of analysis. This was possible because the structure of the data and the scale of the data (shares of preference) is the same across all eight legs. In retrospect we checked if there were significant differences in cluster membership between the legs (there were some significant differences but these differences were not particularly big). The cluster analysis resulted in the following 2 clusters with each 5 sub-clusters, for a total of 10 clusters: 1. One cluster with respondents preferring a low end TV (up to some $600), about 1/3 of the sample This cluster consists of 5 sub-clusters based on brand—the 5 sub-clusters largely coincide with a strong preference for just one specific brand out of the 5 brands tested, although in some sub-clusters there was a preference for two brands jointly. 2. The other cluster with respondents preferring a mid/high end TV (starting at some $600), about 2/3 of the sample This cluster also consists of 5 sub-clusters based on brand, with the same remark as before for the low end TV cluster. The next step was to redo the MSE comparison by cluster and sub-cluster. Please note that the shares of preference within a cluster by definition will reflect the low end or mid/high end category and a certain brand preference, completely in line with the general description of the cluster. After all, the clustering was based on these shares of preference. The interesting thing is to see how the actual shares (based on actual answers in holdout tasks) behave by sub-cluster. The following is merely an example of one sub-cluster, namely for the “mid/high end Vizio cluster,” yet the same phenomenon applies in all of the other 9 subclusters. The following graph makes the comparison at a brand level (shares of the SKUs of each 176
brand together) and compares predicted shares of preference in the cluster with the actual shares in the cluster, for each of the three choice tasks.
Since this is the “mid/high end Vizio cluster,” the predicted shares of preference for Vizio (in the first three rows) are by definition much higher than for the other brands and are also nearly equal for the three holdout tasks. However, the actual shares of Vizio in the holdout tasks (in the last three rows) are very similar to the shares of Samsung and Sony while varying a lot more in the three holdout tasks than in the predicted shares. So the predicted strong brand preference for Vizio is not reflected at all in the actual holdout tasks! The fact that the actual brand shares are quite different across the three tasks has a very likely cause in the particular specifications of the holdout tasks: 1. In the first holdout task, Panasonic had the cheapest product in the whole task, followed by Samsung. Both of these brands have a much higher share than predicted. 2. In the second holdout task, both Vizio and LG had the cheapest product in the whole task. Indeed Vizio has a higher actual share than in the other tasks (but not as overwhelmingly as in the prediction) while LG still has a low actual share (which at least confirms that people in this cluster dislike LG). 3. In the third holdout task, Samsung had the cheapest product in the whole task and has a much higher share than predicted. It is exactly this one concept (the cheapest one of Samsung) that was distorting the whole aggregate MSE analysis that we discussed in the previous section. Less clear in the second holdout task, but very clear in the first and third holdout task, it seems that we have a case here where the importance of brand is being overestimated while the importance of price is being underestimated. The actual choices are much more based on price than was predicted and much less based on brand. The earlier MSE analysis in which we found one concept with a big discrepancy between actual and predicted aggregate share perfectly fits into this. Please note though that this is evidence, not proof (we found a black swan but not all swans are black).
177
DISCUSSION AFTER THE CONFERENCE: WHICH METRIC TO USE? A discussion went on after the conference concerning how to evaluate and compare different conjoint methods: is it hit rates (as has been habitual in the conjoint world throughout the years) or is it predicted share of preference of the chosen concept? Hit rates clearly have an advantage of not being subject to scale issues, whereas shares of preference can be manipulated through a lower or higher scale factor in the utilities. Though you may well argue, why would anyone manipulate shares of preference with scale factors to begin with? Predicted shares of preference on the other hand clearly have an advantage of being a closer representation of the underlying logistic probability modeling. Speaking of this model, if the multinomial logistic model specification is correct (i.e., if answers to holdout tasks were to be regarded as random draws from these utility-based individual models), we could not even have a significant difference between the hit rate percentage and the mean share of preference. So reversing this reasoning, since these two measures deviate so much, apparently either the model specification or the estimation procedure contains some element of bias. It may be just the scale factor that is biased, or the problem may be bigger than that. We would definitely recommend some further academic investigation in this issue.
CONCLUSIONS In general we were thrilled by some of these findings but less than enthusiastic about others. On a positive note, it was interesting to see that removing the screening section of ACBC does not harm its predictive strength. This can be important for studies in which shortening the length of interview is necessary (i.e., other methods are being used and require the respondent’s engagement or for cost purposes). While for this test we had a customized version of ACBC available, Sawtooth announced during the Conference that skipping the screening section as one of the options will be available to the “general public” in the next SSI Web version (v8.3). On a similar note, using a more narrow price range for ACBC did not hurt either (though nor did it beat standard ACBC) despite the fact that doing so creates more multicollinearity in the design since the total price shown to respondents is more in line with the sum of the concept’s feature prices. Across all methods, it was encouraging to find that all models beat random choice by a factor of 3–4, meaning that it is very well possible to predict respondent choice in complex choice situations such as purchasing one television out of twenty alternatives. One more double-edged learning is that while none of the ACBC alternatives outperformed standard ACBC, it just goes to show that ACBC already performs well as it is! In addition since many of the alternative legs were simplifications of the standard methodology, it shows that the method is robust enough to support simplifications yet still yield similar results. This may very well hint that much of the respondent’s non-compensatory choice behavior can be inferred from their choices between near-neighbors of their BYO concept. After all, the BYO concept is for all intents and purposes the “ideal” concept for the respondent. We had also hoped that pricebalanced CBC would help to improve standard CBC since it would force respondents to make trade-offs in areas other than just price, however this did not turn out to be the case. From our own internal perspective, it was quite promising to see that Dynamic CBC outperformed both CBC alternatives though again disappointing that it did not quite match the 178
predictive power of the ACBC legs tested. Despite not performing as well as ACBC, Dynamic CBC could still be a viable methodology to use in cases where something like a BYO exercise or screening section might not make sense in the context of the market being analyzed. In addition, further refinement of the method could possibly lead to results as well if not better than ACBC. Finally we were surprised to see that all of the 8 tested legs were very poor in predicting one of the concepts in one of the holdout tasks (although the ACBC legs did somewhat less poorly than the CBC legs). The underlying phenomenon seems to be that brand sensitivity is overestimated in all legs while price sensitivity is underestimated. This is something definitely to dig into further, and—who knows—may eventually lead to an entirely new conjoint approach beyond ACBC or CBC.
NEXT STEPS As mentioned in the results section, there were some discrepancies between the actual holdout responses and the conjoint predictions that we would like to further investigate. Given the promise shown by our initial run of Dynamic CBC, we would like to further test more variants of it and in other markets as well. As a means of further testing our results, we would also like to double-check our findings concerning the increased number of cut-points on any past data that we may have from ACBC studies that included holdout tasks.
Marco Hoogerbrugge
Jeroen Hardon
Christopher Fotenos
REFERENCES Sawtooth Software (2009), “ACBC Technical Paper,” Sawtooth Software Technical Paper Series, Sequim, WA
179
RESEARCH SPACE AND REALISTIC PRICING IN SHELF LAYOUT CONJOINT (SLC) PETER KURZ1 TNS INFRATEST STEFAN BINNER BMS MARKETING RESEARCH + STRATEGY LEONHARD KEHL PREMIUM CHOICE RESEARCH & CONSULTING
WHY TALK ABOUT SHELVES? For consumers, times long ago changed. Rather than being served by a shop assistant, superand hypermarkets have changed the way we buy products, especially fast-moving consumer goods (FMCGs). In most developed countries there is an overwhelming number of products for consumers to select from:
“Traditional Trade”
“Modern Trade”
As marketers became aware of the importance of packaging design, assortment and positioning of their products on these huge shelves, researchers developed methods to test these new marketing mix elements. One example is a “shelf test” where respondents are interviewed in front of a real shelf about their reaction to the offered products. (In FMCG work, the products are often referred to as “stock keeping units” or “SKUs,” a term that emphasizes that each variation of flavor or package size is treated as a different product.) For a long time, conjoint analysis was not very good at mimicking such shelves in choice tasks: early versions of CBC were limited to a small number of concepts to be shown. Furthermore the philosophical approach for conjoint analysis, let’s call it the traditional conjoint approach, was driven by taking products apart into attributes and levels. However, this traditional approach missed some key elements in consumers’ choice situation in front of a modern FMCG shelf, e.g.:
How does the packaging design of an SKU communicate the benefits (attribute levels) of a product?
1
Correspondence: Peter Kurz, Head of Research & Development TNS Infratest (
[email protected]) Stefan Binner, Managing Director, bms marketing research + strategy (
[email protected]) Leonhard Kehl, Managing Director, Premium Choice Research & Consulting (
[email protected])
181
How does an SKU perform in the complex competition with the other SKUs in the shelf?
As it became easy for researchers to create shelf-like choice tasks (in 2013, among Sawtooth Software users who use CBC, 11% of their CBC projects employed shelf display) a new conjoint approach developed: “Shelf Layout Conjoint” or “SLC.”
HOW IS SHELF LAYOUT CONJOINT DIFFERENT? The main differences between Traditional and Shelf Layout Conjoint are summarized in this chart: TRADITIONAL CONJOINT
SHELF LAYOUT CONJOINT
- Products or concepts consist usually of defined attribute levels
- Communication of “attributes” through non-varying package design (instead of levels)
- More rational or textual concept description (compared to packaging picture)
- Visibility of all concepts at once - Including impact of assortment
- Almost no impact of package design - Usually not too many concepts per task
- Including impact of shelf position and number of facings - Information overflow
many attributes—few concepts
few visible attributes (mainly product and price—picture only)—many concepts
Many approaches are used to represent a shelf in a conjoint task. Some are very simple:
182
Some are quite sophisticated:
However, even the most sophisticated computerized visualization does not reflect the real situation of a consumer in a supermarket (Kurz 2008). In that paper, comparisons between a simple grid of products from which consumers make their choices and attempts to make the choice exercise more realistic by showing a store shelf in 3D showed no significant differences in the resulting preference share models.
THE CHALLENGES OF SHELF LAYOUT CONJOINT Besides differences in the visualization of the shelves, there are different objectives SLCs can address, including:
pricing product optimization portfolio optimization positioning layout promotion
SLCs also differ in the complexity of their models and experimental designs, ranging from simple main effects models up to complex Discrete Choice Models (DCM’s) with lots of attributes and parameters to be estimated. Researchers often run into very complex models, with one attribute with a large number of levels (the SKU’s) and related to each of these levels one attribute (often, price) with a certain number of levels. Such designs could easily end up with several hundred parameters to be estimated. Furthermore, for complex experimental designs, layouts have to be generated in a special way, in order to retain realistic relationships between SKUs and realistic results. Socalled “alternative-specific designs” are often used in SLC, but that does not necessarily mean that it is always a good idea to estimate price effects as being alternative-specific. In terms of estimating utility values (under the assumption you estimate interaction effects, which lead to alternative-specific price effects), many different coding-schemes can be prepared which are mathematically identical. But, the experimental design behind the shelves is slightly different. Different design strategies affect how much level overlap occurs and therefore how efficient the 183
estimation of interactions can be. Good strategies to reduce this complexity in the estimation stage are crucial. With Shelf Layout Conjoint now easily available to every CBC user, we would like to encourage researchers to use this powerful tool. However, there are at least five critical questions in the design of Shelf Layout Conjoints which need to be addressed: Are the research objectives suitable for Shelf Layout Conjoint? What is the correct target group and SKU space (the “research space”)? Are the planned shelf layout choice tasks meaningful for respondents, and will they provide the desired information from their choices? Can we assure realistic inputs and results with regard to pricing? How can we build simulation models that provide reliable and meaningful results? As this list suggests, there are many topics, problems and possible solutions with Shelf Layout Conjoint. However, this paper focuses on only three of these very important issues. We take a practitioner’s, rather than an academic’s point of view. The three key areas we will address are: 1. Which are suitable research objectives? 2. How to define the research space? 3. How to handle pricing in a realistic way?
SUITABLE RESEARCH OBJECTIVES FOR SHELF LAYOUT CONJOINT Evaluating suitable objectives requires that researchers be aware of all the limitations and obstacles Shelf Layout Conjoint has. So, we begin by introducing three of those key limitations. 1. Visualization of the test shelf. Real shelves always look different than test shelves.
Furthermore there is naturally a difference between a 21˝ Screen and a 10 meter shelf in a real supermarket. The SKUs are shown much smaller than in reality. One cannot touch and feel them. 3D models and other approaches might help, but the basic issue still remains.
184
2. Realistic choices for consumers. Shelf Layout Conjoint creates an artificial distribution and awareness: All products are on the shelf; respondents are usually asked to look at or consider all of them. In addition, we usually simplify the market with our test shelf. In reality every distribution chain has different shelf offerings, which might further vary with the size of the individual store. In the real world, consumers often leave store A and go to store B if they do not like the offering (shop hopping). Sometimes products are out of stock, forcing consumers to buy something different. 3. Market predictions from Shelf Layout Conjoint. Shelf Layout Conjoint provides results from a single purchase simulation. We gain no insights about repurchase (did they like the product at all?) or future purchase frequency. In reality, promotions play a big role, not only in the shelf display but in other ways, for example, with second facings. It is very challenging to measure volumetric promotion effects such as “stocking up” purchases, but those play a big role in some product categories (Eagle 2010; Pandey, Wagner 2012). The complexity of “razor and blade” products, where manufacturers make their profit on the refill or consumable rather than on the basic product or tool, are another example of difficult obstacles researchers can be faced with.
SUITABLE OBJECTIVES Despite these limitations and obstacles Shelf Layout Conjoint can provide powerful knowledge. It is just a matter of using it to address the right objectives; if you use it appropriately, it works very well! Usually suitable objectives for Shelf Layout Conjoint fall in the areas of either optimization of assortment or pricing. The optimization of assortment most often refers to such issues as: Line extensions with additional SKUs
185
What is the impact (share of choice) of the new product? Where does this share of choice come from (customer migration and/or cannibalization)? Which possible line extension has the highest consumer preference or leads to the best overall result for the total brand assortment or product line? Re-launch or substitution of existing SKUs
What is the impact (share of choice) for the re-launch? Which possible re-launch alternative has the highest consumer preference? Which SKU should be substituted for? How does the result of the re-launch compare to a line extension?
Branding What is the effect of re-branding a line of products? What is the effect of the market entry of new brands or competitors? The optimization of pricing most often involves questions like: Price positioning What is the impact of different prices on share of choice and profit? How should the different SKUs within an assortment be priced? How will the market react to a competitor’s price changes? Promotions What is the impact (sensitivity) to promotions? Which SKUs have the highest promotion effect? How much price reduction is necessary to create a promotion effect? Indirect pricing What is the impact of different contents (i.e., package sizes) on share of choice and profit? How should the contents of different SKUs within an assortment be defined? How will the market react to competitors’ changes in contents? On the other hand there are research objectives which are problematic, or at least challenging, for Shelf Layout Conjoint. Some examples include: 186
Market size forecasts for a sales period Volumetric promotion effects Multi category purchase/TURF-like goals Positioning of products on the shelf Development of package design Evaluation of new product concepts
New product development
Not all of the above research objectives are impossible, but they at least require very tailored or cautious approaches.
DEFINITION OF THE CORRECT MARKET SCOPE By the terminology “market scope” we mean the research space of the Shelf Layout Conjoint. Market scope can be defined by three questions, which are somewhat related to each other: What SKUs do we show on the Shelf? What consumers do we interview? What do we actually ask them to do?
=> SKU Space => Target Group => Context
SKU Space
Possible solutions to this basic problem depend heavily on the specific market and product category. Two main types of solutions are: 1. Focusing the SKU space on or by market segments Such focus could be achieved by narrowing the SKU space to just a part of the market such as - distribution channel (no shop has all SKUs) - product subcategories (e.g., product types such as solid vs. liquid) - market segments (e.g., premium or value for money) This will provide more meaningful results for the targeted market segment. However it may miss migration effects from (or to) other segments. Furthermore such segments might be quite artificial from the consumer’s point of view. Alternatively one could focus on the most relevant products only (80:20 rule). 2. Strategies to cope with too many SKUs
187
When there are more SKUs than can be shown on the screen or understood by the respondents, strategies might include - Prior SKU selection (consideration sets) - Multiple models for subcategories - Partial shelves Further considerations for the SKU space: - Private labels (which SKUs could represent this segment?) - Out of stock effects (whether and how to include them?) Target Group
The definition of the target group must align with the SKU space. If there is a market segment focus, then obviously the target group should only include customers in that segment. Conversely, if there are strategies for huge numbers of SKUs all relevant customers should be included. There are still other questions about the target group which should also be addressed, including: - Current buyers only or also non-buyers? - Quotas on recently used brands or SKUs? - Quotas on distribution channel? - Quotas on the purchase occasion? Context
Once the SKU space and the target group are defined the final element of “market scope” is to create realistic, meaningful choice tasks: 1. Setting the scene: - Introduction of new SKUs - Advertising simulation 2. Visualization of the shelf - Shelf Layout (brand blocks, multiple facings) - Line pricing/promotions - Possibility to enlarge or “examine” products 3. The conjoint/choice question - What exactly is the task? - Setting a purchase scenario or not? - Single choice or volumetric measurement?
PRICING Pricing is one of the most, if not the most, important topic in Shelf Layout Conjoint. In nearly all SLCs some kind of pricing issue is included as an objective. But “pricing” does not mean just one common approach. Research questions in regard to pricing are very different between different studies. They start with easy questions about the “right price” of a single SKU. They often include the pricing of whole product portfolios, including different pack sizes, flavors and 188
variants and may extend to complicated objectives like determining the best promotion price and the impact of different price tags. Before designing a Shelf Layout Conjoint researchers must therefore have a clear answer to the question: “How can we obtain realistic input on pricing?” Realistic pricing does not simply mean that one needs to design around the correct regular sales price. It also requires a clear understanding of whether the following topics play a role in the research context. Topic 1: Market Relevant Pricing
The main issue of this topic is to investigate the context in which the pricing scenario takes place. Usually such an investigation starts with the determination of actual sales prices. At first glance, this seems very easy and not worth a lot of time. However, most products are not priced with a single regular sales price. For example, there are different prices in different sales channels or store brands. Most products have many different actual sales prices. Therefore one must start with a closer look at scanner data or list prices of the products in the SKU space. As a next step, one has to get a clear understanding of the environment in which the product is sold. Are there different channels like hypermarkets, supermarkets, traders, etc. that have to be taken into account? In the real world, prices are often too different across channels to be used in only one Shelf Layout Conjoint design. So we often end up with different conjoint models for the different channels. Furthermore, the different store brands may play a role. Store brand A might have a different pricing strategy because it competes with a different set of SKUs than store brand B. How relevant are the different private labels or white-label/generic products in the researched market? In consequence one often ends up with more than one Shelf Layout Conjoint model (perhaps even dozens of models) for one “simple” pricing context. In such a situation, researchers have to decide whether to simulate each model independently or to build up a more complex simulator. This will allow pricing simulations on an overall market level, tying together the large number of choice models, to construct a realistic “playground” for market simulations. Topic 2: Initial Price Position of New SKUs
With the simulated launch of new products one has to make prior assumptions about their pricing before the survey is fielded. Thus, one of the important tasks for the researcher and her client is to define reasonable price points for the new products in the model. The price range must be as wide as necessary, but as narrow as possible. Topic 3: Definition of Price Range Widths and Steps
Shelf Layout Conjoint should cover all possible pricing scenarios that would be interesting for the client. However, respondents should not be confronted with unrealistically high or low prices. Such extremes might artificially influence the choices of the respondent and might have an influence on the measured price elasticity. Unrealistically high price elasticity is usually caused by too wide a price range, with extremely cheap or extremely expensive prices. One should be aware that the price range over which an SKU is studied has a direct impact on its elasticity results! This is not only true for new products, where respondents have no real price 189
knowledge, but also for existing products. Furthermore unrealistically low or high price points can result in less attention from respondents and more fatigue in the answers of respondents, than realistic price changes would have caused. Topic 4: Assortment Pricing (Line Pricing)
Many clients have not just a single product in the market, but a complete line of competing products on the same shelf. In such cases it is often important to price the products in relation to each other. A specific issue in this regard is line pricing: several products of one supplier share the same price, but differ in their contents (package sizes) or other characteristics. Many researchers measure the utility of prices independently for each SKU and create line pricing only in the simulation stage. However, in this situation, it is essential to use line-priced choice tasks in the interview: respondents’ preference structure can be very different when seeing the same prices for all products of one manufacturer rather than seeing different prices, which often results in choosing the least expensive product. This leads to overestimation of preference shares for cheaper products. A similar effect can be observed if the relative price separations of products are not respected in the choice tasks. For example: if one always sells orange juice for 50 cents more than water, this relative price distance is known or learned by consumers and taken into account when they state their preference in the choice tasks. Special pricing designs such as line pricing can be constructed by exporting the standard design created with Sawtooth Software’s CBC into a CSV format and reworking it in Excel. However, one must test the manipulated designs afterwards in order to ensure the prior design criteria are still met. This is done by re-importing the modified design into Sawtooth Software’s CBC and running efficiency tests. Topic 5: Indirect Pricing
In markets where most brands offer line pricing the real individual price positioning of SKUs is often achieved through variation in their package content sizes. This variation can be varied and modeled in the same way and with the same limitations as monetary prices. However, one must ensure that the content information is sufficiently visible to the consumers (e.g., written on price tags or on the product images).
Topic 6: Price Tags
Traditionally, prices in conjoint are treated like an attribute level and are simply displayed beneath each concept. Therefore in many Shelf Layout Conjoint projects the price tag is simply the representation of the actual market price and the selected range around it. However, in reality consumers see the product name, content size, number of applications, price per application or 190
standard unit in addition to the purchase price. (In the European Community, such information is mandatory by law; in many other places, it is at least customary if not required.) In choice tasks, many respondents therefore search not only for the purchase price, but also for additional information about the SKUs in their relevant choice set. Oversimplification of price tags in Shelf Layout Conjoint does not sufficiently reflect the real decision process. Therefore, it is essential to include the usual additional information to ensure realistic choice tasks for respondents.
Topic 7: Promotions
The subject of promotions in Shelf Layout Conjoint is often discussed but controversial. In our opinion, only some effects of promotions can be measured and modeled in Shelf Layout Conjoint. SLC provides a one-point-in-time measurement of consumer preference. Thus, promotion effects which require information over a time period of consumer choices cannot be measured with SLC. It is essential to keep in mind that we can neither answer the question if a promotion campaign results in higher sales volume for the client nor make assumptions about market expansion—we simply do not know anything about the purchase behavior (what and how much) in the future period. However, SLC can simulate customers’ reaction to different promotion activities. This includes the simulation of the necessary price discount in order to achieve a promotion effect, comparison of the effectiveness of different promotion types (e.g., buy two, get one free) as well as simulation of competitive reactions, but only at a single point in time. In order to analyze such promotion effects with high accuracy, we recommend applying different attributes and levels for the promotional offers from those for the usual market prices. SLC including promotion effects therefore often have two sets of price parameters, one for the regular market price and one for the promotional price. Topic 8: Price Elasticity
Price Elasticity is a coefficient which tells us how sales volume changes when prices are changed. However, one cannot predict sales figures from SLC. What we get is “share of preference” or “share of choice” and we know whether more or fewer people are probably purchasing when prices change. In categories with single-unit purchase cycles, this is not much of a problem, but in the case of Fast Moving Consumer Goods (FMCG) with large shelves where consumers often buy more than one unit—especially under promotion—it is very critical to be precise when talking about price elasticity. We recommend speaking carefully of a “price to share of preference coefficient” unless sales figures are used in addition to derive true price elasticity.
191
The number of SKUs included in the research has a strong impact on the “price to share of preference coefficient.” The fewer SKUs one includes in the model, the higher the ratio; many researchers experience high “ratios” that are only due to the research design. But they are certainly wrong if the client wants to know the true “coefficient of elasticity” based on sales figures. Topic 9: Complexity of the Model
SLC models are normally far more complex than the usual CBC/DCM models. The basic structure of SLC is usually one many-leveled SKU attribute and for each (or most) of its levels, one price attribute. Sometimes there are additional attributes for each SKU such as promotion or content. As a consequence there are often too many parameters to be estimated in HB. Statistically, we have “over-parameterization” of the model. However there are approaches to reduce the number of estimated parameters, e.g.:
Do we need part-worth estimates for each price point? Could we use a linear price function? Do we really need price variation for all SKUs? Could we use fixed price points for some competitors’ SKUs? Could we model different price effects by price tiers (such as low, mid, high) instead of one price attribute per SKU?
Depending on the quantity of information one can obtain from a single respondent, it may be better to use aggregate models than HB models. The question is, how many tasks could one really ask of a single respondent before reaching her individual choice task threshold, and how many concepts could be displayed on one screen (Kurz, Binner 2012)? If it’s not possible to show respondents a large enough number of choice tasks to get good individual information, relative to the large number of parameters in an SLC model, HB utility estimates will fail to capture much heterogeneity anyway.
TOPICS BEYOND THIS PAPER How can researchers further ensure that Shelf Layout Conjoint provides reliable and meaningful results? Here are some additional topics:
Sample size and number of tasks Block designs Static SKUs Maximum number of SKUs on shelf Choice task thresholds Bridged models Usage of different (more informative) priors in HB to obtain better estimates
EIGHT KEY TAKE-AWAYS FOR SLC 1. Be aware of its limitations when considering Shelf Layout Conjoint as a methodology for your customers’ objectives. One cannot address every research question. 192
2. Try hard to ensure that your pricing accurately reflects the market reality. If one model is not possible, use multi-model simulations or single market segments. 3. Be aware that the price range definition for a SKU has a direct impact on its elasticity results and define realistic price ranges with care. 4. Adapt your design to the market reality (e.g., line pricing), starting with the choice tasks and not only in your simulations. 5. Do not oversimplify price tags in Shelf Layout Conjoint; be sure to sufficiently reflect the real decision environment. 6. SLC provides just a one point measurement of consumer preference. Promotion effects that require information about a time period of consumer choices cannot be measured. 7. Price elasticity derived from SLC is better called “price to share of preference coefficients.” 8. SLC often suffers from “over-parameterization” within the model. One should evaluate different approaches to reduce the number of estimated parameters.
Peter Kurz
Stefan Binner
Leonhard Kehl
REFERENCES Eagle, Tom (2010): Modeling Demand Using Simple Methods: Joint Discrete/Continuous Modeling; 2010 Sawtooth Software Conference Proceedings. Kehl, Leonhard; Foscht, Thomas; Schloffer, Judith (2010): Conjoint Design and the impact on Price-Elasticity and Validity; Sawtooth Europe Conference Cologne. Kurz, Peter; Binner, Stefan (2012): “The Individual Choice Task Threshold” Need for Variable Number of Choice Tasks; 2012 Sawtooth Software Conference Proceedings. Kurz, Peter (2008): A comparison between Discrete Choice Models based on virtual shelves and flat shelf layouts; SKIM Working Towards Symbioses Conference Barcelona. Orme, Bryan (2003): Special Features of CBC Software for Packaged Goods and Beverage Research; Sawtooth Software, via Website. 193
Pandey, Rohit; Wagner, John; Knappenberger, Robyn; (2012): Building Expandable Consumption into a Share-Only MNL Model; 2012 Sawtooth Software Conference Proceedings.
194
ATTRIBUTE NON-ATTENDANCE IN DISCRETE CHOICE EXPERIMENTS DAN YARDLEY MARITZ RESEARCH
EXECUTIVE SUMMARY Some respondents ignore certain attributes in choice experiments to help them choose between competing alternatives. By asking the respondents which attributes they ignored and accounting for this attribute non-attendance we hope to improve our preference models. We test different ways of asking stated non-attendance and the impact of non-attendance on partial profile designs. We also explore an empirical method of identifying and accounting for attribute non-attendance. We found that accounting for stated attribute non-attendance does not improve our models. Identifying and accounting for non-attendance empirically can result in better predictions on holdouts.
BACKGROUND Recent research literature has included discussions of “attribute non-attendance.” In the case of stated non-attendance (SNA) we ask respondents after they answer their choice questions, which attributes, if any, they ignored when they made their choices. This additional information can then be used to zero-out the effect of the ignored attributes. Taking SNA into account theoretically improves model fit. We seek to summarize this literature, to replicate findings using data from two recent choice experiments, and to test whether taking SNA into account improves predictions of in-sample and out-of-sample holdout choices. We also explore different methods of incorporating SNA into our preference model and different ways of asking SNA questions. In addition to asking different stated non-attendance questions, we will also compare SNA to stated attribute level ratings (two different scales tested) and self-explicated importance allocations. Though it’s controversial, some researchers have used latent class analysis to identify nonattendance analytically (Hensher and Greene, 2010). We will attempt to identify non-attendance by using HB analysis and other methods. We will determine if accounting for “derived nonattendance” improves aggregate model fit and holdout predictions. We will compare “derived non-attendance” to the different methods of asking SNA and stated importance. It’s likely that non-attendance is more of an issue for full profile than for partial profile experiments, another hypothesis our research design will allow us to test. By varying the attributes shown, we would expect respondents to pay closer attention to which attributes are presented, and thus ignore fewer attributes.
STUDY 1 We first look at the data from a tablet computer study conducted in May 2012. Respondents were tablet computer owners or intenders. 502 respondents saw 18 full profile choice tasks with 3 alternatives and 8 attributes. The attributes and levels are as follows: 195
Attribute Level 1 Operating System Apple Memory 8 GB Included Cloud Storage None Price $199 Camera Picture Quality 0.3 Megapixels Warranty 3 Months Screen Size 5" High definition display Screen Resolution (200 pixels per inch)
Level 2 Android 16 GB 5 GB $499 2 Megapixels 1 Year 7" Extra-high definition display (300 pixels per inch)
Level 3 Windows 64 GB 50 GB $799 5 Megapixels 3 Years 10"
Another group of 501 respondents saw 16 partial profile tasks based upon the same attributes and levels as above. They saw 4 attributes at a time. 300 more respondents completed a differently formatted set of partial profile tasks and were designated for use as an out-of-sample holdout. All respondents saw the same set of 6 full profile holdout tasks and a stated nonattendance question. The stated non-attendance question and responses were: Please indicate which of the attributes, if any, you ignored when you made your choices in the preceding questions:
Memory Operating System Screen Resolution Price Screen Size Warranty Camera Picture Quality Included Cloud Storage I did not ignore any of these Average # Ignored
Full Profile 17.5% 20.9% 22.7% 26.9% 27.3% 27.9% 30.1% 36.7% 26.1%
Partial Profile 14.2% 18.6% 14.0% 16.4% 17.2% 24.2% 21.8% 28.9% 36.3%
2.10
1.55
Full Profile 8 7 6 5 4 3 2 1
Partial Profile 7 4 8 6 5 2 3 1
We can see that the respondents who saw partial profile choice tasks answered the stated nonattendance question differently from those who saw full profile. The attribute rankings are different for partial and full profile, and 6 of the 9 attribute frequencies are significantly different (bold). Significantly more partial profile respondents stated that they did not ignore any of the attributes. From these differences we conclude that partial profile respondents pay closer attention to which attributes are showing. This is due to the fact that the attributes shown vary from task to task. We now compare aggregate (pooled) multinomial logistic models created using data augmented with the stated non-attendance (SNA) question and models with no SNA. The way we account for the SNA question is: if the respondent said they ignored the attribute, we zero out
196
the attribute in the design, effectively removing the attribute for that respondent. Accounting for SNA in this manner yields mixed results. Holdout Tasks Full Profile 56.7% 56.9%
No SANA With SANA
Likelihood Ratio Full Profile No SANA 1768 With SANA 2080
Partial 53.1% 51.6%
Partial 2259 2197
Out of Sample Full Profile 47.7% 52.0%
No SANA With SANA
Partial 55.4% 55.4%
For the full profile respondents, we see slight improvement in Holdout Tasks hit rates from 56.7% to 56.9% (applying the aggregate logit parameters to predict individual holdout choices). Out-of-sample hit rates (applying the aggregate logit parameters from the training set to each out-of-sample respondent’s choices) and Likelihood Ratio (2 times the difference in the model LL compared to the null LL) also show improvement. For partial profile respondents, accounting for the SNA did not lead to better results. It should be noted that partial profile performed better on out-of-sample hit rates. This, in part, is due to the fact that the out-of-sample tasks are partial profile. Similarly, the full profile respondents performed better on holdout task hit rates due to the holdout tasks being full profile tasks. Looking at the resulting parameter estimates we see little to no differences between models that account for SNA (dashed lines) and those that don’t. We do, however, see slight differences between partial profile and full profile. 2 1.5 1 0.5 0
FP
FP SNA
-0.5
PP
-1
PP SNA
-1.5
OS
Memory
Cloud Storage
Price
Screen Res
Camera (Mpx)
Warranty
10"
7"
5"
3Yr
1Yr
3Mo
5.0
2.0
0.3
XHD
HD
$799
$499
$199
50GB
5GB
0GB
64GB
16GB
8GB
Win
And
App
-2
Screen Size
Shifting from aggregate models to Hierarchical Bayes, again we see that accounting for SNA does not lead to improved models. In addition to accounting for SNA by zeroing out the ignored attributes in the design (Pre), we also look at zeroing out the resulting HB parameter estimates
197
(Post). Zeroing out HB parameters (Post) has a negative impact on holdout tasks. Not accounting for SNA resulted in better holdout task hit rates and root likelihood (RLH). Holdout Tasks
RLH
Full Profile No SNA 72.8% Pre SNA 71.1% Post SNA 66.6%
Partial 63.2% 61.1% 59.4%
Full Profile No SNA 0.701 With SNA 0.633
Partial 0.622 0.565
Lots of information is obtained by asking respondents multiple choice tasks. With all this information at our fingertips, is it really necessary to ask additional questions about attribute attendance? Perhaps we can derive attribute attendance empirically, and improve our models as well. One simple method of calculating attribute attendance that we tested is to compare each attribute’s utility range from the Hierarchical Bayes models to the attribute with the highest utility range. Below are the utility ranges for the first three full profile respondents from our data. Utilities Range ID 6 8 12
Operating System 2.73 0.14 3.07
Memory 1.89 0.64 0.64
Cloud Storage 1.52 0.04 0.16
Price 1.82 9.87 1.36
Screen Camera Resolution Megapixels Warranty Screen Size 1.01 2.65 0.84 1.13 0.02 0.68 0.38 0.51 0.12 1.60 0.91 0.94
With a utility range for Price of 9.87 and all other attribute ranges of less than 1, we can safely say that the respondent with ID 8 did not ignore Price. The question becomes, at what point are attributes being ignored? We analyze the data at various cut points. For each cut point we find the attribute for each individual with the largest range and then assume everything below the cut point is ignored by the individual. For example, if the utility range of Price is the largest at 9.87, a 10% cut point drops all attributes with a utility range of .98 and smaller. Below we have identified for our three respondents (grayed out) the attributes they empirically ignored at the 10% cut point. At this cut point we would say the respondent ID 6 ignored none of the attributes, ID 8 ignored 7, and ID 12 ignored 2 of the 8 attributes. Utilities Range ID 6 8 12
Operating System 2.73 0.14 3.07
10% Cut Point Memory 1.89 0.64 0.64
Cloud Storage 1.52 0.04 0.16
Price 1.82 9.87 1.36
Screen Camera Resolution Megapixels Warranty Screen Size 1.01 2.65 0.84 1.13 0.02 0.68 0.38 0.51 0.12 1.60 0.91 0.94
We now analyze different cut points to see if we can improve the fit and predictive ability of our models, and to find an optimal cut point.
198
Average Hit Rate
Empirical Attribute Non Attendance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Internal - FP Internal - PP Hold Out - FP Hold Out - PP
None
10%
20%
30%
40%
50%
Cutting Point Below Max
We see that as the cut point increases so does the hit rate on the holdout tasks. An optimal cut point is not obvious. We will use a 10% cut point as a conservative look and 45% as an aggressive cut point. Looking further at the empirical 10% and 45% cut points in the table below, we see that for full profile we get an improved mean absolute error (MAE). For partial profile the MAE stays about the same. Empirically accounting for attribute non-attendance improved our models while accounting for stated non-attendance did not. MAE No SNA SNA Emp 10 Emp 45
Full Profile 0.570 0.686 0.568 0.534
Partial 1.082 1.109 1.082 1.089
From this first study we see that respondents pay more attention to partial profile type designs and conclude that, accounting for stated non-attendance does not improve our models. We wanted to further explore these findings and test our empirical methods so we conducted a second study.
STUDY 2 The second study we conducted was a web-based survey fielded in March 2013. The topic of this study was “significant others” and included respondents that were interested in finding a spouse, partner, or significant other. We asked 2,000 respondents 12 choice tasks with 8 attributes and 5 alternatives. We also asked all respondents the same 3 holdout tasks and a 100 point attribute importance allocation question. Respondents were randomly assigned to 1 of 4 cells of a 2x2 design. The 2x2 design was comprised of 2 different stated non-attendance questions, and 2 desirability scale rating questions. The following table shows the attributes and levels considered for this study.
199
Attribute Attractiveness Romantic/Passionate Honesty/Loyalty Funny Intelligence Political Views Religious Views Annual Income
Levels Not Very Attractive Not Very Romantic/Passionate Can't Trust Not Funny Not Very Smart Strong Republican Christian $15,000 $40,000
Somewhat Attractive Somewhat Romantic/Passionate Mostly Trust Sometimes Funny Pretty Smart Swing Voter Religious - Not Christian $65,000 $100,000
Very Attractive Very Romantic/Passionate Completely Trust Very Funny Brilliant Strong Democrat No Religion/Secular $200,000
We wanted to test different ways of asking attribute non-attendance. Like our first study, we asked half the respondents which attributes they ignored during the choice tasks: “Which of these attributes, if any, did you ignore in making your choice about a significant other?” The other half of respondents were asked which attributes they used: “Which of these attributes did you use in making your choice about a significant other?” For each attribute in the choice study, we asked each respondent to rate all levels of the attribute on desirability. Half of the respondents were asked a 1–5 Scale the other half a 0–10 Scale. Below are examples of these rating questions for the Attractiveness attribute. 1–5 Scale Still thinking about a possible significant other, how desirable is each of the following levels of their Attractiveness?
Completely Not Very Highly Extremely No Opinion/Not Desirable Unacceptable Desirable Desirable Desirable Relevant (1) (2) (3) (5) (4) Not Very Attractive Somewhat Attractive Very Attractive
m m m
m m m
m m m
m m m
m m m
m m m
0–10 Scale Still thinking about a possible significant other, how desirable is each of the following levels of their Attractiveness?
Not Very Attractive Somewhat Attractive Very Attractive
Extremely Undesireable (0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
m m m
m m m
m m m
m m m
m m m
m m m
m m m
m m m
m m m
Extremely No Opinion/Not Desirable Relevant (10) (9) m m m
m m m
m m m
As previously mentioned, all respondents were asked an attribute importance allocation question. This question was asked of all respondents so that we could use it as a base measure for comparisons. The question asked: “For each of the attributes shown, how important is it for your significant other to have the more desirable versus less desirable level? Please allocate 100 points among the attributes, with the most important attributes having the most points. If an attribute has no importance to you at all, it should be assigned an importance of 0.” We now compare the results for the different non-attendance methods. All methods easily identify Honesty/Loyalty as the most influential attribute. 86.8% of respondents asked if they used Honesty/Loyalty in making their choice marked they did. Annual Income is the least influential attribute for the stated questions, and second-to-least for the allocation question. Since 200
Choice, 1–5 Scale and 0–10 Scale are derived, they don’t show this same social bias, and Annual Income ranks differently. Choice looks at the average range of the attribute’s HB coefficients. Similarly, 1–5 and 0–10 Scale looks at the average range of the ranked levels. For the 0–10 Scale, the average range between “Can’t Trust” and “Completely Trust” is 8.7. Stated Used Honesty/Loyalty 86.8% Funny 53.7%
Stated Ignored 8.1% 17.4%
1-5 Scale 3.44 2.11
0-10 Scale Allocation Choice 8.70 31.3% 5.93 5.42 10.0% 1.20
Intelligence 52.1%
17.0%
1.81
4.75
10.6%
1.24
48.7% 44.7% 39.3% 28.0% 20.1%
15.4% 25.6% 26.7% 35.9% 44.2%
2.06 1.81 0.88 0.33 1.88
5.12 4.18 2.12 1.11 4.22
12.5% 13.8% 9.7% 5.3% 6.8%
0.96 1.44 0.82 0.50 1.63
Attribute Rankings Stated Used Honesty/Loyalty 1 Funny 2 Intelligence 3 Romantic/Passionate 4 Attractiveness 5 Religious Views 6 Political Views 7 Annual Income 8
Stated Ignored 1 4 3 2 5 6 7 8
1-5 Scale 1 2 5 3 6 7 8 4
0-10 Scale Allocation Choice 1 1 1 2 5 5 4 4 4 3 3 6 6 2 3 7 6 7 8 8 8 5 7 2
Romantic/Passionate Attractiveness Religious Views Political Views Annual Income
In addition to social bias, another deficiency with Stated Used and Stated Ignored questions is that some respondents don’t check all the boxes they should, thus understating usage of attributes for Stated Used, and overstating usage for Stated Ignored. The respondents, who were asked which attributes they used, selected on average 3.72 attributes. The average number of attributes that respondents said they ignored is 1.92, implying they used 6.08 of the 8 attributes. If respondents carefully considered these questions, and truthfully marked all that applied, the difference between Stated Used and Stated Ignored would be smaller. For each type of stated non-attendance question, we analyzed each attribute for all respondents and determined if the attribute was ignored or not. The following table shows the differences across the methods of identifying attribute non-attendance. For the allocation question, if the respondent allocated 0 of the 100 points, the attribute was considered ignored. For 1–5 Scale and 0–10 Scale if the range of the levels of the attribute was 0, the attribute was considered ignored. The table below shows percent discordance between methods. For example, the Allocate method identified 35.1% incorrectly from the Stated Used method. Overall, we see
201
the methods differ substantially one from another. The most diverse methods are the Empirical 10% and the Stated Used with 47.6% discordance. % Discordance Stated Stated Used Ignore Allocate 35.1% 24.9% Stated Used Stated Ignore 1-5 Scale 0-10 Scale
1-5 Scale 22.3% 39.3% 23.2%
0-10 Scale Emp 10% 21.3% 28.8% 41.6% 47.6% 22.2% 27.9% 23.6% 22.4%
Accounting for these diverse methods of attribute non-attendance in our models, we can see the impact on holdout hit rates is small, and only the 1–5 Scale shows improvement over the model without stated non-attendance. Hit Rates No SNA Holdout 1 55.6% Holdout 2 57.1% Holdout 3 61.8% Average 58.2%
Allocate 55.2% 58.1% 61.3%
Stated Used 51.4% 54.1% 57.6%
Stated Ignore 56.4% 54.9% 62.1%
1-5 Scale 58.8% 57.0% 61.5%
0-10 Scale 52.8% 56.5% 61.2%
58.2% 54.3% 57.8% 59.1% 56.8%
When we account for attribute non-attendance empirically, using the method previously described, we see improvement in the holdouts. Again we see, in the table below, that as we increase the cut point, zeroing out more attributes, the holdouts increase. Instead of accounting for attribute non-attendance by asking additional questions we can efficiently do so empirically using the choice data. Hit Rates Cut Point None Holdout 58.2% Internal 91.0%
10% 58.7% 90.3%
20% 59.0% 86.5%
30% 62.4% 81.6%
40% 64.8% 76.8%
50% 67.0% 73.5%
CONCLUSIONS When asked: “Please indicate which of the attributes, if any, you ignored when you made your choices in the preceding questions” respondents previously asked discrete choice questions of a partial profile type design, indicated they ignored fewer attributes than those asked full profile choice questions. Partial profile type designs solicit more attentive responses than full profile designs. This, we believe, is due to the fact that the attributes shown change, demanding more of the respondent’s attention. Aggregate and Hierarchical Bayes Models typically do not perform better when we account for stated attribute non-attendance. Accounting for questions directly asking which attributes were ignored performs better than asking which attributes were used. Derived methods eliminate the social bias of direct questions and, when accounted for in the models, tend to perform better than the direct questions. Respondents use a different thought process when answering stated
202
attribute non-attendance questions than choice tasks. Combing different question types pollutes the interpretation of the models and is discouraged. A simple empirical way to account for attribute non-attendance is to look at the HB utilities range and zero out attributes with relatively small ranges. Models where we identify nonattendance empirically perform better on holdouts and benefit from not needing additional questions.
Dan Yardley
PREVIOUS LITERATURE Alemu, M.H., M.R Morkbak, S.B. Olsen, and C.L Jensen (2011) “Attending to the reasons for attribute non-attendance in choice experiments,” FOI working paper, University of Copenhagen. Balcombe, K., M. Burton and D. Rigby (2011) “Skew and attribute non-attendance within the Bayesian mixed logit model,” paper presented at the International Choice Modeling Conference. Cameron, T.A. and J.R. DeShazo (2010) “Differential attention to attributes in utility-theoretic choice models,” Journal of Choice Modeling, 3: 73–115. Campbell, D., D.A. Hensher and R. Scarpa (2011) “Non-attendance to attributes in environmental choice analysis: a latent class specification,” Journal of Environmental Planning and Management, 54, 2061–1076. Hensher, D.A. and A.T. Collins (2011) “Interrogation of responses to stated choices experiments: is there sense in what respondents tell us?” Journal of Choice Modeling, 4: 62–89. Hensher, D.A. and W.H. Greene (2010) “Non-attendance and dual processing of common-metric attributes in choice analysis: A Latent Class Specification,” Empirical Economics, 39, 413– 426. Hensher, D.A. and J. Rose (2009) “Simplifying choice through attribute preservation or nonattendance: implications for willingness to pay,” Transportation Research E, 45, 583–590. Hensher, D.A., J. Rose and W.H. Greene (2005) “The implications on willingness to pay of respondents ignoring specific attributes,” Transportation, 32, 203–222. 203
Scarpa, R., T.J. Gilbride, D. Campbell, D. and D.A. Hensher (2009) “Modelling attribute nonattendance in choice experiments for rural landscape valuation,” European Review of Agricultural Economics, 36, 151–174. Scarpa, R., R. Raffaelli, S. Notaro and J. Louviere (2011) “Modelling the effects of stated attribute non-attendance on its inference: an application to visitors benefits from the alpine grazing commons,” paper presented at the International Choice Modeling Conference.
204
ANCHORED ADAPTIVE MAXDIFF: APPLICATION IN CONTINUOUS CONCEPT TEST ROSANNA MAU JANE TANG LEANN HELMRICH MAGGIE COURNOYER VISION CRITICAL
SUMMARY Innovative firms with a large number of potential new products often set up continuous programs to test these concepts in waves as they are developed. The test program usually assesses these concepts using monadic or sequential monadic ratings. It is important that the results be comparable not just within each wave but across waves as well. The results of all the testing are used to build a normative database and select the best ideas for implementation. MaxDiff is superior to ratings, but is not well suited for tracking across the multiple waves of a continuous testing program. This can be addressed by using an Anchored Adaptive MaxDiff approach. The use of anchoring transforms relative preferences into an absolute scale, which is comparable across waves. Our results show that while there are strong consistencies between the two methodologies, the concepts are more clearly differentiated through their anchored MaxDiff scores. Concepts that were later proven to be successes also seemed to be more clearly identified using the Anchored approach. It is time to bring MaxDiff into the area of continuous concept testing.
1. INTRODUCTION Concept testing is one of the most commonly used tools for new product development. Firms with a large number of ideas to test usually have one or more continuous concept test programs. Rather than testing a large number of concepts in one go, a small group of concepts are developed and tested at regular intervals. Traditional monadic or sequential monadic concept testing methodology—based on rating scales—is well suited for this type of program. Respondents rate each concept one at a time on an absolute scale, and the results can be compared within each wave and across waves as well. Over time, the testing program builds up a normative database that is used to identify the best candidates for the next stage of development. To ensure that results are truly comparable across waves, the approach used in all of the waves must be consistent. Some of the most important components that must be monitored include: Study design—A sequential monadic set up is often used for this type of program. Each respondent should be exposed to a fixed number of concepts in each wave. The number of respondents seeing each concept should also be approximately the same. Sample design and qualification—The sample specification and qualifying criteria should be consistent between waves. The source of sample should also remain stable. River samples and router samples suffer from lack of control over sample composition and therefore are not suitable 205
for this purpose. Network samples, where the sample is made up of several different panel suppliers, need to be controlled so that similar proportions come from each panel supplier each time. Number and format of the concept tested—The number of concepts tested should be similar between waves. If a really large number of items need to be tested, new waves should be added. The concepts should be at about the same stage of concept development. The format of the concepts, for example, image with a text description, should also remain consistent. Questionnaire design and reporting—In a sequential monadic set-up, respondents are randomly assigned to evaluate a fixed number of concepts and they see one concept at a time. The order in which the concepts are presented is randomized. Respondents are asked to evaluate the concept using a scale to rate their “interest” or “likelihood of purchase.” In the reporting stage, the key reporting statistic used to assess the preference for concepts is determined. This could be, for example, the top 2 box ratings on a 5-point Likert scale of purchase interest. This reporting statistic, once determined, should remain consistent across waves. In this type of sequential monadic approach, each concept is tested independently. Over time, the key reporting statistics are compiled for all of the concepts tested. This allows for the establishment of norms and action standards—the determination of “how good is good?” Such standards are essential for identifying the best ideas for promotion to the next stage of product development. However, it is at this stage that difficulties often appear. A commonly encountered problem is rating scale statistics offer only a very small differentiation between concepts. If the difference between the second quartile and the top decile is 6%—while the margin of error is around 8%—how can we really tell if a concept is merely good, or if it is excellent? Traditional rating scales do not always allow us to clearly identify which concepts are truly preferred by potential customers. Another method is needed to establish respondent preference. MaxDiff seems to be an obvious choice.
2. MAXDIFF MaxDiff methodology has shown itself to be superior to ratings (Cohen 2003; Chrzan & Golovashkina 2006). Instead of using a scale, respondents are shown a subset of concepts and asked to choose the one they like best and the one they like least. The task is then repeated several times according to a statistical design. There is no scale-use bias. The tradeoffs respondents make during the MaxDiff exercise not only reveal their preference heterogeneity, but also provide much better differentiation between the concepts. However, MaxDiff results typically reveal only relative preferences—how well a concept performs relative to other concepts presented at the same time. The best-rated concept from one wave of testing might not, in fact, be any better than a concept that was rated as mediocre in a different wave of testing. How can we assure the marketing and product development teams that the best concept is truly a great concept rather than simply “the tallest pygmy”? Anchoring
Converting the relative preferences from MaxDiff studies into absolute preference requires some form of anchoring. Several attempts have been made at achieving this. Orme (2009) tested
206
and reported the results from dual-response anchoring proposed by Jordan Louviere. After each MaxDiff question, an additional question was posed to the respondent: Considering just these four features . . . All four are important None of these four are important Some are important, some are not Because this additional question was posed after each MaxDiff task, it added significantly more time to the MaxDiff questionnaire. Lattery (2010) introduced the direct binary response (DBR) method. After all the MaxDiff tasks have been completed, respondents were presented with all the items in one list and asked to check all the items that appealed to them. This proved to be an effective way of revealing absolute preference—it was simpler and only required one additional question. Horne et al. (2012) confirmed these advantages of the direct binary method, but also demonstrated that it was subject to context effects; specifically, the number of items effect. If the number of items is different in two tests, for example, one has 20 items and another has 30 items, can we still compare the results? Making It Adaptive
Putting aside the problem of anchoring for now, consider the MaxDiff experiment itself. MaxDiff exercises are repetitive. All the tasks follow the same layout, for example, picking the best and the worst amongst 4 options. The same task is repeated over and over again. The exercise can take a long time when a large number of items need to be tested. For example, if there are 20 items to test, it can take 12 to 15 tasks per respondent to collect enough information for modeling. As each item gets the same number of exposures, the “bad” items get just as much attention as the “good” items. Respondents are presented with subsets of items that seem random to them. They cannot see where the study is going (especially if they see something they have already rejected popping up again and again) and can become disengaged. The adaptive approach, first proposed by Bryan Orme in 2006, is one way to tackle this issue. Here is an example of a 20-item Adaptive MaxDiff exercise:
The process works much like an athletic tournament. In stage 1, we start with 4 MaxDiff tasks with 5 items per task. The losers in stage 1 are discarded, leaving 16 items in stage 2. The losers are dropped out after each stage. By stage 4, 8 items are left, which are evaluated in 4 pairs. In the final stage, we ask the respondents to rank the 4 surviving items. Respondents can 207
see where the exercise is going. The tasks are different, and so less tedious; and respondents find the experience more enjoyable overall (Orme 2006). At the end of Adaptive MaxDiff exercise, the analyst has individual level responses showing which items are, and which are not, preferred. Because the better items are retained into the later stages and therefore have more exposures, the better items are measured more precisely than the less preferred items. This makes sense—often we want to focus on the winners. Finally, the results from Adaptive MaxDiff are consistent with the results of traditional MaxDiff (Orme 2006). Anchored Adaptive MaxDiff
Anchored Adaptive MaxDiff combines the adaptive MaxDiff process with the direct binary anchoring approach. However, since the direct binary approach is sensitive to the number of items included in the anchoring question, we want to fix that number. For example, regardless of the number of concepts tested, the anchoring question may always display 6 items. Since this number is fixed, all waves must have at least that number of concepts for testing. Probably, most waves will have more. Which leads to this question: which items should be included in the anchoring question? To provide an anchor that is consistent between waves, we want to include items that span the entire spectrum from most preferred to least preferred. The Adaptive MaxDiff process gives us that. In the 20 item example shown previously, these items could be used in the anchoring: Ranked Best from Stage 5 Ranked Worst from Stage 5 Randomly select one of the discards from stage 4 Randomly select one of the discards from stage 3 Randomly select one of the discards from stage 2 Randomly select one of the discards from stage 1 None of the above
The Anchored Adaptive MaxDiff approach has the following benefits: • • • •
A more enjoyable respondent experience. More precise estimates for the more preferred items—which are the most important estimates. No “number of items” effect through the “controlled” anchoring question. Anchored (absolute) preference.
The question remains whether the results from multiple Anchored Adaptive MaxDiff experiments are truly comparable to each other.
3. OUR EXPERIMENT While the actual concepts included in continuous testing programs vary, the format and the number of concepts tested are relatively stable. This allows us to set up the Adaptive MaxDiff exercise and the corresponding binary response anchoring question and use the same nearly identical structure for each wave of testing. 208
Ideally, the anchored MaxDiff methodology would be tested by comparing its results to an independent dataset of results obtained from scale ratings. That becomes very costly, especially given the multiple wave nature of the process we are interested in. To save money, we collected both sets of data at the same time. An Anchored Adaptive exercise was piggybacked onto the usual sequential monadic concept testing which contained the key purchase intent question. We tested a total of 126 concepts in 5 waves, approximately once every 6 months. The number of concepts tested in each wave ranged from 20 to 30. Respondents completed the Anchored Adaptive MaxDiff exercise first. They were then randomly shown 3 concepts for rating in a sequential monadic fashion. Each concept was rated on purchase intent, uniqueness, etc. Respondents were also asked to provide qualitative feedback on that concept. The overall sample size was set to ensure that each concept received 150 exposures. Wave 1 2 3 4
Number of concepts 25 20 21 30
5 Total
30 126
Field Dates
Sample Size
Spring 2011 Fall 2011 Spring 2012 Fall 2012
1,200 1,000 1,000 1,500
Spring 2013
1,500 6,200
Including 5 known “star” concepts With 2 "star" concepts repeated
In wave 4, we included five known “star” concepts. These were concepts based on existing products with good sales history—they should test well. Two of the star concepts were then repeated in wave 5. Again, they should test well. More importantly, they should receive consistent results in both waves. The flow of the survey is shown in the following diagram:
The concept tests were always done sequentially, with MaxDiff first and the sequential monadic concept testing questions afterwards. Thus, the concept test results were never “pure 209
and uncontaminated.” It is possible the MaxDiff exercise might have influenced the concept test results. However, as MaxDiff exposed all the concepts to all the respondents, we introduced no systematic biases for any given concept tested. The results of this experiment are shown here:
The numbers on each line identify the concepts tested in that wave. Number 20 in wave one is different from number 20 in wave 2—they are different concepts with the same id number. The T2B lines are the top-2-box Purchase Intent, i.e., the proportion of respondents who rated the concept as “Extremely likely” or “Somewhat likely” to purchase and is based on approximately n=150 per concept for rating scale evaluation. The AA-MD lines are the Anchored Adaptive MaxDiff (AA-MD) results. We used Sawtooth Software CBC/HB in the MaxDiff estimation and the pairwise coding for the anchoring question as outlined in Horne et al. (2012). We plot the simulated probability of purchase, i.e., it is the exponentiated beta for a concept divided by the sum of exponentiated beta of that concept and exponentiated beta of the anchor threshold. The numbers are the average across respondents. It can be interpreted as the average likelihood of purchase for each concept. While the Anchored Adaptive MaxDiff results are generally slightly lower than the T2B ratings, there are many similarities. The AA-MD results have better differentiation than the purchase intent ratings. Visually, there is less bunching together, especially among the better concepts tested. Note that while we used T2B score here, we also did a comparison using the average weighted purchase intent. With a 5-point purchase intent scale, we used weighting factors of 0.7/0.3/0.1/0/0. The results were virtually identical. The correlation between T2B score and weighted purchase intent was 0.98; the weighted purchase intent numbers were generally lower. Below is a scatter plot of AA-MD scores versus T2B ratings. There is excellent consistency between the two sets of numbers, suggesting that they are essentially measuring the same 210
construct. Since T2B ratings can be used in aggregating results across waves, MaxDiff scores can be used for that as well.
We should note the overlap of the concepts (in terms of preferences by either measure) among the 5 waves. All the waves have some good/preferred concepts and some bad/less preferred concepts. No wave stands out as being particularly good or bad. This is not surprising in a continuous concept testing environment as the ideas are simply tested as they come along without any prior manipulation. We can say that there is naturally occurring randomization in the process that divides the concepts into waves. We would not expect to see one wave with only good concepts, and another with just bad ones. This may be one of the reasons why AA-MD works here. If the waves contain similarly worthy concepts, the importance of anchoring is diminished. To examine the role of the anchor, we reran the results without the anchoring question in the MaxDiff estimation. Anchoring mainly helps us to interpret the MaxDiff results in an “absolute” sense so that Adaptive Anchoring-MaxDiff scores represent the average simulated probability of purchase. Anchoring also helps marginally in improving consistency with the T2B purchase intent measure across waves.
Alpha with T2B (%) Purchase Intent n=126 concepts AA MD scores with Anchoring Without Anchoring
0.94 0.91
We combined the results from all 126 concepts. The T2B line in the chart below is the top-2box Purchase Intent and the other is the AA-MD scores. The table to the right shows the distribution of the T2B ratings and the AA-MD scores. 211
These numbers are used to set “norms” or action standards—hundreds of concepts are tested; only a few go forward. There are many similarities between the results. With T2B ratings, there is only a six percentage point difference between something that is good and something truly outstanding, i.e., 80th to 95th percentile. The spread is much wider (a 12 point difference) with AA-MD scores. The adaptive nature of the MaxDiff exercise means that the worse performing concepts are less differentiated, but that is of little concern. The five “star” concepts all performed well in AA-MD, consistently showing up in the top third. The results were mixed in T2B purchase intent ratings, with one star concept (#4) slipping below the top one-third and one star concept (#5) falling into the bottom half. When the same two star concepts (#3 and #4) were repeated in wave 5, the AA-MD results were consistent between the two waves for both concepts, while the T2B ratings were more varied. Star concept #4’s T2B rating jumped 9 points in the percent rank from wave 4 to wave 5. While these results are not conclusive given that we only have 2 concepts, they are consistent with the type of results we expect from these approaches.
212
4. DISCUSSION As previously noted, concepts are usually tested as they come along, without prioritization. This naturally occurring randomization process may be one of the reasons the Anchored Adaptive MaxDiff methodology appears to be free from the context effect brought on by testing the concepts at different times, i.e., waves. Wirth & Wolfrath (2012) proposed Express MaxDiff to deal with large numbers of items. Express MaxDiff employs a controlled block design and utilizes HB’s borrowing strength mechanism to infer full individual parameter vectors. In an Express MaxDiff setting, respondents are randomly assigned into questionnaire versions, each of which deals with only a subset of items (allocated using an experimental design). Through simulation studies, the authors were able to recover aggregate estimates almost perfectly and satisfactorily predicted choices in holdout tasks. They also advised users to increase the number of prior degrees of freedom, which significantly improved parameter recovery. Another key finding from Wirth & Wolfrath (2012) is that the number of blocks used to create the subset has little impact in parameter recovery. However, increasing the number of items per block improves recovery of individual parameters. Taking this idea to the extreme, we can create an experiment where each item is shown in one and only one of the blocks (as few blocks as possible), and there are a fairly large number of items per block. This very much resembles our current experiment with 126 concepts divided into 5 blocks, each with 20–30 concepts. If we pretend our data came from an Express MD experiment (excluding data from anchoring), we can create an HB run with all 126 items together using all 6,200 respondents. Using a fairly strong prior (d.f. =2000) to allow for borrowing across samples, the results correlate almost perfectly with what we obtained from each individual dataset. This again demonstrates the lack of any context effect due to the “wave” design. This also explains why we see only a marginal 213
decline in consistency between AA-MD score and T2B ratings when anchoring is excluded from the model.
5. “ABSOLUTE” COMPARISON & ANCHORING While we are satisfied that Anchored Adaptive MaxDiff used in the continuous concept testing environment is indeed free from this context effect, can this methodology be used in other settings where such naturally occurring randomization does not apply? Several conference attendees asked us if Anchored Adaptive MaxDiff would work if all the concepts tested were known to be “good” concepts. While we have no definitive answer to this question and believe further research is needed, Orme (2009b) offers some insights into this problem. Orme (2009b) looked at MaxDiff anchoring using, among other things, a 5-point rating scale. There were two waves of the experiment. In wave 1, 30 items were tested through a traditional MaxDiff experiment. Once that was completed, the 30 items were analyzed and ordered in a list from the most to the least preferred. The list was then divided in two halves with best 15 items in one and the worst 15 items in the other. In wave 2 of the study, respondents were randomly assigned into an Adaptive MaxDiff experiment using either the best 15 list or the worst 15 list. The author also made use of the natural order of a respondent’s individual level preference expressed through his Adaptive MaxDiff results. In particular, after the MaxDiff tasks, each respondent was asked to rate, on a 5-point rating scale, the desirability of 5 items which consisted of the following: Item1: Item2: Item3: Item4: Item5: 214
Item winning Adaptive MaxDiff tournament Item not eliminated until 4th Adaptive MaxDiff round Item not eliminated until 3rd Adaptive MaxDiff round Item not eliminated until 2nd Adaptive MaxDiff round Item eliminated in 1st Adaptive MaxDiff round
Since those respondents with the worst 15 list saw only items less preferred, one would expect the average ratings they gave would be lower than those respondents who saw the best 15 list. However, that was not what the respondents did. Indeed the mean ratings for the winning item were essentially tied between the two groups.
Winning MaxDiff Item Item Eliminated in 1st Round
N=115
N=96
Worst15
Best15
3.99
3.98
2.30
3.16
(Table 2—Orme 2009b) This suggested that respondents were using the 5-pt rating scale in a “relative” manner, adjusting their ratings within the context of items seen in the questionnaire. This made it difficult to use the ratings data as an absolute measuring stick to calibrate the Wave 2 Adaptive MaxDiff scores and recover the pattern of scores seen from Wave 1. A further “shift” factor (quantity A below) is needed to align the best 15 and the worst 15 items.
(Figure 2—Orme 2009b) In another cell of the experiment, respondents were asked to volunteer (open-ended question) items that they would consider the best and worst in the context of the decision. Respondents were then asked to rate the same 5 items selected from their Adaptive MaxDiff exercise along with these two additional items on the 5-point desirability scale.
215
N=96
N=86
Worst15
Best15
Winning MaxDiff Item
3.64
3.94
Item Eliminated in 1st Round
2.13
2.95
(Table 3—Orme 2009b)
(Figure 3—Orme 2009b) Interestingly, this time the anchor rating questions (for the 5-items based on preferences expressed in the Adaptive MaxDiff) yielded more differentiation between those who received the best 15 list and those with the worst 15 list. It seems that asking respondents to rate their own absolute best/worst provided a good frame of reference so that 5-pt rating scale could be used in a more “absolute” sense. This effect was carried through into the modeling. While there was still a shift in the MaxDiff scores from the two lists, the effects were much less pronounced. We are encouraged by this result. It makes sense that the success of anchoring is directly tied to how the respondents use the anchor mechanism. If respondents were using the anchoring mechanism in a more “absolute” sense, the anchored MaxDiff score would be more suitable as an “absolute” measure of preferences, and vice versa. Coincidentally, McCullough (2013) contained some data that showed that respondents do not react strictly in a “relative” fashion to direct binary anchoring. The author asked respondents to select all the brand image items that would describe a brand after a MaxDiff exercise about those brand image items and the brands. Respondents selected many more items for the two existing (and well-known) brands than for the new entry brand. 216
Average number of Brand Image Item Selected Brand #1 Brand #2 New Brand 4.9 4.2 2.8 Why are respondents selecting more items for the existing brands? Two issues could be at work here: 1. There is a brand-halo effect. Respondents identify more items for the existing brands simply because the brands themselves are well known. 2. A well-known brand would be known for more brand image items, and respondents are indeed making “absolute” judgment when it comes to what is related to that brand and are making more selections because of it. McCullough (2013) also created a Negative DBR cell where respondents were asked the direct binary response question for all the items that would not describe the brands. The author found that the addition of this negative information helped to minimize the scale-usage bias and removed the brand-halo effect. We see clearly now that existing brands are indeed associated with more brand image items.
(McCullough 2013) While we cannot estimate the exact size of the effects due to brand-halo, we can conclude that the differences we observed in the number of items selected in the direct binary response question are at least partly due to more associations with existing brands. That is, respondents are making (at least partly) an “absolute” judgment in terms of what brand image items are associated with each of the brands. We are further encouraged by this result. To test our hypothesis that respondents make use of direct binary response questions in an “absolute” sense, we set out to collect additional data. We asked our colleagues from Angus Reid Global (a firm specializes in public opinion research) to come up with two lists: list A included 6 items that were known to be important to Canadians today (fall of 2013) and list B included 6 items that were known to be less important.
217
List A Economy Ethics / Accountability Health Care Unemployment Environment Tax relief
List B Aboriginal Affairs Arctic Sovereignty Foreign Aid Immigration National Unity Promoting Canadian Tourism Abroad
We then showed a random sample of Angus Reid Forum panelists one of these lists (randomly assigned) and asked them to select the issues they felt were important. Respondents who saw the list with known important items clearly selected more items than those respondents who saw the list with known less important items.
Number of Items Identified as Important Out Of 6 Items List A List B Most Important Items Less Important Items n= 505 507 Mean 3.0 1.8 Standard Deviation 1.7 1.4 p-value on the difference in means 0.4 (n=863) have average hit rates of 82.7%; those with individual worst/best scale ratios < 0.4 (n=55) have average hit rates of 51.8%.
Figure 5. Combined best-worst utilities, estimated with and without corrections for scale ratios among response types; left panel: all respondents, n=918; right panel: respondents with worst/best scale ratios > 0.4 (n=863).
These small hit rates result because the combined utilities for this group of respondents are virtually inestimable when correcting for best/worst scale ratios. This can be seen in Figure 5. Utilities for those respondents with small or negative worst/best scale ratios tend to be estimated around zero when the correction is made, and add little but noise to the overall estimates (Figure 5, left panel). Removing those respondents and re-estimating utilities cleaned things up considerably (Figure 5, right panel). The same effect can be seen in the range of estimated utilities. Removing the “noisy” respondents led to increased utility ranges, as we might expect to be the case (Table 3). 337
Table 3. Combined best-worst utilities, estimated with scale ratio correction for all respondents in data set 2 (n=918), and for respondents with individual worst/best scale ratios > 0.4 (n=863). Data set 2 All respondents Small scale ratios removed
min -1.76 -2.05
1Q’tile -0.27 -0.30
3Q’tile +0.47 +0.54
max +1.09 +1.25
The individual root likelihood (RLH) statistic has been suggested as one criterion for cleaning “noisy” respondents from data (Orme, 2013). We compared individual worst/best scale ratios to individual RLH statistics in one of the data sets and found no relationship (r=-0.01, t[H0: r=0]=-0.185, p=0.853). It seems then, at least from examination of a single data set, that using best/worst scale ratios provides a measure of respondent “noisiness” that can be used independently from (and in concert with) RLH.
FINAL THOUGHTS Best and worst responses from MaxDiff tasks do tend to be scaled differently. Whatever the reason for this, not taking those differences into account when estimating combined best-worst utilities appears to add some systematic bias, albeit a small amount, to those utilities. The presence of this bias raises the question of whether or not to adjust for the scale differences during analysis. Adjustment requires additional effort, and that effort may not be seen as justified given the lack of strong findings here and elsewhere. Practitioners, rightly or not, may feel that if there is only a little bias and that bias will not change managerial decisions about the data, why not continue with the status quo? It is a reasonable opinion to hold. But, fundamentally, this bias results from having two kinds of responses—best and worst— combined into a single analysis. Perhaps the better solution is to tailor the response used to the desired objective. If the objective is to identify “best” alternatives, ask for only best responses; likewise, if the objective is to identify “worst” alternatives, ask for only worst responses. Doing so might better focus respondents’ attention on the business objective; we could include more choice tasks since we would be asking for one fewer response in each task; and we would not have to worry about the effect of different scales between response types. The difficulty with this approach for practitioners would be in identifying, managerially, what the objective should be (finding “bests” or eliminating “worsts”) in an age where we’ve become used to doing both in the same exercise. This idea remains the focus of future research.
338
Jack Horne
REFERENCES Ben-Akiva, M. and Lerman, S. R. (1985). Discrete Choice Analysis: Theory and Application to Travel Demand. MIT Press, Cambridge, MA. Cohen S. and Orme, B. (2004). What’s your preference? Marketing Research, 16, pp. 32–37. Dyachenko, T. L., Naylor, R. W. and Allenby, G. M. (2013a). Models of sequential evaluation in best-worst choice tasks. Advanced Research Techniques (ART) Forum. Chicago. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2072496 Dyachenko, T. L., Naylor, R. W., and Allenby, G. M. (2013b). Ballad of the best and worst. 2013 Sawtooth Software Conference Proceedings. Dana Point, CA. Louviere, J. J. (1991). Best-worst scaling: A model for the largest difference judgments. Working Paper. University of Alberta. McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior, in P. Zarembka (ed.), Frontiers in Econometrics. Academic Press. New York. pp. 105–142. Orme, B. (2013). MaxDiff/Web v.8 Technical Paper. http://www.sawtoothsoftware.com/education/techpap.shtml Rayner, B. K. and Horne, J. (forthcoming). Scaled MaxDiff. Marketing Letters (submitted). Swait, J. and Louviere, J. J. (1993). The role of the scale parameter in estimation and comparison of multinomial logit models. Journal of Marketing Research, 30, pp. 305–314. Train, K. E. (2007). Discrete Choice Methods with Simulation. Cambridge University Press. New York. pp. 38–79.
339
USING CONJOINT ANALYSIS TO DETERMINE THE MARKET VALUE OF PRODUCT FEATURES GREG ALLENBY OHIO STATE UNIVERSITY JEFF BRAZELL THE MODELLERS JOHN HOWELL PENN STATE UNIVERSITY PETER ROSSI UNIVERSITY OF CALIFORNIA LOS ANGELES
ABSTRACT In this paper we propose an approach for using conjoint analysis to attach economic value to specific features that is commonly used in many econometric applications and in intellectual property litigation. A common approach to this task involves taking difference in utility levels and dividing by the price coefficient. This is fraught with difficulties including a) certain respondents projected to pay astronomically high amounts for features, and b) the approach ignores important competitive realities in the marketplace. In this paper we argued that to assess the economic value of a feature to a firm requires conducting market simulations (a share of preference analysis) involving a realistic set of competitors, including the outside good (the “None” category). Furthermore, it requires a game theoretic approach to compare the industry equilibrium prices with and without the focal product feature.
1. INTRODUCTION Valuation of product features is a critical part of the development and marketing of products and services. Firms are continuously involved in the improvement of existing products by adding new features and many “new products” are essentially old products which have been enhanced with features previously unavailable. For example, consider the smartphone category of products. As new generations of smartphones are produced and marketed, existing features such as screen resolution/size or cellular network speed are enhanced to new higher levels. In addition, features are added to enhance the usability of smartphone. These new features might include integration of social networking functions into the camera application of the smartphone. A classic example, which was involved in litigation between Apple and Samsung, is the use of icons with rounded edges. New and enhanced features often involve substantial development costs and sometimes also require new components which drive up the marginal cost of production. The decision to develop new features is a strategic decision involving not only the cost of adding the feature but also the possible competitive response. The development and marketing costs of feature enhancement must be weighed against the expected increase in profits which will accrue if the product feature is added or enhanced. Expected profits in a world with the new feature must be compared to expected profits in a world without the feature. Computing this change in expected profits involves predicting not only demand for the feature but also assessing 341
the new industry equilibrium that will prevail with a new set of products and competitive offerings. In a litigation context, product features are often at the core of patent disputes. In this paper, we will not consider the legal questions of whether or not the patent is valid and whether or not the defendant has infringed the patent(s) in dispute. We will focus on the economic value of the features enabled by the patent. The market value of the patent is determined both by the value of the features enabled as well as by the probability that the patent will be deemed to be a valid patent and the costs of defending the patent’s validity and enforcement. The practical content of both apparatus and method patents can be viewed as the enabling of product features. The potential value of the product feature(s) enabled by patent is what gives the patent value. That is, patents are valuable only to the extent that they enable product features not obtainable via other (so-called “non-infringing”) means. In both commercial and litigation realms, therefore, valuation of product features is critical to decision making and damages analysis. Conjoint Analysis (see, for example, Orme 2009 and Gustafsson et al. 2000) is designed to measure and simulate demand in situations where products can be assumed to be comprised of bundles of features. While conjoint analysis has been used for many years in product design (see the classic example in (Green and Wind, 1989)), the use of conjoint in patent litigation has only developed recently. Both uses of conjoint stem from the need to predict demand in the future (after the new product has been released) or in a counterfactual world in which the accused infringing products are withdrawn from the market. However, the literature has struggled, thus far, to precisely define meaning of “value” as applied to product features. The current practice is to compute what many authors call a Willingness to Pay (hereafter, WTP) or a Willingness To Buy (hereafter, WTB). WTP for a product feature enhancement is defined as the monetary amount which would be sufficient to compensate a consumer for the loss of the product feature or for a reduction to the non-enhanced state. WTB is defined as the change in sales or market share that would occur as the feature is added or enhanced. The problem with both the WTP and WTB measures is that they are not equilibrium outcomes. WTP measures only a shift in the demand curve and not what the change in equilibrium price will be as the feature is added or enhanced. WTB holds prices fixed and does not account for the fact that as a product becomes more valuable equilibrium prices will typically go up. We advocate using equilibrium outcomes (both price and shares) to determine the incremental economic profits that would accrue to a firm as a product is enhanced. In general, the WTP measure will overstate the change in equilibrium price and profits and the WTB measure will overstate the change in equilibrium market share. We illustrate this using a conjoint survey for digital cameras and the addition of a swivel screen display as the object of the valuation exercise. Standard WTP measures are shown to greatly overstate the value of the product feature. To compute equilibrium outcomes, we will have to make assumptions about cost and the nature of competition and the set of competitive offers. Conjoint studies will have to be designed with this in mind. In particular, greater care to include an appropriate set of competitive brands, handle the outside option appropriately, and estimate price sensitivity precisely must be exercised.
342
2. PSEUDO-WTP, TRUE WTP AND WTB In the context of conjoint studies, feature valuation is achieved by using various measures that relate only to the demand for the products and features and not to the supply. In particular, it is common to produce estimates of what some call Willingness To Pay and Willingness To Buy. Both WTP and WTB depend only on the parameters of the demand system. As such, the WTP and WTB measure cannot be measures of the market value of a product feature as they do not directly relate to what incremental profits a firm can earn on the basis of the product feature. In this section, we review the WTP and WTB measures and explain the likely biases in these measures in feature valuation. We also explain why the WTP measures used in practice are not true WTP measures and provide the correct definition of WTP. 2.1 The Standard Choice Model for Differentiated Product Demand
Valuation of product features depends on a model for product demand. In most marketing and litigation contexts, a model of demand for differentiated products is appropriate. We briefly review the standard choice model for differentiated product demand. In many contexts, any one customer purchases at most one unit of the product. While it is straightforward to extend our framework to consider products with variable quantity purchases, we limit attention to the unit demand situation, and develop our model for a single respondent. In addition, we begin by considering a model for just one respondent. Extensions needed for multiple respondents are straightforward (see Rossi et al., 2005). The demand system then becomes a choice problem in which customers have J choice alternatives, each with characteristics vector, xj, and price, pj. The standard random utility model (McFadden, 1981) postulates that the utility for the jth alternative consists of a deterministic portion (driven by x and p) and an unobservable portion which is modeled, for convenience, as a Type I extreme value distribution.
xj is a k x 1 vector of attributes of the product, including the feature that requires valuation. xf denotes the focal feature. Feature enhancement is modeled as alternative levels of the focal feature, xf (one element of the vector x), while the addition of features would simply have xf as a dummy or indicator variable. There are three important assumptions regarding the model above that are important for feature valuation: 1. This is a compensatory model with a linear utility. 2. We enter price linearly into the model instead of using the more common dummy variable coding used in the conjoint literature. That is, if price takes on K values, p1 pk , we include one price coefficient instead of the usual K-1 dummy variables to represent the different levels. In equilibrium calculations, we will want to consider prices at any value in some relevant range in order to use first order conditions which assume a continuum. 3. There is a random utility error that theoretically can take on any number on the real line. The random utility error, εj, represents the unobservable (to the investigator) part of utility. This means that actual utility received from any given choice alternative depends not only on the observed product attributes, x, and price but also on realizations from the error distribution. In the standard random utility model, there is the possibility of receiving up to infinite utility from 343
the choice alternative. This means that in evaluating the option to make choices from a set of products, we must consider the contribution not only of the observed or deterministic portion of utility but also the distribution of the utility errors. The possibilities for realization from the error distribution provide a source of utility for each choice alternative. In the conjoint literature, the β coefficients are called part-worths. It should be noted that the part-worths are expressed in a utility scale which has an arbitrary origin (as defined by the base alternative) and an equally arbitrary scaling (somewhat like the temperature scale). This means that we cannot compare elements of the β vector in ratio terms or utilizing percentages. In addition, if different consumers have different utility functions (which is almost a truism of marketing) then we cannot compare part-worths across individuals. For example, suppose that one respondent gets twice as much utility from feature A as feature B, while another respondent gets three times as much utility from feature B as A. All we can say is that the first respondent ranks A over B and the second ranks B over A; no statements can be made regarding the relative “liking” of the various features. 2.2 Pseudo WTP
The arbitrary scaling of the logit choice parameters presents a challenge to interpretation. For this reason, there has been a lot of interest in various ways to convert part-worths into quantities such as market share or dollars which are defined on ratio scales. What is called “WTP” in the conjoint literature is one attempt to convert the part-worth of the focal feature, βf, to the dollar scale. Using a standard dummy variable coding, we can view the part-worth of the feature as representing the increase in deterministic utility that occurs when the feature is turned on. For feature enhancement, a dummy coding approach would require that we use the difference in partworths associated with the enhancement in the “WTP” calculation. If the feature part-worth is divided by the price coefficient, then we have converted to the ratio dollar scale. We will call this “pseudo-WTP” as it is not a true WTP measure as we explain below.
This p-WTP measure is often justified by appeal to the simple argument that this is the amount by which price could be raised and still leave the “utility” for choice alternative J the same when the product feature is turned on. Others define this as a “willingness to accept” by giving the completely symmetric definition as the amount by which price would have to be lowered to yield the same utility in a product with the feature turned off as with a product with the feature turned on. Given the assumption of a linear utility model and a linear price term, both definitions are identical. In practice, reference price effects often make WTA differ from WTP, (see [Viscusi-Huber-2011]) but, in the standard economic model, these are equivalent. In the literature (Orme-2001-WTP), p-WTP is sometimes defined as the amount by which the price of the feature-enhanced product can be increased and still leave its market share unchanged. In a homogeneous logit model, this is identical to the expression above. Inspection of the p-WTP formula reveals at least two reasons why p-WTP formula cannot be true WTP. First, the change in WTP should depend on which product is being augmented with the feature. The conventional p-WTP formula is independent of which product variant is being augmented due to the additivity of the deterministic portion of the utility function. Second, true 344
WTP must be derived ex ante—before a product is chosen. That is, adding the feature to one of the J products in the market place enhances the possibilities for attaining high utility. Removing the feature reduces levels of utility by diminishing the opportunities in the choice set. This is all related to the assumption that on each choice occasion a separate set of choice errors are drawn. Thus, the actual realization of the random utility errors is not known prior to the choice, and must be factored into the calculations to estimate the true WTP. 2.3 True WTP
WTP is an economic measure of social welfare derived from the principle of compensating variation. That is, WTP for a product is the amount of income that will compensate for the loss of utility obtained from the product; in other words, a consumer should be indifferent between having the product or not having the product with an additional income equal to the WTP. Indifference means the same level of utility. For choice sets, we must consider the amount of income (called the compensating variation) that I must pay a consumer faced with a diminished choice set (either an alternative is missing or diminished by omission of a feature) so that consumer attains the same level of utility as a consumer facing a better choice set (with the alternative restored or with the feature added). Consumers evaluate choices a priori or before choices are made. Features are valuable to the extent to which they enhance the attainable utility of choice. Consumers do not know the realization of the random utility errors until they are confronted with a choice task. Addition of the feature shifts the deterministic portion of utility or the mean of the random utility. Variation around the mean due to the random utility errors is equally important as a source of value. The random utility model was designed for application to revealed preference or actual choice in the marketplace. The random errors are thought to represent information unobservable to the researcher. This unobservable information could be omitted characteristics that make particular alternatives more attractive than others. In a time series context, the omitted variables could be inventory which affects the marginal utility of consumption. In a conjoint survey exercise, respondents are explicitly asked to make choices solely on the basis of attributes and levels presented and to assume that all other omitted characteristics are to be assumed to be the same. It might be argued, then, that the role of random utility errors is different in the conjoint context. Random utility errors might be more the result of measurement error rather than omitted variables that influence the marginal utility of each alternative. However, even in conjoint setting, we believe it is still possible to interpret the random utility errors as representing a source of unobservable utility. For example, conjoint studies often include brand names as attributes. In these situations, respondents may infer that other characteristics correlated with the brand name are present even though the survey instructions tell them not to make these attributions. One can also interpret the random utility errors as arising from functional form mis-specification. That is, we know that the assumption of a linear utility model (no curvature and no interactions between attributes) is a simplification at best. We can also take the point of view that a consumer is evaluating a choice set prior to the realization of the random utility errors that occur during the purchase period. For example, I consider the value of choice in the smartphone category at some point prior to a purchase decision. At the point, I know the distribution of random utility errors that will depend on features I have not yet discovered or from demand for features which is not yet realized (i.e., I will realize that I will get
345
a great deal of benefit from a better browser). When I go to purchase a smartphone, I will know the realization of these random utility errors. To evaluate the utility afforded by a choice set, we must consider the distribution of the maximum utility obtained across all choice alternatives. This maximum has a distribution because of the random utility errors. For example, suppose we add the feature to a product configuration that is far from utility maximizing. It may still be that, even with the feature, the maximum deterministic utility is provided by a choice alternative without the feature. This does not mean that feature has no value simply because the product it is being added to is dominated by other alternatives in terms of deterministic utility. The alternative with the feature added can be chosen after realization of the random utility errors if the realization of the random utility error is very high for the alternative that is enhanced by addition of the feature. The evaluation of true WTP involves the change in the expected maximum utility for a set of offerings with and without the enhanced product feature. We refer the reader to the more technical paper by Allenby et al. (2013) for its derivation, and simply show the formula below to illustrate its difference from the p-WTP formula described above:
where a* is the enhanced level of the attribute. In this formulation, the value of an enhanced level of an attribute is greater when the choice alternative has higher initial value. 2.4 WTB
In some analyses, product features are valued using a “Willingness To Buy” concept. WTB is the change in market share that will occur if the feature is added to a specific product.
where MS(j) is the market share equation for product j. The market share depends on the entire price vector and the configuration of the choice set. This equation holds prices fixed as the feature is enhanced or added. The market share equations are obtained by summing up the logit probabilities over possibly heterogeneous (in terms of taste parameters) customers. The WTB measure does depend on which product the feature is added to (even a world with identical or homogeneous customers) and, thereby, remedies one of the defects of the pseudo-WTP measure. However, WTB assumes that firms will not alter prices in response to a change in the set of products in the marketplace as the feature is added or enhanced. In most competitive situations, if a firm enhances its product and the other competing products remain unchanged, we would expect the focal firm to be able to command a somewhat higher price, while the other firms’ offerings would decline in demand and therefore, the competing firms would reduce their price or add other features. 2.5 Why p-WTP, WTP and WTB are Inadequate
Pseudo-WTP, WTP and WTB do not take into account equilibrium adjustments in the market as one of the products is enhanced by addition of a feature. For this reason, we cannot view either pseudo-WTP nor WTP as what a firm can charge for a feature-enhanced product nor can 346
we view WTB as the market share than can be gained by feature enhancement. Computation of changes in the market equilibrium due to feature enhancement of one product will be required to develop a measure of the economic value of the feature. WTP will overstate the price premium afforded by feature enhancement and WTB will also overstate the impact of feature enhancement on market share. Equilibrium computations in differentiated product cases are difficult to illustrate by simple graphical means. In this section, we will use the standard demand and supply graphs to provide an informal intuition as to why p-WTP and WTB will tend to overstate the benefits of feature enhancement. Figure 1 shows a standard industry supply and demand set-up. The demand curve is represented by the blue downward sloping lines. “D” denotes demand without the feature and “D*” denotes demand with the feature. The vertical difference between the two demand curves is the change in WTP as the feature is added. We assume that addition of the feature may increase the marginal cost of production (note: for some features such as those created purely via software, the marginal cost will not change). It is easy to see that, in this case, the change in WTP exceeds the change in equilibrium price. A similar argument can be made to illustrate that WTB will exceed the actual change in demand in a competitive market. Figure 1 Difficulties with WTP
347
3. ECONOMIC VALUATION OF FEATURES The goal of feature enhancement is to improve profitability of the firm introducing product with feature enhancement into an existing market. Similarly, the value of a patent is ultimately derived from the profits that accrue to firms who practice the patent by developing products that utilize the patented technology. In fact, the standard economic argument for allowing patent holders to sell their patents is that, in this way, patents will eventually find their way into the hands of those firms who can best utilize the technology to maximize demand and profits. For these reasons, we believe that the appropriate measure of the economic value of feature enhancement is the incremental profits that the feature enhancement will generate.
Profits, π, is associated with the industry equilibrium prices and shares given a particular set of competing products which is represented by the choice set defined by the attribute matrix. This definition allows for both price and share adjustment as a result of feature enhancement, removing some of the objections to the p-WTP, WTP and WTB concepts. Incremental profits is closer in spirit, though not the same, to the definition of true WTP in the sense that profits depend on the entire choice set and the incremental profits may depend on which product is subject to feature enhancement. However, social WTP does not include cost considerations and does not address how the social surplus is divided between the firm and the customers. In the abstract, our definition of economic value of feature enhancement seems to be the appropriate measure for the firm that seeks to enhance a feature. All funds have an opportunity cost and the incremental profits calculation is fundamental to deploying product development resources optimally. In fairness, industry practitioners of conjoint analysis also appreciate some of the benefits of an incremental profits orientation. Often marketing research firms construct “market simulators” that simulate market shares given a specific set of products in the market. Some even go further as to attempt to compute the “optimal” price by simulating different market shares corresponding to different “pricing scenarios.” In these exercises, practitioners fix competing prices at a set of prices that may include their informal estimate of competitor response. This is not the same as computing a marketing equilibrium but moves in that direction. 3.1 Assumptions
Once that principle of incremental profits is adopted, the problem becomes to define the nature of competition, the competitive set and to choose an equilibrium concept. These assumptions must be added to the assumptions of a specific parametric demand system (we will use a heterogeneous logit demand system which is flexible but still parametric) as well as a linear utility function over attributes and the assumption (implicit in all conjoint analysis) that products can be well described by bundles of attributes. Added to these assumptions, our valuation method will also require cost information. Specifically, we will assume 1. Demand Specification: A standard heterogeneous logit demand that is linear in the attributes (including price). 2. Cost Specification: Constant marginal cost. 3. Single product firms. 4. Feature Exclusivity: The feature can only be added to one product. 348
5. No Exit: Firms cannot exit or enter the market after product enhancement takes place. 6. Static Nash Price Competition: There is a set of prices from which each individual firm would be worse off if they deviated from the equilibrium. Assumptions 2, 3, 4 can be easily relaxed. Assumption 1 can be replaced by any valid demand system. Assumptions 5 and 6 cannot be relaxed without imparting considerable complexity to the equilibrium computations.
4. USING CONJOINT ANALYSIS FOR EQUILIBRIUM CALCULATIONS Economic valuation of feature enhancement requires a valid and realistic demand system as well as cost information and assumptions about the set of competitive products. If conjoint studies are to be used to calibrate the demand system, then particular care must be taken to design a realistic conjoint exercise. The low cost of fielding and analyzing a conjoint design makes this method particularly appealing in a litigation context. In addition, with Internet panels, conjoint studies can be fielded and analyzed in a matter of days, a time frame also attractive in the tight schedules of patent litigation. However, there is no substitute for careful conjoint design. Many designs fielded today are not useful for economic valuation of feature enhancement. For example, in recent litigation, conjoint studies in which there is no outside option, only one brand, and only patented features were used. A study with any of these limitations is of questionable value for true economic valuation. Careful practitioners of conjoint have long been aware that conjoint is appealing because of its simplicity and low cost but that careful studies make all the difference between realistic predictions of demand and useless results. We will not repeat the many prescriptions for careful survey analysis which include thorough crafting questionnaires with terminology that is meaningful to respondents, thorough and documented pre-testing and representative (projectable) samples. Furthermore, many of the prescriptions for conjoint design including well-specified and meaningful attributes and levels are extremely important. Instead, we will focus on the areas we feel are especially important for economic valuation and not considered carefully enough. 4.1 Set of Competing Products
The guiding principle in conjoint design for economic valuation of feature enhancement is that the conjoint survey must closely approximate the marketplace confronting consumers. In industry applications, the feature enhancement has typically not yet been introduced into the marketplace (hence the appeal of a conjoint study), while in patent litigation the survey is being used to approximate demand conditions at some point in the past in which patent infringement is alleged to have occurred. Most practitioners of conjoint are aware that, for realistic market simulations, the major competing products must be used. This means that the product attributes in the study should include not only functional attributes such as screen size, memory, etc., but also the major brands. This point is articulated well in (Orme, 2001). However, in many litigation contexts, the view is that only the products and brands accused of patent infringement should be included in the study. The idea is that only a certain brand’s products are accused of infringement and, therefore, that the only relevant feature enhancement for the purposes of computing patent damages are feature enhancement in the accused products.
349
For example, in recent litigation, Samsung has accused Apple iOS devices of infringing certain patents owned by Samsung. The view of the litigators is that a certain feature (for example, a certain type of video capture and transmission) infringes a Samsung patent. Therefore, the only relevant feature enhancement is to consider the addition or deletion of this feature on iOS devices such as the iPhone, iPad and iPod touch. This is correct but only in a narrow sense. The hypothetical situation relevant to damages in that case is only the addition of the feature to relevant Apple products. However, the economic value of that enhancement depends on the other competing products in the marketplace. Thus, a conjoint survey which only uses Apple products in developing conjoint profiles cannot be used for economic valuation. The value of a feature in the marketplace is determined by the set of alternative products. For example, in a highly competitive product category with many highly substitutable products, the economic value or increment profits that could accrue to any one competitor would typically be very small. However, in an isolated part of the product space (that is a part of the attribute space that is not densely filled in with competing products), a firm may capture more of the value to consumers of a feature enhancement. For example, if a certain feature is added to an Android device, this may cause greater harm to Samsung in terms of lost sales/profits because smart devices in the Android market segment (of which Samsung is a part) are more inter-substitutable. It is possible that addition of the same feature to the iOS segment may be more valuable as Apple iOS products may be viewed as less substitutable with Android products than other Android products. We emphasize that these examples are simply conjectures to illustrate the point that a full set of competing products must be used in the conjoint study. We do not think it necessary to have all possible product variants or competitors in the conjoint study and subsequent equilibrium computations. In many product categories, this would require a massive set of possible products with many features. Our view is that it is important to design the study to consider the major competing products both in terms of brands and the attributes used in the conjoint design. It is not required that the conjoint study exactly mirror the complete set of products and brands that are in the marketplace but that the main exemplars of competing brands and product positions must be included. 4.2 Outside Option
There is considerable debate as to the merits of including an outside option in conjoint studies. Many practitioners use a “forced-choice” conjoint design in which respondents are forced to choose one from the set product profiles in each conjoint choice task. The view is that “forced-choice” will elicit more information from the respondents about the tradeoffs between product attributes. If the “outside” or “none of the above” option is included, advocates of forced choice argue that respondents may shy away from the cognitively more demanding task of assessing tradeoffs and select the “none” option to reduce cognitive effort. On the opposite side, other practitioners advocate inclusion of the outside option in order to assess whether or not the product profiles used in the conjoint study are realistic in the sense of attracting considerable demand. The idea being that if respondents select the “none of the above” option too frequently then the conjoint design has offered very unattractive hypothetical products. Still others (see, for example, 9Brazell et al., 2006) argue the opposite side of the argument for forced choice. They argue that there is a “demand” effect in which respondents select at least one product to “please” the investigator. There is also a large literature on how to implement the “outside” option.
350
Whether or not the outside option is included depends on the ultimate use of the conjoint study. Clearly, it is possible to measure how respondents trade-off different product attributes against each other without inclusion of the outside option. For example, it is possible to estimate the price coefficient in a conjoint study which does not include the outside option. Under the assumption that all respondents are NOT budget constrained, the price coefficient should theoretically measure the trade-offs between other attributes and price. The fact that respondents might select a lower price and pass on some features means that they have an implicit valuation of the dollar savings involved in this trade-off. If all respondents are standard economic agents in the sense that they engage in constrained utility maximization, then this valuation of dollar savings is a valid estimate of the marginal utility of income. This means that a conjoint study without the outside option can be used to compute the p-WTP measure, which only requires a valid price coefficient. We have argued that p-WTP is not a measure of the economic value to the firm of feature enhancement. This requires a complete demand system (including the outside good) as well as the competitive and cost conditions. In order to compute valid equilibrium prices, we need to explicitly consider substitution from and to other goods including the outside good. For example, suppose we enhance a product with a very valuable new feature. We would expect to capture sales from other products in the category as well as to expand the category sales; the introduction of the Apple iPad dramatically grew the tablet category due, in part, to the features incorporated in the iPad. Chintgunta and Nair (2011) make a related observation that price elasticities will be biased if the outside option is not included. We conclude that an outside option is essential for economic valuation of feature enhancement as the only way to incorporate substitution in and out of the category is by the addition of the outside option. At this point, it is possible to take the view that if respondents are pure economic actors that they should select the outside option corresponding to their true preferences and that their choices will properly reflect the marginal utility of income. However, there is a growing literature which suggests that different ways of expressing or allowing for the outside option will change the frequency with which it is selected. In particular, the so-called “dual response” way of allowing for the outside option (see Uldry et al., 2002 and Brazell et al., 2006) has been found to increase the frequency of selection of the outside option. The “dual-response” method asks the respondent first to indicate which of the product profiles (without the outside option) are most preferred and then asked if the respondent would actually buy the product at the price posted in the conjoint design. Our own experience confirms that this mode of including the outside option greatly increases the selection of the outside option. Our experience has also been that the traditional method of including the outside option often elicits a very low rate of selection which we view as unrealistic. The advocates of the “dual response” method argue that the method helps to reduce a conjoint survey bias toward higher purchase rates than in the actual marketplace. Another way of reducing bias toward higher purchase rates is to design a conjoint using an “incentive-compatible” scheme in which the conjoint responses have real monetary consequences. There are a number of ways to do this (see, for example, Ding et al., 2005) but most suggestions (an interesting exception is Dong et al., 2010) use some sort of actual product and a monetary allotment. If the products in the study are actual products in the marketplace, then the respondent might actually receive the product chosen (or, perhaps, be eligible for a lottery which would award the product with some probability). If the respondent selects the outside option, they would receive a cash transfer (or equivalent lottery eligibility). 351
4.3 Estimating Price Sensitivity
Both WTP and equilibrium prices are sensitive to inferences regarding the price coefficient. If the distribution of prices puts any mass at all on positive values, then there does not exist a finite equilibrium price. All firms will raise prices infinitely, effectively firing all consumers with negative price sensitivity and make infinite profits on the segment with positive price sensitivity. Most investigators regard positive price coefficients as inconsistent with rational behavior. However, it will be very difficult for a normal model to drive the mass over the positive half line for price sensitivity to a negligible quantity if there is mass near zero on the negative side. We must distinguish uncertainty in posterior inference from irrational behavior. If a number of respondents have posteriors for price coefficients that put most mass on positive values, this suggests a design error in the conjoint study; perhaps, respondents are using price an a proxy for the quality of omitted features and ignoring the “all other things equals” survey instructions. In this case, the conjoint data should be discarded and the study re-designed. On the other hand, we find considerable mass on positive values simply because of the normal assumption and the fact that we have very little information about each respondent. In these situations, we have found it helpful to change the prior or random effect distribution to impose a sign constraint on the price coefficient. In many conjoint studies, the goal is to simulate market shares for some set of products. Market shares can be relatively insensitive to the distribution of the price coefficients when prices are fixed to values typically encountered in the marketplace. It is only when one considers relative prices that are unusual or relatively high or low prices that the implications of a distribution of price sensitivity will be felt. By definition, price optimization will stress-test the conjoint exercise by considering prices outside the small range usually considered in market simulators. For this reason, the quality standards for design and analysis of conjoint data have to be much higher when used from economic valuation than for many of the typical uses for conjoint. Unless the distribution of price sensitivity puts little mass near zero, the conjoint data will not be useful for economic valuation using either our equilibrium approach or for the use of the more traditional and flawed p-WTP methods.
5. ILLUSTRATION To illustrate our proposed method for economic valuation and to contrast our method with standard p-WTP methods, we consider the example of the digital camera market. We designed a conjoint survey to estimate the demand for features in the point and shoot submarket. We considered the following seven features with associated levels: 1. 2. 3. 4. 5. 6. 7.
Brand: Canon, Sony, Nikon, Panasonic Pixels: 10, 16 mega-pixels Zoom: 4x, 10x optical Video: HD (720p), Full HD (1080p) and mic Swivel Screen: No, Yes WiFi: No, Yes Price: $79–279
We focused on evaluating the economic value of the swivel screen feature which is illustrated in Figure 2. The conjoint design was a standard fractional factorial design in which each 352
respondent viewed sixteen choice sets, each of which featured four hypothetical products. A dual response mode was used to incorporate the outside option. Respondents were first asked which of the four profiles presented in each choice task was most preferred. Then the respondent was asked if they would buy the preferred profile at the stated price. If no, then this response is recorded as the “outside option” or “none of the above.” Respondents were screened to only those who owned a point and shoot digital camera and who considered themselves to be a major contributor to the decision to purchase this camera. Figure 2 Swivel Screen Attribute
Details of the study, its sampling frame, number of respondents and details of estimation are provided in Allenby et al., 2013. We focus here on some of the important summary findings: 1. The p-WTP measure of the swivel screen attribute is $63. 2. The WTP measure of the swivel screen attribute is $13. 3. The equilibrium change in profits is estimated to be $25. We find that the p-WTP measure dramatically overstates the economic value of a product feature, and that the more economic-based measures are more reasonable.
6. CONCLUSION Valuation of product features is an important part of the development and marketing of new products as well as the valuation of patents that are related to feature enhancement. We take the position that the most sensible measure of the economic value of a feature enhancement (either the addition of a completely new feature or the enhancement of an existing feature) is incremental profits. That is, we compare the equilibrium outcomes in a marketplace in which one of the products (corresponding to the focal firm) is feature enhanced with the equilibrium profits in the same marketplace but where the focal firm’s product is not feature enhanced. This measure of economic value can be used to make decisions about the development of new features or to choose between a set of features that could be enhanced. In the patent litigation setting, the value of the patent as well as the damages that may have occurred due to patent infringement should be based on an incremental profits concept. Conjoint studies can play a vital role in feature valuation provided that they are properly designed, analyzed, and supplemented by information on the competitive and cost structure of 353
the marketplace in which the feature-enhanced product is introduced. Conjoint methods can be used to develop a demand system but require careful attention to the inclusion of the outside option and inclusion of the relevant competing brands. Proper negativity constraints must be used to restrict the price coefficients to negative values. In addition, the Nash equilibrium prices computed on the basis of the conjoint-constructed demand system are sensitive to the precision of inference with respect to price sensitivity. This may mean larger and more informative samples than typically used in conjoint applications today. We explain why the current practice of using a change in “WTP” as a way valuing a feature is not a valid measure of economic value. In particular, the calculations done today involving dividing the part-worths by the price coefficient are not even proper measures of WTP. Current pseudo-WTP measures have a tendency to overstate the economic value of feature enhancement as they are only measures of shifts in demand and do not take into account the competitive response to the feature enhancement. In general, firms competing against the focal featureenhanced product will adjust their prices downward in response to the more formidable competition afforded by the feature enhanced product. In addition, WTB analyses will also overstate the effects of feature enhancement on market share or sales as these analyses also do not take into account the fact that a new equilibrium will prevail in the market after feature enhancement takes place. We illustrate our method by an application in the point and shoot digital camera market. We consider the addition of a swivel screen display to a point and shoot digital camera product. We designed and fielded a conjoint survey with all of the major brands and other major product features. Our equilibrium computations show that the economic value of the swivel screen is substantial and discernible from zero but about one half of the pseudo WTP measure commonly employed.
Greg Allenby
Peter Rossi
REFERENCES Allenby, G.M., J.D. Brazell, J.R. Howell and P.E.Rossi (2014) “Economic Valuation of Product Features,” working paper, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2359003 Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile Prices in Market Equilibrium,” Econometrica, 63(4), 841–890. 354
Brazell, J., C. Diener, E. Karniouchina, W. Moore, V. Severin, and P.-F. Uldry (2006): “The NoChoice Option and Dual Response Choice Designs,” Marketing Letters, 17(4), 255–268. Chintagunta, P. K., and H. Nair (2011): “Discrete-Choice Models of Consumer Demand in Marketing,” Marketing Science, 30(6), 977–996. Ding, M., R. Grewal, and J. Liechty (2005): “Incentive-Aligned Conjoint,” Journal of Marketing Research, 42(1), 67–82. Dong, S., M. Ding, and J. Huber (2010): “A SimpleMechanism to Incentive-align Conjoint Experiments,” International Journal of Research in Marketing, 27(25–32). McFadden, D. L. (1981): “Econometric Models of Probabilistic Choice,” in Structural Analysis of Discrete Choice, ed. by M. Intrilligator, and Z. Griliches, pp. 1395–1457. North-Holland. Ofek, E., and V. Srinivasan (2002): “How Much Does the Market Value an Improvement in a Product Attribute,” Marketing Science, 21(4), 398–411. Orme, B. K. (2001): “Assessing the Monetary Value of Attribute Levels with Conjoint Analysis,” Discussion paper, Sawtooth Software, Inc. Petrin, A. (2002): “Quantifying the Benefits of New Products: The Case of the Minivan,” Journal of Political Economy, 110(4), 705–729. Rossi, P. E., G. M. Allenby, and R. E. McCulloch (2005): Bayesian Statistics and Marketing. John Wiley & Sons. Sonnier, G., A. Ainslie, and T. Otter (2007): “Heterogeneity Distributions of Willingness-to-Pay in Choice Models,” Quantitative Marketing and Economics, 5, 313–331. Trajtenberg, M. (1989): “The Welfare Analysis of Product Innovations, with an Application to Computed Tomography Scanners,” Journal of Political Economy, 97(2), 444–479. Uldry, P., V. Severin, and C. Diener (2002): “Using a Dual Response Framework in Choice Modeling,” in AMA Advanced Research Techniques Forum.
355
THE BALLAD OF BEST AND WORST TATIANA DYACHENKO REBECCA WALKER NAYLOR GREG ALLENBY OHIO STATE UNIVERSITY “Best is best and worst is worst, and never the twain shall meet Till both are brought to one accord in a model that’s hard to beat.” (with apologies to Rudyard Kipling)
In this paper, we investigate the psychological processes underlying the Best-Worst choice procedure. We find evidence for sequential evaluation in Best-Worst tasks that is accompanied by sequence scaling and question framing scaling effects. We propose a model that accounts for these effects and show superiority of our model over currently used models of single evaluation.
INTRODUCTION Researchers in marketing are constantly developing tools and methodologies to improve the quality of inferences about consumer preferences. Examples include the use of hierarchical models of marketplace data and the development of novel ways of data collection in surveys. An important aspect of this research involves validating and testing models that have been proposed. In this paper, we take a look at one of these relatively new methods called “Maximum Difference Scaling,” also known as Best-Worst choice tasks. This method was proposed (Finn & Louviere, 1992) to address the concern that insufficient information is collected for each individual respondent in discrete choices experiments. In Best-Worst choice tasks, each respondent is asked to make two selections: select the best, or most preferred, alternative and the worst, least preferred, alternative from a list of items. Thus, the tool allows the researcher to collect twice as many responses on the same number of choices tasks from the same respondent. This tool has been extensively studied and compared to other tools available to marketing researchers (Bacon et al., 2007; Wirth, 2010; Marley et al., 2005; Marley et al., 2008). MaxDiff became popular in practice due its superior performance (Wirth, 2010) compared to traditional choice-based tasks in which only one response, the best alternative, is collected. While we applaud the development of a tool that addresses the need for better inferences, we believe that marketing researchers need to deeply think about analysis of the data coming from the tool. It is important to understand assumptions that are built into the models that estimate parameters related to consumer preferences. The main assumption that underlies current analysis of MaxDiff data is the assumption of equivalency of the “select-the-best” and “select-the-worst” responses, meaning that two pieces of information from two response subsets contain the same quality of information that can be extracted to make inferences about consumer preferences. We can test this assumption of equivalency of information by performing the following analysis. We can split the Best-Worst 357
data into two sub-datasets—“best” only responses and “worst” only responses. If we assume that the assumption that respondents rank items from top to bottom is true, then we can run the same model to obtain inferences of preference parameters in both data subsets. If the assumption of one-time ranking is true, the inference parameters recovered from these data sets should be almost the same or close to each other. Figure 1 shows the findings from performing this analysis on an actual dataset generated by the Best-Worst task. We plotted the means of estimated preference parameters from the “selectthe-best” only responses on the horizontal line and from “select-the-worst” on the vertical line. If the two subsets from the Best-Worst data contained the same quality of information about preference parameters, , then all points would have been on or close to the 45-degree line. We see that there are two types of systematic non-equivalence of the two datasets. First, the datasets differ in the size of the range that parameters spin for Best and Worst responses. This result is interesting, as it would indicate more consistency in Best responses than in Worst. Second, there seems to be a possible relationship between the Best and Worst parameters. These two factors indicate that the current model’s assumption of single, or one-time, evaluation in Best-Worst tasks should be re-evaluated as the actual data do not support this assumption. Figure 1. Means of preference parameters estimated from the “select-the-best” subset (on the horizontal line) and from the “select-the-worst” subset (on the vertical line)
PROPOSED APPROACH To understand the results presented in Figure 1, we take a deeper look at how the decisions in the Best-Worst choice tasks are made, that is, we consider possible data-generating mechanisms in these tasks. To think about these processes, we turned to the psychology literature that presents 358
a vast amount of evidence that would point to the fact that we should not expect the information collected from the “select-the-best” and “select-the-worst” decisions to be equivalent. This literature also provides a source of multiple theories that can help drive how we think about and analyze the data from Best-Worst choice tasks. In this paper, we present an approach that takes advantage of theories in the psychology literature. We allow these psychological theories to drive the development of the model specification for the Best-Worst choice tasks. We incorporate several elements into the mathematical expression of the model, the presence of which is driven by specific theories. The first component is related to sequential answering of the questions in the Best-Worst choice tasks. Sequential evaluation is one of the simplifying mechanisms that we believe is used by respondents in these tasks. This mechanism allows for two possible sequences of decision: selecting the best first and then moving to selecting the worst alternative, or answering the “worst” question first and then choosing the “best” alternative. This is in contrast to the assumptions of the two current models developed for these tasks: single ranking as we described above and a pairwise comparison of all items presented in the choice task, where people are assumed to maximize the distance between the two items. The sequential decision making that we assume in our model generates two different conditions under which respondents provide their answers because there are different numbers of items that people evaluate in the first and the second decision. In the first response, there is a full list of items from which to make a choice, while the second choice involves a subset of items with one fewer alternative because the first selected item is excluded from the subsequent decision making. This possibly changes the difficulty of the tasks as respondents move from the first to the second choice, making the second decision easier with respect to the number of items that need to be processed. This change in the task in the second decision is represented in our model through parameter ψ (order effect), which is expected to be greater than one to reflect that the second decision is less error prone because it is easier. Another component of our model deals with the nature of the “select-the-best” and “selectthe-worst” questions. We believe that there is another effect that can be accounted for in our sequential evaluation model for these choice tasks that cannot be included in any of the singleevaluation models. As discussed above, the “sequential evaluation” means that there are two questions in Best-Worst tasks: “select-the-best” and “select-the worst” that are answered sequentially. But these two questions require switching between the mindsets that drive the responses to two questions. To select the best alternative, a person retrieves experiences and memories that are congruent with the question at hand—“select-the-best.” The other question, “select-the-worst” is framed such that another, possibly overlapping, set of memories and associations are retrieved that is more congruent with that question. The process of such biased memory retrieval is described in the psychology literature exploring hypothesis testing theory and the confirmation bias (Snyder, 1981; Hoch and Ha, 1986). This literature suggests that people are more likely to attend to information that is consistent with a hypothesis at hand. In the Best-Worst choice tasks, the temporary hypothesis in the “Best” question is “find the best alternative,” so that people are likely to attend mostly to memories related to the best or most important things that happened to them. The “Worst” question would generate another hypothesis “select the worst” that people would be trying to
359
confirm. This would create a different mental frame making people think about other, possibly bad or less important, experiences to answer that question. The subsets of memories from the two questions might be different or overlap partially. We believe that there is an overlap and, thus, the differences in preference parameters between the two questions can be represented by the change in scale. This is a scale parameter λ (question framing effect) in our model. However, if the retrievals in the two questions are independent and generate different samples of memories, then a model where we allow for independent preference β parameters would perform better than the model that only adjusts the scale parameter. The third component of the model is related to the error term distribution in models for BestWorst choices tasks. Traditionally, a logit specification that is based on the maximum extreme value assumption of the error term is used. This is mostly due to the mathematical and computational convenience of these models: the probability expressions have closed forms and, hence, the model can be estimated relatively fast. We, however, want to give the error term distributional assumption serious consideration by thinking about more appropriate reasons for the use of extreme value (asymmetric) versus normal (symmetric) distributional assumptions. The question is: can we use the psychology literature to help us justify the use of one specification versus another? As an example of how it can be done, we use the theory of episodic versus semantic memory retrieval and processing (Tulvin, 1972). When answering Best-Worst questions, people need to summarize the subsets of information that were just retrieved from memory. If the memories and associations are aggregated by averaging (or summing) over the episodes and experiences (which would be consistent with a semantic information processing and retrieval mechanism), then that would be consistent with the use of the normally distributed (symmetric) error term due to the Central Limit Theorem. However, if respondents pay attention to specific episodes within these samples of information looking for the most representative episodes to answer the question at hand (which would be consistent with an episodic memory processing mechanism), then the extreme value error term assumption would be justified. This is due to Extreme Value Theory, which says that the maximum, or minimum, of a random variable is distributed Max, or Min, extreme value. Thus, in the “select-the-best” decision it is appropriate to use the maximum extreme value error term, and in the “select-the-worst” question, the minimum extreme value distribution is justified. Equation 1 is the model for one Best-Worst decision task. This equation shows the model based on the episodic memory processing, or extreme value error terms. It includes two possible sequences, that are indexed by parameter θ, order scale parameter ψ in the second decision, exclusion of the first choice from the set in the second decision, and our question framing scaling parameter λ. The model with the normal error term assumption has the same conceptual structure but the choice probabilities have different expressions.
360
Equation 1. Sequential Evaluation Model (logit specification)
This model is a generalized model and includes some existing models as special cases. For example, if we use the probability weight instead of sequence indicator θ, then under specific values of that parameter, our model would include the traditional MaxDiff model. The concordant model by Marley et al. (2005) would also be a special case of our modified model.
EMPIRICAL APPLICATION AND RESULTS We applied our model to data that was collected from an SSI panel. Respondents went through 15 choices tasks with five items each as is shown in Figure 2. The items came from a list of 15 hair care concerns and issues. We analyzed responses from 594 female respondents over 50 years old. This sample of the population is known for high involvement with the hair care category. For example, in our sample, 65% of respondents expressed some level of involvement with the category. Figure 2. Best-Worst task
We estimated our proposed models with and without the proposed effects. We used Hierarchical Bayesian estimation where preference parameters β, order effect ψ and context effect λ are heterogeneous. To ensure empirical identification, the latent sequence parameter θ is estimated as an indicator parameter from Bernoulli distribution and is assumed to be the same for all respondents. We use standard priors for the parameters of interest. Table 1 shows the improvement of model fit (log marginal density, Newton-Raftery estimator) as the result of the presence of each effect, that is, the marginal effect of each model element. Table 2 shows in-sample and holdout hit probabilities for the Best-Worst pair (random chance is 0.05).
361
Table 1. Model Fit Exploded logit (single evaluation) Context effect only Order effect only Context and order effects together
LMD NR -13,040 -12,455 -11,755 -11,051
These tables show significant improvement in fit from each of the components of the model. The strongest improvement comes from the order effect, indicating that the sequential mechanisms we assumed are more plausible given the data than the model with the assumption of single evaluation. The context effect improves the fit as well, indicating that it is likely that the two questions, “select-the-best” and “select-the-worst,” are processed differently by respondents. The model with both effects included into the model is the best model not just with respect to fit to the data in-sample, but also in terms of holdout performance. Table 2. Improvement in Model Fit
0.3062
In-sample Improvement * -
0.3168 0.3443 0.3789
3.5 % 12.4% 23.7%
In-sample Hit Probabilities Exploded logit (single evaluation) Context effect only Order effect only Context and order effects together
0.2173
Holdout Improvement * -
0.2226 0.2356 0.2499
2.4% 8.4% 15.0%
Holdout Hit Probabilities
* Improvements are calculated over the metric in first line, which comes from the model that assumes single evaluation (ranking) in Best-Worst tasks. We found that both error term assumptions (symmetric and asymmetric) are plausible as the fit of the models are very similar. Based on that finding, we can recommend using our sequential logit model, as it has computational advantages over the sequential probit model. The remaining results we present are based on the sequential logit model. We also found that the presence of dependent preference parameters between the “Best” and “Worst” questions (question framing scale effect λ) is a better fitting assumption than the assumption of independence of β’s from the two questions. From a managerial standpoint, we want to show why it is important to use our sequential evaluation model instead of single evaluation models. We compared individual preference parameters from two models: our best performing model and exploded logit specification (single-evaluation ranking model). Table 3 shows the proportion of respondents for whom a subset of top items is the same between these two models. For example, for the top 3 items related to the hair care concerns and issues, the two models agree only for 61% of respondents. If we take into account the order within these subsets, then the matching proportion drops to 46%. This means that for more than half of respondents in our study, the findings and recommendations will be different between the two models. Given the fact that our model of 362
sequential evaluation is a better fitting model, we suggest that the results from single evaluation models can be misleading for managerial implications and that the results from our sequential evaluation model should be used. Table 3. Proportion of respondents matched on top n items of importance between sequential and single evaluation (exploded logit) models Top n items 1 2 3 4 5 6
Proportion of respondents (Order does not matter) 83.7% 72.1% 61.1% 53.2% 47.0% 37.7%
Proportion of respondents (Order does matter) 83.7% 65.0% 46.5% 29.1% 18.4% 10.4%
Our sequential evaluation model also provides additional insights about the processes that are present in Best-Worst choice tasks. First, we found that in these tasks respondents are more likely to eliminate the worst alternative from the list and then select the best one. This is consistent with literature that suggests that people, when presented with multiple alternatives, are more likely to simplify the task by eliminating, or screening-out, some options (Ordonez, 1999; Beach and Potter, 1992). Given the nature of the task in our application, where respondents had to select the most important and least important items, it is not surprising that eliminating what is not important first would be the most likely strategy. This finding, however, is in contrast with the click data that was collected in these tasks. We found that about 68% of clicks were best-then-worst. To understand this discrepancy, we added the observed sequence information into our model by substituting the indicator of latent sequence θ with the decision order that we observed. Table 4 shows the results of the fit of these models. The data on observed sequence makes the fit of the model worse. This suggests that researchers need to be careful when thinking that click data is a good representation of the latent processes driving consumer decisions in Best-Worst choice tasks. Table 4. Fit of the models with latent and observed sequence of decisions. LMD NR Latent sequence Observed sequence
-11,051 -12,392
In-sample Hit Probabilities 0.3789 0.3210
Holdout Hit Probabilities 0.2499 0.2247
To investigate order further, we manipulated the order of the decisions by collecting responses from two groups. One group was forced to select the best alternative first and then select the worst, and the second group was forced to select in the opposite order. We found that the fit of our model with the indicator of latent sequence is the same as for the group that was required to select the worst alternative first. This analysis gives us more confidence in our 363
finding. To understand why the click data seem to be inconsistent with the underlying decision making processes in respondents’ minds is outside of the scope of this paper, but is an important topic for future research. Our model also gives us an opportunity to learn about other effects present in Best-Worst choice tasks and account for those effects. For instance, there is a difference in the certainty level between the first and the second decisions. As expected, the second decision is less error prone than the first. The mean of the posterior distribution of the order effect ψ is greater than one for almost all respondents. This finding is consistent with our expectation that the decrease in the difficulty of the task in the second choice will impact the certainty level. While we haven’t directly tested the impact of the number of items on the list on certainty level, our finding is expected. Another effect that we have included in our model is the scale effect of question framing λ, which represents the level of certainty in the parameters that a researcher obtains between the best and worst selections as the result of the response elicitation procedure—“best” versus “worst.” We found that the average of the sample for this parameter is 1.17, which is greater than one. This means that, on average, respondents in our sample are more consistent in their “worst” choices. However, we found significant heterogeneity in this parameter among respondents. To understand what can explain the heterogeneity in this parameter, we performed a postestimation analysis of the context scale parameter as it relates to an individual’s expertise level, which was also collected in the survey. We found a negative correlation of -0.16 between the means of the context effect parameter and the level of expertise, meaning that experts are likely to be more consistent in what is important to them and non-experts are more consistent about what is not important to them. We also found a significant negative correlation (-0.20) between the direct measure of the difficulty of the “select-the-worst” items and the context effect parameter, indicating that if it was easier to respond to the “select-the-worst” questions, λ was larger which is consistent with our proposition and expectations.
CONCLUSIONS In this paper, we proposed a model to analyze data from Best-Worst choice tasks. We showed how the development of model specification could be driven by theories from the psychology literature. We took a deep look at how we can think about the possible processes that underlie decisions in these tasks and how to reflect that in the mathematical representation of the datagenerating mechanism. We found that our proposed model of sequential evaluation is a better fitting model than the currently used models of single evaluation. We showed that adding the sequential nature to the model specification allows other effects to be taken into consideration. We found that the second decision is more certain than the first decision, but the “worst” decision is, on average, more certain. Finally, we demonstrated the managerial implications of the proposed model. Our model that takes into account psychological processes within Best-Worst choice tasks gives different results about what is most important to specific respondents. This finding has direct implications for
364
new product development initiatives and understanding the underlying needs and concerns of customers.
Greg Allenby
REFERENCES Bacon, L., Lenk, P., Seryakova, K., Veccia, E. (2007) “Making MaxDiff More Informative: Statistical Data Fusion by Way of Latent Variable Modeling,” Sawtooth Software Conference Proceedings, 327–343. Beach, L. R., & Potter, R. E. (1992) “The pre-choice screening of options,” Acta Psychologica, 81(2), 115–126. Finn, A. & Louviere, J. (1992). “Determining the Appropriate Response to Evidence of Public Concern: The Case of Food Safety,” Journal of Public Policy & Marketing, Vol. 11, No. 2 (Fall, 1992), 12–25. Hoch, S. J. and Ha, Y.-W. (1986) “Consumer Learning: Advertising and the Ambiguity of Product Experience,” Journal of Consumer Research , Vol. 13, 221–233. Marley, A. A. J. & Louviere, J.J. (2005). “Some Probabilistic Models of Best, Worst, and BestWorst Choices,” Journal of Mathematical Psychology, 49, 464–480. Marley, A. A. J., Flynn, T.N. & Louviere, J.J. (2008). “Probabilistic Models of Set-dependent and attribute level Best-Worst Choice,” Journal of Mathematical Psychology, 52, 281–296. Ordo ez, L. D., Benson, III, L. and Beach, L. R. (1999), “Testing the Compatibility Test: How Instructions, Accountability, and Anticipated Regret Affect Prechoice Screening of Options,” Organizational Behavior and Human Decision Processes, Vol. 78, 63–80. Snyder, M. (1981) “Seek and ye shall find: Testing hypotheses about other people,” in C. Heiman, E. Higgins and M. Zanna, eds, ‘Social Cognition: The Ontario Symposium on Personality and Social Psychology,’ Hillside, NJ: Erlbaum, 277–303. Tulving, E. (1972) “Episodic and Semantic Memory,” in E. Tulving and W. Donaldson, eds, ‘Organization of Memory,’ Academic press, NY and London, pp. 381–402.
365
Wirth, R. (2010) “HB-CBC, HB-Best-Worst_CBC or NO HB at All,” Sawtooth Software Conference Proceedings, 321–356.
366