The birth of a new measurement scale

I've been working on the development of two new self-report scales over the past 4 or 5 years, and thought it might be interesting for those considering doing the same to get an appreciation for how rigourous the process of scale development *should* be.  The scale I'm going to describe here is meant to provide an indication of the level of recovery following a musculoskeletal injury or just nonspecific pain.  I've been focusing on people with whiplash (we are cWhIP after all) but I believe it will hold relevance for many different conditions.  So without further ado, here is a brief history of the new Satisfaction, Interference and Recovery Index, or SIRI (that's right Apple, I came up with the name 3 years ago).

Step 1: Identify a gap

This process started back, way back, in my first year of PhD studies at the University of Western Ontario (now called Western University), which would have been 2006.  As is a customary first step in doctoral studies, a literature review is undertaken to develop an understanding of the state of knowledge in a particular area.  My interest being in prognosis following acute WAD, I undertook a systematic review of prognostic factors, which then morphed into a meta-analysis of the topic.  What I didn't anticipate when undertaking that work was that the thing previous authors were trying to predict was very inconsistent.  So much so in fact that part of my comprehensive (aka 'candidacy' in some places) exam was a review of the various ways that 'recovery' had been defined in the literature to that point.  This variation meant that trying to perform any meaningful statistical pooling was very difficult, and in fact I had to distill the literature down quite a bit to end up with a pool of studies that had outcomes that were at least similar enough to allow something meaningful to come out of it.  So, there was a gap identified - there was (and still is) almost no consistency in the ways 'recovery' is defined in prognostic studies.

Step 2: Develop a theoretical framework

So with the idea in mind that I would try to create some kind of meaningful indicator of recovery, I began digging into literature that I would have never otherwise touched.  I learend about Maslow's Heirarchy of Needs, Higgins' Self-Discrepancy Theory, and Ryan and Deci's musings on happiness, health and eudaimonism, to name a few.  I took graduate level courses in Health Psychology, and in Philosophy and Epistemology.  And eventually, after about 2 year's worth of a flurry of knowledge absorption and no less than 4 complete re-writes from scratch, I, with the wisdom of epidemiologist Dr. Joy MacDermid and psychologist Dr. Warren Nielson, ultimately managed to come up with a coherent manuscript that served as a foundational theoretical framework from which to start building my new scale.  But, this of course was only 3 people's opinions.  So...

Step 3: Determine what key stakeholders have to say

The next step was to ask people whose opinions actually mattered: people in pain, and the people who care for them.  So that's what we did.  A series of focus groups followed by an independent series of one-on-one interviews ensued that included over 20 people with neck pain and 15 rehabilitation professionals.  The question we posed to the people with neck pain was: "How will you know when you're recovered?  Or in other words, how will the recovered version of you differ from the current version of you?"  The question we posed to the clinicians was "How do you know when your patients are recovered?"  The focus sessions followed a nominal group technique and the one-on-one interviews used a structured interview schedule.  The results of these, especially the patient sessions, were fascinating in my mind.  The clinicians sessions were good too of course, but given that I myself am also a clinician, I had a bit of an idea of what they might say.  A manuscript with these results has been written but needs a bit of revision to add in the newest information.  Here's a graphic of the results from the patient focus groups that informed the next stage of scale development:

What you're seeing here in the main chart are the six domains that came out of a thematic analysis of all the different recovery indicators offered by the patients.  The smaller graphs represent sub-domains of their same-coloured pie slice.  Manageable symptoms and functional capacity are nothing new, but things like autonomy & spontaneity, satisfying relationships and (most interesting to me) a positive future outlook are a bit more intriguing.  As one of the informants put it (and I'm paraphrasing here):

'Look, they could give me a pill tomorrow that might take all of my pain away, but would I feel recovered?  No.  It's not about what I may or may not be able to do tomorrow, it's what I think I may or may not be able to accomplish in the future that's important to me.  If I knew that things would never get worse than they are right now, then I could probably live with this and then I would say, ya, maybe I am recovered.'

Brilliant.  Every scale we've currently got focuses on the present: how much lifting, reading, driving, sitting, etc... can you do right now?  I am not aware of any scale that actually asks whether people are confident in their future.  

So now we've got both a sound framework built on both theory and empirical knowledge, and an understanding of what key stakeholders believe.  So onto the next step.

Step 4: Generating items

This is arguably one of the most difficult but important tasks of scale development.  If the items on a scale aren't good, then all of the testing in the world isn't going to save it.  As is standard practice, we developed far more items than we would need.  For this I recruited a group of 6 experienced clinicians, showed them the data we had so far, and asked each of them to independently generate items that tapped into each domain.  I also did the same.  When we pooled our items together, we had generated over 300!  Through a rather lengthy process of consensus-building, we whittled that number down to a manageable 15.  Part of this process, which is often overlooked but is so vital, was also deciding on what type of response options we wanted to offer.  This is no easy process, and I'm not sure I'll delve fully into it here, but it requires consideration of things like: respondent burden, ease of scoring and interpretation, the intended use of the scale, the types of statistical analyses you would ultimately like to perform, and the general conceptual framework of what you're trying to collect (ie. opinions, frequencies, intensities, etc...).  Because the goal of this new scale was that it would be stable over time when no change had occurred, but sensitive to important change, would provide an easily interpretable score, and could be logically considered continuous-level data for statistical analysis (that last one is always a bit of a balancing act), we ultimately settled on a 0-10 scale, guided by our belief that most people are at least familiar with the general function of the traditional 0-10 scale.  Of course the next step was to decide what to measure on that 0-10 scale.  Again, after a long bout of considerations that I won't bore you with, we decided on measuring satisfaction in the various life domains that we were capturing.  This was an intentional departure from traditional 'disability' scales, which by definition have a negative valence to the wording of the items.  A satisfaction scale, by contrast, has a positive valence in its wording, and frankly I believe that people in pain probably get tired of rating how 'disabled they are, and rating their satisfaction would be a welcome departure.

There was one final consideration, and I'm still waffling a little on this one but I think it's warranted.  many disability scales assume equal weighting of all items to all people.  So, for example, an item pertaining to driving is given equal weighting as an item pertaining to headaches.  This is problematic in my mind, especially when the intention is to establish the state of the person as either recovered or not.  Some people may always rate high driving difficulty, but don't drive, so it's arguably an irrelevant item.  Reading a newspaper would be another good example - many people don't even get a daily newspaper, so it's irrelevant.  So to on our scale, despite the fact that we believed each item was measuring something important to people in pain, we weren't prepared to assume that all items were equally important, and as such we added a second set of responses to every item, and that was a 0-10 scale of 'importance'.

A final consideration was to separate the symptom items from the more general satisfaction items, mostly because they were conceptually distinct - it's possible to be satisfied in the presence of pain, just as it's possible to be dissatisfied despite the absence of pain.  We now had two subscales on our index: a symptoms scale and a satisfaction / importance scale.

Step 5: What do the experts think?

Armed with our prototype 15-item scale that by this point had taken the better part of 2 years to construct, the next step was to establish some content validity.  We already had some evidence of that based on the steps we had taken to this point, but we now wanted to get some academic input.  The prototype tool, along with a brief explanation of the thinking behind each item, was sent to several (I forget the number now, but in the area of 15) prominent experts in various fields - medicine, psychology, epidemiology, pain science, and the like.  We also sent it to some people involved in policy decisions.  This step is a bit hard on the nerves, especially considering the time it's taken to this point, but it's necessary and very worthwhile.  The feedback from these experts was extremely valuable, and we used it to refine the tool into a second version.  That second version was then sent to a professional technical editor to ensure good layout and readability, and her feedback was also excellent.  The third version of the tool followed.

Step 6: Pilot testing

Before implementing widespread testing, we piloted the tool using patients in a few relevant clinical practices.  They were invited to complete the scale, and then afterwards I sat down with them and we went through each item individually.  I asked them what they thought the item was asking, and why they chose the response they did.  This process, known widely as 'cognitive debriefing' ensures that people are reading your items in the ways that you think you've written them.  This of course again leads to some more refinements.

Step 7: Let's try to establish some validity

By this point, the scale had already been through several revisions and had been pared down to 13 items based on the feedback we had received.  This is where some of you who follow me on Twitter will have come in.  We received ethics approval through our institutional review board to run an on-line data collection protocol.  Here we used Twitter, Facebook Ads and Google Adwords to try and recruit people with neck or low back pain to complete the survey.  We gave each respondent a set of items intended to measure recovery from different angles (work status, medication usage, compensation/litigation involvement, treatment received, general recovery status), along with an existing and well-established region-specific disability measure, either the Neck Disability Index or the Roland Morris Low Back Disability Questionnaire, and also the prototype SIRI tool.  At the end of each we also asked some general questions about the usefulness of the tools, the readability and layout, and the degree to which people thought they really measured recovery.  The order of the established measure and SIRI was randomized to prevent any order effect of completing the scales.  I also impishly included a validation question half way through the survey which was simply 'please choose number 4 (or 6) in this line', just to make sure people were actually paying attention.

We were shooting for 100 respondents at this stage, but only got 75.  Unfortunately, 14 of those were either incomplete or failed the validation question, which left us with 61 responses upon which to run some preliminary analyses.  Through a process that included inter-item and item-total correlations, correlations with the existing measures, and mean group differences on each item between the levels of our omnibus recovery indicators described in the parentheses in the above paragraph, the number of satisfaction items was reduced to 10 items, each of which appear to provide different but valuable information on the recovery status of the individual.  The satisfaction subscale and symptoms subscale are interpreted separately, and when compared to the existing region-specific disability measures, function at least as well if not better in some cases, dependent on the recovery indicator being studied.

So as it stands right now, as of Feb. 6th 2012, the SIRI is composed of a symptoms subscale, that lists 11 symptoms, the frequency and intensity of each, and a satisfaction subscale, the scores of which are weighted against the rated importance on each item.  The algorithm isn't difficult but does require a few calculations, for which I will be providing a calculator on this website and (hopefully) a smartphone application shortly.  I don't think it's quite ready for primetime yet though.  I still need to look at stability and sensitivity to change (related concepts), and also whether this scale provides any additional information on recovery status over and above the existing disability scales.  A quick regression analysis suggests that it does, but I'll save a more formal analysis for a larger sample.

While I won't post the scale here for fear of losing my intellectual property, I would like to see it used in clinical practice and am willing to share it with clinicians who would like to try it out.  I can provide some information on scoring and important cut-scores for interpreting the results.  For anyone interested, contact me at dwalton5@uwo.ca.