Measuring Pain Catastrophizing in the clinical setting

This post is with respect to our recent publication in the Clinical Journal of Pain, in which Drs. Tim Wideman, Micheal Sullivan and myself performed a Rasch analysis on the commonly-used Pain Catastrophizing Scale.  For those interested in the true nitty-gritty of the methods, analyses and results, I’ll refer you to the paper itself.  However, a cursory summary of the findings is warranted, after which I’ll offer up some opinions regarding the clinical or real-world implications.

We used a database of 235 subjects with work-related injuries (93% low back pain) from Quebec for this analysis.  Rasch analysis is considered one of the ‘new’ techniques for evaluating the properties of a scale (although it’s been around since the early ‘60s).  Some readers may also be familiar with the term Item Response Theory (IRT), which is another ‘new’ approach to scale evaluation that has also been around since the ‘60s.  These are in contrast to approaches drawn from Classical Test Theory (CTT), with which most readers will likely be familiar; CTT gives us such characteristics as reliability, minimum detectable change, and construct validity to name a few.  Rasch and IRT purists will argue that CTT is a ‘weak’ approach to scale evaluation, in that it necessarily involves the non-confirmable notion of random error, which can never be either accepted or refuted.  Rasch and IRT do not include random error in their mathematical models, and hence are largely considered more conceptually sound (although like with anything in statistics, alternative opinions abound).

Anyhoo, the value of Rasch analysis is that it allows for some very deep exploration of the properties of a scale, right down to the level of the individual item and the individual response options.  The end result of a scale that ‘fits’ the Rasch model is that the scale can be confidently considered an interval-level measurement tool.  Why is this important?  For starters, most statistical tests (e.g. t-tests, ANOVA, Pearson correlation, linear regression) are only purely appropriate if the data are normally distributed and are interval-level.  In fact, most mathematical procedures require interval-level data, which in a simple nutshell means that the conceptual difference between a 1 and 2 is the same as the distance between a 10 and 11.  If you want to perform any kind of mathematical operation (addition, subtraction, multiplication or division) then this property has to be true in order for the result to make any sense.  This cannot be assumed for most ordinal-level scales, which form the vast majority of scale types in most health-related fields.  For example, we can’t assume that distance between ‘strongly disagree’ and ‘disagree’ is the same as the distance between ‘disagree’ and ‘agree’, the latter actually requiring a conceptual transition across the threshold from general disagreement to general agreement, while the former are simply different levels of disagreement.  So basically, in order for almost all of current knowledge on pain catastrophizing and its effects on things like treatment effectiveness or long-term outcomes to be valid, the Pain Catastrophizing Scale (PCS) had better act like an interval-level measure when tested.  To keep this story at least somewhat short, the results of our analysis indicate that it does, but with some potentially important caveats.

The first is that the response options for two items (items 8 and 12) were somewhat disordered, that is, respondents appeared to have some difficulty in reliably answering those questions with the response options given.  This is not particularly surprising, and in fact I’ll admit some surprise that there weren’t more disordered responses identified.  The reason I say this is that a critical scrutiny of the response options on the PCS reveal some ambiguity in terms of what they’re actually measuring.  The first 4 options (not at all, to a slight degree, to a moderate degree, to a great degree) appear to be measuring the magnitude of the individual’s experience with each item.  On the other hand, the 5th and final option ‘all the time’ is clearly measuring the frequency of the same experience.  In other words, one might logically think it’s possible for someone to feel slightly or moderately afraid that the pain will get worse (item 6) but feel that all the time.  Similarly, a number of items include frequency-based words directly within them (item 1 actually includes the statement ‘all the time’ directly within the item).  So it becomes a little difficult to interpret a response when someone, for example, indicates that they worry all the time about whether the pain will end to a slight degree, but don’t choose the ‘all the time’ option.  This doesn’t mean the PCS is in any way a poor scale, in fact we’ve shown rather convincingly (in my opinion) that it actually functions quite well, so arguably these concerns are irrelevant.  But I highlight them because I believe it’s important for clinicians to look closely at not just the statistical properties of a scale as reported in the scientific literature, but also at the qualitative properties of a scale to help them truly interpret what the scale can and what it cannot tell them.  The end result of this is that we’ve suggested a rescoring of items 8 and 12 so that they’re both now out of 3 rather than 4, and the overall scale is out of 50 rather than 52, the nice round number being a pleasant side effect.

Another consideration is the notion of dimensionality in a measurement tool.  It is a very basic axiom of quantitative measurement that any scale meant to be subject to mathematical manipulation should only measure a single construct.  To draw an example, let’s say I want to create a scale to measure how ‘big’ you are.  I may decide that I could measure your height in centimetres, your weight in kilos, and let’s say your shoe size.  While all 3 of these would give me some indication of how big you are, it would make little sense to sum them all together and create an overall scale of ‘bigness’.  Using this example, there would be no way I could say that a change from 350 to 345 is the same as a change from 450 to 445.  Even though both are different by 5 points, I have no way of knowing what changed.  Another way of putting it, I wouldn’t be able to confidently say that someone who scores a 700 on my combined scale is twice as big as someone who scores 350.  On the other hand, had I captured just one of those subscales, weight for example, I could say that someone who weighs 80kg is twice as heavy as someone who weighs 40kg.  So unidimensionality is an important consideration in measurement science, and I will put forth an opinion here that most common scales in use in rehabilitation have not been adequately evaluated for this important trait, which renders summative scores highly susceptible to inaccuracies or bias.  The Rasch model also allows this type of analysis, and to keep what is already a very long story at least somewhat short, it was adequately unidimensional for use as a summative score.  There’s a pile of deep level philosophical stuff we could talk about here, the very nature of catastrophizing being one of them, but for now we’ll just leave it at that.

The last thing I’ll talk about here is how Rasch analysis gives us the ability to see how a scale performs across the range of possible scores.  It does this in a couple of different ways, one of them visual (a histogram) and another more statistical.  The statistical method allows us to make use of a transformation matrix which tells us how to transform raw ordinal-level scores out of 50 to interval-level scores, again out of 50.  As has been the case with every such analysis I’ve seen so far, the results of this transformation suggest that change at the extreme ends of the scale are more meaningful than change in the middle of it.  As an example, a raw score change from 1 to 2 on the PCS corresponds to a 5-point change on the interval scale.  On the other hand, a raw score change from 23 to 24 corresponds to a 0.5 point change on the interval scale.  So a 1 point difference at the extreme bottom of the scale represents a 10-fold greater interval level change than the same 1 point difference in the middle of the scale.  This probably has highly important implications for establishing things such as minimum detectable change or minimum clinically important difference, but given that the latter are drawn from classical test theory, there currently isn’t a good analog that can be drawn from the newer Rasch or IRT approaches.

So in the end, the results of our analysis indicate that the PCS is by and large a reasonably good scale from a mathematical perspective, especially when scored out of 50 rather than 52.  It doesn’t of course tell you when or how you should use it in clinical practice, and what to do about a high score if you encounter one.  For that type of information, readers are directed to http://sullivan-painresearch.mcgill.ca/pcs.php.