A Discussion of "p" Values

The p value is perhaps the most familiar test statistic in modern scientific discourse. It is often mistakenly used by new graduate students and lay readers to interpret the entirety of the empirical data for any given study. What once began as a useful tool for decision making in hypothesis testing, became a one-trick litmus test to determine whether results were or were not significant and, by unfortunate extension, publishable or not publishable.

Therefore, we would like to provide a brief, accurate portrayal of the p value and the manner in which it should be used and interpreted. This article will serve authors as a helpful update on the current status of the p value as a tool in the scientific community. Be aware that if a misunderstanding of p values is evident in your manuscript, then you can reasonably expect an outright rejection by the reviewer.

The p value was originally calculated as a test statistic that would describe a given set of data based on an assumed null hypothesis. Pierre-Simon Laplace—also the provider of the mathematical description of surface tension—originally calculated p values in an attempt to categorize gender distributions as “real.” Thus, the notion originated that the p value could detect whether differences were real or, alternatively, due to coincidental probability. The utility of the p value was that it would establish a common, standardized decision making process for rejecting or accepting hypotheses based on empirical data. As proposed by Ronald Fisher, this threshold would be set at <0.05 for the rejection of the null hypothesis. Importantly, this was a completely arbitrary value designated and used by the scientist, not the statistician.

So, given the utility of the p value, what exactly is being calculated?

The p value is a description of the data; it is not a description of the hypothesis. The value indicates the probability—assuming the null hypothesis is true—of acquiring a result as extreme as the data set tested. This is a valuable tool for deciding whether to reject or accept the null hypothesis. As a community, scientists have agreed on a threshold for rejecting the null. This directly reflects the probability of falsely rejecting the null (type I error) or falsely accepting the null (type II error). Therefore, it provides an intuitive indication of the likelihood that apparent differences are “real.”

Unfortunately, this intuitiveness has steadily led to widespread misuse of the p value, and recent developments, such as the reproducibility crisis, have shifted attitudes concerning the use and reporting of p values. Understanding these changes is now critical to achieving success in the publication process.

Recently, due to the uproar over p values, the American Statistics Association felt compelled to release a statement on the use of the p value.

From the The ASA's Statement on p-Values: Context, Process, and Purpose (2016) from Wasserstein and Lazar:

“P values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”

and

“Scientific conclusions and business or policy decisions should not be based only on whether a p value passes a specific threshold.”

and

“A p value, or statistical significance, does not measure the size of an effect or the importance of a result.”

These three judgments are critical to the modern use of p values. We will now provide some guidance for using the p statistic in your manuscript based on the preceding information.

P values are increasingly being reported as exact values rather than the threshold (p <0.05). Editorial boards and reviewers increasingly recognize the arbitrary designation of the threshold, and they want to see the p value as a continuous scale that represents the strength of the data against the null hypothesis. The threshold is still acceptable for guiding statements of significance, but the exact value of test statistics should be reported unless specified otherwise.
The p value cannot be reported in isolation. Given that the p value provides no evidence of the effect size, generalizability, or importance of a result, reviewers expect to see complimentary statistical evidence. This can include reporting effect size, confidence intervals, and standard errors. Similarly, it is inadvisable to use phrases such as “highly significant” for the preceding reasons.
Do not attempt to circumvent statistical significance. With the widespread recognition that p values are being erroneously used as a catch-all to determine if a study was a success or a failure, reviewers are unforgiving with authors attempting to subvert their p values. This is commonly referred to as “p-hacking.” Do not state that a difference was found and subsequently qualify it with “but this difference did not reach statistical significance.” Likewise, phrases such as “marginally significant” should never be used.
The hypothesis statement should be as specific as possible. P values operate on reduction ad absurdum logic. The alternative or experimental hypothesis is accepted because the null has been deemed improbable. Maintenance of this logical structure is crucial; only two explanations, the alternative and null hypotheses, can be possible. Therefore, the null is typically a hypothesis of no effect. Make sure that your alternative hypothesis is a suitable counter claim.

Ronald L. Wasserstein & Nicole A. Lazar (2016) The ASA's Statement on p-Values: Context, Process, and Purpose, The American Statistician, 70:2, 129-133, DOI: 10.1080/00031305.2016.1154108

(Please retain the reference in reprint: https://www.letpub.com/author_education_p_values)

A discussion of p values