|
|
p-ter
i think the AIC (or BIC, or whatever) and p-values are both a bit more similar and a bit more different that you suggest.
1. similarities: both involve estimating the likelihoods of different models. if the models are nested, one can treat the difference in likelihod as coming from a chi-squared distribution, and calculate a p-value. this is what all standard stats software does when it reports a p-value.
2. differences: often, however, you want to compare models that are not nested, and need some way to judge which model fits the data better. This is done using the one of these information criteria, which gives a heuristic penalized likelihood based on the number of parameters. So the AIC is not used in formal hypothesis testing like a p-value, but rather as a tool in model choice. (ie. there's no theoretical result that i'm aware of suggesting that a different of AIC in X will happen with probability Y under some null, like there is with the likelihood ratio).
As long as people keep coming up with nested models, the p-value isn't going anywhere (not to say that it's not often misused).
Email | Homepage | 03.05.09 - 7:15 am | #
|
bbartlog
1) Your Y-axis in the first graphic is sort of mislabeled. If those are 'percent', then the figures should be in single digits, not .01, .02 and so on.
2) I'm not sure that just looking for 'p value' will give you a full reporting of the use of this method. I'm pretty sure I've seen papers where all they do is report p values in parentheses (p=.04, etc.), without even explicitly discussing the methodology.
3) I would tend to assume that the trendlines for the two methods are vaguely sigmoidal, with a steep slope during the period when they become accepted and popular, and a leveling off once they're in use most everywhere that they're appropriate. Under that assumption, the second graphic doesn't say what you think it does; it's what you'd expect from a younger methodology compared to an older one.
Having said that, I do think the new method is a good one. It does force us to define 'model space' or a model population somehow, but if researchers are forced to do that explicitly it should provide better insight into the research. Currently (I'm looking at you, pharma research...) the 'null hypothesis' can be selected after the fact to lead to a better-looking conclusion.
Email | Homepage | 03.05.09 - 1:44 pm | #
|
agnostic
A p-value isn't a likelihood because it's the probability of observing the data and all unobserved data that would be more extreme, under a null. A likelihood only counts the observed data.
I'm using "hypothesis" pretty loosely -- I consider it the same as a model. I.e., "I have a hypothesis that reproductive success is a Michaelis-Minten function of the number of willing females that a male has access to."
In the decline in the use of p-value relative to an information criterion, I think we see more of a focus on fitting and judging models, and less of an emphasis on testing whether some correlation is different from 0 or whether two means are exactly the same.
Email | Homepage | 03.05.09 - 1:48 pm | #
|
agnostic
I always have that problem in mislabeling percents. Ah well, people can tell what it means.
Sure, "p value" will catch less instances than "p," but using "p" will overestimate its use even more so.
The curves will be sigmoidal if they both become endemic, like a piece of technology that becomes fixed. But that's pretty hard to predict -- epicycles, the coefficient of variation, protoplasm, and plenty of other concepts or tools in science showed curves that were more like epidemic diseases, spreading and then burning out.
Even if p-values turn out to be here to stay, the ratio graph still shows that at least tools used in model comparison are catching up.
Email | Homepage | 03.05.09 - 1:55 pm | #
|
Statsquatch
If the hypothesis concerns a single parameter the best replacement for the p-value is a plot of the profile likelihood or, if you are a Bayesian, the posterior density. With these, or a confidence interval, you can evaluate hypotheses other than the null. The problem is with hypotheses concerning multiple parameters simultaneously, what you need to look at if you are fitting a complicated model. Comparing AICs might be better than comparing likelihoods since they are penalized for model complexity but neither work that well in my opinion.
bbartlog,
Pharma only changes the null hypotheses for the general public. They FDA will not let you get away with it.
Email | Homepage | 03.05.09 - 8:30 pm | #
|
Larry, San Francisco
As a professional statistician/econometrician in the real world I actually have developed pretty strong beliefs about statistical testing. The acid test of any model is how it predicts out of sample (i.e. data that was not used to develop the model). Using this criteria I have found that p-values are practically worthless. Many models have all their coefficients p-values being 'statistically significant' and then have out of sample predictions worse than just using the mean value of the Y variable out of sample. The AIC and related criteria (there is something called the BIC) for standard regressions and the Gini coefficient/Entropy measures for discrimination models (i.e. trying to predict one type or another) are pretty good if they are applied to out of sample data. A model developed under these circumstances is usually very robust.
Email | Homepage | 03.05.09 - 10:22 pm | #
|
Steve Sailer
Thanks. I'd never heard of this topic before, but it sounds promising.
I'd be interested in seeing more about information criteria, especially with a real world example.
Is this treating modeling kind of like data compression, where you are trying to find the optimum trade off of compression and accuracy?
Email | Homepage | 03.06.09 - 12:18 am | #
|
agnostic
Rather than go through an actual example, consider something we know would happen. You have data on the population size of the US from 1790 to 2000, and when you plot it, you can tell it's accelerating. A bunch of functions could do that -- a quadratic, cubic, or higher degree polynomial. Or an exponential.
Let's say the quadratic and the exponential fit the points equally well. The exponential wins because it only has 2 parameters to estimate -- initial population size, and the growth rate. The quadratic has 3: the coefficients of the x^2 and x terms, and the constant at the end.
This is why it doesn't matter that you can use Excel to fit the data nearly perfectly, if it requires a 30th degree polynomial -- all of those parameters (the constant term plus a coefficient in front of every degree of x term) will get penalized in the judging.
That's a good analogy about compression.
Email | Homepage | 03.06.09 - 3:31 am | #
|
Rich Lawler
The key reference, which is a bear to read, is by Burnham and Anderson, "Model selection and multimodel inference: An information theoretic approach." A really NICE! discussion of this issue, from an evolutionary approach, is in Chapter 1 of Elliot Sober's latest (frickin' brilliant) book "Evidence and Evolution."
Email | Homepage | 03.06.09 - 9:33 am | #
|
bioIgnoramus
"As a professional statistician/econometrician in the real world ...": I sometimes think that "the real world" is a pretty implausible model.
Email | Homepage | 03.06.09 - 9:51 pm | #
|
bill r
Actually, the p-value is pretty simple. Its the probability that the statistic in question calculated on comparable data from a hypothesized state (null or not) would meet or beat what has already been observed.
Why meet or beat? Statistics don't have to be monotone decreasing in their distance from hypothesized center/value. (Think of a beta) The survival function is, by definition. Doesn't require any of that B.S. about ensembles of experiments that some of the more rabid p-value critics like to fling about.
Email | Homepage | 03.16.09 - 9:31 pm | #
|
Comment Preview:
|
|
|
Commenting by HaloScan.com
|