Showing posts with label Florida. Show all posts
Showing posts with label Florida. Show all posts

Tuesday, January 24, 2012

Solid Demographics + Most Recent Poll = Best Prediction

Reviewing the SC primary results, the poll closest to the actual voting date that had good demographics was the most accurate. That poll was PPP.

Virtually every primary has at least 10% of the vote being decided on the day of the election. Voters who decided on the day of the election or within a few days make up at least 20% and in SC it was a whopping 55%!

So trying to predict a vote within a couple percentage points, which is important when you have 4 to 7 viable contestants, seems to require some sort of mysticism to be accurate.

But if you simply poll in those last few days, and model your demographics based on past demographics, taking into account any obvious reasons for a significant change from the most recent election or two (e.g. voters can reasonably easily switch from one primary/caucus to another... even in a so-called "closed" primary, or a prior contest was moot because the party's nominee was clear by the time that prior year's contest was held, etc.

Anyway, looking at the just released PPP poll for Florida and reviewing the exit poll data, I see no reason to consider any major change in demographics from the 2008 exit poll and for the most part, the PPP poll deviates very little from that.

But here is one outlier: it says 40% of voters describe their political philosophy as 40%. That is WAY off the mark. In 2008 it was 27%. In 2000 and 1996 the Florida primary was preceeded by two dozen contests, so voter interest is low. How that affects political philosophy is not yet clear to me. However, the very conservative % in those two contests was even lower at 20% and 21%. Even though Florida is a state with higher growth than most other states, a demographic shift statewide seems virtually impossible.

So how would this affect the results?
Let us assume that the conditional probabilities crosstab reported by PPP is not significantly affected by any other demographic anomalies (one I noted was the much lower hispanic sample of 7% vs. 12% in the '08 primary).

So the conditional probability of choosing a candidate based on your political philosophy is:
Political PhilosophyNewtRomneySantorumPaulOther
Very Liberal42%34%9%5%10%
Somewhat Liberal44%23%9%20%5%
Moderate30%39%6%17%9%
Somewhat Conservative35%42%9%9%6%
Very Conservative44%23%20%8%5%
If one assumes the exit polling in the 2008 primary is reflective of the 2012 exit polling, then the % of voters for each political philosophy is 2% very liberal, 8% somewhat liberal, 28% moderate, 34% somewhat conservative, and 27% very conservative.

For Newt Gingrich, 2%x42% + 8%x44% + 28%x30% etc... you get 36.5%.
The numbers for Mitt Romney calculate to 33.9% which cuts the 5 pt lead from Newt in half.
The numbers for Santorum and Ron Paul reverse their positions from a 3 pt lead for Santorum to a .7 pt lead for Ron Paul.

Now consider what the media narrative is when the pollsters say someone is leading by 2.5 % vs. the 8% lead reported by Insider Advantage (who gives ZERO demographics, ZERO polling methodology).

The best scientific research is open and relatively easily reviewable by anyone with the time and inclination to do so. Mistakes are found, and corrected. This is in stark contrast with the majority of media polls put out every week during election season.

Sunday, January 15, 2012

GIGO - Garbage In Garbage Out

This is not my usual tone for this blog, but I grow weary of bad polls and bad explanations of the GOP race.

From Wikipedia:
Garbage in garbage out "was coined as a teaching mantra by George Fuechsel,[1] an IBM 305 RAMAC technician/instructor in New York. Early programmers were required to test virtually each program step and cautioned not to expect that the resulting program would 'do the right thing' when given imperfect input. The underlying principle was noted by the inventor of the first programmable computing device design:

On two occasions I have been asked,—'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."


In my years of working with health insurance data, we spent far more time cleaning up garbage data, redefining vague information into something more useful and coming up with models that rationally explain how healthcare services were used before we got to using much more than basic arithmetic. Rarely were T tests, R squared values or more sophisticated statistics used like actuaries or less successful competitors did. You can't put lipstick on a pig.

As I look at the internals of various primary polls for Iowa, New Hampshire and now South Carolina, I am struck by the number of times a poll with seriously flawed input data is touted by the media and used to build a false narrative. The worst case has to be the late December CNN Iowa poll, which surveyed ZERO voters who weren't registered Republicans, despite the undisputed significant voting by non-Republicans in Iowa GOP caucuses. The right % should have been 25%, the same % was demonstrated during the last non-competitive Iowa Democratic caucus (1996) - the same number that would be validated in the 2012 exit polls. Since the media either was ignorant of it, or gave so little warning, and there were no other polls for two days, the 24 hour media trumpted a nonexistent surge by Santorum due to this seriously flawed poll. So today, I read an article about how the evangelical leaders have coalesced around supporting Santorum... someone who would have likely stayed in 6th or maybe reached 5th place in Iowa and then returned home if not for a single flawed, heavily promoted poll.

Today I see an even worse poll, from some organization I've never heard of, that somehow believes that less than 5% of the voters in South Carolina's primary will be less than 40 years old and 55% will be at least 65 years old. For reference, in 2008 35% were older than 60 like in 2000 according to the exit poll. The 2000 exit poll showed that 25% were older than 60.

This poll is inexplicably given a weight rating of 4 bars out of 5 on Nate Silver's otherwise credible forecasting model. Aggregators favor combining as many polls as possible, no matter their quality, hoping that with enough garbage, the various garbage factors will cancel each other out. Some skilled analysts like Nate Silver, attempt to quantify and somewhat discount the garbage by using theoretical formulas about sampling error on a bell curve, applying some likely useful heuristics like the age of the poll, and employing the somewhat controversial, though still likely useful strategy of rating a pollster by how close it's polls come to predicting the actual result.

I guess this is better than nothing, but it is not something we did in the healthcare data analysis industry, nor did any of our competitors. Admittedly, we were spending millions of dollars on these tasks, while the polling aggregators do this for something with at least one or two zeros in their budget.

Still, I can't believe that spending a little effort on trying to adjust the data or at least dump a poll that has such problems as today's horrendously bad SC poll isn't low hanging fruit for these small organizations.

Nate Silver's latest article also repeats a widely spread meme that the South Carolina GOP is home to 60% evangelicals/born agains and as evidence, cites a 2008 exit poll that doesn't even ask this question. Worse yet, the 2000 GOP exit poll shows 34% belong to the religious right (which is not the same as born again/evangelical, but it is the closest I could find). Perhaps some non-exit poll has this 60%, but before I cite that, I want to do a serious look at the internals to figure out if other anomalies exist.

Another false notion:

Too often I hear that South Carolina isn't like New Hampshire where independents are such a factor. Wrong. It is an open primary. In 2000 when there Gore still had some modest competition from Senator Bradley, 39% of voters who identified themselves as Independent or Democratic voted in the GOP primary. This is compared to Iowa's caucus in 2012 where there was no competitive race for the Democrats and 25% of the GOP voters were non-Republicans.

These bad polls due to age distribution and independent voter % can heavily penalize Ron Paul - someone who the establishment wants to discredit, but I'll leave that explanation for another post.

As I look at suspect polling internals, I wonder if the accuracy of the aggregations in Iowa and New Hampshire weren't the beneficiaary of a certain amount of luck, and that a repeat of the NH Dem 08 primary is waiting.

So for the next week I'm going dumpster diving into SC and FL voting/polling data, sifting through the garbage, hoping to bring out some good.

Or for my dyslexic friends: IGOG. Into the Garbage, Out comes Good.