The Boost
The statistics of counter-terrorism
September 12, 2011statistics

Is it worth more to save a person from a terrorist's bomb than, say, a car crash? John Mueller and Mark Stewart grapple with the spending choices we make in their new book about homeland security. The trouble, of course, is that much of what matters to us is unmeasurable. In the United States, we lose some 34,000 people a year in highway accidents. If we lost even a thousand a year--twenty per week--in terrorist attacts, the country would be likely be paralyzed by fear (and I can only imagine where that would take our politics.)

Yet somehow we're able to pack the minivan, load the kids, jack up the radio, and head out onto deadly highways. Driving is a risk we're willing to accept. We feel safe even when we're not. And feeling safe, it can be argued, matters more within a society than numbers. Even if researchers made a convincing case that we could reduce deaths by 10,000 or 15,000 a year, I bet most people wouldn't support increased spending for it. That's because they (we) don't feel threatened--even if we should.

Still, we spend lavishly on counterterrorism. And in the post 9/11 fervor, which was revisited on its 10th anniversary, much of the spending has gone unquestioned. That's going to change. I don't think we'll ever look at the numbers as coldly as Mueller and Stewart, who use the logic of a triage ward. I'm not convinced that we should. But their analysis provides a good starting point.

The Numerati in health care: Targeting the barnacles
January 19, 2011statistics

In the "Shopper" chapter of The Numerati, I wrote about "barnacles." These are the shoppers who go from store to store looking for bargains. They never spend full price. They actually cost stores money. And now merchants are coming up with analytical tools to find the barnacles among their shoppers--and perhaps to treat them differently. They can start by removing their names from promotional mailings.

In a New Yorker article, Atul Gawande details this same type of analysis in health care. Jeffrey Brenner, a dataminer in Camden, NJ, has identified the one percent of the population that accounts for 30% of health-care costs. These people are to Camden health care what barnacles are to Target and Best Buy. They're a problem. But they're also an opportunity. Because once the population is identified, it's possible to test these people, using different approaches to make them healthier--and save buckets of money.

This type of statistical analysis is the great hope for improving health care, and keeping the system from going broke. Unfortunately, this process leads to fears, some of them justified. What if the algorithms are focused first on driving up profits? It could happen. The best antidote is for the dataminers to tell their stories, as Brenner has.

Statisticians zero in on Euro crooner
May 31, 2010statistics

Kaggle, a British-based platform for prediction contests, has sponsored an unusual statistics competition, inviting teams from all over Europe to predict the winner of the annual European song contest. According to Kaggle's blog, the statisticians outperformed the prediction betting pools. This apparently indicates that smart stats fiends, at least in this case, outduel the wisdom of betting crowds.

The statisticians, according to the blog, took into account several variables that the betting public ignored. The most interesting point, to me was this:

[T]he betting-market data itself might have an impact on the outcome. This year, pre-contest favourites seemed unwilling to allocate votes to each other. Azerbaijan awarded Germany just one vote when other countries awarded Germany an average of 6.5. Germans returned the favour by not sparing a single vote for Azerbaijan.

As a special offering to fans of EuroVision contests, here's a look at Raphael singing, in 1966, Yo Soy Aquel.

Happy North Dakotans: How stats deceive
April 7, 2010statistics

My son is choosing between universities in Pennsylvania and Washington (state). So I thought I'd compare the states on a variety of metrics (even though, as I posted recently, I don't think the choice will make much difference).

What I found was a Gallup poll showing that for statisfaction with standard of living, North Dakotans rank first. My question: Does that tell us more about North Dakota or the people who choose to live there? I think it's pretty clear that if you gave 300-some million Americans a choice to start over and live anywhere they wanted, a fairly small number would end up in North Dakota. In fact, it might come pretty close to the 646,000 who currently live there.

The poll fails to take into account the winnowing process that's hit the Great Plains. Populations in states like North Dakota have fallen in recent decades, as the young and the old have looked for opportunities and sunshine to the south.The population has ebbed from 680,000 in 1930. If you consider that the national population in those 80 years has grown from 122 million to 308 million, clearly a shrinking minority elects to remain in North Dakota.

So if a poll-taker calls those sturdy and stubborn Dakotans and asks them if they're satisfied with their choice, aren't a lot of them going to answer in the affirmative? Contrast that with where many of the North Dakotans have moved: Fast-growing Florida is the sixth unhappiest state. (Sunny Nevada heads the list.)

This brings us to one other variable, which is related. Nevada and Florida suffered greatly from the collapse of the housing bubble. North Dakota, with its shrinking population, was untouched.  According to a Pew Study, only one of 165 homeowners was likely to face foreclosure there, the lowest rate in the country. That compares to one of 11 in Nevada and one of 26 in Florida.

Following that Dakota detour, I return to the chart and find that Pennsylvania and Washington both score in the middle of the pack in satisfaction. Of course, that kind of (boring) economic satisfaction is virtually the last thing college students have on their mind...

Stamen Design: Illustrating the physics of information
February 4, 2010statistics

Ben Cerveny of Stamen Designs was speaking before me at the Webtrends Engageconference yesterday. Stamen, an eight-person shop in San Francisco, produces fascinating and provocative visuals from big data sets. He showed data of everything from real estate to news as squiggling, morphing blobs and lines. Sometimes it looked like cell biology, but Cerveny pointed to another science. He said Stamen was looking for rule sets for the "physics of information."

That idea has been batted around for a while. Last year I read The User Illusion: Cutting Consciousness Down to Size, by Tor Norretranders. He went into great length about information and the second law of thermodynamics. The idea is that information, like heat, tends naturally toward entropy. It loses its structure and disperses. It starts to look more like the general stream on Twitter and less like The New York Times. And the job of journalists or algorithm writers is to use intelligence to bring order to the information. In that sense, we do the work of the Maxwell's Demon. That's a fictitous character thought up by James Maxwell who has the intelligence (and dexterity) to separate fast- and slow-moving molecules, and thus create "free" energy (and counteract entropy). The question, of course, is whether the energy gained by separating the molecules would compensate for the energy spent in separating them. Somewhere in there is the value of information.

Anyway, Stamen does cool work. The photo above is from Trulia, a company that has a vast data base of real estate transactions through U.S. history. You can click on a neighborhood and see the development patterns. I like the one of Miami Beach. They also do lots of work with Digg, the crowdsourced news aggregator. Check out Digg Labs. You can even turn the swarm, which shows the sprouting and clustering of news items in their community, as a desktop screen saver. (I'd be tempted, but wonder (thinking back to Maxwell's Demon, how much energy it would gobble.)

This reminds me. One morning I was breakfasting in Palo Alto and wearing a shirt I picked up in Madison: Wisconsin Physics. One smart aleck stopped by my table, pointed to my shirt, and said: Wisconin physics? Do they have different laws of thermodynamics there?

Predicting murders in Philadelphia
January 19, 2010statistics

What bits of data correlate to murder, or could help predict such crime in a major American city? On Michael Trick's Operations Research blog, I found a link to this challenge to predict murders in Philadelphia. (That population number is wrong by the way. Philly's population is around 1.45 million, not 5.8 million.)

The Analytics X Competition recalls the much-ballyhooed Netflix prize. But the stakes are tad lower ($100 instead of $1 million). And as Trick points out, participants have a much freer range to find their data. While the Netflix competition was based exclusively on anonymous user data, Philadelphia sleuths can incorporate any data they want. Could movie rental patterns correlate to murder? They can plug it in. Income? Graduation rates, home values? Anything goes. One blogger at Live at the Witch Trials tinkers with data on crowded houses.

The sad thing, of course, is that the winner will benefit, in a small way, from the tragedies that he or she predicts. But maybe the wisdom of a data-crunching crowd can help cities like Philadelphia dedicate more resources to high-risk areas and keep the grisly predictions from coming true.

Pie-chart innumeracy
January 15, 2010statistics

With the spread of data, we're going to be consuming ever more data graphics. Many, no doubt will maul and tangle statistics. The Fox chart above, which appears to describe 193% of Republican voters, is just one example. This comes from an EagerEyes post on Understanding Pie Charts.

Words to use for large Twitter followings
December 7, 2009statistics

Here are two lists of words. One of them correlates to Twitter users with large groups of followers. The other ones pop up more frequently in Twitter accounts with few followers. See if you can guess which is which:

A: Top, Online, Send, List, Web, Media, Join

B: Sleep, Hate, Damn, Feeling, Homework, Class, Boring, Stuck

According to a study by Thomas Kalafatis that top list is associated with high follower numbers, group B with low ones. There's a good discussion on the post about the value of these words as predictors. I would guess, based on what I know about Twitter, that the correlation doesn't point to causality.

One reason: Marketers who round up Twitter followings (sometimes on paid services) use all of those words in group A. I wrote about one of them, Steve Lafavore, in a recent post. He uses most of the words in Group A, and few of the ones in B. He's up to 10,406 followers (though most of them pay precious little attention to each other, at least according to experience).

As another commenter points out on the Social Media Today site, the first group of words has to do with an offer, the second group is derived mostly from humdrum living. Those are people describing their lives to friends--and not necessarily engaged in creating legions of followers.

Time-warping: How can it help predict baseball?
December 2, 2009statistics

A couple of months ago, I was talking to Anne Milley, director of analytical intelligence strategy at SAS. She was telling me about time-warping. That's a method for assessing greater significance to events that happen in certain times.

The most common is to give more weight to the most recent events. The book I looked for yesterday is probably a predictor of my interest tomorrow than one I searched for in 2004. But how much more relevant is it? Statisiticans can study patterns across large populations and come up with time-warping formulas. I would imagine that they vary from sector to sector. A three-year-old search for hospice treatment probably has close to zero predictive power at this point. But if you were looking for Bob Dylan songs back then, you're probably still interested.

This type of analysis is going to become ever more pervasive as we generate more time-stamped data with our smart phones. Of course, the trick then will be to warp for both time and place. The variations are endless.

                                                         Adrian Beltre

I would imagine that Nate Silver, the baseball and poliitical statistician I interviewed last spring at South by SouthWest, has sophisticated time-warping models for baseball players. Since the Phillies are in the market for a third baseman, I've been thinking recently about Adrian Beltre, who had one great year at the hot corner for the Dodgers. As a 25-year-old, he hit 48 home runs in 2004--but hasn't hit more than 26 in a season since then. I would think that time-warping would almost discount that one season as a near meaningless blip. Now that I think about it, there's a chance it's not meaningless at all: After 2004, baseball started testing much more vigorously for steroids.

That raises another challenge for statisticians: Drug warp.

Spanking and low IQs--causation or correlation?
September 24, 2009statistics

What do you think about this article about the apparent correlation between spanking and low IQs? I'm not sold. The researcher says he adjusted for socioeconomic factors. So here's my question: How do the IQs of spanking parents compare to those of non-spanking parents? I wonder if they tested that...

I'm immediately skeptical of practically anything that has to do with IQ. The idea of measuring intelligence, which I see as a universe, on a linear scale seems kind of dumb to me. (I argue about it from time to time with my son, who's getting an advanced degree in psychology.)

