Stephen Baker



Home - posts tagged as statistics

Statisticians zero in on Euro crooner  posted on May 31, 2010

statistics

Kaggle, a British-based platform for prediction contests, has sponsored an unusual statistics competition, inviting teams from all over Europe to predict the winner of the annual European song contest. According to Kaggle's blog, the statisticians outperformed the prediction betting pools. This apparently indicates that smart stats fiends, at least in this case, outduel the wisdom of betting crowds.

The statisticians, according to the blog, took into account several variables that the betting public ignored. The most interesting point, to me was this:

[T]he betting-market data itself might have an impact on the outcome. This year, pre-contest favourites seemed unwilling to allocate votes to each other. Azerbaijan awarded Germany just one vote when other countries awarded Germany an average of 6.5. Germans returned the favour by not sparing a single vote for Azerbaijan.


As a special offering to fans of EuroVision contests, here's a look at Raphael singing, in 1966, Yo Soy Aquel.


add comment link to post send to friend

Happy North Dakotans: How stats deceive  posted on April 7, 2010

statistics

My son is choosing between universities in Pennsylvania and Washington (state). So I thought I'd compare the states on a variety of metrics (even though, as I posted recently, I don't think the choice will make much difference).

What I found was a Gallup poll showing that for statisfaction with standard of living, North Dakotans rank first. My question: Does that tell us more about North Dakota or the people who choose to live there? I think it's pretty clear that if you gave 300-some million Americans a choice to start over and live anywhere they wanted, a fairly small number would end up in North Dakota. In fact, it might come pretty close to the 646,000 who currently live there.

The poll fails to take into account the winnowing process that's hit the Great Plains. Populations in states like North Dakota have fallen in recent decades, as the young and the old have looked for opportunities and sunshine to the south.The population has ebbed from 680,000 in 1930. If you consider that the national population in those 80 years has grown from 122 million to 308 million, clearly a shrinking minority elects to remain in North Dakota.

So if a poll-taker calls those sturdy and stubborn Dakotans and asks them if they're satisfied with their choice, aren't a lot of them going to answer in the affirmative? Contrast that with where many of the North Dakotans have moved: Fast-growing Florida is the sixth unhappiest state. (Sunny Nevada heads the list.)

This brings us to one other variable, which is related. Nevada and Florida suffered greatly from the collapse of the housing bubble. North Dakota, with its shrinking population, was untouched.  According to a Pew Study, only one of 165 homeowners was likely to face foreclosure there, the lowest rate in the country. That compares to one of 11 in Nevada and one of 26 in Florida.

Following that Dakota detour, I return to the chart and find that Pennsylvania and Washington both score in the middle of the pack in satisfaction. Of course, that kind of (boring) economic satisfaction is virtually the last thing college students have on their mind...

add comment link to post send to friend

Stamen Design: Illustrating the physics of information  posted on February 4, 2010

statistics


Ben Cerveny of Stamen Designs was speaking before me at the Webtrends Engageconference yesterday. Stamen, an eight-person shop in San Francisco, produces fascinating and provocative visuals from big data sets. He showed data of everything from real estate to news as squiggling, morphing blobs and lines. Sometimes it looked like cell biology, but Cerveny pointed to another science. He said Stamen was looking for rule sets for the "physics of information."

That idea has been batted around for a while. Last year I read The User Illusion: Cutting Consciousness Down to Size, by Tor Norretranders. He went into great length about information and the second law of thermodynamics. The idea is that information, like heat, tends naturally toward entropy. It loses its structure and disperses. It starts to look more like the general stream on Twitter and less like The New York Times. And the job of journalists or algorithm writers is to use intelligence to bring order to the information. In that sense, we do the work of the Maxwell's Demon. That's a fictitous character thought up by James Maxwell who has the intelligence (and dexterity) to separate fast- and slow-moving molecules, and thus create "free" energy (and counteract entropy). The question, of course, is whether the energy gained by separating the molecules would compensate for the energy spent in separating them. Somewhere in there is the value of information.

Anyway, Stamen does cool work. The photo above is from Trulia, a company that has a vast data base of real estate transactions through U.S. history. You can click on a neighborhood and see the development patterns. I like the one of Miami Beach. They also do lots of work with Digg, the crowdsourced news aggregator. Check out Digg Labs. You can even turn the swarm, which shows the sprouting and clustering of news items in their community, as a desktop screen saver. (I'd be tempted, but wonder (thinking back to Maxwell's Demon, how much energy it would gobble.)

This reminds me. One morning I was breakfasting in Palo Alto and wearing a shirt I picked up in Madison: Wisconsin Physics. One smart aleck stopped by my table, pointed to my shirt, and said: Wisconin physics? Do they have different laws of thermodynamics there?

add comment link to post send to friend

Predicting murders in Philadelphia  posted on January 19, 2010

statistics


What bits of data correlate to murder, or could help predict such crime in a major American city? On Michael Trick's Operations Research blog, I found a link to this challenge to predict murders in Philadelphia. (That population number is wrong by the way. Philly's population is around 1.45 million, not 5.8 million.)

The Analytics X Competition recalls the much-ballyhooed Netflix prize. But the stakes are tad lower ($100 instead of $1 million). And as Trick points out, participants have a much freer range to find their data. While the Netflix competition was based exclusively on anonymous user data, Philadelphia sleuths can incorporate any data they want. Could movie rental patterns correlate to murder? They can plug it in. Income? Graduation rates, home values? Anything goes. One blogger at Live at the Witch Trials tinkers with data on crowded houses.

The sad thing, of course, is that the winner will benefit, in a small way, from the tragedies that he or she predicts. But maybe the wisdom of a data-crunching crowd can help cities like Philadelphia dedicate more resources to high-risk areas and keep the grisly predictions from coming true.

add comment link to post send to friend

Pie-chart innumeracy  posted on January 15, 2010

statistics


With the spread of data, we're going to be consuming ever more data graphics. Many, no doubt will maul and tangle statistics. The Fox chart above, which appears to describe 193% of Republican voters, is just one example. This comes from an EagerEyes post on Understanding Pie Charts.

add comment link to post send to friend

Words to use for large Twitter followings  posted on December 7, 2009

statistics

Here are two lists of words. One of them correlates to Twitter users with large groups of followers. The other ones pop up more frequently in Twitter accounts with few followers. See if you can guess which is which:

A: Top, Online, Send, List, Web, Media, Join

B: Sleep, Hate, Damn, Feeling, Homework, Class, Boring, Stuck

According to a study by Thomas Kalafatis that top list is associated with high follower numbers, group B with low ones. There's a good discussion on the post about the value of these words as predictors. I would guess, based on what I know about Twitter, that the correlation doesn't point to causality.

One reason: Marketers who round up Twitter followings (sometimes on paid services) use all of those words in group A. I wrote about one of them, Steve Lafavore, in a recent post. He uses most of the words in Group A, and few of the ones in B. He's up to 10,406 followers (though most of them pay precious little attention to each other, at least according to experience).

As another commenter points out on the Social Media Today site, the first group of words has to do with an offer, the second group is derived mostly from humdrum living. Those are people describing their lives to friends--and not necessarily engaged in creating legions of followers.

add comment link to post send to friend

Time-warping: How can it help predict baseball?  posted on December 2, 2009

statistics

A couple of months ago, I was talking to Anne Milley, director of analytical intelligence strategy at SAS. She was telling me about time-warping. That's a method for assessing greater significance to events that happen in certain times.

The most common is to give more weight to the most recent events. The book I looked for yesterday is probably a predictor of my interest tomorrow than one I searched for in 2004. But how much more relevant is it? Statisiticans can study patterns across large populations and come up with time-warping formulas. I would imagine that they vary from sector to sector. A three-year-old search for hospice treatment probably has close to zero predictive power at this point. But if you were looking for Bob Dylan songs back then, you're probably still interested.

This type of analysis is going to become ever more pervasive as we generate more time-stamped data with our smart phones. Of course, the trick then will be to warp for both time and place. The variations are endless.


                                                         Adrian Beltre

I would imagine that Nate Silver, the baseball and poliitical statistician I interviewed last spring at South by SouthWest, has sophisticated time-warping models for baseball players. Since the Phillies are in the market for a third baseman, I've been thinking recently about Adrian Beltre, who had one great year at the hot corner for the Dodgers. As a 25-year-old, he hit 48 home runs in 2004--but hasn't hit more than 26 in a season since then. I would think that time-warping would almost discount that one season as a near meaningless blip. Now that I think about it, there's a chance it's not meaningless at all: After 2004, baseball started testing much more vigorously for steroids.

That raises another challenge for statisticians: Drug warp.

add comment link to post send to friend

Spanking and low IQs--causation or correlation?  posted on September 24, 2009

statistics

What do you think about this article about the apparent correlation between spanking and low IQs? I'm not sold. The researcher says he adjusted for socioeconomic factors. So here's my question: How do the IQs of spanking parents compare to those of non-spanking parents? I wonder if they tested that...

I'm immediately skeptical of practically anything that has to do with IQ. The idea of measuring intelligence, which I see as a universe, on a linear scale seems kind of dumb to me. (I argue about it from time to time with my son, who's getting an advanced degree in psychology.)

add comment link to post send to friend

Crowd wisdom--with a dash of sobriety  posted on September 17, 2009

statistics

Research from Carnegie Mellon, according to MIT's Tech Review, indicates that powerful minorities influence large crowds. (Anyone who's studied the Russian Revolution already knew as much.) But this research focuses on online recommendations at sites like Amazon. There, a handful of zealots can drive a review up or down.

How to manage this? The researcher, Vassalis Kostakos, suggests clipping off a few of the highs and lows. That's what they do in Olympics gymnastics, I think, to defend the average from outliers. But which books are you more likely to review, one you love or hate, or one that's in the middle? I suspect that editing out the passionate people will steer them toward other Web sites.

add comment link to post send to friend

Ichiro and the limits of the Numerati  posted on August 31, 2009

statistics


Ichiro Suzuki

For baseball fans
, here's a very good piece on one of the greatest puzzles for statisticians in the sport: Ichiro Suzuki. By some counts, Ichiro is one of the greatest hitters in the history of the sport. But he's unique--which makes statistical comparisons to other players all but useless.

The Ichiro conundrum is a challenge for all of us. It would seem to make sense, as a career strategy, to angle for uniqueness. If you achieve it, no one can replace you. But at the same point, if statisticians can't compare you to a star, you might not be appreciated.

Now that I think about it, this has been a problem for odd ducks for eons. They're often unappreciated. This is true whether the analysis comes from a king, a manager, or a statistical chart. The challenge for all of us is to surround ourselves with people (and systems) that grasp our value. (Despite his issues with the statisticians, Ichiro seems to be pulling it off. His latest contract is for $90 million.)


add comment link to post send to friend




©2010 Stephen Baker Media, All rights reserved.     Site by Infinet Design










@MichaelPizzo My pleasure. Another book u might like is Afterthought by James Bailey. Not new, but puts data in context of sci/math history

follow me on twitter





The Book Bag - Zoe Page

The Wall Street Journal - John Derbyshire

Frankfurter Allgemeine Zeitung - Milos Vec

The Guardian (UK) - Steven Poole & Christopher Exeter

read more reviews





The appeal of virtual
- May 18, 2010


My next book: IBM's Jeopardy mission
- March 22, 2010


BusinessWeek's strategy
- November 12, 2009


BusinessWeek cannot afford to stay within McGraw-Hill
- August 6, 2009


How to remake BusinessWeek?
- July 16, 2009


Fiction: The Andean Correspondent
- May 30, 2009


It's OK not to read the book...
- January 8, 2009


List of favorite non-fiction books
- December 18, 2008


Early results of behavioral ad campaign
- November 4, 2008


Launching Numerati behavioral campaign: Will deliver 8 million targeted ads
- September 5, 2008


The Worker: Excerpted as BusinessWeek cover story, Aug 28, 2008
- August 28, 2008


Message for math and business readers
- August 27, 2008