 |


|


Home - posts tagged as Datamining

WSJ: Advertiser tracking on the rise posted on July 31, 2010

Datamining

The Wall Street Journal publishes a report today (behind firewall) on cookies, and the growth of consumer-tracking on major Web sites. For the report, they analyzed big Web sites, including their own, and found that many dropped more than 100 cookies into visitors' computers. (The Journal dumps 60 cookies, slightly below the 64-cookie average on the 50 largest sites.) The only big site that doesn't track visitors is Wikipedia.org.
As a reader (and former editor) I found the Journal story maddenly vague. It says that cookies are on the rise, but doesn't give any historical context. It mentions data-analysis companies that are doing highly detailed work, but doesn't name them. And while it states what type of analysis they could do with this detailed data, it doesn't give examples of how it's being used. To wit:
"Some tracking files can record a person's keystrokes online and then transmit the text to a data-gathering company that analyzes it for content, tone, and clues to a person's social connections..... Data-gathering companies [can] build personal profiles that could include age, gender, race, zip code, income, marital status, and health concerns, along with recent purchases and favorite TV shows and movies."
Why not name a few of these companies, and, while they're at it, ask advertisers how such detailed profiles are being used? Also, note the use of the word "could" in the last sentence. Is there evidence that these unnamed companies are actually building these profiles? We don't know.
I dealt with these issues often while researching The Numerati. The problem here, as in much of the data economy, is the gap between the astonishingly rich trove of data and the undeveloped business model for it. Most companies simply don't know how to put the data to use. How do you deal with millions of detailed consumer profiles when you only have four or ten or 20 different types of ad campaigns? You ignore most of the details and put the people into enormous buckets. (Credit-card companies are a notable exception. They can create thousands of different offers and test them against different groups. But they've been at this since long before the age of cookies.)
Eventually advertisers will learn to make use of this information, if a privacy uprising doesn't shut cookies down. But for now much of this detail we're communicating with our clicks and keystrokes is piling up in data centers, largely ignored.
|


Questions about analytics? posted on July 9, 2010

Datamining

If you're wondering how to harness the power of the numerati in your own enterprise, consider participating in the webinar my colleagues at SmartDataCollective are holding on July 15. It's called Trendspotting for Growth, and it features analytics experts from SAS, Warner Home Videos, and Teradata, among others.
***
|



I thought I'd pay tribute to the upcoming Dutch-Spanish World Cup final with a look at the history between the two countries. This 1634 painting by Diego Velazquez, The Surrender at Breda (also known as Las Lanzas), depicts a scene from Holland's 80-year war to liberate the country from the Spaniards.
In the 1620s, the Spaniards laid siege to the town of Breda. The villagers resisted heroically, but in the end, in 1625, the town fathers surrendered the keys to the Spaniards. In this scene, the Spanish leader, a Genovese named Ambrosio de Spinoza, is trying to prevent the Dutchman from kneeling. No need to add humiliation to defeat. (We'll see if that same notion prevails, on one side or the other, in Sunday's game.)
What I find interesting is the contrast between the two sides. Velazquez depicts the Spaniards, on the right, as noble. It's not just their clothes. Their lances are long and elegant. The Dutch, on the left, are far more modest, and their pikes are stubby and strong. On the soccer field, the Spaniards, with their precision passing game, are more elegant than the Dutch. But the Dutch, with their ability to kick in rockets from 30 yards out, might be stronger.
One more point about the painting. To focus the viewers' attention on the drama in the middle, Velazquez closed the right flank with a horse's ass--a bold strategy. On Sunday, we'll see a Spaniard named Pedro dashing up and down the right flank. (He made an ass of himself in the semifinal by not passing to a wide open Fernando Torres in the closing minutes...)
|


Datamining competition: HIV progression posted on July 8, 2010

Datamining

Kaggle, a data-prediction company, is inviting numerati the world over to mine a small set of data and predict patterns in HIV. Offering $500, the competition has already drawn in 71 individuals and teams, and some of them have submitted dozens of predictions.
As we saw in the famous Netflix datamining contest, these competitions can generate a lot of activity and research. And while Netflix offered a million dollars, Kaggle demonstrates that big money is not necessary. Researchers are drawn by the challenge, the data, and the chance to mingle with (and show-off before) their global peers. What's more, these competitions are wonderful for networking. As competitors scan the other studies, they see new approaches, meet new colleagues, and often they team up. The winning Netflix team was a coalition of people who met on the competition site.
Kaggle founder, Anthony Goldbloom, argues that competitions can move scientific research far faster than the traditional process involving peer-reviewed papers. In a blog post, he writes:
Whereas scientific literature tends to evolve slowly (somebody writes a
paper, somebody else tweaks that paper and so on), a competition
inspires rapid innovation by introducing the problem to a wide
audience. There are an infinite number of approaches that can be
applied to any modeling task and it is impossible to know at the outset
which technique will be most effective. By exposing a problem to a
wide audience, competitions expose the problem to a range of different
techniques. This maximises the chances of finding a solution, and gets
the most out of any particular dataset – given its inherent noise and
richness.
|


LinkedIn mines data for future job paths posted on June 30, 2010

Datamining

Let's say you're a 26-year-old accountant. You're
wondering about your future, and it's easy to see the three most common
paths. You climb up in an accounting firm, look for accounting work on
your own or at another company, or you go back to school and get an MBA.
But how about outside those options? I was talking yesterday with Dipchand (Deep) Nishar, LinkedIn's vice president of products and user experience. He came to LinkedIn
from Google 18 months ago (The two headquarters are about a half mile
apart) with a mandate to develop new data-centric products and
services. And he's walking me through this hypothetical accountant's
career conundrum. If you think about everyone who was ever a
26-year-old accountant, hundreds of thousands of people have wrestled
with these same choices. And what paths did they follow?
That's
where LinkedIn's data trove comes in. The company has some 70 million
members. That's data on 70 million careers. Conceivably, the company
could provide a service showing each one of us the paths that others
took when they were in the same position we're in now. It could diagram
where those choices led. "Maybe he ends up deciding to be a high school
math teacher," Nishar says. In that case, he could find current math
teachers who have followed that path and debrief them.
Nishar
says that this type of service, now under development, will be
available by year end. Of course, to message the
accountants-turned-math teachers directly, our 26-year-old would have
to upgrade to an paid account at LinkedIn. That's part of the business
plan. But if Nishar and his team figure out ways to create these types
of services, more of us might be willing to pay for them.
I asked
Nishar how much data LinkedIn had. He wouldn't say, and told me that
the quantity of data was irrelevant. "I could have exabytes," he said.
"If I don't do anything with it, it's useless. To know the five people
you should connect to, you might need only a kilobyte of data." I
suspect that LinkedIn has a relatively small trove of data compared to
other social networks, because most of the LinkedIn stash is in words
and numbers, not videos or jpegs. But those words and numbers could
spell gold.
Nishar also pooh-poohed one current theory in data, espoused by Wired's Chris Anderson,
that Big Data will turn the process of discovery on its head. According
to that school, which leans heavily on insights from Google:
Petabytes allow us to say: "Correlation is enough."
We can stop looking for models. We can analyze the data without
hypotheses about what it might show. We can throw the numbers into the
biggest computing clusters the world has ever seen and let statistical
algorithms find patterns where science cannot.
Nishar, who
headed up product development for Google in Asia, disagrees. "There are
two types of consultants," he says. "The unsuccessful ones collect lots
of data. The successful ones start with a hypothesis." He also
maintains that gifted humans are far better than machines at picking
out patterns in data.
This debate will rage for years. But if you
learn about new job trends or other insights coming out of the data
drove at LinkedIn, chances are it started with a hypothesis from a
human on Nishar's team. (Reposted from SmartDataCollective)
|


Keeping count of people (and things) posted on June 15, 2010

Datamining

I learned while researching The Numerati that the Chinese have 11 different spellings for Osama Bin Laden. (Maybe it's up to 12 or 13 by now.) So if the quants at the National Security Agency were attempting to monitor Chinese Web traffic about the Al Queda leader, their computers have to recognize all of these different spellings, and group them.
At the same time, I share a name with a prominent author who wrote best-selling books such as How to Live with a Neurotic Dog. Smart systems have to figure out that we're not the same person. (This, of course, is a huge issue for thousands of people whose names condemn them to no-fly lists.)
It sounds easy, but one of the toughest challenges in digging through unstructured data is to come up with accurate counts of people and entities. Jeff Jonas has a very thoughtful blog post and article on this. He writes:
it is essential
to understand the difference between three transactions carried out by three
people versus one person who carried out all three transactions. Without the ability to determine when
entities are the same, it quickly becomes clear that sensemaking is all but
impossible....I find most organizations have
underestimated this principle: If a system cannot count, it cannot
predict.
|


Amazon's foreign publishing push: A customer data play posted on May 20, 2010

Datamining

I'm trying to imagine everything Amazon knows about me. The company's computers know the books and music I buy, the ones I click on, the ones I send as gifts. On my Kindle app, they can monitor the ones I read, and even gauge my enthusiasm. (As they well know, I'm struggling with The Golden Bowl.)
Given all that data, which foreign books would I be most likely to buy if they were translated? As I read on Mashable, that's what Amazon wants to figure out. The new imprint, AmazonCrossing, will buy rights to non-English language books, translate them, and market them to readers statistically most likely to buy.
My question: Since the market for translated books in this market is small, how can they draw statistical correlations between North American readers and foreign writers they don't know? I look at the French Amazon page for the first book Amazon is translating, Tierno Monénembo's King of Kahel. Readers of that book appear to be interested largely on other non-English writers. I don't see any Tom Clancy or Ian McEwan overlap.
Still, this is an interesting challenge: What are the most telling statistical correlations between people of different languages and cultures? In global markets, it's an important question, and Amazon's just starting its research.
One detail. Translators are paid between $6,000 and $8,000 to for a 60,000-word novel. (That's a small book, 5,000 fewer words than the Numerati).
|


A month of this blog, in words posted on May 13, 2010

Datamining

Following Jeremy Wagstaff's lead, I created a word cloud by pasting this
blog's URL into Wordle.net. Based on the words in the cloud,
I think the service only looked at the most recent posts. And I was a
little puzzled by the big "quot." I think it's some kind of formatting
word in the software.
A growing number of Web sites provide these types of data-imaging
services. IBM's ManyEyes is a
good place to browse, if you feel like losing a few hours. When I was
still working at BusinessWeek (and occasionally feeling chained to a
desk), I created a word
cloud there of my (still) unpublished novel, Donkey Show. It
looks like ManyEyes now carries out a structural parse of the sentences
in a text, linking common subjects to verbs and objects.
(This is the kind of analysis routinely carried out by ManyEyes'
corporate cousin, Watson, the
Jeopardy-playing computer. Just like us, it has to figure out the
structure of question in order to understand it. And it has to
understand the words in different contexts. What is the meaning of the
verb "to sink" if you're talking about playing pool? And how about the
same verb if you're playing in a pool? Teaching computers to make
these distinctions is a titanic challenge.)
***
I got back late last night to New Jersey from Los Angeles. I spent an
extra $50 to move up my flight yesterday, and avoid the red-eye. But
before leaving, I rented a bike on the beach at Santa Monica and rode on
a gorgeous spring morning. If any of you are on a business trip in LA,
and feeling overwhelmed by traffic and other urban grief, a bike ride in
Santa Monica is a nice antidote. It's only 15 minutes from the
airport.
|


AT&T studies user data to cope with iPhone crunch posted on March 31, 2010

Datamining

An excellent piece in the WSJ (behind subscriber wall) about AT&T's push to cope with the exploding data traffic of iPhone users. The company has to get this under control, because by early next year unhappy iPhone subscribers, especially in overloaded NY and SF markets, will likely have the chance to switch to Verizon.
To fine-tune its network, AT&T is studying ever more user data:
Before the iPhone, it used to be able to accurately forecast to the
minute the type of phone usage each new customer would add to its
network based on basic demographics such as age and income levels. The
forecast always held true across cities and towns.
But with the iPhone, such bets are off, AT&T executives painfully
learned. It now looks at a broader set of customer profiles to forecast
behaviors. For example, in a metro area with a large proportion of
students, the phone operator schedules network upgrades to occur outside
of colleges' nine-month academic terms.
"I'm as interested now in what you're doing when you're not on the
network," said John Stankey, head of AT&T's operations arm.
One interesting note from the article is while AT&T has taken its bruises in this data-intensive market, at least it's learning. Some beleaguered users may jump to Verizon and other carriers just as those companies start to struggle with the same issues. It might be smart for Verizon to pay top-dollar for an AT&T engineer or two, just to get the know-how.
For example, AT&T said when iPhone customers started checking their
email and surfing the Web from their high-rise offices, AT&T
repositioned its cellular antennas to point up, instead of down.
There must be scores of similar lessons they've learned.
***
I'm in Seattle for a couple of days on book research. I've been interviewing folks at Vulcan Inc. about artificial intelligence. Now I'm in a coffee shop (surprise, surprise). I think I'll head across the street to the art museum before meeting with Ed Lazowska, head of the computer science dept at U. Washington (the U-Dub, as they call it around here).
|


Revenue Science: Companies already know who they want to target posted on March 25, 2010

Datamining

I was in a hotel in Abu Dhabi when I saw a familiar face. It was Basem Nayfeh, chief technology officer at Audience Science, a leading behavioral targeting company. When I was writing The Numerati, Audience Science was the leading competitor to Tacoda. Both companies tracked the behavior of Websurfers and delivered ads linked to their perceived needs and interests. I was going to profile one of the companies in the book. So which was it going to be, the one in New York whose CEO (Dave Morgan) I knew, or the one 3,000 miles away, in Bellevue, WA?
I did Tacoda, which later was sold to AOL for about a quarter billion dollars. So now, if I want to talk about behavioral targeting (and don't feel like wading through press departments at Google, Yahoo or AOL), Audience Science is the place to go.
Nayfeh told me about a new trend. Lots of big companies, he says, already know the people they want to target. They have them in their database, or have tracked them on their own Web site. So instead of starting a campaign to hunt for people "likely" to be interested in their product or service, many of them are now simply saying: "Reach these people for me."
It's a shift in advertising, and it seems to me that it further weakens media sites. (Nayfeh, however, points out that these targeted people still need content. "They won't look at an empty screen.") In any case, we talked about the state of behavioral targeting for 15 minutes, and we'll put the interview on Smart Data Collective. (I'll link to it when it's up.)
Oh, by the way, since Behavioral Targeting has become a bugaboo for privacy advocates, and finds itself in the crosshairs of Congressional reformers, the industry now calls itself "audience" targeting.
|


Canadian hockey fans flush in unison posted on March 11, 2010

Datamining

The NY Times hockey blog discusses the pattern of water consumption in Canada as their team played the gold-medal hockey game against the U.S. It seems that water consumption spiked during commercial breaks, presumably as fans flushed toilets.
This type of behavior analysis is going to become much more common as we move toward smarter grids. Privacy advocates note, for example, that utilities will be able to spot the return of children from school in the afternoon, as the appliances switch on, And pin-pointing basement marijuana operations will be a relative no-brainer.
I'm wondering if brewers will be able to focus TV advertising on households that flush more during sports time-outs. Of course, soda drinkers have to flush, too. To isolate the beer drinkers, they may have to aggregate more data--assuming it's worth it (which at this point, just to be clear, it's not.)
***
I never really got used to the nine-hour time difference here in Abu Dhabi. I woke up this morning at 3:30, and am flying back to the U.S. today. I'll spend some of the 14 hours aloft reading Chess Metaphors, by Diego Rasskin-Gutman.
Incidentally, I stayed in the ultra modern Ras Hotel in Abu Dhabi. It's covered, as you can see, with an illuminated shell. The strangest thing is that it overlooks a Formula One speedway. So outside the window, by day, you hear practice below on the track: race cars gunning by and the screeching of their tires.
|







|

|


|
 |









@MichaelPizzo My pleasure. Another book u might like is Afterthought by James Bailey. Not new, but puts data in context of sci/math history

follow me on twitter





The Book Bag - Zoe Page

The Wall Street Journal - John Derbyshire

Frankfurter Allgemeine Zeitung - Milos Vec

The Guardian (UK) - Steven Poole & Christopher Exeter

read more reviews





The appeal of virtual
- May 18, 2010

My next book: IBM's Jeopardy mission
- March 22, 2010

BusinessWeek's strategy
- November 12, 2009

BusinessWeek cannot afford to stay within McGraw-Hill
- August 6, 2009

How to remake BusinessWeek?
- July 16, 2009

Fiction: The Andean Correspondent
- May 30, 2009

It's OK not to read the book...
- January 8, 2009

List of favorite non-fiction books
- December 18, 2008

Early results of behavioral ad campaign
- November 4, 2008

Launching Numerati behavioral campaign: Will deliver 8 million targeted ads
- September 5, 2008

The Worker: Excerpted as BusinessWeek cover story, Aug 28, 2008
- August 28, 2008

Message for math and business readers
- August 27, 2008







|