-
Website
http://beta.simplifierlab.com/ -
Original page
http://beta.simplifierlab.com/2007/12/googles-data-as.html -
Subscribe
All Comments -
Community
-
Top Commenters
-
whitneymcn
1 comment · 16 points
-
terrycojones
1 comment · 20 points
-
Graham S.
1 comment · 4 points
-
greenskeptic
2 comments · 9 points
-
femmebot
4 comments · 1 points
-
-
Popular Threads
Hi Brad,
Great post. This really resounds with what we are doing for our own start up.
I believe that Google's data gathering abilities can be matched by interpreting a user's imprints while they use services across the web.
For example, how can the related tags in Delicious or Twitter conversations be co-related to find something meaningful. This would be more than a mashup of getting two services together but actually corelating the data to do something else completely differently.
Another example, why can't people crawl, index, ingest and analyse the web's podcasts to get voice accents. The data is already out there on the open web.
I think the data across all these web services is richer and deeper than just what Google has. So, if we take a site by site approach and then co-relate maybe Google's data advantage can be nullified.
FriendFeed has taken a site by site approach to interpret data from these services and deliver useful information to users.
Why can't other services also utilize the data to make thier own service richer?
Nik
All this seems to proceed from the idea that out of an abundance of data comes more useful information. As late as the 70's it was still thought that if enough sensors could be distributed throughout the atmosphere, and if a powerful enough computer were used to process all that information, we would be able to predict the weather accurately for months in advance. Wrong.
There's another shift about to take place that will change everything about the nature of files, that will make data mining impossible. It is being driven by the desire for privacy, but it is far above and beyond and infinitely more powerful than encryption. Yet, it is simpler and when it s, Google won't know what to do.
There is no way data has "increasing marginal utility." Clever idea but simply false. Most of what Google is trying to do is make inferences about specific parameters (e.g. your tastes). That's basically a bayesian inference problem, which has the property that the first bit of information is a lot more valuable that the next bit. Google figured me out years ago, my search history barely matters anymore.
Now, you are on to something. It's just doesn't have anything to do with the marginal utility. It's more like a natural monopoly.
Extending Googster's analysis, the relationship between data and prediction is pretty thoroughly described in mathematics, though in cases where all resources are strictly bounded in some fashion (i.e. reality) the relationship is a lot more complex and has interesting ramifications.
Generally speaking, prediction is computationally intractable in its purest form so we use all manner of approximations to make predictions. While it is true that early bits are worth more than later bits, slight differences in algorithm design and quality can generate significant differences in average predictive quality down the road. (There are a number of interesting theoretical caveats to this, but mentioning them would lead too far afield.) The datasets of most types are often large enough that the differentiator is the algorithm, and the optimal algorithm for the general case is known to be non-computable. It is worth pointing out that a lot of the mathematics behind this is relatively recent, and it is not an area many people are deeply familiar with.
So what does this mean? It means that at any moment, some company could materialize with a substantially better representation algorithm and eat Google's lunch, at least as far as prediction goes. While Google's algorithms are theoretically far from optimal in many regards, few companies have been able to exploit this by making significant improvements in the necessary areas of computer science. The article is correct in that you can't beat Google nibbling around the edges, you have to use computer science they don't have. The good news is that improvements in the areas of computer science required are eminently doable in the sense that we know the current state-of-the-art is substantially improvable for practical systems.
And lest it sound hypothetical, I know of one venture doing a Series A now with the requisite computer science IP to do the job, and possibly another early one that is apparently attacking another one of the computer science avenues to displacing Google at least in theory. Of course, if the data all gets locked up inside Google or whoever, *then* those companies start to gain significant monopoly leverage. On the other hand, Google et al have barely scratched the surface of what can be done -- at least in theory -- with the vast public body of data so that is less of a roadblock than it may sound at first.
Google is in a strong position but it not quite as invulnerable as it is sometimes portrayed, at least not with regard to prediction and data mining, and they definitely do not have a lock on the people who are generating most of the computer science that will inevitably make Google's current prediction and mining methods obsolete. Still plenty of churn left in that market. At least in theory.
The platform shift you mention is already underway, it's the Mobile Internet, where none of the big players have a major slice of the pie secured yet.
Google's mobile strategy has just started, you can see it in the new Google Maps, and the potential game changer Android can be.
Given that "prediction" can mean different things in different professional contexts forgive me if I don't see how that relates to Google's public products with embedded data collection to date.
As long as Google can successfully resist agreeing to standards by offering their own unique open platforms to enthusiastic participants while continually creating data gathering systems fueled by enthusiastic users, they've got a great shot at global dominance.
As we continue our flow from pc to web to mobile devices to physical/natural environment to human body, Google can just keep moving with us.
Malcolm Gladwell wrote a great New Yorker piece in this topic:
http://gladwell.com/2007/2007_01_08_a_secrets.html
It is subtitiled "Enron, intelligence, and the perils of too much information"
In financial markets, there's a spike in returns for knowing the next piece of information that's going to impact the stock that's not yet factored in the price.
(Insiders make better decisions on average, but plenty of insiders have gone down with a sinking ship, and a fast way to go broke is with inside information. There's no value if it's already in the price, and there's no value if it's not going to get priced in before something more important happens)
Similarly, that one piece of information no one else has that puts someone in a specific psychographic, or shopping for a particular product in a particular location, is worth a (big) spike in returns.
If Google can keep their monopoly on that information, they will keep generating superior returns. That's different from saying there are continuously increasing returns.
(Also, while there might be increasing returns to knowing your location, income, and amount spent on CDs and each can be stored as a number, the complexity of acquiring each succeeding number rises exponentially, and in that sense they contain more information in the same number of bits)
I 100% agree that data has increasing marginal utility... and I think much of that utility has yet to be tapped.
There's significant inherent value in the metadata buried within the data itself that we are just ning to see exploited. There are two kinds of meta-data: explicit, which we're starting to do things with, and implicit, which almost no-one has done much with as of yet.
I think much of what will be interesting in the next 2-4 years will be driven by the exploitation of implicit data. Interesting times are ahead!
I believe what we're looking at is that "Data is a product" or "Data's value as an intangible asset is non-linear." The behavior we're seeing around data stores in the market appears to be that of sigmoid curves, such as those associated with product lifecycles. Please see the diagram here: http://sphericalmusings.blogspot.com/2006/11/sigmoid-curve.html
Google is doing a good job in hopping from one data product lifecycle to the next, in a way that (among large enterprises) only 3M, Intel, and Microsoft have done in the past. It requires avoiding the innovator's dilemma by practicing self-cannibalization. Self-cannibalization requires consistently sub-optimizing short term cash flow which has been difficult for Yahoo, et al.
Google seems to be making this bargain: I'll give you free software/services (Search, Docs, 411, etc.) if you give me free data. Why else would they be building massive data centers?
The sniff test seems to say that Google wants to lead the data-driven future, so to them, on a macro level, more data is more valuable.
Great post! Often information is a key asset in an organization. Back in the old days, often the retailors often ended up with the customer knowledge. The supplier often had to rely on the retailor for this information.
When Google starts incorporating the 23andMe data (http://www.wired.com/medtech/genetics/magazine/...) - then we'll have something to worry about.
Only half joking as 23andMe founder is married to Sergey Brin
The applicable concept from economics is that of a supermodular production function. Let x and y be two different types of data and let f(x, y) stand for the value that Google can derive from the data.
As a previous commenter pointed out, because of the characteristics of Bayesian inference (which lies at the heart of what Google does) it will be the case that the marginal benefit from an increase in x is declining, i.e. f(x' + e, y) - f(x', y) x. Suppose you want to serve targeted ads and you have access to a stream of searches by an individual, then the first observation of the term "bmw" has more benefit in identifying that individual as a potential car buyer (and serving a car ad) than the second observation of a similar term, say "mercedes".
Now suppose that you also have access to a second type of data on the individual, say geography, then a supermodular production function says that having more of that second type of data makes every additional bit of the first data more valuable, i.e. f(x + e, y') - f(x, y') > f(x + e, y) - f(x, y) for y' > y. In the example, the marginal benefit from seeing the second search term ("mercedes") increases if you happen to also know that the individual is located in Greenwich, CT, because now you can serve up a localized car ad (for which the dealer is sure to be willing to pay more).
The fact that Google's production function is almost certainly supermodular poses a big challenge for potential competitors. It means that even if you succeed in gathering more data of one type, the incremental value of that data for you may be a lot less than for Google which can combine it with lots of other types of data.
I analized this for my economics class today.
As marketing data increases, total utility increases, not MARGINAL utility. The more data collected the more total utility. However as more and more marketing data is collected, each category s to add less and less value to ones marketing research, and marginal utility decreases.
This follows the first basic assumption of consumer behavior, More is better! The more data collected the greater the utility
The fact that marginal utility decreases follows the law of diminishing marginal utility. The more categories of marketing data collected, the less value the information adds.
That is what we are doing with our start-up. Utilizing contextual data, we are able to collect valuable data on the web, and able to index these data with our contextual algorithm, thus providing users a powerful front end base to find there contents on the web.
I don't think the marginal utility increases. I just think the "fat head" of the demand curve is going to get fatter and fatter.
kcgm opzxgslfc xwjldz hdrgze symqwx owmqe shojniuc
xrlvj msjflxe