On Thursday, I interviewed Rishab Aiyer Ghosh, Open Source Initiative board member, founding editor of First Monday, and co-founder of search startup Topsy. I published an article based on the interview on Tuesday, but there was a lot more to our conversation than could be captured in a single piece. Throughout the course of the interview, we talk in detail about the impact of the social web on everything from journalism to popular uprisings. Ghosh also gives some details of how Topsy works, and what makes it different from other search engines.
Google built the old authority model…
RG: The whole way the web works has changed over the past few years. When people think of search, they think it’s really very simple. Most people think that search is a database lookup. If you look at web search engines, that’s very far from the truth, because what they do is extremely complicated. Twenty years ago, when the first web search engines started, what they did was basically database lookup—finding all the documents that had occurrences of the word, say, “tiger.”
But once you had millions of documents with occurrences of the word “tiger,” it was very difficult to make sense of the results and you needed ranking. That’s where Google’s key innovation of ranking things by the authority of the website came into being. They created graphs of websites and domains, and gave each domain authority based on how much it was cited by other websites, but that’s still the technology that all web searches use. That’s still one of the key relevance criteria, and that’s based on a structure of authority on the web that has changed.
…and the primitive social web broke it
Around 2005, people used to do this thing called “Google bombing,” where they would put links.. [description of Google bombing]. One of the responses from Google was to require that all websites put a “nofollow” tag on links that are not created by the website itself.
So if you had a link that was posted in the comments, or posted by a user—which includes things like Wikipedia or all social media—which has not been created by the website [then you had to add a "nofollow" tag]. So the authority model—where, when a website links to something else, it gives its authority to that thing—that model breaks down because the website is no longer controlling who puts that link on its pages. So for all links of those types, they were forced to add this nofollow tag so that [the links] could be ignored for the purpose of computing authority. What that means, though, is that, while it was breaking the earlier authority model of Google, [Google] did not change their authority model in response to the way the web was changing.
And the web changed so that the authority model of the new web is that people are the sources of authority. This was always really the authority model, but 10 or 15 years ago, a website and a person were pretty much the same thing.
JS: A website was a useful proxy for a person or a collection of people (an institution, say).
RG: Yes. And that changed when you had different people posting on the same website, or the same people posting on different websites—that proxy didn’t work anymore. But Google didn’t change their authority model.
Topsy’s model: graphs of people
What we do is, we build graphs of people. So we don’t build a graph of websites—we build a graph of people. And we compute the authority of people—what we call “influence”—based on the likelihood that a single post from a single individual is going to get attention. So that has the weight—it’s basically like PageRank for Twitter. And then we apply that [weight] to everything that that person says.
What we do when you search on Topsy, is we go and match all citations, just as a Google search will match documents. But then in our index, we don’t just store the documents, we store who it came from and what it points to. And then we will collate all the things that it points to and sort those based on the influence of the people pointing to it.
So if you search for Occupy Wall Street, we’ll find all the tweets for Occupy Wall Street, we’ll find all the links and targets of those citations, and we’ll compute scores for all of those based on many criteria including the influence of the source of each tweet. And then we’ll rank them according to those scores.
Basically, Topsy searches are ranked based on the number of tweets to them, and the influence of the people who did those tweets. Now, this is a model that’s not limited to Twitter. It’s for any kind of social content, including blogs, reviews, and Google+ postings. That’s what we launched last week: Google+. Google’s own Google+ search does not rank based on relevance and authority—they just use a version of subjective search, so if you’re logged in you’ll see some stuff, and if you’re not logged in you’ll see different stuff. We use the same ranking model I’ve described here for Google+. We build a graph of Google+ users based on how much attention each post of theirs gets, and then we rank search results based on citations from Google+ users.
What’s interesting is that we can actually do searches for experts. So we don’t just have to collate on the target, we can collate on the source. So we find the important people for any given word or phrase, based on how much they’ve been talking about it vs. what else they talk about, and their influence on that topic. Which means that if you search on Google+ for people for Linux, you’ll get some pretty random crap. If you search on Topsy you’ll get Linus Torvalds as number one.
Tracking the Arab Spring
RG: If you search Google for Egypt images, you’ll get pyramids and camels, and that’s ridiculous. If you search on Topsy you’ll get pictures of Tahrir Square from people’s camera phones, taken on the spot. If you look at Twitter, it’s real-time but it’s not ranked, so it’s just real-time signal passing you buy. What we’ve built at Topsy is a way to get real-time, fresh, relevant results.
I started following this a lot during the Arab Spring, because I had been to all these places during my previous open source life. I spoke at the Wikipedia conference in Alexandria, Egypt, and I know some bloggers in Egypt. Some friends of mine were in Tahrir Square. So I started using Topsy to see what was going on.
I did an expert search on Egypt and I found Alaa right on top. He’s the blogger who was jailed for blogging, and was one of the Linux distributors, and a big blogger; he tweets a lot and he was in Tahrir Square. He became really influential and got lots and lots of followers as this was happening. Then after a few months, NPR was the most influential on Egypt because, what happens is when a topic becomes big, then the mainstream takes over and people who were on the ground fall into more specialized niches.
During the protests in the Arab Spring, a lot of the protests were being distributed as a date. What we were able to do after the fact is analyze a lot of the data on the protest and do this share of voice analysis. So you would actually see how fast a particular protest was taking off—how much traction it got, and whether it was sustained or whether it disappeared. So you would be able to predict whether something was going to happen or not, or at least whether was something gaining traction in the social media space. Whether they’re actually going to turn out on the streets is a different matter.
Because we do this through pretty complex search technology, we’re able to do things like, say, track Bahrain protests excluding Reuters and CNN and Al Jazeera, which gives you a sense of how much it’s taken off among ordinary people. And then it takes off when the mainstream media gets to it. So you have that gap between when ordinary people are talking about it and when the mainstream media gets to it.
The iPhone 4S launch, and a trading opportunity
Ghosh and I discussed a bit about the use of Topsy as an input to certain algo-driven trading platforms, and the company is exploring this as a potential revenue model. Here’s an anecdote about the iPhone 4S launch from the end of that part of the conversation.
RG: We’ve got a bunch of metrics, including things like sentiment and influence filter counts, so that you can identify things that are going up or down; we’re able to identify related keywords. If you’re interested in Apple, we can tell you what is the aggregate sentiment for all the things around apple.
For example, when the iPhon4S launched, the tech media and bloggers were generally unhappy with it because they were expecting the iPhone 5, so they were disappointed. That led to the analysts being disappointed, which led to the stock price falling. Mashable asked us to do a Twitter sentiment study around it, and what we found was that the Twitter response was by and large neutral, but more positive than negative. So it was like 21 percent positive, 17 percent negative.
JS: So then there was a buy on the dip moment.
RG: Exactly, so if you looked at the Twitter sentiment, then you could be like, “all of these guys, yeah sure they’re big bloggers, but they’re not the customers. They’re the people who are disappointed that it’s not the new shiny device. But are there other people like that?” And it turns out that there weren’t. So if you had bought $AAPL based on the social media sentiment of consumers who are actually going to buy [the iPhone 4S], then you would have bought the stock [when it was selling off] and you would have made a killing in the next two weeks.
So that’s a very tangible example of how the sentiment worked in this case, and how it worked in a way that we were able to track related keywords, e.g. iPhone, iPhone 4, iPad.
A shift in the information flow: from publication to conversation
RG: What our work shows is that there’s a real shift in how information has been flowing. In the days when Google started, information was disseminated through publication. If it wasn’t print publication, it was web publication. So you had a curated process. The process of creating content was a little harder, and you had a source of authority, because this came out on a website and that domain had authority over it.
Now, however, information is disseminated through conversations. These are public conversations. This was always how information got disseminated before, but you couldn’t capture it. It was just people talking in some way that wasn’t captured—we might be talking in a cafe or protesting. It’s not meant to be private—it is public, but it’s not being captured, so it’s only accessible to the people who are physically present.
Whereas what has happened over the past few years is that most new information is coming out in the form of conversations that are spreading beyond a physical presence. So it’s wider conversations that then get recorded for posterity, but that also allows you to get a lot more signal. Previously, the only source of information was what professionals had edited together and published officially, and that’s kind of the Google model. That’s why when you search for any proper noun, the Google top results are going to be wikipedia results.
JS: They’re sort of “establishment” results—your institutions and your publications, your New Yorkers and New York Times’s. So you’re saying that Google gives you establishment results, and the way that it incorporates individuals into the process is as news consumers—by tailoring those results to you.
RG: Google’s News search is total… there’s no ranking there. They’ll show you a Reuters story and tell you that it’s from some newspaper in Ohio, and you go to that newspaper and you see that it’s a Reuters story, and you’ll wonder why the Reuters link isn’t on it. And that’s because they don’t actually use their web-based ranking for that because they can’t, because it’s new so there are no [earlier] links to it. When the news about the iPhone 4S [first] comes out, there are no links to it because it’s brand new. How is Google going to show that with the way they model? They can’t, but we can, because we can say, “we don’t know this term, and we don’t have any links to it previously, but people who have influence are talking about this term in connection with these links, so those are the important links for this term.” So that’s why we’re able to provide ranking and relevance in real-time.
JS: So would is it fair to summarize by saying that Google treats the web like a curated archive that you want to search, and Topsy treats it like a free-form conversation that you want to filter for specific bits.
RG: Yes, exactly. You want to filter, and you want to get what’s most important.
You don’t need to index everything—only what’s important
RG: Nobody in search cares about indexing everything. That’s what they say, but nobody gives a shit about everything. Users don’t care. They only care about the right results, so users don’t generally look beyond the first few pages. So if you were able to provide only the first few pages of results for every query through some form of magic, you wouldn’t need to index everything.
The problem is, with the model of ranking that web search engines traditionally have, which is the links between websites, they actually have to index everything before they know what is important. We don’t have to do that, because what is important to us is what people are talking about, and if we index all the things that people are talking about—if we go and index only those pages—then depending on how much money we have and how much capital we have, we can index the top 1 percent of the web, the top 20 percent of the web, or anywhere in between.
But it’ll always be the top. We don’t have to index garbage. So it’s a very efficient way of building out a search index.
So we had to build a completely different architecture. We built most of our technology in-house and ground-up, so our search engine is all in C++. We’ve got an in-memory graph of 100 billion edges that’s real-time, live, and constantly updating at the rate of tens of thousands of inserts per second. And that is completely different from the typical Big Data infrastructure of MapReduce and large batch jobs. Because batch jobs just don’t work when you’re trying to do real-time ranking.
So we built this whole pipeline architecture, where you have data coming in—whether it’s from the Twitter firehose or Google+ or something else—and you have interpreters for each of those things that convert it into our standardized format.
The role (or not) of social media in the uprisings
RG: You know the whole debate about whether the Arab Spring was social media driven. I’m kind of skeptical about it, and I was skeptical about it then because it’s only a small minority of people who were using Facebook and Twitter, and the countries blocked them. I think what was key was that Facebook and Twitter provided… Well, it’s not that people weren’t frustrated, it’s not that they weren’t ready to brave bulltets. It’s that their frustration and braving bullets is an information problem.
You might be ready to go and protest your dictator, but you don’t wan to be the only one doing it. You want to go out if there are a million people going out. But how do you know if there are a million people going out? You don’t. If you had to rely on pieces of paper and photocopies… when you rely on an older medium, things happen more slowly. Maybe you feel like going out today, but by the time you know that hundreds of thousands of people feel like going out, maybe you don’t feel like going out because a week has passed and you’re nervous. Whereas, with social media, if you feel like going out today, within a few hours you know if everyone is going out today or is going out next week. So that’s the information thing—the knowledge that other people share your perspective and are going to do this spreads faster with social media.
So we can’t tell whether [the Arab Spring] would have happened or not with social media, but we can tell that social media did play a role in it.
But in the case of Occupy Wall Street, that’s probably the first social media driven protest. Because without social media, it’s very unlikely that that would’ve happend without social media. Because the level of frustration, and the lack of other outlets for protest, is not as big of an issue here. In Cairo, there’s nothing you can do other than go out on the street; but here, there are lots of other things you could do, and the level of frustration is much much less. So the need to know that there were other people who were going to go out was much higher, and therefore social media really helped that much more here.
Also, there’s the fact that [at #OWS] everybody is on Twitter. [In the Mideast] that’s not the case. Most of the [Arab Spring] protesters weren’t on Twitter, but here everybody is. So [#OWS] was probably the first true social-media driven protest… Well, actually the Tea Party was probably the first. They used Twitter a whole helluva lot. We did a whole thing on political hashtags, and the right wing political hashtags are really big. After Obama, Michelle Bachmann is one of the most tweeted about politicians, with strong negative and positive sentiment.
We did a whole interesting analysis of intense sentiment hashtags, and it was interesting that the most intense negative sentiment hashtags were things about politics. And the most intense positive sentiment hashtags were, firstly, not that intense, and secondly pretty boring sounding—travel, photography.
JS: So people don’t like things with the fervency with which they hate them.
RG: Certainly not when they converse. And that’s kind of normal. If you’re in a crowd, you’re not going to be yelling in the street, “oh I love travel!” You’re going to be like, “I hate Obama,” or “I hate Perry,” or whoever. Which means that actually you have to keep that in mind. You have to be able to normalize levels of sentiment for everything.