Keeping an Eye on Search Engine Diversity

Wednesday, December 24th, 2008

Last week, I came across two fascinating stories that show some of the problems relating to search engine diversity. Chris Soghoian writes how Google’s restrictive Adwords policy stands in the way of informing searchers about public data about political donations. Paul Alan Levy comments on a story about a company called Lifestyle Lift that tries to lead users away from a critical website by optimizing its presence on the Google search result page.

Research shows that the majority of search engine users stay on the first page of results and usually click on the first few results only. This increases the importance of being in the top of the organic results. In addition to offering a competitive way to be found in the organic results, search engines allow organizations to bid on search terms which result in sponsored results. This is the dominant search engine business model. Sponsored links typically show up in the right hand side of the search result page or on the top of the page if they perform above a certain threshold set by the search company. For practical purposes I will restrict my comments to Google.

What is clear is that the two sets of results represent two different information flows. The flow of organic results is the result of the crawling and ranking of sources on the Web. The sponsored results are the result of the Adwords auctioning system, or other forms of sponsoring, such as Google Grants or promotion by Google itself. Research shows that about half of the Internet users might be unaware of a difference between organic and sponsored results.

Although it seems to make sense to to require transparency between organic results and sponsored results, in reality the difference is complex and problematic. On a superficial level, the difference reflects ideas about transparency of advertising in the media. Listeners, viewers or internet users should be able to distinguish between editorial content and advertising (in the broad sense). But, in the context of search results, the difference between a sponsored or an organic result does not tell much about the amount of money that has been paid for the result to show up. It is also quite common for a website to show up both in the organic results and the sponsored results. The different processes that govern the ranking in the two sets of results are equally opaque for users. And finally, the sponsored results are the results that are in some waysmore editorial in nature. My tentative conclusion is that the difference is not well understood and of limited value. What is more important is that we evaluate diversity and quality of search results pages in their totality.

Difference does not reflect payments

Both information flows of results, i.e. organic and sponsored, involve the spending of money to show up in the results as prominently as possible. The difference is who receives the payments, but it is clear that the user is not in a position to evaluate them. Whereas an organic result might show up for certain search terms without someone having invested any money into that, many organizations and companies use search engine optimization services to optimize their organic ranking and presence in search results. The example of Lifestyle Lift shows that sometimes complete websites are being set up to enhance presence and ranking.

It would be interesting to see how much is typically paid for organic optimization in comparison to sponsored optimization. Organic results might be seen as more trustworthy by internet users that think that they understand the difference between the sets of results. This increases the incentives to invest in organic optimization. If the ability to place sponsored links on search engine property would be restricted, one can expect increased pressure on organic results. There are no editorial restrictions on advertising in organic results.

Duplication of results and sources is common

Especially for searches related to products, companies and services, it is common that some of the sponsored links are duplicates of the organic links. Duplication blurs the line between organic and sponsored results. An argument in favor of duplication is that it decreases noise for users that were looking for that source anyway. Duplication is problematic because it decreases the diversity of the results and the possibility that users are confronted with information that adds another perspective to a search. The fact that duplication is fairly common suggests that it is worth for advertisers to push organic results off the first page and out of the attention of most searchers. This is especially the case if the duplication involves a sponsored result in the top of the page on the left, and the same organic results right below it. Duplication might be costly so this is a typical case where the rich get richer because they manage to capture the attention of internet users.

Ranking mechanisms of both sets of results are equally opaque

The ranking of organic search results by Google is notoriously opaque. The Page Rank algorithm, that uses links to establish a global measure of relevance for websites is only one of the many different (200+) factors that is used by Google to establish the ranking of organic results. If one wants to know about ranking algorithms the place to look is the search engine optimization industry. These search engine experts have a pretty good idea how search engines work and walk the fine line between accepted optimization and being punished. Ironically, this industry is also seen as the major reason for search engines not to be more transparent about their ranking algorithms.

The ranking of sponsored results is equally opaque. I have heard some experts explain that the ranking of sponsored links reflects the bids the advertisers, but in reality it is more complex. Google places advertisements on the basis of its keyword auction, and a quality score. The quality score is determined by Google. It includes the historical click through rate and other partly-non-disclosed relevance factors. Thus, the highest placed sponsored link is not necessarily the one that involves the highest willingness to pay. In addition, some sponsored links are placed (and can be clicked on) without them having to make a payment. They are sponsored by Google.

Quality control on sponsored links is stronger than for organic results

Interestingly, the sponsored results are subject to detailed editorial policy that guarantees the quality of sponsored results. The policy of Google Adwords also includes the suppression of unauthorized usage of trademarks, as Chris Soghoian found out. There are several legal and non-legal reasons for this, e.g. the fact that search engines receive money for the placement of these results. Thus, sponsored links might end up being of higher (average) quality and more relevant for some keywords. The relative balance between the quality of organic and sponsored results is of major importance to search engines. If search engines can keep the overall quality of the results in their totality the same, while shifting the relevance of results towards the sponsored results, they make more money.

Keeping an eye on search engine diversity

My conclusion from these ideas and remarks is that we need to focus more on quality metrics of actual search result pages in their totality than on transparency between organic and sponsored results. This difference is of limited value to users. I have the impression that Google is committed to diversity of search results and have heard one Google engineer in Europe say this publicly, but more independent empirical research (Benkler 2008, p. 285) is needed in this direction. Search engines could also be more explicit about this commitment, for instance by being transparent about their commitments to diversity and implementing policies that prevent duplication of results.

Yahoo Promises to Anonymize Log Data After 90 Days

Wednesday, December 17th, 2008

Yahoo has announced that after reviewing its retention policies for user data, it has decided to start anonymize data after 90 days. The last block of the ip address and the cookie ID will be erased.

It’s important to note that even after this measure the data will stay to be very sensitive and the process can probably be reversed because of the richness of the data that remains, but it is a big step for the industry.

The New York Times reports:

Privacy advocates said that the new policy was a step in the right direction and credited the change to pressure from European regulators.

“As much as the U.S. search firms talk about how they are improving their practices, I think they are really afraid that the Europeans are going to bring an enforcement action under European privacy laws,” said Marc Rotenberg, executive director of the Electronic Privacy and Information Center. “That’s where the push is really coming from.”

Mosquito (almost) Illegal in the Netherlands

Tuesday, December 16th, 2008

The Dutch government has written to parliament that the use of the Mosquito will not be backed by the government. In its legal assessment, the government writes that the deployment of the Mosquito, a device that produces a high pitched tone that annoys young people, infringes fundamental rights such as the right to free movement, the right to the integrity of the human body, and the right to equal treatment. The Mosquito was on Bits of Freedom’s list to be a Big Brother Awards 2007 candidate.

K.U. v. Finland: No Data Retention Obligation

Monday, December 15th, 2008

The European Court of Human Rights has issued its judgment in the case K.U. v. Finland. The Court concludes that Article 8 of the Convention puts member states under a positive obligation to protect people against grave interferences with their private life by others on the Internet. This obligation includes that the member state has to criminalize grave interferences with the right to private life and provide for a legal framework that allows for the identification and effective prosecution of offenders.

The Court mentions that this framework has to respect the right to freedom of expression and private life of internet users. The Court does not say that the member state has to make sure that data to identify individuals are available. In fact, it says that only on occasion the right to private life and freedom of expression of internet users can be interfered with legitimately. The Court makes very clear that if identifying data of an alleged offender (the offense being a grave interference with the right to private life) are available, the law must provide for access to those data to allow effective prosecution. Here a few of the key conclusions and considerations:

The Court concludes that grave interferences with the right to private life must be criminalized:

While the choice of the means to secure compliance with Article 8 in the sphere of protection against acts of individuals is, in principle, within the State’s margin of appreciation, effective deterrence against grave acts, where fundamental values and essential aspects of private life are at stake, requires efficient criminal-law provisions

The State’s positive obligation under Article 8 ECHR to prosecute grave interferences with Article 8 ECHR may extend to questions of criminal procedural law:

the State’s positive obligations under Article 8 to safeguard the individual’s physical or moral integrity may extend to questions relating to the effectiveness of a criminal investigation even where the criminal liability of agents of the State is not at issue.

The Court concludes that Article 8 implies that there needs to be a way to identify offenders and bring them to justice:

It is plain that both the public interest and the protection of the interests of victims of crimes committed against their physical or psychological well-being require the availability of a remedy enabling the actual offender to be identified and brought to justice, in the instant case the person who placed the advertisement in the applicant’s name, and the victim to obtain financial reparation from him.

Obviously, this need runs into other fundamental rights of internet users. In the following excerpt, the court notes that also offenders (I would say alleged offenders) can rely on the guarantees of the Convention, in particular the right to respect for private life and the right of freedom of expression:

Another relevant consideration is the need to ensure that powers to control, prevent and investigate crime are exercised in a manner which fully respects the due process and other guarantees which legitimately place restraints on crime investigation and bringing offenders to justice, including the guarantees contained in Articles 8 and 10 of the Convention, guarantees which offenders themselves can rely on.

The Court makes clear that the prevention of crime and disorder and the protection of the rights and freedom of others makes this consideration relative:

Although freedom of expression and confidentiality of communications are primary considerations and users of telecommunications and Internet services must have a guarantee that their own privacy and freedom of expression will be respected, such guarantee cannot be absolute and must yield on occasion to other legitimate imperatives, such as the prevention of disorder or crime or the protection of the rights and freedoms of others.

From the perspective of data retention, the words to note here are “on occasion”. That could reasonably be interpreted as standing in the way of blanket data retention of Internet traffic and location data.

As TJ McIntyre concludes, the judgment raises a lot of very difficult questions. The Court concludes it is primarily up to the member states to resolve them:

Without prejudice to the question whether the conduct of the person who placed the offending advertisement on the Internet can attract the protection of Articles 8 and 10, having regard to its reprehensible nature, it is nonetheless the task of the legislator to provide the framework for reconciling the various claims which compete for protection in this context.

Dutch Supreme Court Asks Adword Questions to ECJ

Sunday, December 14th, 2008

The Dutch Supreme Court has decided to ask prelimirary questions to the ECJ in the Adwords case Portakabin v. Primakabin:

“considering the great importance, beyond the national boundaries, of the application of the Internet in question, and the fact that also other supreme courts in the member states have asked questions to the ECJ that have not been answered yet.”

Search Engine Society by Alex Halavais

Friday, December 12th, 2008

Alexander Halavais has done the Web search research community a great favor with his book Search Engine Society published in the Digital Media and Society Series of Polity. The book is not only a comprehensive overview of the relevant literature about search engines from the last decade. It is also well written, concise and pleasingly balanced.

To a large extent, the book is a literature review and does an excellent job at it as well. So I will not try two write a review of that review, but simply make a few remarks and recommend everyone interested to get the book.

It’s worth noting that Halavais has not restricted himself to the social sciences, but has looked at computer science and legal scholarship as well. Although his analysis of legal and policy issues is not as advanced as his discussion of the social aspects, it is good reading for legal scholars as well, exactly because it is embedded in his discussion of the societal aspects of search engines.

Halavais makes some very interesting points about privacy and social search. He makes a connection between what we understand about the impact of search engines on society and the private (unshared) nature of the use of search engines. By keeping our searches private, only companies and others that have access to the laws by law or deals, will get to know what we are searching for and how this might impact us in general. I have (amateurishly) thought about this aspect of search technology myself a few times. Asking questions to other people is a deep social process if you think about it. It expresses trust, curiosity, willingness to bond and a range of other fundamental social values. The reference to the predominantly anti-social nature of current search technology is a welcome and recurrent theme in the book.

There were a few points in the book where I had to disagree. For instance, Halavais describes the crawler as a simple piece of technology. In my understanding crawling and technical crawling management has grown into one of the most complex parts of major search engines.

He also states that Google has started anonymizing the logs of users that are not logged in. Unfortunately, Google will do that only in two incomplete steps, one after 9 months and another step after 18 months. Even after 18 months the logs can hardly be said to be anonymous if they remain organized with a unique identifier replacing ip-address and cookie The logs themselves simply contain too much information, as was shown by the AOL data release.

Halavais also points to the danger that German, by using Google and Yahoo news services, they will be exposed to English language centric or United States oriented news. I am sure this will go down well in Europe, and he attributes this conclusion to German scholars Machill & Beiler, but it’s hard to subscribe to this conclusion since these providers have German language news services, in which one finds German language news by German newspapers.

Pointing Fingers in Search Log Retention Debate in the EU

Tuesday, December 9th, 2008

The New York Times reports that Microsoft, Google and Yahoo are unwilling to abide with the demands of the Article 29 Working Group. Microsoft is willing to lower its retention periods if the others will, or if there is a guarantee that enforcement will make them. Google and Yahoo are reported to refuse further concessions with regard to their retention policies. Regulators have postponed a decision until February 2009.

John Vassallo, a lawyer for Microsoft, said Microsoft was not willing to act alone because doing so would create a commercial disadvantage.

“We support the commissioners’ recommendations but are asking them to ensure these are uniformly observed,” said Mr. Vassallo, who is based in Brussels. “Otherwise, to do so unilaterally would put us at a disadvantage.”

IWF Removes Wikipedia Entry From its Blacklist

Tuesday, December 9th, 2008

The UK Internet Watch Foundation has removed the Wikipedia entry on a controversial Scorpions album cover (Vrigin Killer) from its blacklist. The statement describes its action as the outcome of its appeal process. The reasons for the removal from the censorship list are “the length of time the image has existed and its wide availability”. Maybe the IWF board tried a simple Google Image Search to get to this conclusion.

The IWF has close relationships with search engines and the Chilling Effects Clearinghouse shows several takedown notices sent to Google (for instance for searches to 4chan) I have been trying to understand why the Wikipedia entry was not reported to these members of the IWF but am still looking for an answer. Anyone?

The most important question, of course, if whether this remarkable episode will lead to any improvements at the IWF. UPDATE: Chris Soghoian proposes some significant improvements for the same issue in the US at his cnet blog.

Visit and Talk at UW-Milwaukee’s CIPR

Monday, December 8th, 2008

This Friday, I will travel to Milwaukee to give a talk at the UW-Milwaukee’s Center for Information Policy Research. I will speak about freedom of expression and search engine law and policy. I particularly look forward to the discussion and meeting again with Michael Zimmer. The announcement of the lunch talk is here and here.

IWF Censors Access to Parts of Wikipedia

Sunday, December 7th, 2008

Wikinews reports that major ISPs in the UK are blocking access to parts of Wikipedia because of a Scorpions cover and a screenshot from a 1938 movie.

The blame for what’s going on here goes to the Internet Watch Foundation. The ISPs block access to child pornography based on a list managed by the IWF. The IWF could have sent a takedown request for the material to Wikipedia. As far as I can tell, Wikipedia is a responsive intermediary. If Wikipedia does not want to take the material down without a Court order (which I think is reasonable considering the status of the material) there is all the less reason to put the sources on a secret blocking list.

In the case of Wikipedia this was bound to be discussed in the media. For others, a block of their websites or the page they might have been looking for could stay unnoticed. Transparency is difficult to provide because of the nature of the material, but the complete lack of accountability is unnecessary.

