A new and exciting dataset is available. It contains the number of visitors, average visit time, "tweets" on Twitter, and "likes" on Facebook, for a set of thousands of web pages. The data is aggregated on windows of 5-minutes, during a period of 48 hours.

We are inviting researchers to participate in a competition: an ECML/PKDD Discovery Challenge that consists on predicting the total activity after 48 hours, by observing only the first hour of life of a web page. This is an important task that has significant practical applications.


Dataset available courtesy of Chartbeat Inc.

Carlos Castillo and Josh Schwartz
Predictive Web Analytics Challenge Co-Chairs

Presentation on November 14th, 2013 at the Tow Center, Columbia Journalism School. (New York, USA).

I had the privilege to work with Wei Chen (Microsoft Research) and Laks V.S. Lakshmanan (University of British Columbia) on a book for the Synthesis Lectures on Data Management series, edited by M. Tamer Özsu and published by Morgan and Claypool.

This book starts with a detailed description of well-established diffusion models, including the independent cascade model and the linear threshold model, that have been successful at explaining propagation phenomena. We describe their properties as well as numerous extensions to them, introducing aspects such as competition, budget, and time-criticality, among many others. We delve deep into the key problem of influence maximization, which selects key individuals to activate in order to influence a large fraction of a network. Influence maximization in classic diffusion models including both the independent cascade and the linear threshold models is computationally intractable, more precisely #P-hard, and we describe several approximation algorithms and scalable heuristics that have been proposed in the literature. Finally, we also deal with key issues that need to be tackled in order to turn this research into practice, such as learning the strength with which individuals in a network influence each other, as well as the practical aspects of this research including the availability of datasets and software tools for facilitating research. We conclude with a discussion of various research problems that remain open, both from a technical perspective and from the viewpoint of transferring the results of research into industry strength applications

The book is available for USD 20 or through many libraries:

Two chapters are available for free:

Wired UK, 30 September 2013.

Katie Collins covers part of our work in Social Computing and Social Innovation at QCRI:

On 24 September a 7.7-magnitude earthquake struck south-west Pakistan, killing at least 300 people. The following day Patrick Meier at the Qatar Computer Research Institute (QCRI) received a call from the UN Office for the Coordination of Humanitarian Affairs (OCHA) asking him to help deal with the digital fallout -- the thousands of tweets, photos and videos that were being posted on the web containing potentially valuable information about the disaster.

[...] AIDR (Artificial Intelligence for Disaster Response) was the second project tested for the first time during the Pakistan floods, and is due to be launched officially at the CrisisMappers conference in Nairobi in November. It's an open-source tool relying on both human and machine computing, allowing human users to train algorithms to automatically classify tweets and determine whether or not they are relevant to a particular disaster.

In Pakistan, SBTF volunteers tagged 1,000 tweets, out of which 130 were used to create a classifier and train an algorithm that could be used to recognise relevant tweets with up to 80 percent accuracy ...

Full article in Wired UK.

QCRI/AJE press release: QCRI and Al Jazeera launch predictive web analytics platform for news

New platform developed by QCRI and Al Jazeera can predict visits to news articles by taking cues from social media

News organisations have vast archives of information, as well as a number of web analytic tools that aid in allocating editorial resources to cover different news events, and capitalise on this information. These tools allow editors and media managers to react to shifts in their audience’s interest, but what is lacking is a tool to help predict such shifts.

Qatar Computing Research Institute (QCRI) and Al Jazeera are announcing the launch of FAST (Forecast and Analytics of Social Media and Traffic), a platform that analyses in real-time the life cycle of news stories on the web and social media, and provides predictive analytics that gauge audience interest.

“The explosion of big data in the media domain has provided QCRI an excellent research opportunity to develop an innovative way to derive value from the information,” said Dr Ahmed Elmagarmid, Executive Director of QCRI. “Together with our valued partner, Al Jazeera, the QCRI team has developed a platform that will help shift the way media does business.”

“Al Jazeera English’s website thrives on good original content in news and features, dynamic ways of creativity through interactive and crowd sourcing methods, and up-to-date social media tools. We welcome working with QCRI in developing FAST as it allows us to understand the consumption of news and what is expected to do well in driving traffic forward. Analytics in predicting the future trend of a web story is a crucial component in understanding web traffic, this initiative is a component we welcome,” said Imad Musa, Head of Online for Al Jazeera English.

You can test the platform at http://fast.qcri.org/ and read the full press release at the QCRI website. The system is based on research described in the following paper:


Subscribe to ChaTo (Carlos Castillo) RSS