Originally posted by Stefano Costantini ’15 on the Barcelona GSE Data Scientists blog. Stefano is on Twitter @stefanoc.
At the Renyi Hour on November 13th 2014, Frederic Udina gave a talk on big data and official statistics. Apart from being a professor at UPF and BGSE, Frederic is Director of IDESCAT, the statistical institute of Catalonia.
In his talk, Frederic compared the “traditional” official statistics – slow to produce, with well-defined privacy limits and access rights – to “big data”, which is fast to produce, volatile and with fuzzy privacy limits. Frederic highlighted the tension between these two worlds, focusing particularly on the need for official statistics to become easier to collect, organise and customise to the need of the final user. In particular, Frederic identified the opportunity for IDESCAT (and other statistical institutes) to integrate the officially collected information with alternative information sources, such as:
- Administrative data
- Data freely available from the society
- Data from private companies
Frederic outlined IDESCAT’s plan to move away from the current data generation system (the ‘stove pipe model’) which is slow, expensive and inefficient as it does not re-use information already collected, towards a fully integrated model (‘Plataforma Cerdà’) where any new information needs to be integrated with existing data.
Frederic noted that data is becoming increasingly important in society, and this is beginning to be recognised by official statistical institution. In particular, Frederic discussed the Royal Statistical Society’s Data manifesto where the RSS notes that data is:
- A key tool for better, informed policy-making
- A way to strengthen democracy and trust
- A driver of prosperity.
Frederic also stressed the importance of confidentiality and privacy issues with regards to data availability. While it is desirable for some data to be freely available to the public, confidentiality and privacy should always be protected. However, it is important to strike the right balance between access and privacy, ensuring that while personal sensitive data is protected, important information is not prevented from being used in ways that may ultimately help the wider society. Personal health records are a classic example of this.
Frederic concluded his talk by providing some example of national statistical authorities integrating official statistics with widely available information to carry out new interesting analysis. Examples include:
- Production of origin/destination arrays between territorial units (usually municipalities) for working or studying reasons using trajectories of mobile phones (ISTAT, New Zealand Statistics)
- Using Google Trends to estimate/predict labour market, monthly forecast, small-area estimation (ISTAT)
- Measuring use of TCI in firms, by using web scraping and text mining techniques