Loading…
Type: Data skills clear filter
Thursday, May 22
 

10:00am CEST

Masterclass: Hack your way into a big dataset with R (Masterclass ticket needed)
Thursday May 22, 2025 10:00am - 12:00pm CEST

A separate ticket is required to attend this masterclass. If you would like to attend but haven't yet purchased a ticket, please contact us at info@dataharvest.eu

R is one of the most useful programming languages in data journalism. You may have heard of it, maybe even tried it a little and found the learning curve too steep. If so, this session is for you.

We are going to spend the day looking at European Environment Agency’s EPRTR (European Pollutant Release and Transfer Register) data – it’s a lot of data, and some of it is quite messy. It contains dozens, probably hundreds, of potential lines of investigation to be explored – and that’s what we’re going to do.

By the end of the day, you will know how to import data in an R environment, filter it, reshape it, and interrogate it. You will be able to make some basic graphs. Above all, you will be on the way to finding stories in the day’s chosen data, and be able to take your script away and use it again, or adapt it to other datasets. And, we hope, you will have the beginnings of a story idea.

We will assume that you are familiar with spreadsheets, but that you have no knowledge of R. You will not need to install anything – everything will be run on cloud instances of R.

If you’re already advanced with R, it is still worth coming along to use and share what you know, to support others, and to learn something new.

(If you already have a dataset you want to work with – bring that too!)
Speakers
avatar for Leopold Salzenstein

Leopold Salzenstein

Data Coordinator, Arena for Journalism in Europe
Investigative data journalist
avatar for Jonathan Stoneman

Jonathan Stoneman

Arena for Journalism in Europe
Former BBC journalist, turned datajournalist, trainer, consultant. Works with Arena as Masterclass Coordinator for the Climate Arena fellows.
Thursday May 22, 2025 10:00am - 12:00pm CEST
Z1.16

1:00pm CEST

Masterclass: Hack your way into a big dataset with R (Masterclass ticket needed) LIMITED
Thursday May 22, 2025 1:00pm - 3:00pm CEST
A separate ticket is required to attend this masterclass. If you would like to attend but haven't yet purchased a ticket, please contact us at info@dataharvest.eu

R is one of the most useful programming languages in data journalism. You may have heard of it, maybe even tried it a little and found the learning curve too steep. If so, this session is for you.

We are going to spend the day looking at European Environment Agency’s EPRTR (European Pollutant Release and Transfer Register) data – it’s a lot of data, and some of it is quite messy. It contains dozens, probably hundreds, of potential lines of investigation to be explored – and that’s what we’re going to do.

By the end of the day, you will know how to import data in an R environment, filter it, reshape it, and interrogate it. You will be able to make some basic graphs. Above all, you will be on the way to finding stories in the day’s chosen data, and be able to take your script away and use it again, or adapt it to other datasets.

We will assume that you are familiar with spreadsheets, but that you have no knowledge of R. You will not need to install anything – everything will be run on cloud instances of R.

If you’re already advanced with R, it is still worth coming along to use and share what you know, to support others, and to learn something new.

(If you already have a dataset you want to work with – bring that too!)
Speakers
avatar for Leopold Salzenstein

Leopold Salzenstein

Data Coordinator, Arena for Journalism in Europe
Investigative data journalist
avatar for Jonathan Stoneman

Jonathan Stoneman

Arena for Journalism in Europe
Former BBC journalist, turned datajournalist, trainer, consultant. Works with Arena as Masterclass Coordinator for the Climate Arena fellows.
Thursday May 22, 2025 1:00pm - 3:00pm CEST
Z1.16

3:30pm CEST

Masterclass: Hack your way into a big dataset with R (Masterclass ticket needed) LIMITED
Thursday May 22, 2025 3:30pm - 5:00pm CEST

A separate ticket is required to attend this masterclass. If you would like to attend but haven't yet purchased a ticket, please contact us at info@dataharvest.eu

R is one of the most useful programming languages in data journalism. You may have heard of it, maybe even tried it a little and found the learning curve too steep. If so, this session is for you.

We are going to spend the day looking at European Environment Agency’s EPRTR (European Pollutant Release and Transfer Register) data – it’s a lot of data, and some of it is quite messy. It contains dozens, probably hundreds, of potential lines of investigation to be explored – and that’s what we’re going to do.

By the end of the day, you will know how to import data in an R environment, filter it, reshape it, and interrogate it. You will be able to make some basic graphs. Above all, you will be on the way to finding stories in the day’s chosen data, and be able to take your script away and use it again, or adapt it to other datasets.

We will assume that you are familiar with spreadsheets, but that you have no knowledge of R. You will not need to install anything – everything will be run on cloud instances of R.

If you’re already advanced with R, it is still worth coming along to use and share what you know, to support others, and to learn something new.

(If you already have a dataset you want to work with – bring that too!)
Speakers
avatar for Jonathan Stoneman

Jonathan Stoneman

Arena for Journalism in Europe
Former BBC journalist, turned datajournalist, trainer, consultant. Works with Arena as Masterclass Coordinator for the Climate Arena fellows.
avatar for Leopold Salzenstein

Leopold Salzenstein

Data Coordinator, Arena for Journalism in Europe
Investigative data journalist
Thursday May 22, 2025 3:30pm - 5:00pm CEST
Z1.16
 
Friday, May 23
 

1:15pm CEST

Beyond the pixels: the power of raster data in QGIS
Friday May 23, 2025 1:15pm - 2:30pm CEST
Manipulating and analyzing raster data can be intimidating, as it often appears more complex than vector data. However, raster data—such as satellite imagery or forest loss information—is essential for environmental and geographic storytelling. For example others, it enables journalists to assess vegetation health, visualize floods or droughts, and calculate deforested areas, even when true-colour satellite imagery is obscured by clouds.

In this hands-on session, participants will learn the key functions in QGIS needed to work with raster data. This includes loading raster layers, managing projections, setting band combinations (such as false color) for analysis, styling raster layers to enhance visibility, and performing raster calculations.

To attend this session, participants should have basic QGIS skills.

Before the session, please install QGIS on your laptops and make sure it is working properly. Download from: https://www.qgis.org/en/site/forusers/download.html

If you encounter any issues during installation, this guide may help: https://www.qgis.org/resources/installation-guide/
Speakers
avatar for Federico Acosta Rainis

Federico Acosta Rainis

Data Specialist, Pulitzer Center
Federico Acosta Rainis is a data specialist at the Pulitzer Center's Environment Investigations Unit. Previously an IT consultant, he transitioned into journalism a decade ago, working with La Nación in Argentina, where he has contributed to award-winning projects. Federico received... Read More →
avatar for Kuang Keng Kuek Ser

Kuang Keng Kuek Ser

Data Editor, Pulitzer Center
Kuek Ser Kuang Keng is the data editor for the Pulitzer Center, where he supports investigative journalists of the Rainforest Investigations Network (RIN) and Ocean Reporting Network (ORN) to achieve their investigation goals. Based in Kuala Lumpur, Malaysia, Keng is a digital journalist... Read More →
Friday May 23, 2025 1:15pm - 2:30pm CEST
Z2.08

1:15pm CEST

Mapping independence: why and how newsrooms should host and style their own maps
Friday May 23, 2025 1:15pm - 2:30pm CEST
Tired of Google Maps' branding and tracking? In this session, we’ll explore how newsrooms can host their own map tiles using open-source tools such as Protomaps and OpenFreeMap. We’ll look at how to reduce costs, protect reader privacy, and gain full editorial control — alongside a hands-on workshop on customizing map styles with Maputnik. No prior knowledge is required for the workshop, though some experience working with maps will be a bonus, especially for the self-hosting part. The workshop takes place entirely in a browser. After this session, attendees will be familiar with alternatives to commercial map providers and know the foundations of map styling and self-hosting, empowering them to create maps that better serve their reporting and storytelling.
Speakers
avatar for João Antunes

João Antunes

Software designer
I'm a software designer, with a career that spans design and engineering, focused on human-computer interaction, data and visual journalism.
avatar for Rui Barros

Rui Barros

data journalist, Público
Portuguese data journalist currently working at Público.
Friday May 23, 2025 1:15pm - 2:30pm CEST
Z2.09

1:15pm CEST

🤖 How LLMs can classify thousands of records in minutes
Friday May 23, 2025 1:15pm - 2:30pm CEST
You find yourself staring at a dataset with tens or hundreds of thousands of rows. Maybe you want to get up-to-date FOIA contact details for all government departments in your country, or to find out which political donors have links to the fossil fuels industry. What do you do?

Large Language Models (LLMs) can help journalists automate simple research and classification tasks that would take an unreasonably long time to do manually.

In this session, we'll outline how Global Witness has used LLMs, search engines and web scraping to help us identify fossil fuel lobbyists at COP29. These techniques can be applied to other investigations and research tasks.

The workshop will cover:

- An interactive classification demo
- Some basic tips on setting up a research/classification project
- The challenges of doing AI research at scale and how to address them
- Using more advanced tools

After attending this session, you will be able to take an existing dataset and automatically augment it with new data, opening up the potential for new stories and investigations.

If you want to follow along with the classification demo, you'll need to be able to run Jupyter Notebooks on your device or have a Google account. A basic understanding of Python would be useful, but we won't be writing any new code.
Speakers
avatar for Nicu Calcea

Nicu Calcea

Senior Data Investigator, Global Witness
I’m a journalist with 14 years of experience in media, specialised in data reporting. Currently based in London.
avatar for Teodora Ćurčić

Teodora Ćurčić

investigative journalist & multimedia producer, Center for Investigative Journalism of Serbia (CINS)
Friday May 23, 2025 1:15pm - 2:30pm CEST
Z2.10

3:00pm CEST

AI in code editors: save time when writing code
Friday May 23, 2025 3:00pm - 4:15pm CEST
In this session we will explore AI assisted code editors, and go over the obscure but necessary features that enable you to write web scrapers easily. We'll walk through examples of how to approach unfamiliar pages and web technologies, and how it's already being used to speed up development substantially.

To get the most out of this session, you should have basic knowledge of web scraping in Python.
Speakers
avatar for Klil Eden

Klil Eden

Director, Vandelay Analytics
I work with organizations to develop robust, data-intensive investigative software and tech capacity for strategic projects, including digital defense. I've worked in public benefit technology in many roles -- engineering, product leadership, infosec, and analytics. My work supports... Read More →
avatar for Max Harlow

Max Harlow

Bloomberg News
Max Harlow is a data reporter at Bloomberg News in London. He uses data and documents to cover topics including money in politics, corporate sleaze and international trade. He formerly worked at the Financial Times. He also runs Journocoders, a community group for journalists to develop... Read More →
Friday May 23, 2025 3:00pm - 4:15pm CEST
Z2.08

3:00pm CEST

Browser puppetry: Playwright for dynamic website scraping
Friday May 23, 2025 3:00pm - 4:15pm CEST
Playwright is a next-generation browser automation tool that allows you to use Python or JavaScript to scrape almost any web page. It can assist in downloading pages of government documents, capturing tweets before they get deleted, or simply breaking past the cookie consent banner. Beyond the basics, it can also easily take screenshots, monitor and log network requests, and even fit right into your traditional BeautifulSoup scraping approach.

We'll look at:

- Installing Playwright
- Accessing elements on the page
- Interacting with web pages (clicking, navigating, filling out forms)
- Taking screenshots
- Sending pages to traditional scraping tools like BeautifulSoup
- Common patterns including pagination and CAPTCHA breaking

For those of you familiar with tackling similar problems using Selenium: Playwright is a similar tool with a better interface, better install/upgrade process, and ten times the usability. It might be time to upgrade!

Participants should have a basic knowledge of Python and HTML, but we'll also cover how to breeze past those basics with AI assistance. To fully participate, participants should have Jupyter installed. Additional software and installation tips will be available at https://github.com/jsoma/dataharvest25-playwright-scraping
Speakers
avatar for Nicu Calcea

Nicu Calcea

Senior Data Investigator, Global Witness
I’m a journalist with 14 years of experience in media, specialised in data reporting. Currently based in London.
avatar for Jonathan Soma

Jonathan Soma

Professor, Columbia University
Jonathan Soma is Knight Chair in Data Journalism at Columbia Journalism School, where he directs both the Data Journalism MS and the summer intensive Lede Program. His courses there cover everything from basic Python and analysis to ai2html and machine learning. Right now he's very... Read More →
Friday May 23, 2025 3:00pm - 4:15pm CEST
Z2.09

3:00pm CEST

Cracking the code: how to use RegEx in your investigations
Friday May 23, 2025 3:00pm - 4:15pm CEST
When you unlock the power of regular expressions (RegEx) you supercharge your spreadsheet!

Participants will learn how to extract hidden patterns from text, clean messy datasets, and automate repetitive tasks using the RegEx formulas within Google Sheets (but this session is also a good intro if you want to apply it in other code).

Through practical examples—like extracting donations, cleaning salutation-heavy lists, and extracting postcodes—you will leave with the confidence to apply RegEx in your day-to-day data work.

Attendees will receive a RegEx cheat sheet (customisable for their own use) and a practical demo spreadsheet to take their skills to the next level. No prior experience of RegEx is required, but you should be comfortable writing formulas in Google Sheets/Excel. I will be sharing a Google Sheet containing the data for which you will require a Google account.
Speakers
avatar for Pamela Duncan

Pamela Duncan

Data Projects Editor, The Guardian
Pamela Duncan is the acting editor of the Guardian’s Data Project team, producing high quality and exclusive data stories. She can usually be found at her desk poring over spreadsheets, translating non-machine readable PDFs into a usable format and using coding skills - most often... Read More →
avatar for Rob Gebeloff

Rob Gebeloff

Reporter, New York Times
Robert Gebeloff has worked as a data projects reporter for The New York Times since 2008 and has taught data journalism for many years in newsrooms and at conferences. He was co-winner of the George Polk Award in 2015 and was a Pulitzer Prize finalist in both 2015 and 2016 for projects... Read More →
Friday May 23, 2025 3:00pm - 4:15pm CEST
Z2.10

4:45pm CEST

Data Magic made simple: three ways to crunch numbers in spreadsheets
Friday May 23, 2025 4:45pm - 6:00pm CEST
We know that thousands of lines in a dataset can be intimidating, especially if you’re not a programmer. Spreadsheets can do the heavy lifting — and mastering them is easier than you expect!

In this session, we will walk you through three different ways to dive into data using nothing but spreadsheet tools. Along the way, we’ll show you how to cross-check your calculations, ensuring your findings are accurate and reliable. Whether you’re a complete beginner or have already used spreadsheets in your work, you’ll leave with practical skills to handle data confidently without ever touching a line of code. Bring your laptop and join us to discover how easy and powerful data analysis can be!
Speakers
avatar for Pamela Duncan

Pamela Duncan

Data Projects Editor, The Guardian
Pamela Duncan is the acting editor of the Guardian’s Data Project team, producing high quality and exclusive data stories. She can usually be found at her desk poring over spreadsheets, translating non-machine readable PDFs into a usable format and using coding skills - most often... Read More →
avatar for Alina Yanchur

Alina Yanchur

Data and Investigative Journalist
Friday May 23, 2025 4:45pm - 6:00pm CEST
Z2.08

4:45pm CEST

Finding connections: transform your document collections into a graph visualisation
Friday May 23, 2025 4:45pm - 6:00pm CEST
A high level overview of the GraphRAG ecosystem. We aim to show:
- What GraphRAG is, and how it works
- How to prepare your documents
- How to build your own graph.
- How to interface with your graph using Python

These techniques can be used to gain a visual overview of the contents of document sets, and to find information in the documents without having to rely on keywords. Put simply, it enables you to make sense of large amounts of documents without having to read them all.

In the session you'll be making a graph from arbitrary documents, visualizing it, and using it to answer questions.

It will help if you've heard of RAG before, but this is not a prerequisite. To follow along, all you need is a browser and an e-mail address.
Speakers
avatar for Lasse Edfast

Lasse Edfast

Documentary producer and data researcher, Freelance
Documentary producer, researcher and data journalist. I'm be happy to help you with your Python project.
avatar for Johan Schuijt

Johan Schuijt

Freelance Full Stack / Data developer
I help teams make sense of their data with small sharp custom software tools that save time.
Friday May 23, 2025 4:45pm - 6:00pm CEST
Z2.09

4:45pm CEST

Investigating British royals' wealth using geospatial tools 👑 : learn the basics of PostGIS
Friday May 23, 2025 4:45pm - 6:00pm CEST
Working with geospatial datasets is often critical to investigations dealing with where things are located, where an event occurred, land ownership, and the environment.

In this workshop, participants will learn the basics of using PostGIS to query and join geospatial datasets using the Guardian's Cost of the Crown series as a case study.

We’ll run through a brief overview of PostGIS, the spatial data types, indexes, and functions it adds to Postgres, and a few of its strengths and weaknesses as a tool.

Participants will connect to an existing database containing the data used in the investigation. They'll be guided through writing queries to reproduce the investigation's results.

To take part in this workshop, participants should feel comfortable writing basic SQL queries. Participants should bring their laptops with either DBeaver or their favorite SQL client tool installed.

Speakers
avatar for Zeke Hunter-Green

Zeke Hunter-Green

Software Developer, The Guardian
Zeke Hunter-Green is a software developer and computational journalist on the Guardian’s Digital Investigations team which develops investigative tools and contributes to journalism projects.
avatar for Kuang Keng Kuek Ser

Kuang Keng Kuek Ser

Data Editor, Pulitzer Center
Kuek Ser Kuang Keng is the data editor for the Pulitzer Center, where he supports investigative journalists of the Rainforest Investigations Network (RIN) and Ocean Reporting Network (ORN) to achieve their investigation goals. Based in Kuala Lumpur, Malaysia, Keng is a digital journalist... Read More →
Friday May 23, 2025 4:45pm - 6:00pm CEST
Z2.10
 
Saturday, May 24
 

9:30am CEST

From source to chart 📈 : Using free tools to automate your data flows
Saturday May 24, 2025 9:30am - 10:45am CEST
For monitoring purposes and to inform our reporting, it’s helpful to keep an eye on trends over time in some datasets, for example when tracking the progress of an mpox outbreak, monitoring pre-election polls or investigating migration trends.

This is where automation comes in handy: with the data being automatically grabbed from the source, reconfigured and channeled into a chart for visualization, allowing data journalists and their colleagues to notice report-worthy trends early on.

To establish such a workflow, we have been relying on free tools like Datawrapper and Github Actions to run a Python script.

Participants will be guided through setting up such a workflow step-by-step.

🚀 In this session you will learn how to...
- collect data from a url
- parse the data into the needed format using Python's `pandas` library
- use Python's `datawrapper` library to create a chart
- set up the script to run automatically on Github Actions

🔍 Prerequisites & Tools:

- required: Laptop
- required: Datawrapper API token
- required: Github Account (if you want to automate the chart update)
- optional: Code text editor (if you prefer working with your own code for automation), such as Atom or Sublime Text
- optional: Distill browser plugin (if you want to update on click)
- optional: basic understanding of Python/coding helpful, but not required

Materials of this session will be available at: https://github.com/dw-data/dataharvest25-automate-source-to-chart
Speakers
avatar for Gianna-Carina Gruen

Gianna-Carina Gruen

Head of Data-driven Journalism, DW
Gianna-Carina Grün is a data journalist and leads the data-driven journalism unit at DW that she founded in 2017. In her day-to-day she produces and edits visuals and data-driven stories and as a ddj trainer spreads enthusiasm for coding in journalism.She has a bachelor in Life Sciences... Read More →
avatar for Jonathan Soma

Jonathan Soma

Professor, Columbia University
Jonathan Soma is Knight Chair in Data Journalism at Columbia Journalism School, where he directs both the Data Journalism MS and the summer intensive Lede Program. His courses there cover everything from basic Python and analysis to ai2html and machine learning. Right now he's very... Read More →
Saturday May 24, 2025 9:30am - 10:45am CEST
Z2.04 Z2.04

9:30am CEST

Python Without the pain: write code with LLMs
Saturday May 24, 2025 9:30am - 10:45am CEST
In an age where data is an important part of impactful storytelling, journalists need tools that enable them to work with it effectively. Python is a powerful resource for analyzing and visualizing data, but it can be intimidating for those without a technical background. This workshop breaks down those barriers, showing how AI tools like ChatGPT can make coding basics approachable and accessible. By equipping journalists with these skills, the workshop aims to empower them to create richer, data-driven stories and visualisations without relying heavily on external technical support.

The session will start with an overview of Python and how AI-assisted coding works, showcasing how these tools can simplify technical challenges, followed by real-life examples. Afterward, participants will dive into a hands-on session using Jupyter Notebook to practice running and adapting Python scripts. By the end, they’ll feel more confident tackling technical problems independently.

Participants are encouraged to have Python (with Jupyter Notebook) installed on their devices, or a Google Collab environment ready. You will also need a ChatGPT account set up before attending. While familiarity with Python is helpful, it’s not required.
Speakers
avatar for Joseph Smith

Joseph Smith

Staff Software Engineer, The Guardian
Joseph Smith is the engineering lead of the Guardian's Generative AI team, which is exploring how this new technology might be safely used for journalism.Before that, he spent four years working on investigative and data journalism projects at the Guardian and building software to... Read More →
avatar for Teodora Ćurčić

Teodora Ćurčić

investigative journalist & multimedia producer, Center for Investigative Journalism of Serbia (CINS)
Saturday May 24, 2025 9:30am - 10:45am CEST
Z2.09

9:30am CEST

Teaching LLMs to build your Machine Learning Models
Saturday May 24, 2025 9:30am - 10:45am CEST
In this practical session, participants will learn how LLMs like ChatGPT can assist in writing machine learning code for journalistic investigations.

We’ll start by prompting ChatGPT to generate code for analyzing a small dataset. Then, we’ll apply the code to a larger dataset locally. After attending this session, participants will be able to train and use this machine learning model on their own.

This method was used by Frontstory.pl to analyze thousands of messages on Telegram and reveal the scale of drug trafficking activity in Poland.

To follow along, participants should be comfortable using Python and Jupyter Notebook.
Speakers
avatar for Klil Eden

Klil Eden

Director, Vandelay Analytics
I work with organizations to develop robust, data-intensive investigative software and tech capacity for strategic projects, including digital defense. I've worked in public benefit technology in many roles -- engineering, product leadership, infosec, and analytics. My work supports... Read More →
avatar for Anastasiia Morozova

Anastasiia Morozova

Data and investigative journalist, Frontstory.pl/ Reporters Foundation
I’m a data and investigative journalist with a background in tracking Russian influence, desinformation operations and sanctions evasion in Europe. I’m especially interested in projects where I can combine data analysis and visual storytelling to expose hidden networks or financial... Read More →
Saturday May 24, 2025 9:30am - 10:45am CEST
Z2.10

11:15am CEST

Make your own investigative application with minimal code
Saturday May 24, 2025 11:15am - 12:30pm CEST
Getting the most out of your investigative data requires more than ad-hoc scripting. Deep research requires a persistent state, data model, user tagging, collaboration, access management, and automatic updates. In short, you need a research application. In this session, you'll learn to use an open source, low-code application to turn a simple ETL into a complex app.

To follow along comfortably, you should be familiar with basic python scripting and REST APIs.
Speakers
avatar for Klil Eden

Klil Eden

Director, Vandelay Analytics
I work with organizations to develop robust, data-intensive investigative software and tech capacity for strategic projects, including digital defense. I've worked in public benefit technology in many roles -- engineering, product leadership, infosec, and analytics. My work supports... Read More →
avatar for Johan Schuijt

Johan Schuijt

Freelance Full Stack / Data developer
I help teams make sense of their data with small sharp custom software tools that save time.
Saturday May 24, 2025 11:15am - 12:30pm CEST
Z2.04 Z2.04

11:15am CEST

Making maps with code
Saturday May 24, 2025 11:15am - 12:30pm CEST
Data journalists have traditionally thought of maps and spatial calculations as a job for special mapping software, like QGIS. But it's often more efficient to do GIS work in the script in which you perform the rest of your analysis.

In this class, you will see how easy it is to work with maps within your code.

The class will be taught in R, so some familiarity is recommended, but the skills are generic to all languages.
Anybody that wishes to code along will just need to sign up for a free account on https://github.com/
Speakers
avatar for João Antunes

João Antunes

Software designer
I'm a software designer, with a career that spans design and engineering, focused on human-computer interaction, data and visual journalism.
avatar for Rob Gebeloff

Rob Gebeloff

Reporter, New York Times
Robert Gebeloff has worked as a data projects reporter for The New York Times since 2008 and has taught data journalism for many years in newsrooms and at conferences. He was co-winner of the George Polk Award in 2015 and was a Pulitzer Prize finalist in both 2015 and 2016 for projects... Read More →
Saturday May 24, 2025 11:15am - 12:30pm CEST
Z2.09

11:15am CEST

More than just the Wayback Machine: how to investigate deleted and archived content
Saturday May 24, 2025 11:15am - 12:30pm CEST
Even among investigative journalists, web archives tend to be underrated – and undertaught. This hands-on session introduces journalists to powerful techniques for using web archives.
Participants will learn how to recover deleted or hidden content and archive key material from platforms like Instagram and X.
Using real-world examples, we’ll demonstrate how these skills can strengthen reporting across a wide range of stories, from everyday reporting to investigative longreads.
After this session, you will be able to retrieve archived content, recover deleted posts (not necessarily the same things!), and preserve online material using advanced web archiving tools and techniques. We will teach participants how to tweak the URL and use the asterisk, and we will demonstrate why the "Golden Hour" of archiving is so important in breaking news situations.
No prior experience is required—just an interest in digital sleuthing and a willingness to explore new tools.
Please bring a laptop with you, preferably with the Chrome browser installed.
Speakers
avatar for Jasmine Jacot-Descombes

Jasmine Jacot-Descombes

OSINT-Journalist (Video), Neue Zürcher Zeitung
avatar for Jan Ludwig

Jan Ludwig

Team Lead OSINT, Neue Zürcher Zeitung
Saturday May 24, 2025 11:15am - 12:30pm CEST
Z2.10

1:45pm CEST

Extract all relations in Game of Thrones using AI and Python
Saturday May 24, 2025 1:45pm - 3:00pm CEST
AI-models are great tools for structuring unstructured data, especially if you know some basic Python. Those language models can be used to find and classify all arguments in a bunch of documents, make statistics of political debates or hunt for greenwashing in corporate reports. In this session we will extract all relations in the Game of Thrones series and plot them as a network map. Come along if you want to know who slept with whom! (Here is an example of what it looks like: https://lasseedfast.se/got/ ). After this session, you will be able to use AI in combination with Python to systematically extract pieces of information from a large volume of data.

To follow along, participants need a basic understanding of Python.
Install Ollama  and download one of the models to follow along on your own laptop.
Speakers
avatar for Lasse Edfast

Lasse Edfast

Documentary producer and data researcher, Freelance
Documentary producer, researcher and data journalist. I'm be happy to help you with your Python project.
avatar for Johan Schuijt

Johan Schuijt

Freelance Full Stack / Data developer
I help teams make sense of their data with small sharp custom software tools that save time.
Saturday May 24, 2025 1:45pm - 3:00pm CEST
Z2.04 Z2.04

1:45pm CEST

Spreadsheets with superpowers: LLMs for data extraction and classification
Saturday May 24, 2025 1:45pm - 3:00pm CEST
Lots of data and investigative journalism takes place in spreadsheets. Frequently, we want to  perform a task for every row in our spreadsheet. For instance, we may have cells containing:

- Quotes from a speech by a European politician that we want to classify into “Pro-EU”, “Anti-EU” or “Neutral”

- Company annual reports from which we want to extract the ultimate controlling party

- Political ads which we want to sift according to whether they mention immigration, directly or indirectly

In this session, participants will learn to write a custom AppScript function in Google Sheets that will enable them to apply Large Language Models (LLMs) from OpenAI and Anthropic to their spreadsheet data.

By the end, attendees will be able to write a formula like =LLM(A1, “gpt-4o”, “Is this text about immigration?”), then drag it down to apply it to hundreds of rows at once. This will enable us to apply the astonishing natural language capabilities of LLMs en masse to cells within our spreadsheet.

Attendees will acquire the following skills:

- Using AppScript to write custom functions in Google Sheets

- Using LLMs via APIs

- Some basic LLM prompting techniques and tips

- Understanding when an LLM is likely to be reliable (when its output is based entirely on data within the spreadsheet) and when it is more likely to hallucinate (when its output draws on its own limited knowledge of the world)
Speakers
avatar for Joseph Smith

Joseph Smith

Staff Software Engineer, The Guardian
Joseph Smith is the engineering lead of the Guardian's Generative AI team, which is exploring how this new technology might be safely used for journalism.Before that, he spent four years working on investigative and data journalism projects at the Guardian and building software to... Read More →
avatar for Alina Yanchur

Alina Yanchur

Data and Investigative Journalist
Saturday May 24, 2025 1:45pm - 3:00pm CEST
Z2.09

1:45pm CEST

Together at last: R and Python united in the Positron IDE
Saturday May 24, 2025 1:45pm - 3:00pm CEST
For years datajournalists have been forced to choose between learning R or Python in order to do data analysis with a scripted language. This meant the choice of IDE (integrated development environment – the app for writing and managing scripts and files) was always a  defining decision.

R users mostly turned to RStudio to maintain R and run scripts, make plots etc. Python users have had a variety of options – Google Colab, Jupyter, Anaconda etc to manage their scripts and projects.

Now there’s a program built to handle both languages in parallel (but not quite simultaneously!) - it's called Positron.

In this session we will introduce you to the Positron program. We will show you the interface, and how to get started with your usual coding language, before working through some scenarios where being able to move quickly from one language to the other is desirable. (And if you have examples of times when you’ve needed this facility, please bring them to this session)

You will ideally have some experience of R or Python, and some appetite for using the other language, perhaps even on deadline. If you want to follow along in the session, install Positron beforehand from https://positron.posit.co/
Speakers
avatar for Adrián Maqueda

Adrián Maqueda

Data Analysis, Dataviz & Front-end, Civio
avatar for Jonathan Stoneman

Jonathan Stoneman

Arena for Journalism in Europe
Former BBC journalist, turned datajournalist, trainer, consultant. Works with Arena as Masterclass Coordinator for the Climate Arena fellows.
Saturday May 24, 2025 1:45pm - 3:00pm CEST
Z2.10

3:30pm CEST

Protests, TikTok, and more 🎥: analyzing images and videos with AI
Saturday May 24, 2025 3:30pm - 4:45pm CEST
Learn how AI can sort through images and video to help you wrangle footage from protests and riots (NYT), analyze trends on TikTok (Washington Post), keep an eye on your local school board meetings (Hearst), measure the effects of congestion pricing (Bloomberg), and a hundred other tidbits for when the cameras might be rolling.

With a little Python and a dash of foundational knowledge, this session will cover downloading videos, building and evaluating transcripts, splitting scenes, categorizing images, and detecting/counting/tracking objects.

Participants will get the most out of this session if they have a working knowledge of Python. To follow along, you should have Jupyter installed on your computer or a Google account to use Google Colab. Additional materials and installation tips will be available at https://github.com/jsoma/dataharvest25-ai-images-video
Speakers
avatar for Luc Martinon

Luc Martinon

Freelance data journalist
Freelance data journalist, based in Berlin. I work on different cross border projects, the last big one being the Forever Pollution Project.Proud member of the CORRECTIV.Europe project.Topics: corporate capture, conflicts of interest, national and EU lobbying, chemical pollution... Read More →
avatar for Jonathan Soma

Jonathan Soma

Professor, Columbia University
Jonathan Soma is Knight Chair in Data Journalism at Columbia Journalism School, where he directs both the Data Journalism MS and the summer intensive Lede Program. His courses there cover everything from basic Python and analysis to ai2html and machine learning. Right now he's very... Read More →
Saturday May 24, 2025 3:30pm - 4:45pm CEST
Z2.04 Z2.04

3:30pm CEST

Start looking: finding patterns in data with your eyes 👀
Saturday May 24, 2025 3:30pm - 4:45pm CEST
You've just obtained a big dataset. Where do you begin? How do you find the story buried within the rows and columns?

In this session, you will learn how to quickly become familiar with your data by making a series of charts that will illustrate not just the contents of your data but unveil patterns that can help guide your reporting.

This class will be taught in R, so some familiarity is recommended, but the skills are common to all languages.
Anybody that wishes to code along will just need to sign up for a free account on https://github.com/
Speakers
avatar for Rob Gebeloff

Rob Gebeloff

Reporter, New York Times
Robert Gebeloff has worked as a data projects reporter for The New York Times since 2008 and has taught data journalism for many years in newsrooms and at conferences. He was co-winner of the George Polk Award in 2015 and was a Pulitzer Prize finalist in both 2015 and 2016 for projects... Read More →
Saturday May 24, 2025 3:30pm - 4:45pm CEST
Z2.09

3:30pm CEST

💡Streamlit for building tools and collaborate with non-coders
Saturday May 24, 2025 3:30pm - 4:45pm CEST
With Streamlit, you can set up a web page in just a few lines of Python code to share your findings with your team or your audience – or to collect information from them. Use it to swiftly try out an idea for publication before asking your IT department to develop it, or to let a colleague make use of a Python-scripted tool you've written. Or build yourself a chatbot to help navigate your own research, local and safe on your computer.
In this session, we’ll cover the basics of Streamlit and build a page where users can upload a PDF along with some information, send it to a Python function for processing, and display the results. More advanced users will learn how to build an LLM-powered chatbot.
Streamlit is a Python library, so you should have a basic understanding of Python. You also need to be the admin of your computer, or at least have permission to start a local web server on it. If you want to build a chatbot, you’ll need to install Ollama and download a model such as Gemma3 (ollama.com/library/gemma3) before the session starts.
Speakers
avatar for Lasse Edfast

Lasse Edfast

Documentary producer and data researcher, Freelance
Documentary producer, researcher and data journalist. I'm be happy to help you with your Python project.
avatar for Gianna-Carina Gruen

Gianna-Carina Gruen

Head of Data-driven Journalism, DW
Gianna-Carina Grün is a data journalist and leads the data-driven journalism unit at DW that she founded in 2017. In her day-to-day she produces and edits visuals and data-driven stories and as a ddj trainer spreads enthusiasm for coding in journalism.She has a bachelor in Life Sciences... Read More →
Saturday May 24, 2025 3:30pm - 4:45pm CEST
Z2.10

5:15pm CEST

Advanced AI prompts for investigations
Saturday May 24, 2025 5:15pm - 6:30pm CEST
How can can journalists use metaprompting to let AI build system instructions for assistants to empower investigations? Through examples of iterative metaprompting Rune Ytreberg will show you how to master the art of efficient prompting, and use metaprompting to build AI assistants to unlock the black box that is AI.

After attending this session you will be able to use advanced prompting techniques (meta-prompting) to use a large language model (LLM) to make system instructions. These advanced prompts are detailed instructions that customize AI assistants supporting investigations. Among others they are used in Open AI and Anthropic projects, and RAG platforms like Kotaemon or Anything LLM.

You need no special knowledge to attend this session, but if you would like to follow along on your computer you will need some basic knowledge of prompting tools such as Chat GPT, Claude or similar, and will need to have signed up for a paid version of your chosen AI platform before the session.
Speakers
avatar for Rune Ytreberg

Rune Ytreberg

Journalist, Itromsø
Head the datajournalism lab at the local newspaper iTromsø. We use AI to make new editorial tools and products for 70 newspapers in the Polaris media group ASA.Interested in Agentic RAG, LLM, Nocode AI, Recommender systems. Member of the AI Journalism lab at J+ CUNY. Open for Ai... Read More →
Saturday May 24, 2025 5:15pm - 6:30pm CEST
Z2.04 Z2.04

5:15pm CEST

AI cookbook 🥧: 6 recipes for the modern journalist
Saturday May 24, 2025 5:15pm - 6:30pm CEST
What if you could harness AI to automate repetitive tasks, extract meaningful insights from complex datasets, or even assist in storytelling? In this session, you’ll learn how to create practical, customizable workflows—“AI recipes”—designed to tackle real newsroom challenges.

Drawing inspiration from cutting-edge techniques in AI agent design, we’ll guide you through building tools that can annotate maps, analyze documents, and much more. Whether you’re a data journalist, editor, or simply curious about the potential of AI, this session will provide hands-on insights to integrate AI agents into your work.
Speakers
avatar for Rui Barros

Rui Barros

data journalist, Público
Portuguese data journalist currently working at Público.
Saturday May 24, 2025 5:15pm - 6:30pm CEST
Z2.09

5:15pm CEST

Scraping the unscrapable: advanced approaches to deal with complex sites and evade anti-scraping systems
Saturday May 24, 2025 5:15pm - 6:30pm CEST
Scraped data can often be the backbone of an investigation, but some websites are more difficult to scrape than others. This session will cover best practices for dealing with tricky sites, including coping with captchas, IP blocks, and browser fingerprinting.

This is an advanced session aimed at people who already have experience of writing code to scrape websites and want to move up to the next level: participants will leave with an understanding of how to approach hard-to-scrape websites, plus the tradeoffs and costs of these approaches.
Speakers
avatar for Max Harlow

Max Harlow

Bloomberg News
Max Harlow is a data reporter at Bloomberg News in London. He uses data and documents to cover topics including money in politics, corporate sleaze and international trade. He formerly worked at the Financial Times. He also runs Journocoders, a community group for journalists to develop... Read More →
Saturday May 24, 2025 5:15pm - 6:30pm CEST
Z2.10
 
Sunday, May 25
 

9:30am CEST

Efficient Collaboration? Automate your Google Docs with Python!
Sunday May 25, 2025 9:30am - 10:45am CEST
Their wide adoption of Google tools makes them a great choice in collaborative contexts like cross border projects. Online documents and spreadsheets can also be used as input for any kind of automated processes, allowing non technical people to participate with tools they are already familiar with. In this hands-on session we will see how to automate Google Docs to make more with it. Think for example how to create, edit, upload or export files, but also perform some search and replace across multiple documents, automating the publishing of a csv as a spreadsheet with headers and filters, etc.

I will briefly present the different options available (Google App Scripts, Python with Gspread or the Google Drive API). Then you will have an opportunity to complete a Hands-On task.

For the hands on task, please set up one of the authenticating methods described here, either OAuth or a service account.

If you have a particular use case in mind, please bring it and we will discuss it together / try to make it happen!
Speakers
avatar for Ada Homolova

Ada Homolova

ARENA / Follow The Money, Austria/ Slovakia
Adriana is a freelance data journalist, trainer and public spending nerd. She coordinates the data skills training track on the Dataharvest conference and investigates the European Union for Follow The Money Bureau Brussel.
avatar for Luc Martinon

Luc Martinon

Freelance data journalist
Freelance data journalist, based in Berlin. I work on different cross border projects, the last big one being the Forever Pollution Project.Proud member of the CORRECTIV.Europe project.Topics: corporate capture, conflicts of interest, national and EU lobbying, chemical pollution... Read More →
Sunday May 25, 2025 9:30am - 10:45am CEST
Z2.04 Z2.04

11:15am CEST

RAG: Leveraging AI for document research
Sunday May 25, 2025 11:15am - 12:30pm CEST
Investigative journalism often involves analyzing vast amounts of documents. In this session, we will demonstrate how Retrieval-Augmented Generation (RAG) systems can help you getting information out of document collections.

We’ll showcase both open-source solutions and a custom-built tools, using real-world examples from recent collaborative investigations. Whether you're dealing with leaked documents, court records, or government reports, this session will equip you with techniques to get insights from your document faster than using an intern.

After attending this session, you will be able to integrate these tools into your investigative workflow, and apply them to large document collections.

No prior knowledge of AI or programming is required

To follow along, bring your laptop!
Speakers
avatar for Rune Ytreberg

Rune Ytreberg

Journalist, Itromsø
Head the datajournalism lab at the local newspaper iTromsø. We use AI to make new editorial tools and products for 70 newspapers in the Polaris media group ASA.Interested in Agentic RAG, LLM, Nocode AI, Recommender systems. Member of the AI Journalism lab at J+ CUNY. Open for Ai... Read More →
avatar for Hendrik Lehmann

Hendrik Lehmann

Head of Innovation Lab, Tagesspiegel
Hendrik Lehmann leitet das Tagesspiegel Innovation Lab und baut den europäischen Rechercheverbund Urban Journalism Network mit auf. Er arbeitet vor allem an investigativen Datenanalysen, Dataviz und interaktiven Anwendungen. Und sein Team versucht, AI für investigative Recherche... Read More →
avatar for Johan Schuijt

Johan Schuijt

Freelance Full Stack / Data developer
I help teams make sense of their data with small sharp custom software tools that save time.
Sunday May 25, 2025 11:15am - 12:30pm CEST
Z2.10

11:15am CEST

Reset before you repeat: How automation helps us save time in the data team
Sunday May 25, 2025 11:15am - 12:30pm CEST
Data processes are often repetitive. Download the dataset, clean it, analyze, visualize, publish. What if new data comes in? Start all over again. From Google Sheets to Python, automation can streamline this process — saving time, enabling live coverage, or helping you be first with recurring stories. But is it worth automating everything?

Through real examples — from election coverage to internal requests from other teams — we’ll share our ongoing reflection at El Confidencial’s data desk: how we decide what is worth automating, what has worked (and what hasn’t), and how maintaining a small library of reusable scripts has helped us speed up our workflows and focus more on storytelling. In this practical session, we will take a step back from the daily grind to rethink how automation can help us work smarter, not harder.

Some knowledge of HTML or spreadsheets is desirable, but not required. We will mention Google Sheets, data visualization platforms like Datawrapper or Flourish, and languages like SQL, R or Python, as well. It is recommended to have access to an AI (e.g. ChatGPT, Gemini, Mistral) in order to participate in some of the examples we will give in the session.
Speakers
avatar for Miguel Ángel Gavilanes

Miguel Ángel Gavilanes

Data Journalist, El Confidencial
Sunday May 25, 2025 11:15am - 12:30pm CEST
Z2.04 Z2.04
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.