Technology hype is a fickle resource. A few years ago, blockchain and crypto were the magic buzzwords that had venture capitalists reaching for chequebooks. Now, with the sector worth 40% of what it was at its peak and key players embroiled in a litigious quagmire, its power to fascinate has waned and artificial intelligence has firmly displaced it as the flavour of the month.
VCs are simply desperate to hand over money to AI developers. Mistral AI, which consists of three French AI engineers and little else, received €105m at a €240m valuation in Europe’s largest ever seed round just a month after incorporating. Politicians are equally excited, pinning their hopes for transformative productivity growth on the technology and doing everything in their power to ensure their jurisdiction will play host to AI companies.
The hype may well be justified. The capabilities of the latest AI tools are awe-inspiring and it seems inevitable that their proliferation and development will have enormous economic consequences.
But just as the cryptocurrency industry is running into regulatory challenges, AI is beginning to face its own legal battles on a number of fronts. Where the crypto industry is struggling with securities laws and anti-money laundering restrictions, AI providers are facing the ire of privacy regulators and activists. One of the most problematic regimes for the AI sector is the General Data Protection Regulation, the legal regime that protects the privacy of all residents of the European Union and UK.
Pitfalls of publicly available data
Developing a large language model such as ChatGPT requires the gathering of vast bodies of text through a process called web scraping. These datasets ingest details scraped from open online sources such as social media profiles. The information is in the public domain so gathering it can’t be an issue, right? Wrong. Privacy lawyers are at pains to point out that AI developers engaging in web scraping and analytics remain on the hook.
Chris Elwell-Sutton is a partner in the data, privacy and cybersecurity team at the UK law firm TLT. ‘There’s a common belief that, if data is pulled from publicly available sources, it is out of the scope of GDPR and similar privacy regimes. This is a mistake – potentially a very costly one,’ he explains. ‘Once your personal data is stored within a filing system, you have the protection of the GDPR, regardless of the original source.’
GDPR places a variety of stringent obligations on any organisation storing, transmitting or running analytics on personal data. The most fundamental headache LLM deployers will face from GDPR is identifying a legal basis for the wholesale scraping of the personal data of millions of people without their knowledge or consent. This question has been subject to heavy regulatory and judicial scrutiny across Europe and there’s no simple solution in sight.
There are still unknowns on the detail of how GDPR will apply to generative AI, but the first shots have been fired in what is certain to be a long and expensive battle. ChatGPT was temporarily banned by the Italian data protection authority over inaccurate results and a lack of lawful grounds for processing, as well as the mishandling of children’s data. Google then had to postpone the EU launch of its competitor Bard over similar privacy challenges – though the tech giant did see fit to launch the service in the UK – a nod, perhaps, to the business-positive approach to privacy enforcement promised by the UK’s Conservative government post-Brexit.
OpenAI has made improvements to its privacy position, providing age verification, allowing users to opt out of their data being used for training modes and taking ‘measures to strip away personally identifiable information from the training dataset’. But proving this claim to the satisfaction of a regulator is likely to be difficult.
The exact techniques OpenAI has used to remove personal data from its dataset have not yet been disclosed in detail, but they have mentioned ‘data filtering and anonymisation’.
Elwell-Sutton said: ‘While it is true that anonymisation of data can take it outside the scope of privacy regulation, true anonymisation is a very high bar to achieve under GDPR. Auditing and testing the effectiveness of anonymisation methods would typically involve assessing the outputs a tool generates. But, in the case of a chatbot producing multiple answers to the same questions in an unpredictable manner, auditing its output is not a reliable method of assessing the privacy credentials of the model. The whole thing is too opaque for any external party to be able to verify that true anonymisation has taken place.’
A high and rising bar
There is another potential consequence of AI proliferation on privacy, which will have implications for many organisations that have nothing to do with AI.
There is no exhaustive list of what constitutes personal data. Instead, it is defined as data relating to an identified natural person. Predictive, rather than generative AI, is able to make use of advanced algorithms and predictive analytics to extrapolate further personal information about individuals.
This is a particular risk for the anonymised data in AI training sets. For data to have been effectively anonymised, there needs to be, at most, a remote likelihood of identifying an individual by means reasonably likely to be used. ‘That definition is tech-dependent,’ said Elwell-Sutton. ‘Data that was once anonymised can morph into GDPR-regulated personal data over time as more powerful tools emerge that can make connections between previously unrelated or meaningless elements within datasets.’
AI tools are also increasingly capable of generating new personal data by inference. ‘A powerful analytics tool can take a few seemingly harmless data points such as your partner’s name, your location at a place of worship or a medical centre and quickly infer your religion, your political views, your sexual orientation or likelihood of suffering from a certain illness,’ says Elwell-Sutton. ‘This “inferred” personal data can be hugely sensitive and its creation and storage carries major legal implications – a point that the European Court of Justice and regulators across Europe have repeatedly confirmed.’
The risks associated with inferred data present regulatory challenges for all kinds of businesses, not least those impacted by last year’s ban on Google Analytics in several European countries. LLM developers may find these issues particularly painful as the size and complexity of their training datasets makes it difficult to impose suitable guardrails and audit them. Sanitising training datasets is likely to remain a concern for LLM developers. Complying with data protection regulations may involve facing the complex technical challenges of permanently removing references to particular individuals from training data.
One alternative is a regulatory carveout that relaxes the rules around the inclusion of personal data in AI training datasets. This is not outside the realms of possibility – the UK is relaxing the rules around the inclusion of copyrighted material for AI developers – but with so many jurisdictions scrutinising the privacy implications of AI, legal troubles are unlikely to go away any time soon.
AI developers are not ignoring this legal threat, but their approach to privacy leaves something to be desired, according to Elwell-Sutton. ‘OpenAI’s focus on privacy and data protection has been successful as far as it goes,’ he said. ‘The issue is that their efforts revolve around the manageable issue of user-provided data, rather than the vast lakes of personal data scraped from public sources and used to train their models. They’ve done a good job on the low-hanging fruit, but they have a lot more work to do.’
Lewis McLellan is Editor of the Digital Monetary Institute, OMFIF.