Data Privacy and Security for AI Platforms
The rapid emergence of Large Language Models (LLMs) has introduced powerful new capabilities to the academic landscape, from streamlining research workflows to enhancing administrative efficiency. However, as these tools become integrated into university life, they bring complex data security and privacy challenges. For faculty, researchers, and administrators at Columbia, understanding how data is handled by these systems is essential for protecting the university's intellectual property and the privacy of its community.
Understanding "Public" LLMs
In practical terms, public LLMs refers to consumer-facing AI services—such as the free or standard versions of ChatGPT, Gemini, and Claude—that operate outside of Columbia University's negotiated legal and security agreements. When a user submits a prompt to these public services, the data typically leaves the university's controlled environment.
A critical distinction of public LLMs is their default data usage policy: most frontier AI developers utilize user chat data, including prompts and uploaded files, to train and improve their future models by default. While some platforms offer opt-out mechanisms, these are often not the default setting, placing the burden on the user to proactively protect their data.
Common Misunderstandings and Risks
Many users operate under misconceptions regarding how their interactions with AI are stored and managed (Stanford 2025).
- Deletion: Deleting a chat from your history does not necessarily mean the data is purged from the developer's servers. Many companies retain chat data for extended periods (or even indefinitely) for "trust and safety" reviews or model quality assessments.
- Isolation: It is often assumed that prompts are processed in isolation. In reality, modern chatbots are increasingly designed with long-term memory or personalization features that store details about a user's interests and goals across multiple sessions.
- Human Review: To improve model accuracy, developers frequently employ human contractors to review anonymized or de-linked chat transcripts. If a user includes identifiable information in a prompt, that data may be viewed by external reviewers.
A recent study out of Stanford University revealed the extent of these data privacy practices in frontier public LLMs. Each column is a chatbot developer and each row is a practice as captured by their privacy policies.
How Columbia Classifies Data
To guide safe usage, Columbia University classifies data based on its level of sensitivity:
- Public: Information intended for public consumption (e.g., published research, course catalogs).
- Internal: Data intended for use within the Columbia community (e.g., internal memos, department-level communications).
- Confidential: Information that requires protection (e.g., student records protected by FERPA, unreleased research data, legal contracts).
- Sensitive: Data that requires the highest level of protection due to legal or regulatory requirements (e.g., Protected Health Information (PHI), Social Security numbers, or biometric data).
Read more about our classification policy here.
The Value of Columbia-Grounded LLMs
For work involving non-public data, Columbia IT maintains a set of enterprise-grade AI tools — including Columbia ChatGPT Education, CHAT, and Google for Education — that operate within a secured environment sometimes called a "walled garden."
Unlike public LLMs, these university-approved tools are governed by specific enterprise contracts. These agreements ensure that the data you input is not used by the provider (e.g., OpenAI or Google) to train their public models. Inputs/outputs are isolated to individuals' accounts, meaning they aren't accessible to Columbia IT's own system administrators.
Current Data Allowable LLM Usage by Classification
This page has the current table of AI tools that are approved for use based on the classification of the data being processed, as well as a history of release notes.
For users at CUIMC, please refer to AI and Generative Technology Use at CUIMC for additional information regarding AI tools approved for use with Sensitive Data.
If you have any questions about AI data privacy and security, please contact [email protected].