Data Security in the Age of GenAI

Generative AI offers data analysts powerful tools to generate insights and explore data in creative new ways, making their work faster and more efficient. As team sizes shrink, work piles up, and pressure rises to produce meaningful results quickly, skipping tools like ChatGPT or Copilot can be a huge competitive disadvantage. However, using publicly available models with proprietary data or PII in prompts can lead to breaches or loss of IP. Every organization will need to create policy around GenAI, so here's some tips for developing data handling guidelines that will guide your organization's use of GenAI tools.

Start by classifying data based on its sensitivity—such as public, internal, confidential, or highly sensitive—so that appropriate protection levels can be applied. Clearly define who has access to different data categories, implementing the Principle of Least Privilege to limit access to only those who need it. This will help guide your team in deciding what datasets can and can't be used in prompts.

Something important to look at is the terms of service of any GenAI tool your team wants to use. Ensure that you retain ownership of your data and that the tool doesn’t claim rights to use or store the data beyond what is needed for generating results. Be aware of how your data might be used for model training or future AI improvements.

Another important factor is compliance. Ensure that making use of a GenAI tool complies with relevant data privacy laws such as GDPR, HIPAA, or CCPA. Some regulations may prohibit sharing certain types of data with third-party tools without proper consent or data protection agreements in place.

Avoid feeding PII or confidential datasets directly into these systems. Should you have a case where it'd be helpful to leverage a GenAI tool, be sure to employ data masking techniques to protect sensitive details. For example, replace names with generic identifiers (e.g., "person 1"). Or you could completely redact certain pieces of information by excluding them from the dataset shared with the model. If you have unique identifiers that are sensitive, like email, use a deterministic one-way hash, like SHA-256.

GenAI is an amazing technology that has the potential to empower data teams, but only if risks are mitigated through good data handling practices. For more information, check out Zinc’s guidelines on AI.

Previous
Previous

Sharing Datasets Based on User Profiles

Next
Next

Team Spotlight: Cheska Perez