Artificial intelligence — especially in the form of large language models (LLMs) — promises enormous productivity gains for enterprises.
But with great power comes great responsibility. Unchecked AI risks compromising security, enabling harmful outcomes, and eroding public trust.
That’s why secure, private, and responsible AI development and usage sit at the heart of Moveworks' mission. With our next-generation copilot, we refuse to sacrifice ethics for efficiency or safety for convenience's sake. We invested tremendous engineering resources to create guardrails and controls from day one, putting our copilot through rigorous testing.
Why such intense scrutiny? Our copilot handles employee and customer data in enterprise IT ecosystems, integrates with critical infrastructure, and influences millions of decisions. We aim to advance best practices industry-wide and spark a dialogue on earning public trust through better AI safety.
In this post, we’ll break down:
- The methodology guiding our security and privacy testing regime
- Key risks we identified and mitigations in developing our copilot
- Ongoing efforts to stay ahead of emerging attack techniques
Understanding critical AI security and privacy issues
Securing LLM-powered tools, like our enterprise copilot, requires methodically analyzing potential risks from every angle. We turned to the industry-standard OWASP Top 10 vulnerabilities for machine learning systems to ground our threat modeling. In this case, four of the top ten were the most relevant to our platform:
Prompt injection: Manipulating large language models (LLMs) via prompts can lead to unauthorized access, data breaches, and compromised decision-making.
Insecure output handling: Neglecting to validate outputs generated by LLMs may lead to downstream security risks, including code execution capable of compromising systems and exposing data.
Sensitive information disclosure: Failing to protect against disclosure of sensitive information in LLM-generated outputs can result in accidental exposures or a loss of competitive advantage.
Insecure plugin design: LLM plugins that process untrusted inputs and have insufficient access controls risk severe exploits like remote code execution.
This list is an excellent starting point, as it covers many areas of concern. However, as an enterprise copilot, there are other risks that we know we would want to protect ourselves against, such as hallucination, disinformation, etc.
Hallucination refers to a situation wherein an AI system, especially one dealing with natural language processing (NLP), generates outputs that may be irrelevant, nonsensical, or incorrect based on the input provided. Hallucination often occurs when the AI system is unsure of the context, relies too much on its training data, or lacks a proper understanding of the subject matter.
For example, an enterprise copilot could provide falsified employee names and phone numbers as if they were real. Without safeguards, it could easily manufacture fake workplace knowledge.
The subtlety of these inaccuracies can appear innocent but have much broader implications when dealing with more sensitive subjects. To the untrained eye, incorrect statements could very well seem true.
While this example may be more benign than some, there have been real-world consequences of this issue. Hallucination poses a real threat when LLMs are used for real-world applications — like responding to employee questions in a business setting or providing automated patient support in a healthcare setting.
Disinformation is a deliberate attempt to mislead or deceive, presenting false information as if it were true. Disinformation differs from hallucination, as hallucinations are often unintended and arise due to an AI model, such as an LLM, making things up based on patterns learned during training without malicious intent or purposeful acts of spreading false information.
Any tool with generative capabilities can create disinformation, whether by surfacing fabricated knowledge base articles or drafting emails attempting to persuade others that false claims are factual.
Prompt injection generally refers to deliberately creating an input in a conversation with an LLM with the goal of producing unintended or potentially harmful output. Since LLMs treat inputs as instructions to follow, prompt injection can be used to try to coerce or trick the system into generating problematic content that violates its safety constraints, for example.
Prompt Injection can be both direct and indirect. Both of these methods can be abused by a bad actor to make the LLM ignore its own instructions and do something else. DAN(Do Anything Now) is an excellent example of this! This prompt explicitly asks that the LLM to perform as normal, minus restrictions on harmful content. Prompt injection can also lead to unintended actions, such as directing the AI to file a ticket as another user. This could also be used to leak sensitive information, such as customer data if the proper prompt could be found.
Harmful content encompasses outputs including but not limited to racism, homophobia, sexism, or crude language. This content poses risks even when subtly implied rather than explicit. For example, defaulting to "he" when referring to a CEO perpetuates gender bias.
An LLM can be tricked into generating harmful content via tactics including:
- Encoding: Using character substitutions to “hide” content, such as the ASCII representations of the word (See Figure 4)
- Context manipulation: Guiding conversations to elicit problematic responses gradually rather than in one shot. An example of content manipulation would be asking the LLM over and over again benign or harmless questions, and then asking it to do something that would produce harmful content somewhat related to the benign requests.
- Exploiting generative capabilities: Prompting an LLM to produce dangerous instructions or toxic viewpoints through its own writing, such as by having an LLM write a gender-biased story about how male CEOs are better than female CEOs.
Even while not intended maliciously, the capability of an LLM to generate harmful content exists without mitigations in place. Allowing exposure to harmful content could create business risks for enterprises using AI. Proactively blocking and filtering toxic outputs is, therefore, essential.
Identity validation encompasses the authentication and authorization processes used to control access to systems and data, as well as validating the authenticity of requests. There are several risks that can undermine these controls:
- Authentication bypasses allow unauthorized users to access a system or data by evading its normal process of verifying who they are. If an authentication bypass is found, an attacker can appear to be another user with broader permissions or privileges. This access could be abused to access restricted customer information or financial documents, and/or gain access to other systems.
- Authorization bypasses are when a user is able to access data or perform actions beyond their permission levels. If an authorization bypass is found, it could be theoretically possible to use a different user’s account to reset passwords, access restricted content, or potentially perform an action on this other user’s behalf.
- Confused deputy attacks enable a user — with fewer privileges or fewer rights — to trick another user or program into misusing its authority on the system. In this type of attack, an LLM itself could be "confused" into believing a request is legitimate.
All of these risks apply not only to the AI itself but also to plugins that have been developed and will be developed in the future. If plugins do not properly perform identity validation, the functionality they extend to the user could be abused by an attacker. For example, if a Jenkins build plugin did not properly validate that the user had the appropriate permissions, an attacker could continuously build a resource-intensive Jenkins build, wasting computational resources and engineering time.
Red-teaming our copilot with real-world attack scenarios
To rigorously vet our copilot’s defenses, we simulated realistic security and privacy attacks. An LLM red-teaming exercise ensured that we could probe even subtle weaknesses that bad actors could exploit.
To perform these attacks, we rely on great external research, as well as, our in-house expertise based on how copilots get used. Below are some examples:
- Testing text sanitization through typoglycemia, scrambling inner letters while preserving the first and last to challenge sanitization methods.
- Stressing NLP capabilities via multilingual inputs and "false friend" words.
- Disguising malicious payloads using Base64 or URL manipulations for input encoding.
- Employing ciphers to attack LLMs.
- Identifying and resolving undesirable behaviors within pre-trained language models.
- Developing universal triggers exploiting vulnerabilities across NLP systems.
- Automatically generating jailbreaks to bypass system constraints.
- Using control characters for injecting commands into prompts, assessing input handling.
- Social engineering prompts to elicit harmful responses
- Sequences of inputs to gradually weaken restrictions
- Attempts to rewrite keywords through encoding
This approach enabled us to assess both publicly known and proprietary risks to our copilot, while new techniques are identified and we evolve further.
We mitigate hallucinations with fact and source verification, as presented in our fact-checking system, which is aided by our grounded knowledge system.
More specifically, we validate whether the available documentation supports the data the copilot presents to the user. If documentation does not support the copilot’s conclusion, it will refuse to respond with the incorrect information. This approach significantly reduces serving false or fabricated information to users. We know we can trust the data in the documentation we do find, as long as it’s part of our grounded knowledge system.
Balancing information security without limiting the copilot's workflow integration capabilities required a nuanced approach. If we removed all of the generative capabilities provided by LLMs, the copilot would be much less helpful.
With this challenge in mind, we use multi-layer defenses to reduce disinformation:
- Content truthfulness is checked through support documents as prescribed by our fact-checking system.
- Generative writing is generally allowed except in egregious cases
- Explicitly harmful disinformation is blocked by our content moderation system.
- Our grounded knowledge system allows us to ground our copilot with the most up-to-date and trustworthy documentation available to it.
Mitigating prompt injection
Our content moderation system and prompt protection aim to recognize when outputs recycle original user inputs, indicating potential manipulation attempts. Detected issues are blocked while still allowing access to conversation contexts for personalized recommendations.
Balancing security and usability required permitting some generative capabilities while restricting malicious inputs. Our approach focuses on mitigating the highest-risk injections without limiting functionality.
For example, a user might attempt to trick the copilot into printing an inappropriate or threatening message by embedding it in the text that he asked to spell check. However, our solution aims to recognize malicious attempts and simply provide the original text back to the user after checking it is grammatically correct (see Figure 7).
Suppressing harmful content
Our content moderation system aims to catch varieties of abusive content through comprehensive techniques covering threats, harassment, hate speech, and more.
Removing toxic outputs helps protect users and ensures copilot conversations uphold enterprise standards. This approach creates a safe and ethical environment for enterprises to adopt AI, such as LLMs.
Preventing identity validation risks
We found no viable authentication bypasses during testing, as our copilot requires users to authenticate through a trusted third-party chat platform, such as Slack or Microsoft Teams, before granting access. In the context of our Moveworks for Web platform, we rely on the SSO provided to us for user authentication. Users who cannot message the copilot directly cannot easily simulate legitimate identities.
This approach leverages our partners’ access controls rather than recreating specialized IP-based or biometrics controls. Users properly authenticated to enterprise chat environments intrinsically pass the copilot's front door checks, too.
While dependent on external platforms, we regularly monitor partners’ security programs to ensure adequate assurance. Presenting falsified identities to simultaneously bypass Slack or Microsoft’s protections and the copilot poses an extremely tall order for attackers.
Our copilot’s authorization controls are built upon enterprise single sign-on (SSO) systems. Access policies rely on the underlying identity provider accurately mapping employees to appropriate data and system permissions.
Our copilot cannot misrepresent an employee's permissions since we rely on the customer’s authorization and authentication systems for it (i.e. we do not use AI to determine the user identity or access).
For example, users couldn’t request other employees’ support tickets or restricted reports. Nor could they ask the copilot to perform unauthorized actions posing as managers or executives. Our copilot’s responses make clear authorization failures while preventing access.
Tying access policies to hardened modern identity platforms proactively blocks both spoofing and privilege escalation risks through the copilot.
Rather than handling access control itself, the copilot offloads identity and permissioning to trusted single sign-on platforms, which have deterministic access controls and identities instead of the LLM to determine those. This approach minimizes internal attack surfaces that could allow spoofing policy checks.
Ensuring AI copilot security amidst an evolving threat landscape
As you can see from the examples and details above, securing our copilot required implementing tailored safeguards:
- Content moderation system: Proactively prevents racist, harmful, dangerous or abusive content. This system significantly reduces social engineering risks and blocks threats like instructions for violence.
- Fact-checking system: Validates that copilot responses match supporting documentation. This system mitigates potential misleading information from hallucinations or disinformation by verifying accuracy and grounding the responses
- Prompt protection: Detects outputs recycling original user inputs, indicating potential manipulation attempts.
Together, our approaches form critical guardrails between users and the copilot. They operate continuously rather than just reacting to issues, preventing issues before they occur.
While no single solution is foolproof, layered defenses combining prevention and detection make threats significantly more difficult to abuse.
Developing a proactive approach to AI copilot security and privacy
As for any AI solution, security is not a one-and-done exercise. Maintaining responsible protections demands proactively planning for emerging threats. Below are some examples of areas that we are actively working on:
- Automated vulnerability discovery: "Hacker" user models try bypassing defenses, while "privacy" models uncover potential data leaks. Additionally, continuous red teaming surfaces and arising issues.
- Fuzz testing: Subtly modifying known attacks to discover variants of existing exploits.
- Query risk assessment framework: Our copilot uses data to identify abuse to improve our detection and response capabilities
We invest heavily in our copilot's security foundations, not just superficial protections. Scaling system guardrails and threat detection while hardening systemic weaknesses will sustain security as AI capabilities grow more advanced.
You probably noticed that we didn’t include all of the OWASP Machine Learning Top Ten vulnerabilities in our initial risk assessment. This exclusion does not mean we ignored these items. Rather, it acknowledges that we already had mitigating factors in place due to engineering and architecting decisions made earlier in the process.
For example, while direct model theft posed a low risk given the copilot’s separation from data access, we preemptively enacted strict output constraints. Confidential data identification patterns help filter any sensitivities before responses reach users. This "trust but verify" approach proactively reduces non-critical vectors like potential training data exposure through copilot outputs before customers deploy it.
The Moveworks Copilot has mitigations in place against key threats by design
Long before we began building this copilot, our engineering teams were architecting and designing systems to help us manage identities and all of the complexities that come with it.
The importance of security and privacy in operating large language models, especially in an enterprise context, cannot be understated. By sharing our methodology, we hope to advance best practices industry-wide.
This journey continues to evolve — innovation must continue balancing utility, security, and privacy to earn trust in AI. While it is not possible to eliminate all risks, we believe that we’ve taken the appropriate steps to significantly reduce these risks within our copilot.
We would like to thank the individuals listed below for their invaluable feedback.
Matthew Mistele, Tech Lead Manager, NLU
Emily Choi-Greene, Security & Privacy Engineer
Rahul Kukreja, Security Solutions Architect
Hanyang Li, Machine Learning Engineer
Learn more about what it takes to ensure conversational AI security for the enterprise.
Table of contents