Table of contents

Title

Table of content

Title



Blog

Attacks

Microsoft's 38TB data leak: Among exposed were secrets, code and AI training data

cover-microsoft-38tb-data-leak (https://unsplash.com/photos/5AiWn2U10cw)

Jason Chavarría

Content writer and editor

Updated

Sep 22, 2023



5 min

One misconfigured shared access signature (SAS) token allowed the exposure of 38 terabytes of data on Microsoft's AI GitHub repository. The data included secrets, private keys, passwords to Microsoft services, private source code and more than 30,000 internal Microsoft Teams messages from 359 Microsoft employees. Let's see what we can learn about and from this leak.

Cause, description and threats of this massive leak

Microsoft researchers goofed up when publishing open-source AI training data on their GitHub repository named "robust-models-transfer." The repository was created with the purpose of making available open-source code and AI models for image recognition. The issue was with both the access and permission levels that were granted through the link used to share the data. The researchers created the link employing Azure's SAS tokens feature. This feature allows sharing data from Azure storage accounts. And that's just great if configured to give access to only the data that is intended to be shared. In this incident, unfortunately, the entire storage account was shared, containing both what was intended to be seen as well as what wasn't. Talk about oversharing! To add insult to injury, the link was set to share with permission to overwrite and delete files and its expiry date was set to October 6, 2051.

Among the exposed private information was a disk backup of the workstations of two employees, which included credentials and internal messages that amounted to more than 30,000. And, truly, what has been more worrying about this leak is the possibility of malicious hackers tampering with the AI training data shared by Microsoft. The repository tells readers to follow the URL, download a file and feed that AI model into a script. Since said file is formatted using pickle, a formatter prone by design to arbitrary code execution (it's possible for a file to run Python code when the file is loaded), were an attacker to inject malicious code into the AI model, they could execute commands in the machines of unsuspecting users.

Shortly after being tipped off about this leak, Microsoft invalidated the SAS token to prevent access to the storage account. Not two weeks later, they replaced the token on GitHub. Reportedly, "no customer data was exposed" (thankfully) and "no customer action is required in response to this issue."

Possible security risks when using Azure SAS tokens

Now, undoubtedly, there's the fact that more and more organizations handle massive piles of data, e.g., as they decide to implement AI, and the cloud provides the availability and scalability they need. So, the solutions that these organizations use to manage these data need to allow secure configurations. This is not to negate, though, that security is a shared responsibility: Organizations should make sure that they configure those solutions in such a way that cybercriminals' prying eyes cannot catch a glimpse of sensitive information.

That being said, let's acknowledge Azure's responsibility in an incident like this. The granularity of account SAS tokens (the kind involved in this leak) is such that, before generating them, it's possible to select the specific files to be shared, the permissions granted (among 10) and the start and expiry dates and times. So, the security features are there. However, there are a couple of nuances that should be taken into account, as they make this service not quite as perfect.

Firstly, a fact that may be problematic is that generating the account SAS token is not an Azure event, but rather something that is done on the client side: When generating the token, the client's browser is responsible for downloading the account key from Azure and signing the generated token with the key. In turn, the token is not an Azure object. This does not allow for monitoring, so a token can be issued, and an admin, were their knowledge based only on what Azure tells them, would never learn of its creation. And even if they were made aware of its issuance, where the token circulates would also remain unknown. There would be a way, though, to learn of a token as it's used to access a storage account. For this to happen, the storage account should have logging enabled, which can be costly, as prices go up according to the request volume of each account, and logging would need to be paid for per account.

Secondly, revoking the Account SAS token is only possible as the effect of revoking the entire account key that signed the token. Efficient management is thus impaired, as every token signed by that key would be revoked upon applying this solution.

The researchers at Wiz, who discovered the data leak, thoroughly explained the previous issues, as well as the security risks related to the service's allowing for the creation of links granting (optionally) excessive permissions and having (optionally) infinite lifetime. It's true, Azure's tool makes dangerous combinations possible; but, like we said above, the client's responsibility for a secure configuration needs to be taken into account as well. Specifically, Microsoft's data leak sends a message of caution for organizations to review their data management governance in the cloud, lest they end up having their sensitive information up for grabs. So, let's look at what the latter can do from a preventive approach to cybersecurity.

How to prevent leaks like this

The following are some recommendations for using SAS tokens securely:

Take a good look at the data intended to be shared and identify the ways in which they can be misused.
Look into leveraging service SAS tokens, instead of account ones, and establishing a server-side stored access policy. This is a combination that grants access at the resource level rather than the whole storage account level and allows managing permissions and expiry time.
Create SAS tokens to give access to storage accounts dedicated for external sharing.
Check how long data should be shared, as some portion of it might not need to be shared indefinitely.
Check that the permissions are just those necessary to fulfill the objective(s) of sharing the data.
If allowed to be paid for, enable logs that detail SAS token access, signing key and permissions assigned.
Scan repositories continuously with a cloud security posture management (CSPM) tool to identify SAS tokens and detect leakage and misconfigurations regarding scope and permissions.

However, if wishing to prevent the creation of SAS tokens, the recommendation has been to block the access to the operation that lists storage account access keys. The generation of user delegation SAS tokens, which rely on a user key instead of an account key, is still possible, though.

Manage your cloud security posture with Fluid Attacks

We know that many organizations need to handle progressively greater amounts of data in the cloud. If they misconfigure the security features of the cloud services they use, the sensitive data and users they are supposed to secure are at risk. We've talked elsewhere about the importance of checking with CSPM that your cloud-based systems and infrastructures comply with security requirements, prioritizing detected issues and resolving them as soon as possible. Moreover, we've argued that such activities need to be done all the time, keeping pace with development and the evolution of cyber threats (in DevSecOps fashion). That is why we offer Continuous Hacking, which performs CSPM continuously, along with other techniques, and provides the means and guidance to fulfill further vulnerability management steps. If you would like a taste of how we can help you prevent data leaks now, start your free trial.

Get started with Fluid Attacks' cloud security solution right now

Tags:

cybersecurity

cloud

credential

code

risk

company







Subscribe to our newsletter

Stay updated on our upcoming events and latest blog posts, advisories and other engaging resources.

Upside and downside of GenAI in pentesting: insights from an empirical research

Read post



cover-tj-actions-changed-files-vulnerability (https://unsplash.com/photos/silhouette-of-dog-8Ou3EZmTMWA)

Attacks

Felipe Ruiz

•

March 20, 2025

Wake-up call for GitHub Actions! A zero-day vulnerability in tj-actions/changed-files

Read post



Attacks

Felipe Ruiz

•

February 6, 2025

Attacks against the transportation sector: 10 recent critical security breaches

Read post



cover-retail-sector-data-breaches (https://unsplash.com/photos/black-shopping-cart-on-white-floor-u0F1bva4Qh0)

Attacks

Felipe Ruiz

•

October 21, 2024

Retail sector data breaches: Top seven successful cyberattacks

Read post



cover-web-application-security-threats (https://unsplash.com/photos/black-android-smartphone-displaying-home-screen-DsmDqiYduaU)

Attacks

Wendy Rodriguez

•

August 16, 2024

Web app security threats: Sophisticated web-based attacks and proactive measures

Read post



cover-top-financial-data-breaches (https://unsplash.com/photos/a-group-of-people-standing-next-to-each-other-HOrhCnQsxnQ)

Attacks

Wendy Rodriguez

•

June 6, 2024

Top 8 data breaches in the financial sector

Read post



cover-top-10-data-breaches (https://unsplash.com/photos/low-angle-photo-of-city-high-rise-buildings-during-daytime-PhYq704ffdA)

Attacks

Wendy Rodriguez

•

April 11, 2024

Top 10 data breaches in history

Read post



cover-ransomware-prevention (https://unsplash.com/photos/man-in-black-suit-standing-on-green-floor-G1hIBdjQoAA)

Attacks

Wendy Rodriguez

•

April 3, 2024

How to prevent ransomware attacks: The best offense is a good defense

Read post



Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Fluid Attacks' solutions enable organizations to identify, prioritize, and remediate vulnerabilities in their software throughout the SDLC. Supported by AI, automated tools, and pentesters, Fluid Attacks accelerates companies' risk exposure mitigation and strengthens their cybersecurity posture.