RCE in Google Cloud Deployment Manager

[March 2021 update]: This write-up was chosen as the first place winner of the 2020 GCP VRP prize, and LiveOverflow made an amazing video explaining how the vulnerability was found.

TL;DR

By using an internal (dogfood) version of the Google Cloud Deployment Manager, I was able to issue requests to some Google internal endpoints through Google's Global Service Load Balancer, which could have led to RCE.

This could be achieved through a request to the Deployment Manager to create a Type Provider, but adding an undocumented field called googleOptions: Example.

This begins an async operation, in which the Deployment Manager attempts to retrieve a descriptor document from the specified descriptor URL.
If it fails, it might still provide information in the error message, such as the response from the internal server. If it succeeds, it would allow an attacker to issue complex internal requests.

Examples: App Engine Admin API (Internal test version), Issue Tracker Corp API.
Note the issue is not limited to requests to APIs, it just works best on them; example of non-API endpoint (Google Accounts and ID Administration "GAIA" backend - Test endpoint) - The descriptorUrl doesn't matter there, since we expect it to fail because it is not an API.

Google paid $31,337 as a reward for the bug report.

[March 2021 update]: Google paid an additional $133,337 prize as part of the 2020 GCP VRP prize, thus a total of $164,674 was paid for this report + write-up.

Intro

Deployment Manager is a Google Cloud service that provides a way to handle infrastructure resources' creation, deletion, and modification, programmatically (Infrastructure as code).

Relevant Deployment Manager concepts are:

Type: Describes the properties of a specific kind of infrastructure resource (For example: VMs, issue tickets, user permissions), there are several pre-defined Types available in Deployment Manager (Called base types)
Type Provider: Provides a service's RESTful API endpoint, with its descriptor document, for Deployment Manager to manage Types within that service (For example: An API to manage VM instances)
Resource: Represents an instance of a single infrastructure resource, provided by a Type (For example: A VM instance)
Templates: Reusable Python or Jinja2 files to programmatically configure Resources
Deployment: A collection of Resources that are deployed and managed together
Operation: Whenever a creation, modification, or deletion, action is done in the Deployment Manager, an Operation is returned which can be polled to check for completion or error

The main way to interact with the Deployment Manager is through its REST APIs, of which there are two documented versions: v2 (Generally available) and v2beta (In public beta) (Read more about Google products' launch stages).
A key difference between both versions, is that Type Providers are only available in the v2beta version.

Note
It is a bit hard to understand Google Cloud Deployment Manager at first glance, if you are interested in it, I would recommend you play around with it, especially through the v2beta REST API. Read the docs, and try creating Deployments and Type Providers to get the hang of it.
I tried to link useful resources throughout this write-up, hoping to make it easier to understand.

Security research

My first approach to researching the Deployment Manager was to look for hidden or internal Types, since some Google services (Such as Google App Engine Flexible) use the Deployment Manager internally (You can see it in your project's logs when deploying an app), but I found none.

Then, I looked at the Jinja2 and Python templates of the Deployments.
Through some trial and error, I was able to create Deployments, with specially crafted templates, that would return data as a Python exception on their Operations.
This way, I was able to inspect the Python libraries, read the Python code, and list/read files, but the templates' interpreting script runs on an isolated container with zero privileges, not even network connectivity.

After those attempts, I tried creating Type Providers pointing to internal Google Corp APIs, such as issuetracker.corp.googleapis.com, but the Operations always failed with an error saying it did not receive a valid response for the descriptor document, and showing the HTML for the login portal to which issuetracker.corp.googleapis.com redirects to when accessed externally.
And specifying any private IP address failed with an error saying it was not a valid address (Attempts to bypass it, with domains and redirections pointing to private IPs, gave the same result).

These failed attempts were quite demotivating, so I did not continue researching the Deployment Manager for a while (Remember I did not do all this research all at once on a single day, it was a very slow process).

One day, I decided to look into the Deployment Manager API methods, by enabling it on the Google Cloud Console, going to the metrics page, and looking at the Filters section, where there is a drop-down list titled Methods with all of them, including undocumented ones - Methods usually include the API version in their names.

I noticed there were two more API versions besides v2 and v2beta (The documented ones), called alpha and dogfood.
And I could call methods on those versions, just by replacing v2 or v2beta with either alpha or dogfood in every API call.

I played around a bit with the alpha version, but I did not find anything interesting in it.

The dogfood version was a bit more interesting though, especially because I have noticed the word dogfood being used for internal testing in Google services.

Dogfood product versions in Google are usually only intended for googlers, so they use a product and report bugs before the changes make their way to the customers.

Maybe this version had internal features, only intended for googlers!

When I listed the base types on that version, most of them returned an extra field in their definitions: googleOptions.

A couple examples of what this looked like

When I listed my own Type Providers, they also included this extra field, and specifying the $outputDefaults system parameter in my query, I could see which fields did the googleOptions field have inside.

I played around with them, creating Type Providers with different values in those fields, and came up with an idea of what each one of them do and their expected values (Note that at this point I wasn't able to figure out what most of them did):

injectProject
Boolean. Regardless of what value I specified, the Deployment Manager API always set it to false on my Type Providers. Effect unknown.
deleteIntent
Enum. I was able to find a single valid value: CREATE_OR_ACQUIRE. Effect unknown.
isLocalProvider
Boolean. Whenever I set it to true, the Type Provider was always successfully created, regardless of values in any other field, but attempting to create Deployments using it always failed with an error saying the descriptor document could not be retrieved.
ownershipKind
Enum. The valid values were UNKNOWN, USER and GOOGLE. No effects were observed by setting it to any of these values, but I always set it to GOOGLE during my research.
transport
Enum. The valid values I found at first were: UNKNOWN_TRANSPORT_TYPE and HARPOON. No effects were observed by setting it to any of these values.
credentialType
Enum. The valid values I found at first were: UNKNOWN_CREDENTIAL_TYPE and OAUTH. No effects were observed by setting it to any of these values.
gslbTarget
String. Either empty or something like blade:<TARGET> or gslb:<TARGET>. No effects were observed by setting it to any value.
descriptorUrlServerSpec
String, either the same as gslbTarget or empty. No effects were observed by setting it to any value.

This was very promising, GSLB is Google's Global Service Load Balancer, and it acts like a mix between an internal DNS server and a load balancer.

According to the SRE Book, when GSLB is provided a symbolic name (Kind of like a domain name), it will direct traffic to a linked BNS address (Borg Naming Service), which is the Google equivalent of an internal IP address.

It surely looks like this could be used to achieve SSRF to internal servers!
But whatever values I tried on gslbTarget and descriptorUrlServerSpec, they did not seem to have any effect.

I then tried to brute force valid credentialType values, and found a new one: GAIAMINT.

I had seen that name referenced before, for example, in this Google Git commit.

When testing Deployments with a Type Provider using that value, I also tested what happened if I set the Type Provider to use an OAuth 2.0 access token as its authentication mechanism.

Thanks to this, I noticed that a fake API I had set, instead of receiving an access token in the Authorization header, the header was now set to something like this instead: EndUserCreds 1 <URL-safe Base64 data> (Example).

I am not sure how to decode that, but it looks like it has some protobuf data inside some other binary format, and some strings can be retrieved: anonymous, 331656524293@cloudservices.gserviceaccount.com (The email of the service account Deployment Manager uses for tokens on my project), cloud-dm and cloudgaia::vjgv73:9898.
This looks like it is intended for internal use, and some googlers confirmed it is intended for authentication between internal Google systems, it is probably not possible to use it externally.

But besides this oddity, I was unable to brute force any other valid values for credentialType, nor any value for transport.

At this point I also tried adding staging_ to the beginning of the API version, since I noticed the Google Compute Engine API does that for the Staging environment (Fact mentioned in a few places, like in this GitHub PR), and it worked!
But the Staging environment seemed to work exactly the same way as the Production one.

After several failed attempts to achieve anything significant, I stopped researching Deployment Manager for a couple weeks.

Breakthrough: Exploiting Proto over HTTP

One day, I got the idea of using protocol buffers (A Google-developed binary serialization format) to find out the missing values of the credentialType and transport Enums, since in protobuf, Enums are represented as numbers, not Strings, so I could just count up from 1 until I stop finding new values.

Protobufs are used mainly for gRPC, a remote procedure call (RPC) system developed by Google, and supported by many Google APIs.
Unfortunately, the Deployment Manager API does not support gRPC, but it does support a relatively-unknown feature: Proto over HTTP.

Proto over HTTP is an experimental gRPC fallback feature available in some Google APIs, not very well documented, availability varies per API, and different APIs might implement it a bit differently. Not every API that supports gRPC supports Proto over HTTP, and viceversa, so I had to check it on the Deployment Manager API, and when I did so, I determined:

URL paths stay the same (/deploymentmanager/<VERSION>/projects/<PROJECT>/global/...)
The Content-Type header needs to be set to application/x-protobuf
In Production, it fails with the error message: Proto over HTTP is not allowed for service
It works in Staging!

Knowing this, I called the get Type Provider method of the API, and decoded the response protobuf using a tool called protoc (Protocol Buffers compiler) and its --decode_raw option.
This gave me unnamed proto field numbers, and the values assigned to them.

Comparing the values from the retrieved proto and the values in the JSON API, I quickly matched each field number to its field name, and reverse engineered the Type Providers proto message definition.

Quick example of all of this:

I create a Type Provider through the JSON API:
I get that same Type Provider through the JSON API:
I get that same Type Provider through the Proto over HTTP API:
I decode the response with protoc:
I figure out which number corresponds to each field (For example, 1=name, 2=id, 3=insertTime,...)
I construct an approximaiton of the original proto message definition with that information

After some meddling with it, by creating Type Providers with different values in the proto fields through Proto over HTTP, and decoding the protobuf answers, I got a good enough approximation of the values I was missing:

transport

GSLB - It directs requests from the Deployment Manager to the internal Google endpoints specified in gslbTarget and descriptorUrlServerSpec

credentialType

ENDUSERCREDS, TYPE_CREDENTIAL - They seem to act the same way as OAUTH and UNKNOWN_CREDENTIAL_TYPE

Setting transport to GSLB was the key to issuing internal requests!

The bug

With the newly discovered GSLB value for transport, I can craft Type Providers such that the Deployment Manager directs requests to internal Google endpoints... As long as I know where to point gslbTarget to.

Here is an example for creating a Type Provider for Google App Engine Admin API - Test environment (Which since my 2018 GAE RCE, has been blocked externally by a 429 error).
I got blade:apphosting-admin by listing Types on the dogfood version, the appengine.v1.version Type had gslbTarget set to this value.
I added -nightly at the end because, before the GAE Test API got blocked externally in 2018, I had noticed the string nightly a lot in it.

This Type Provider worked flawlessly, and I successfully created a Deployment that used it to launch a new app into GAE Test to check if my 2018 bug was properly fixed (It was).

If I specified some invalid gslbTarget (And I always set descriptorUrlServerSpec to the same value as gslbTarget), the Operation for creating a Type Provider would fail, either with an error message saying it could not connect to the GSLB endpoint, the error the internal endpoint returned (Often 404 Not Found), or that the response was not a valid descriptor document (For example, some endpoints returned a normal HTML) along with the response data.
One endpoint even returned an error page with a Java stack trace and a message along the lines of: Debugging information, only visible to internal IPs!
Therefore, I could retrieve some internal information this way.

If I specified some valid gslbTarget, like blade:corp-issuetracker-api for issuetracker.corp.googleapis.com (I got the GSLB name from some of my past research), I would be able to perform calls to the API!
Even though I had no idea how the format for Issue Tracker's resources would be like, this could be easily overcome by calling listTypes on the new Type Provider.

These were interesting issues, but I was a bit doubtful of their impact, especially since requests were being made with the Deployment Manager service account's credentials for my project, which would probably be restricted to which endpoints it would be allowed to talk to.

While researching this, I had told some googlers that I had found a way to perform requests to GSLB endpoints, and they told me to write it down on a VRP grant ticket, so that the SRE team could have a heads up of what I was up to, in case they detected my requests.

They also explained one potential issue with requests to GSLB endpoints:

If service A makes a request with service B on behalf of user C, the authorization of user C is checked. If there are no credentials for C, then the authorization of A is checked instead.

This was really interesting, since I had noticed that the service account credentials used by Deployment Manager were delegated by cloud-dm-staging@prod.google.com (I could see the delegator's ID in the Cloud Console logs), and I assume it means that Google prod account has, at least, permissions to delegate tokens for some service accounts.
I would just have to find a way to do so, and remove the service account's credentials so the identity of the Deployment Manager would be used instead.

By this time, it was night in Uruguay, so I just wrote down my research in the grant ticket and stopped researching for the day.

Next morning, my dogs woke me up at about 6 AM, and I noticed notifications of updates on the grant ticket, including one that arrived just as I was reading them:

Eduardo then quickly submitted a VRP report for me, triaged it, escalated it to P0, and issued a Nice catch!
It took less than 5 minutes from report submission to Nice catch!, maybe the fastest RCE VRP report ever :).

Later that day, I asked Eduardo a few questions, and he told me this bug was now being treated as an incident, but just because RCE bugs are treated like so.
Because of this, they asked me to stop further research into it, and send them the details of my actions and findings.

I asked about the potential way this issue could have been exploited, and my understanding is:

Privilege escalation may be achieved through the identity of the Deployment Manager service (cloud-dm-staging@prod.google.com), so it might have access to internal services a normal service account would not have access to
It is not known if there are attack vectors that would allow an attacker to achieve a shell into Google's internal systems, but the privileges could be high enough

Because of this probable maximum impact, Google treated this as RCE, and issued a $31,337 reward (Their current standard amount for RCE).

Thanks so much to the Google VRP!
It was a very interesting bug to research, and I would love to see what other issues could be found in Google Cloud Deployment Manager.

Extra notes

The issue has now been fixed. The fix seems to just be that now gslbTarget and descriptorUrlServerSpec are ignored when specified on the create, patch, or update Type Provider operations.
The dogfood version might still be accessible for a while on the API, but that does not mean it is a security issue by itself. There could be some hidden security hole in it though ;).

Also, after reporting my findings to Google, and even after finishing the first few drafts of this write-up, I had the idea of checking whether the discovery document for the dogfood version of the Staging Deployment Manager API could be accessed publicly.
Lo an behold, it can: https://staging-deploymentmanager.sandbox.googleapis.com/$discovery/rest?version=dogfood (Copy on GitHub, just in case it stops working in the future).

The discovery document includes the googleOptions field, and provides a little bit more insight into what its fields do, but not nearly enough, so even if I had noticed the document before, I would have probably had to perform the same steps I performed during my research.

Timeline

May 7th, 2020: Issue found and mentioned in a VRP grant ticket
May 8th, 2020: Googler checks the issue, submits RCE report and quickly escalates it
May 19th, 2020: Reward of $31,337.00 issued
May 20th, 2020: Issue confirmed as fixed
March 2021: Prize of $133,337 issued for the vulnerability write-up

Ezequiel Pereira

Search