All Euler problems
Project Euler

Protein Folding

A protein of length n is a string over {H, P} (hydrophobic/polar). It folds as a self-avoiding walk (SAW) on the 2D square lattice. An H-H contact occurs when two H residues at non-consecutive sequ...

Source sync Apr 19, 2026
Problem #0300
Level Level 13
Solved By 1,301
Languages C++, Python
Answer 8.0540771484375
Length 587 words
graphsearchsequence

Problem Statement

This archive keeps the full statement, math, and original media on the page.

In a very simplified form, we can consider proteins as strings consisting of hydrophobic (H) and polar (P) elements, e.g. HHPPHHHPHHPH.

For this problem, the orientation of a protein is important; e.g. HPP is considered distinct from PPH. Thus, there are \(2^n\) distinct proteins consisting of \(n\) elements.

When one encounters these strings in nature, they are always folded in such a way that the number of H-H contact points is as large as possible, since this is energetically advantageous.

As a result, the H-elements tend to accumulate in the inner part, with the P-elements on the outside.

Natural proteins are folded in three dimensions of course, but we will only consider protein folding in two dimensions.

The figure below shows two possible ways that our example protein could be folded (H-H contact points are shown with red dots)

PIC

The folding on the left has only six H-H contact points, thus it would never occur naturally.

On the other hand, the folding on the right has nine H-H contact points, which is optimal for this string.

Assuming that H and P elements are equally likely to occur in any position along the string, the average number of H-H contact points in an optimal folding of a random protein string of length \(8\) turns out to be \(850 / 2^8 = 3.3203125\).

What is the average number of H-H contact points in an optimal folding of a random protein string of length \(15\)?

Give your answer using as many decimal places as necessary for an exact result.

Problem 300: Protein Folding

Mathematical Foundation

Definition. A self-avoiding walk (SAW) of length n1n-1 on Z2\mathbb{Z}^2 is an injective sequence of lattice points (v0,v1,,vn1)(v_0, v_1, \ldots, v_{n-1}) where each vi+1vi1=1|v_{i+1} - v_i|_1 = 1 (adjacent on the lattice).

Definition. The contact topology of a SAW is the set C={(i,j):ij>1,  vivj1=1}\mathcal{C} = \{(i, j) : |i - j| > 1,\; |v_i - v_j|_1 = 1\} of non-consecutive adjacent pairs.

Theorem 1 (Decomposition of Contacts). The total H-H contacts for protein p=(p0,,pn1)p = (p_0, \ldots, p_{n-1}) under walk ww equals

contacts(p,w)=consec(p)+non-consec(p,w)\text{contacts}(p, w) = \text{consec}(p) + \text{non-consec}(p, w)

where consec(p)={i:pi=H and pi+1=H}\text{consec}(p) = |\{i : p_i = H \text{ and } p_{i+1} = H\}| depends only on pp, and non-consec(p,w)={(i,j)C(w):pi=H and pj=H}\text{non-consec}(p, w) = |\{(i,j) \in \mathcal{C}(w) : p_i = H \text{ and } p_j = H\}| depends on both pp and ww.

Proof. Consecutive residues (ij=1|i - j| = 1) are always lattice-adjacent in any SAW, so their H-H contacts are determined solely by the protein string. Non-consecutive contacts depend on the walk geometry. \square

Theorem 2 (Optimality via Topology Enumeration). The optimal contact count for protein pp is

opt(p)=consec(p)+maxCT(i,j)C1[pi=Hpj=H]\text{opt}(p) = \text{consec}(p) + \max_{\mathcal{C} \in \mathcal{T}} \sum_{(i,j) \in \mathcal{C}} \mathbf{1}[p_i = H \wedge p_j = H]

where T\mathcal{T} is the set of all realizable contact topologies for SAWs of length n1n - 1.

Proof. Since consec(p)\text{consec}(p) is constant across all walks, maximizing total contacts is equivalent to maximizing non-consecutive contacts. The non-consecutive contacts depend only on the contact topology C(w)\mathcal{C}(w), so it suffices to enumerate all distinct topologies rather than all walks. \square

Lemma 1 (Bitmask Evaluation). Representing protein pp as a bitmask B{0,1}nB \in \{0, 1\}^n (where bit i=0i = 0 means pi=Hp_i = H and bit i=1i = 1 means pi=Pp_i = P), the contact pair (i,j)(i, j) contributes iff (B&Mij)=0(B \mathbin{\&} M_{ij}) = 0 where Mij=(1i)(1j)M_{ij} = (1 \ll i) \mid (1 \ll j).

Proof. We have pi=Hp_i = H iff bit ii of BB is 0, and pj=Hp_j = H iff bit jj is 0. Thus both are H iff BB has 0 in positions ii and jj, i.e., (B&Mij)=0(B \mathbin{\&} M_{ij}) = 0. \square

Lemma 2 (Symmetry Reduction for SAWs). By the 4-fold rotational symmetry and 2-fold reflective symmetry of the square lattice (dihedral group D4D_4), fixing the first step to go rightward reduces the SAW enumeration by a factor of 4. Mirror-image walks produce the same contact topology, further reducing by a factor of 2 in many cases.

Proof. Under rotation by 90°90°, a SAW ww maps to ww' with the same contact topology (since adjacency on Z2\mathbb{Z}^2 is rotation-invariant). Fixing the first step eliminates 4 equivalent rotations. Reflection about the xx-axis also preserves contact topology, though some walks are self-mirror-symmetric and must be handled carefully. \square

Theorem 3 (Exact Rationality). The average optimal contact count E[opt(p)]E[\text{opt}(p)] over all 2n2^n proteins is a rational number with denominator dividing 2n2^n.

Proof. Each protein is equally likely with probability 2n2^{-n}, and each opt(p)\text{opt}(p) is a non-negative integer. Thus E[opt(p)]=12npopt(p)12nZE[\text{opt}(p)] = \frac{1}{2^n} \sum_{p} \text{opt}(p) \in \frac{1}{2^n}\mathbb{Z}. \square

Editorial

For each of 2^15 protein strings of length 15 (H/P), find the OPTIMAL (maximum) number of H-H contacts over all self-avoiding walks on 2D lattice. Average the optima. H-H contact: two H residues at lattice-adjacent positions. This includes both consecutive (always adjacent in any walk) and non-consecutive pairs (walk-dependent). Algorithm: 1. Enumerate all SAWs of length 15 (14 steps). 2. For each SAW, find non-consecutive contact pairs. 3. Group by contact topology (deduplicate). 4. For each protein (bitmask), find max non-consecutive contacts across all topologies. 5. Add consecutive H-H contacts (walk-independent). 6. Average over all 2^15 proteins. We enumerate all SAWs of n-1 steps (first step fixed rightward). We then iterate over each protein bitmask, find max contacts over topologies. Finally, iterate over C in topologies.

Pseudocode

Enumerate all SAWs of n-1 steps (first step fixed rightward)
For each protein bitmask, find max contacts over topologies
for C in topologies

Complexity Analysis

  • Time: O(W+T2n)O(W + |\mathcal{T}| \cdot 2^n) where WW is the number of SAWs (approximately 4×1064 \times 10^6 for 14 steps before symmetry reduction, 106\sim 10^6 after), and T|\mathcal{T}| is the number of distinct topologies (approximately 5,0005{,}000). The evaluation loop costs T215=5000×327681.6×108|\mathcal{T}| \cdot 2^{15} = 5000 \times 32768 \approx 1.6 \times 10^8 operations.
  • Space: O(Tc)O(|\mathcal{T}| \cdot \overline{c}) for storing topologies, where c\overline{c} is the average number of contact pairs per topology. Plus O(n)O(n) for the backtracking stack.

Answer

8.0540771484375\boxed{8.0540771484375}

Code

Each problem page includes the exact C++ and Python source files from the local archive.

C++ project_euler/problem_300/solution.cpp
#include <bits/stdc++.h>
using namespace std;

// Problem 300: Protein Folding
//
// For each of 2^15 protein strings of length 15 (H=0, P=1 in bitmask),
// find the optimal (max) number of H-H contacts over all SAWs.
// H-H contact: two H residues at lattice-adjacent positions.
// This includes BOTH consecutive and non-consecutive pairs.
//
// Consecutive pairs (i, i+1) are ALWAYS lattice-adjacent in any SAW,
// so their contribution is walk-independent and can be precomputed.
//
// Non-consecutive contacts (|i-j| > 1) depend on the walk.
// We enumerate all SAWs, find the maximum non-consecutive contacts per protein,
// then add the consecutive contacts.

const int N = 15;
const int GRID = 31;
const int CENTER = 15;

int pos[N][2];
bool visited[GRID][GRID];
int best[1 << N]; // best non-consecutive contacts per protein

set<vector<int>> seen_topologies;

void generateWalks(int step) {
    if (step == N) {
        // Find non-consecutive contact pairs
        vector<int> contact_masks;
        for (int i = 0; i < N; i++) {
            for (int j = i + 2; j < N; j++) {
                int dx = abs(pos[i][0] - pos[j][0]);
                int dy = abs(pos[i][1] - pos[j][1]);
                if (dx + dy == 1) {
                    contact_masks.push_back((1 << i) | (1 << j));
                }
            }
        }
        if (contact_masks.empty()) return;

        sort(contact_masks.begin(), contact_masks.end());
        if (!seen_topologies.insert(contact_masks).second) return;

        int nc = contact_masks.size();
        for (int protein = 0; protein < (1 << N); protein++) {
            int c = 0;
            for (int k = 0; k < nc; k++) {
                if ((protein & contact_masks[k]) == 0) c++;
            }
            if (c > best[protein]) best[protein] = c;
        }
        return;
    }

    static const int dx[] = {1, -1, 0, 0};
    static const int dy[] = {0, 0, 1, -1};
    for (int dir = 0; dir < 4; dir++) {
        int nx = pos[step-1][0] + dx[dir];
        int ny = pos[step-1][1] + dy[dir];
        if (nx < 0 || nx >= GRID || ny < 0 || ny >= GRID) continue;
        if (visited[nx][ny]) continue;
        pos[step][0] = nx;
        pos[step][1] = ny;
        visited[nx][ny] = true;
        generateWalks(step + 1);
        visited[nx][ny] = false;
    }
}

int main(){
    memset(best, 0, sizeof(best));
    memset(visited, false, sizeof(visited));

    pos[0][0] = CENTER;
    pos[0][1] = CENTER;
    visited[CENTER][CENTER] = true;

    // Enumerate all SAWs (all 4 starting directions)
    generateWalks(1);

    // Compute total: best non-consecutive contacts + consecutive H-H contacts
    long long total = 0;
    for (int protein = 0; protein < (1 << N); protein++) {
        int consec = 0;
        for (int i = 0; i < N - 1; i++) {
            if (((protein >> i) & 1) == 0 && ((protein >> (i+1)) & 1) == 0) {
                consec++;
            }
        }
        total += best[protein] + consec;
    }

    double avg = (double)total / (1 << N);
    printf("%.13f\n", avg);

    return 0;
}