Competitive Programming

ICPC 2019 - D. Circular DNA

For each gene type i, ignore all markers except s_i and e_i. After choosing a cut position on the circle, we obtain a linear sequence of those markers. Gene type i counts as good if that linear subsequence is a proper...

Source sync Apr 19, 2026

Track ICPC

Year 2019

Files TeX, C++, statement assets

Folder competitive_programming/icpc/2019/D-circular-dna

ICPC2019TeXC++statement textstatement pdf

This page is built from the copied files in competitive_programming/icpc/2019/D-circular-dna. Edit competitive_programming/icpc/2019/D-circular-dna/solution.tex to update the written solution and competitive_programming/icpc/2019/D-circular-dna/solution.cpp to update the implementation.

The website does not replace those files with hand-maintained HTML. It reads the copied source tree during the build and exposes the exact files below.

Raw TeX Raw C++ Meta JSON

Problem statement

Problem Statement

Copied statement text kept beside the solution archive for direct reference.

Raw text Statement PDF

Problem D
                                         Circular DNA
                                    Time limit: 3 seconds
You have an internship with a bioinformatics research group studying DNA. A single strand of DNA
consists of many genes, which fall into different categories called gene types. Gene types are delimited
by specific nucleotide sequences known as gene markers. Each gene type i has a unique start marker si
and a unique end marker ei . After many dirty jobs (growing bacteria, cell extraction, protein engineering,
and so on), your research group can convert DNA into a form consisting of only the gene markers,
removing all the genetic material lying between the markers.
Your research group came up with the interesting hypothesis that gene interpretation depends on whether
the markers of some gene types form properly nested structures. To decide whether markers of gene type
i form a proper nesting in a given sequence of markers w, one needs to consider the subsequence of w
containing only the markers of gene type i (si and ei ), leaving none of them out. The following (and
only the following) are considered to be properly nested structures:

    • si ei
    • si N ei , where N is a properly nested structure
    • AB, where A and B are properly nested structures

Given your computing background, you were assigned to investigate this property, but there is one
further complication. Your group is studying a specific type of DNA called circular DNA, which is
DNA that forms a closed loop. To study nesting in circular DNA, it is necessary to cut the loop at some
location, which results in a unique sequence of markers (the direction of reading is fixed by molecular
properties). Whether a gene type i forms a proper nesting now also depends on where the circular
DNA is cut. Your task is to find the cutting location that maximizes the number of gene types that form a
properly nested structure. Figure D.1 shows an example corresponding to Sample Input 1. The indicated
cut results in the markers for gene type 1 being properly nested.

                                                             e1
                                               s1            1
                                                                          e1
                                                     9                2

                                          e1     8                        3    s1
                                                 7                        4
                                           e42           6        5
                                                                              e2
                                                     s2           s1

              Figure D.1: Illustration of Sample Input 1 with its optimal cutting location.

Input

The first line of input contains an integer n (1 ≤ n ≤ 106 ), the length of the DNA. The next line
contains the DNA sequence, that is, n markers. Each marker is a character c followed by an integer i,
where c ∈ {s, e} specifies whether it is a start or an end marker and i (1 ≤ i ≤ 106 ) is the gene type of
the marker. The given DNA sequence has been obtained from the circular DNA by cutting at an arbitrary
location.

Output

Output one line with two integers p and m, where p is the cutting position that maximizes the number
of different gene types that form a proper nesting, and m is this maximum number of gene types. The
DNA is cut just before the pth input marker (for instance, the cut shown in Figure D.1 has p = 3). If
more than one cutting position yields the same maximum value of m, output the smallest p that does so.

 Sample Input 1                                     Sample Output 1
 9                                                  3 1
 e1 e1 s1 e2 s1 s2 e42 e1 s1

 Sample Input 2                                     Sample Output 2
 8                                                  8 2
 s1 s1 e3 e1 s3 e1 e3 s3

Editorial

Rendered from the copied solution.tex file. The original TeX source remains available below.

Raw TeX

Key Observations

For one fixed gene type $i$, replace every $s_i$ by $+1$ and every $e_i$ by $-1$. All other markers are irrelevant.
A linear sequence is properly nested if and only if:
- the total sum is $0$, and
- every prefix sum is non-negative.
- Therefore, if the counts of $s_i$ and $e_i$ differ, gene type $i$ can never be good for any cut and may be ignored completely.
- Assume the total sum for type $i$ is $0$. Let $P_i(k)$ be the prefix sum after the first $k$ markers of the original input sequence, considering only type $i$. If we cut before position $p$, then the rotated prefix sums are exactly $P_i(k) - P_i(p-1)$. These are all non-negative if and only if $P_i(p-1)$ is a global minimum of the prefix sum array.
- So a cut position $p$ is good for type $i$ exactly when the prefix sum value just before marker $p$ is a minimum value for that type.
Algorithm
1. For each gene type, store its occurrences in original order as positions together with $+1/-1$ depending on whether the marker is $s_i$ or $e_i$.
2. Ignore every type whose numbers of starts and ends differ.
3. For one remaining type, scan its occurrences once to compute the minimum prefix sum value.
4. Scan its occurrences a second time. Between two consecutive occurrences, the prefix sum before the cut is constant, so this type contributes $+1$ to an entire interval of cut positions whenever that constant value equals the minimum.
5. Use one global difference array over cut positions $1 \dots n$ to add those interval contributions for every valid type.
6. Prefix-sum the difference array to obtain, for every cut position, how many gene types are good there. Output the smallest position with maximum value.
Correctness Proof

We prove that the algorithm outputs the smallest cut position with the maximum possible number of properly nested gene types.

Lemma 1.

For a fixed gene type $i$, if the numbers of $s_i$ and $e_i$ in the whole circle differ, then $i$ is not good for any cut.

Proof.

Any properly nested parenthesis sequence has the same number of opening and closing symbols. Rotating the circle does not change the multiset of markers of type $i$, so unequal counts make a valid nesting impossible for every cut. □

Lemma 2.

Assume gene type $i$ has equal counts of $s_i$ and $e_i$. Let $P_i(k)$ be its prefix sum in the original sequence. Then cutting before position $p$ makes type $i$ good if and only if $P_i(p-1)$ is a minimum value among all $P_i(k)$.

Proof.

After cutting before position $p$, the rotated sequence starts at marker $p$. The prefix sum after reading the first part of this rotated sequence is exactly the original prefix sum minus the value at the cut: \[ P_i(k) - P_i(p-1). \] Because the total sum of type $i$ is $0$, the rotated sequence is properly nested exactly when all these values are non-negative. That is equivalent to \[ P_i(k) \ge P_i(p-1) \quad \text{for all } k, \] which says precisely that $P_i(p-1)$ is a global minimum. □

Lemma 3.

For a fixed valid type $i$, the algorithm adds $+1$ exactly to those cut positions where type $i$ is good.

Proof.

Between consecutive occurrences of type $i$, the prefix sum for that type does not change, so every cut position in that interval sees the same value $P_i(p-1)$. By Lemma 2, the type is good exactly on the intervals where this constant value equals the global minimum. The algorithm adds $+1$ to exactly those intervals in the difference array. □

Theorem.

The algorithm outputs the smallest cut position that maximizes the number of good gene types.

Proof.

By Lemma 1, every ignored type is never good and contributes nothing anywhere. By Lemma 3, every remaining type contributes $1$ exactly at the cut positions where it is good. Therefore, after prefix-summing the global difference array, the value at each cut position is exactly the number of good gene types for that cut. Choosing the largest such value, and the smallest position among ties, produces the required answer. □

Complexity Analysis

Each marker is stored once and each valid type is scanned a constant number of times through its own occurrence list. Thus the running time is $O(n + d)$, where $d$ is the number of distinct gene types appearing in the input. The memory usage is $O(n + d)$.

Implementation Notes
- The implementation stores occurrences of each gene type as a linked list inside flat arrays, so the total memory stays linear in $n$.
- The interval for one prefix value is ``from the previous occurrence $+1$ up to the current occurrence'', because cutting before position $p$ uses the prefix sum after the first $p-1$ markers.
- Cut position $1$ corresponds to the original input order, so its prefix value is always the empty-prefix value $0$.

C++ competitive_programming/icpc/2019/D-circular-dna/solution.cpp

Exact copied implementation source.

Raw file

#include <bits/stdc++.h>
using namespace std;

namespace {

void solve() {
    int n;
    cin >> n;

    const int MAX_TYPE = 1000000;
    vector<int> head(MAX_TYPE + 1, -1);
    vector<int> tail(MAX_TYPE + 1, -1);
    vector<int> cnt_start(MAX_TYPE + 1, 0);
    vector<int> cnt_end(MAX_TYPE + 1, 0);
    vector<int> used_types;

    vector<int> pos(n), delta(n), next_idx(n, -1);
    for (int i = 0; i < n; ++i) {
        string token;
        cin >> token;
        int type = stoi(token.substr(1));
        pos[i] = i + 1;
        delta[i] = (token[0] == 's' ? 1 : -1);
        if (head[type] == -1) {
            used_types.push_back(type);
            head[type] = i;
        } else {
            next_idx[tail[type]] = i;
        }
        tail[type] = i;
        if (delta[i] == 1) {
            ++cnt_start[type];
        } else {
            ++cnt_end[type];
        }
    }

    vector<int> diff(n + 2, 0);
    for (int type : used_types) {
        if (cnt_start[type] != cnt_end[type]) {
            continue;
        }

        int cur = 0;
        int min_value = 0;
        for (int at = head[type]; at != -1; at = next_idx[at]) {
            cur += delta[at];
            min_value = min(min_value, cur);
        }

        cur = 0;
        int left = 1;
        for (int at = head[type]; at != -1; at = next_idx[at]) {
            int right = pos[at];
            if (cur == min_value) {
                ++diff[left];
                --diff[right + 1];
            }
            cur += delta[at];
            left = right + 1;
        }
        if (left <= n && cur == min_value) {
            ++diff[left];
            --diff[n + 1];
        }
    }

    int best_pos = 1;
    int best_score = -1;
    int cur = 0;
    for (int p = 1; p <= n; ++p) {
        cur += diff[p];
        if (cur > best_score) {
            best_score = cur;
            best_pos = p;
        }
    }

    cout << best_pos << ' ' << best_score << '\n';
}

}  // namespace

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    solve();
    return 0;
}

TeX write-up competitive_programming/icpc/2019/D-circular-dna/solution.tex

Exact copied write-up source.

Raw file

\documentclass[11pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{enumitem}

\title{ICPC World Finals 2019\\D. Circular DNA}
\author{}
\date{}

\begin{document}
\maketitle

\section*{Problem Summary}

For each gene type $i$, ignore all markers except $s_i$ and $e_i$. After choosing a cut position on the circle, we obtain a linear sequence of those markers. Gene type $i$ counts as \emph{good} if that linear subsequence is a properly nested parenthesis sequence, with $s_i$ as ``open'' and $e_i$ as ``close''.

We must choose the cut position that maximizes the number of good gene types, breaking ties by the smallest cut position.

\section*{Key Observations}

\begin{itemize}[leftmargin=*]
    \item For one fixed gene type $i$, replace every $s_i$ by $+1$ and every $e_i$ by $-1$. All other markers are irrelevant.
    \item A linear sequence is properly nested if and only if:
    \begin{itemize}[leftmargin=*]
        \item the total sum is $0$, and
        \item every prefix sum is non-negative.
    \end{itemize}
    \item Therefore, if the counts of $s_i$ and $e_i$ differ, gene type $i$ can never be good for any cut and may be ignored completely.
    \item Assume the total sum for type $i$ is $0$. Let $P_i(k)$ be the prefix sum after the first $k$ markers of the \emph{original} input sequence, considering only type $i$. If we cut before position $p$, then the rotated prefix sums are exactly $P_i(k) - P_i(p-1)$. These are all non-negative if and only if $P_i(p-1)$ is a global minimum of the prefix sum array.
    \item So a cut position $p$ is good for type $i$ exactly when the prefix sum value \emph{just before} marker $p$ is a minimum value for that type.
\end{itemize}

\section*{Algorithm}

\begin{enumerate}[leftmargin=*]
    \item For each gene type, store its occurrences in original order as positions together with $+1/-1$ depending on whether the marker is $s_i$ or $e_i$.
    \item Ignore every type whose numbers of starts and ends differ.
    \item For one remaining type, scan its occurrences once to compute the minimum prefix sum value.
    \item Scan its occurrences a second time. Between two consecutive occurrences, the prefix sum before the cut is constant, so this type contributes $+1$ to an entire interval of cut positions whenever that constant value equals the minimum.
    \item Use one global difference array over cut positions $1 \dots n$ to add those interval contributions for every valid type.
    \item Prefix-sum the difference array to obtain, for every cut position, how many gene types are good there. Output the smallest position with maximum value.
\end{enumerate}

\section*{Correctness Proof}

We prove that the algorithm outputs the smallest cut position with the maximum possible number of properly nested gene types.

\paragraph{Lemma 1.}
For a fixed gene type $i$, if the numbers of $s_i$ and $e_i$ in the whole circle differ, then $i$ is not good for any cut.

\paragraph{Proof.}
Any properly nested parenthesis sequence has the same number of opening and closing symbols. Rotating the circle does not change the multiset of markers of type $i$, so unequal counts make a valid nesting impossible for every cut. \qed

\paragraph{Lemma 2.}
Assume gene type $i$ has equal counts of $s_i$ and $e_i$. Let $P_i(k)$ be its prefix sum in the original sequence. Then cutting before position $p$ makes type $i$ good if and only if $P_i(p-1)$ is a minimum value among all $P_i(k)$.

\paragraph{Proof.}
After cutting before position $p$, the rotated sequence starts at marker $p$. The prefix sum after reading the first part of this rotated sequence is exactly the original prefix sum minus the value at the cut:
\[
P_i(k) - P_i(p-1).
\]
Because the total sum of type $i$ is $0$, the rotated sequence is properly nested exactly when all these values are non-negative. That is equivalent to
\[
P_i(k) \ge P_i(p-1)
\quad \text{for all } k,
\]
which says precisely that $P_i(p-1)$ is a global minimum. \qed

\paragraph{Lemma 3.}
For a fixed valid type $i$, the algorithm adds $+1$ exactly to those cut positions where type $i$ is good.

\paragraph{Proof.}
Between consecutive occurrences of type $i$, the prefix sum for that type does not change, so every cut position in that interval sees the same value $P_i(p-1)$. By Lemma 2, the type is good exactly on the intervals where this constant value equals the global minimum. The algorithm adds $+1$ to exactly those intervals in the difference array. \qed

\paragraph{Theorem.}
The algorithm outputs the smallest cut position that maximizes the number of good gene types.

\paragraph{Proof.}
By Lemma 1, every ignored type is never good and contributes nothing anywhere. By Lemma 3, every remaining type contributes $1$ exactly at the cut positions where it is good. Therefore, after prefix-summing the global difference array, the value at each cut position is exactly the number of good gene types for that cut. Choosing the largest such value, and the smallest position among ties, produces the required answer. \qed

\section*{Complexity Analysis}

Each marker is stored once and each valid type is scanned a constant number of times through its own occurrence list. Thus the running time is $O(n + d)$, where $d$ is the number of distinct gene types appearing in the input. The memory usage is $O(n + d)$.

\section*{Implementation Notes}

\begin{itemize}[leftmargin=*]
    \item The implementation stores occurrences of each gene type as a linked list inside flat arrays, so the total memory stays linear in $n$.
    \item The interval for one prefix value is ``from the previous occurrence $+1$ up to the current occurrence'', because cutting before position $p$ uses the prefix sum after the first $p-1$ markers.
    \item Cut position $1$ corresponds to the original input order, so its prefix value is always the empty-prefix value $0$.
\end{itemize}

\end{document}

C++ implementation competitive_programming/icpc/2019/D-circular-dna/solution.cpp

Exact copied implementation source.

Raw file

#include <bits/stdc++.h>
using namespace std;

namespace {

void solve() {
    int n;
    cin >> n;

    const int MAX_TYPE = 1000000;
    vector<int> head(MAX_TYPE + 1, -1);
    vector<int> tail(MAX_TYPE + 1, -1);
    vector<int> cnt_start(MAX_TYPE + 1, 0);
    vector<int> cnt_end(MAX_TYPE + 1, 0);
    vector<int> used_types;

    vector<int> pos(n), delta(n), next_idx(n, -1);
    for (int i = 0; i < n; ++i) {
        string token;
        cin >> token;
        int type = stoi(token.substr(1));
        pos[i] = i + 1;
        delta[i] = (token[0] == 's' ? 1 : -1);
        if (head[type] == -1) {
            used_types.push_back(type);
            head[type] = i;
        } else {
            next_idx[tail[type]] = i;
        }
        tail[type] = i;
        if (delta[i] == 1) {
            ++cnt_start[type];
        } else {
            ++cnt_end[type];
        }
    }

    vector<int> diff(n + 2, 0);
    for (int type : used_types) {
        if (cnt_start[type] != cnt_end[type]) {
            continue;
        }

        int cur = 0;
        int min_value = 0;
        for (int at = head[type]; at != -1; at = next_idx[at]) {
            cur += delta[at];
            min_value = min(min_value, cur);
        }

        cur = 0;
        int left = 1;
        for (int at = head[type]; at != -1; at = next_idx[at]) {
            int right = pos[at];
            if (cur == min_value) {
                ++diff[left];
                --diff[right + 1];
            }
            cur += delta[at];
            left = right + 1;
        }
        if (left <= n && cur == min_value) {
            ++diff[left];
            --diff[n + 1];
        }
    }

    int best_pos = 1;
    int best_score = -1;
    int cur = 0;
    for (int p = 1; p <= n; ++p) {
        cur += diff[p];
        if (cur > best_score) {
            best_score = cur;
            best_pos = p;
        }
    }

    cout << best_pos << ' ' << best_score << '\n';
}

}  // namespace

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    solve();
    return 0;
}

ICPC 2019 - D. Circular DNA

Source-first archive entry

Problem Statement

Editorial

Key Observations

Algorithm

Correctness Proof

Complexity Analysis

Implementation Notes

Code

Source Files