All IOI entries
Competitive Programming

IOI 2010 - Language

Nature of the Task This is not a classical exact-algorithm task. It is an online classification problem: for each excerpt, we must guess one of 56 languages, then the grader reveals the correct answer and we may updat...

Source sync Apr 19, 2026
Track IOI
Year 2010
Files TeX and C++
Folder competitive_programming/ioi/2010/language
IOI2010TeXC++

Source-first archive entry

This page is built from the copied files in competitive_programming/ioi/2010/language. Edit competitive_programming/ioi/2010/language/solution.tex to update the written solution and competitive_programming/ioi/2010/language/solution.cpp to update the implementation.

The website does not replace those files with hand-maintained HTML. It reads the copied source tree during the build and exposes the exact files below.

Problem Statement

This archive entry does not include a standalone statement file or a dedicated statement section in solution.tex.

The statement is not mirrored as a separate asset in this archive. The editorial and source files remain available below.

Editorial

Rendered from the copied solution.tex file. The original TeX source remains available below.

Nature of the Task

This is not a classical exact-algorithm task. It is an online classification problem: for each excerpt, we must guess one of 56 languages, then the grader reveals the correct answer and we may update our model.

The official editorial explicitly notes that many approaches are possible: Rocchio-style classification is enough for the easier subtask, while stronger results come from using character bigrams and trigrams. Therefore the right goal here is not a proof of optimality, but a strong, clean heuristic model.

Chosen Model

We use a \textbf{hashed character $n$-gram model} with \[ n \in \{1,2,3\}. \]

For each language we maintain frequency tables for:

  • unigrams,

  • bigrams,

  • trigrams.

  • The Unicode symbol IDs are randomized by the task setter, so hand-written language knowledge is useless. However, local symbol patterns still carry signal: neighboring characters in the same language version of Wikipedia create stable statistical fingerprints. Character $n$-grams capture exactly this.

    To keep the model compact and fast, each $n$-gram is hashed into a fixed-size bucket table. This introduces a small number of collisions, but allows a much denser and faster implementation than storing every raw $n$-gram explicitly.

Scoring Rule

Given an excerpt, we extract its sparse unigram / bigram / trigram counts and score each language with a smoothed multinomial log-likelihood: \[ \text{score}(\ell) = w_1 \sum_{g \in G_1} c(g)\log \Pr(g \mid \ell) + w_2 \sum_{g \in G_2} c(g)\log \Pr(g \mid \ell) + w_3 \sum_{g \in G_3} c(g)\log \Pr(g \mid \ell) + w_0 \log(1 + \text{docs}_\ell). \]

Here:

  • $G_1, G_2, G_3$ are the unigram, bigram, and trigram feature sets,

  • $c(g)$ is the count of feature $g$ in the current excerpt,

  • $\Pr(g \mid \ell)$ is estimated from the online counts for language $\ell$ using additive smoothing,

  • the trigram term gets the highest weight, because it carries the most contextual information.

  • This is substantially stronger than the previous baseline, which only used reduced bigram overlap.

Online Update

After the grader reveals the true language, we simply add the excerpt's unigram, bigram, and trigram counts to that language model. Since the task is online, this continual update is the whole point: the classifier improves as more labeled excerpts arrive.

Complexity

If an excerpt has length $L$:

  • feature extraction takes $O(L \log L)$ because we sort the hashed $n$-gram lists to compress them into sparse counts;

  • scoring takes time proportional to the number of sparse features times the number of languages already observed;

  • memory usage is fixed by the hash-table dimensions for the 56 languages.

Remark

For this specific task, presenting a strong heuristic and describing it honestly is more rigorous than pretending there is an exact combinatorial solution. The implementation below does exactly that.

Code

Exact copied C++ implementation from solution.cpp.

C++ competitive_programming/ioi/2010/language/solution.cpp

Exact copied implementation source.

Raw file
#include <bits/stdc++.h>
using namespace std;

// IOI 2010 - Language
// This task is heuristic by nature. We use a hashed character n-gram model
// with online updates and smoothed log-likelihood scoring.

namespace {

constexpr int NLANG = 56;
constexpr int UNI_DIM = 1 << 12;
constexpr int BI_DIM = 1 << 15;
constexpr int TRI_DIM = 1 << 16;

constexpr double ALPHA = 0.25;
constexpr double UNI_W = 0.5;
constexpr double BI_W = 1.5;
constexpr double TRI_W = 3.0;
constexpr double PRIOR_W = 0.35;

int unigram[NLANG][UNI_DIM];
int bigram[NLANG][BI_DIM];
int trigram[NLANG][TRI_DIM];
int total_uni[NLANG];
int total_bi[NLANG];
int total_tri[NLANG];
int doc_count[NLANG];

struct SparseCounts {
    vector<pair<int, int>> uni;
    vector<pair<int, int>> bi;
    vector<pair<int, int>> tri;
};

uint32_t mix_bits(uint32_t x) {
    x ^= x >> 16;
    x *= 0x7feb352dU;
    x ^= x >> 15;
    x *= 0x846ca68bU;
    x ^= x >> 16;
    return x;
}

int hash1(int a) {
    return int(mix_bits(uint32_t(a)) & (UNI_DIM - 1));
}

int hash2(int a, int b) {
    uint32_t x = uint32_t(a) * 1000003U ^ uint32_t(b) * 9176U ^ 0x9e3779b9U;
    return int(mix_bits(x) & (BI_DIM - 1));
}

int hash3(int a, int b, int c) {
    uint32_t x = uint32_t(a) * 73856093U ^
                 uint32_t(b) * 19349663U ^
                 uint32_t(c) * 83492791U;
    return int(mix_bits(x) & (TRI_DIM - 1));
}

vector<pair<int, int>> compress(vector<int>& ids) {
    sort(ids.begin(), ids.end());
    vector<pair<int, int>> out;
    for (int id : ids) {
        if (out.empty() || out.back().first != id) {
            out.push_back({id, 1});
        } else {
            out.back().second++;
        }
    }
    return out;
}

SparseCounts extract_features(const int* E, int len) {
    SparseCounts feats;

    vector<int> uni_ids;
    uni_ids.reserve(len);
    for (int i = 0; i < len; i++) {
        uni_ids.push_back(hash1(E[i]));
    }
    feats.uni = compress(uni_ids);

    if (len >= 2) {
        vector<int> bi_ids;
        bi_ids.reserve(len - 1);
        for (int i = 0; i + 1 < len; i++) {
            bi_ids.push_back(hash2(E[i], E[i + 1]));
        }
        feats.bi = compress(bi_ids);
    }

    if (len >= 3) {
        vector<int> tri_ids;
        tri_ids.reserve(len - 2);
        for (int i = 0; i + 2 < len; i++) {
            tri_ids.push_back(hash3(E[i], E[i + 1], E[i + 2]));
        }
        feats.tri = compress(tri_ids);
    }

    return feats;
}

double score_language(const SparseCounts& feats, int lang) {
    if (doc_count[lang] == 0) {
        return -1e100;
    }

    double score = PRIOR_W * log(1.0 + doc_count[lang]);

    double uni_denom = log(total_uni[lang] + ALPHA * UNI_DIM);
    for (const auto& [id, cnt] : feats.uni) {
        score += UNI_W * cnt * (log(unigram[lang][id] + ALPHA) - uni_denom);
    }

    if (!feats.bi.empty()) {
        double bi_denom = log(total_bi[lang] + ALPHA * BI_DIM);
        for (const auto& [id, cnt] : feats.bi) {
            score += BI_W * cnt * (log(bigram[lang][id] + ALPHA) - bi_denom);
        }
    }

    if (!feats.tri.empty()) {
        double tri_denom = log(total_tri[lang] + ALPHA * TRI_DIM);
        for (const auto& [id, cnt] : feats.tri) {
            score += TRI_W * cnt * (log(trigram[lang][id] + ALPHA) - tri_denom);
        }
    }

    return score;
}

}  // namespace

int classify(int* E, int len) {
    SparseCounts feats = extract_features(E, len);

    int best_lang = 0;
    double best_score = -1e100;
    bool have_model = false;

    for (int lang = 0; lang < NLANG; lang++) {
        if (doc_count[lang] == 0) {
            continue;
        }
        have_model = true;
        double cur = score_language(feats, lang);
        if (cur > best_score) {
            best_score = cur;
            best_lang = lang;
        }
    }

    return have_model ? best_lang : 0;
}

void learn(int* E, int len, int lang) {
    SparseCounts feats = extract_features(E, len);
    doc_count[lang]++;

    for (const auto& [id, cnt] : feats.uni) {
        unigram[lang][id] += cnt;
        total_uni[lang] += cnt;
    }
    for (const auto& [id, cnt] : feats.bi) {
        bigram[lang][id] += cnt;
        total_bi[lang] += cnt;
    }
    for (const auto& [id, cnt] : feats.tri) {
        trigram[lang][id] += cnt;
        total_tri[lang] += cnt;
    }
}

int main() {
    int T;
    cin >> T;
    while (T--) {
        int len;
        cin >> len;
        vector<int> E(len);
        for (int i = 0; i < len; i++) {
            cin >> E[i];
        }

        int guess = classify(E.data(), len);
        cout << guess << '\n';

        int correct;
        cin >> correct;
        learn(E.data(), len, correct);
    }
    return 0;
}

Source Files

Exact copied source-of-truth files. Edit solution.tex for the write-up and solution.cpp for the implementation.

TeX write-up competitive_programming/ioi/2010/language/solution.tex

Exact copied write-up source.

Raw file
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[margin=1in]{geometry}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{hyperref}

\lstset{
  language=C++,
  basicstyle=\ttfamily\small,
  keywordstyle=\color{blue}\bfseries,
  commentstyle=\color{green!60!black},
  stringstyle=\color{red!70!black},
  numbers=left,
  numberstyle=\tiny\color{gray},
  breaklines=true,
  frame=single,
  tabsize=4,
  showstringspaces=false
}

\title{IOI 2010 -- Language}
\author{Solution}
\date{}

\begin{document}
\maketitle

\section{Nature of the Task}

This is \emph{not} a classical exact-algorithm task. It is an online
classification problem: for each excerpt, we must guess one of 56 languages,
then the grader reveals the correct answer and we may update our model.

The official editorial explicitly notes that many approaches are possible:
Rocchio-style classification is enough for the easier subtask, while stronger
results come from using character bigrams and trigrams. Therefore the right goal
here is not a proof of optimality, but a strong, clean heuristic model.

\section{Chosen Model}

We use a \textbf{hashed character $n$-gram model} with
\[
  n \in \{1,2,3\}.
\]

For each language we maintain frequency tables for:
\begin{itemize}
  \item unigrams,
  \item bigrams,
  \item trigrams.
\end{itemize}

The Unicode symbol IDs are randomized by the task setter, so hand-written
language knowledge is useless. However, \emph{local symbol patterns} still carry
signal: neighboring characters in the same language version of Wikipedia create
stable statistical fingerprints. Character $n$-grams capture exactly this.

To keep the model compact and fast, each $n$-gram is hashed into a fixed-size
bucket table. This introduces a small number of collisions, but allows a much
denser and faster implementation than storing every raw $n$-gram explicitly.

\section{Scoring Rule}

Given an excerpt, we extract its sparse unigram / bigram / trigram counts and
score each language with a smoothed multinomial log-likelihood:
\[
  \text{score}(\ell)
  =
  w_1 \sum_{g \in G_1} c(g)\log \Pr(g \mid \ell)
  +
  w_2 \sum_{g \in G_2} c(g)\log \Pr(g \mid \ell)
  +
  w_3 \sum_{g \in G_3} c(g)\log \Pr(g \mid \ell)
  +
  w_0 \log(1 + \text{docs}_\ell).
\]

Here:
\begin{itemize}
  \item $G_1, G_2, G_3$ are the unigram, bigram, and trigram feature sets,
  \item $c(g)$ is the count of feature $g$ in the current excerpt,
  \item $\Pr(g \mid \ell)$ is estimated from the online counts for language
        $\ell$ using additive smoothing,
  \item the trigram term gets the highest weight, because it carries the most
        contextual information.
\end{itemize}

This is substantially stronger than the previous baseline, which only used
reduced bigram overlap.

\section{Online Update}

After the grader reveals the true language, we simply add the excerpt's
unigram, bigram, and trigram counts to that language model. Since the task is
online, this continual update is the whole point: the classifier improves as
more labeled excerpts arrive.

\section{Complexity}

If an excerpt has length $L$:
\begin{itemize}
  \item feature extraction takes $O(L \log L)$ because we sort the hashed
        $n$-gram lists to compress them into sparse counts;
  \item scoring takes time proportional to the number of sparse features times
        the number of languages already observed;
  \item memory usage is fixed by the hash-table dimensions for the 56 languages.
\end{itemize}

\section{Remark}

For this specific task, presenting a strong heuristic and describing it honestly
is more rigorous than pretending there is an exact combinatorial solution. The
implementation below does exactly that.

\end{document}