Scraping Chinese government websites

How Chinese government documents are created

The following workflow for Chinese government documents was pieced together from evidence present in the webpage source code itself, as well as documents from the main company involved in the operation: TRS.

The gov.cn open information system (信息公开) operates as a structured content pipeline built on the TRS WCM (拓尔思内容协作平台) v7.x content management platform, where documents originate as rich-text files (typically WPS or word) written directly by civil servants and government and Party officials. They are either directly edited in the content management system or more likely imported with formatting preserved. During ingestion, each document is stored as a record consisting of metadata fields (such as title, issuer, dates, and identifiers) alongside a rich-text content body, with minimal normalisation applied so that paragraph order and inline formatting remain largely unchanged from the original source. This is an extremely important point for scholars wishing to scrape and analyse Chinese government documents via this route via this route.

These records are stored in a relational database and (this is conjecture on my part) passes through a multi-step editorial workflow consisting of draft, review, approval, although this workflow is not visible in the final published output. Publication is handled by a server-side templating system, where a single, consistent template injects record fields into fixed html structures; the document body is inserted verbatim into a designated content container, while metadata is rendered in multiple locations across the page. The final output consists of static html files generated at publish time, with no client-side rendering or content generation; javascript is included only for interface features such as accessibility, navigation, and analytics, and does not affect the document text.


          <div class="pchide abstract mxxgkabstract">
            <h2>索  引  号:</h2>
            <p>000014349/2008-00149</p>
            <h2>主题分类:</h2>
            <p class="zcwj_ztfl">财政、金融、审计\证券</p>
            <h2>发文机关:</h2>
            <p>国务院办公厅</p>
            <h2>成文日期:</h2>
            <p>2008年10月18日</p>
          </div>
          

As a result, the published html represents a stable, template-driven structure combined with raw rich-text content, meaning that structural variation across documents comes from the data itself rather than from changes in templates, making the corpus highly consistent and suitable for large-scale parsing and analysis.

Scraping process

The scraping process follows a simple two-stage pipeline: first, a backend search API (athena) is queried via http POST requests to retrieve structured metadata about documents, including titles, publication dates, and urls. this api returns paginated json data, which must be iterated over to collect the full set of document entries. Once all metadata is collected, each document’s pub_url is used to perform a direct http GET request. Unlike modern javascript-heavy sites, these pages are fully server-rendered and return complete html in a single response. This is the best-possible scenario for website scraping. If only every scraping process was this easy! The final html documents are static and require no javascript execution. all textual content is already present in the response body, typically organised as a linear sequence of paragraph elements. There is no client-side rendering, no lazy loading, and no additional API calls required to access the full text. Perfect!

The structure of the returned html is simple and consistent, resembling a flat sequence of paragraph tags containing the full document text, making it well-suited for large-scale corpus construction and computational analysis.