The New York Times recently celebrated its 20th year on the web. Of course, todayâs digital platforms differ drastically from those of decades past, and this makes it imperative that we modernize the presentation of archival data.
In 2013, we launched a redesign of our entire digital platform that gave readers a more modern, fluid, and mobile-friendly experience through improvements such as faster performance, responsive layouts, and dynamic page rendering. While our new design upgraded reader experience for new articles, engineering and resource challenges prevented us from migrating previously published articles into this new design.
Today we are thrilled to announce that, thanks to a cross-team migration effort, nearly every article published since 2004 is available to our readers in the new and improved design.
As so often happens, the seemingly ordinary task of content migration quickly ballooned into a complex project involving a number of technical challenges. Turns out, converting the approximately 14 million articles published between 1851â2006 into a format compatible with our current CMS and reader experiences was not so straightforward.
At first, the problem seemed simple: we had an archive of XML, and we needed to convert it into a JSON format that our CMS could ingest. For most of our archive data, from 1851 â 1980, the XML files included sufficient data and all we needed to do was parse the XML and rewrite it in the new format.
Stories from 1981 through 2006 were trickier. We compared the articles parsed from XML to a sample of articles currently served on the website and found that in 2004 alone there were more than 60,000 articles on our website that were not included in the XML archive. From 1981 onward, there were possibly hundreds of thousands of online-only articles missing from the archive, which reflected only what appeared in the print edition. This posed a problem because missing articles would show up as 404 Not Found pages, which would deteriorate user experience and damage our ranking on search engines.
Creating the Definitive List of Articles
To successfully migrate our archive, we needed to create a âdefinitiveâ list of all articles appearing on the website. To construct this list we consulted several additional data sources including analytics, sitemaps and our database of book, film and restaurant reviews.
The Archive Migration Pipeline
With our definitive list of articles established, it became clear that we would need to derive structured data from raw HTML for items not present in our archive XML.
To achieve this, we implemented an archive migration pipeline with the following steps:
- Given the definitive list of URLs and archive XML for a given year, determine which URLs are missing from the XML.
- Obtain raw HTML of the missing articles.
- Compare archive XML and raw HTML to find duplicate data and output the âmatchesâ between XML and HTML content.
- Re-process the archive XML and convert into JSON for the CMS, taking into account extra metadata from corresponding HTML found in step 3.
- Scrape and process the HTML that did not correspond to any XML from step 3 and convert into JSON for the CMS.
- Combine the output from steps 4 + 5 to remove any duplicate URLs.
Our plan for the archive migration pipeline presented a few technical challenges.
Scraping Raw HTML and XML
Our CMS stores a lot of metadata about articles â for example, publication date, section, headline, byline, dateline, summary, etc. We needed a way to extract this metadata in addition to the article content itself from raw HTML and XML. We used Pythonâs built-in xml ElementTree parser for processing the XML and BeautifulSoup for processing HTML.
As part of our migration process, we are generating new, SEO-friendly URLs for old content so that readers can more easily find our historical data. SEO-friendly URLs typically include some keywords related to the content of the page, a practice that wasnât standardized in our archive.
For example, on Feb. 12, 2004, the article “San Francisco City Officials Perform Gay Marriages” appeared under a URL ending with â12CND-FRIS.html.â. Realizing we could provide a much more informative link, we derived a new URL from the headline. Now this article is referenced by a URL ending with âsan-francisco-city-officials-perform-gay-marriages.html,â a far more intuitive scheme.
Handling Duplicate Content
Once we identified which URLs were missing from our archive, we realized we had a new problem: duplicate content. Some âmissingâ URLs pointed to HTML documents containing content already present in our XML archive. If we converted both the XML and HTML to JSON without identifying duplicated content, many articles would end up with more than one URL, which would cause duplicate pages to compete against each other for relevance ranking on search engines.
Clearly, we needed to find which XML articles correspond to which HTML articles. As an additional challenge, we had to use a method that didnât rely on exact string matching, because there could be slight differences between archive XML and HTML, such as extra text, that would prevent the two sources from being exactly the same. To tackle these issues, we used an algorithm developed for another one of our projects, TimesMachine, which relies on a text âshinglingâ technique. Read more about the technique here.
This technique successfully matched a majority of âmissingâ HTML articles to existing XML articles. For example, in 2004, we initially had 60K missing articles, but this step successfully matched over 42K articles, reducing the number of potential duplicates by 70%. The remaining 30% of articles would be scraped using BeautifulSoup.
While our original goal was to modernize our digital archive, the migration project has led to opportunities for future projects to engage our readers in our treasure trove of historical news data.
For example, we recently expanded TimesMachine, our custom PDF reader, to include newspaper scans from 1981-2002. However, article text from 1851â1980 is still only available as scans in TimesMachine. Full digital text will take this experience a step further.
Weâre currently collaborating with a transcription services company to bridge that gap, starting with 1960â1980, so that readers can more easily find, research, and experience content throughout history. Things are going well: weâve just released the full digital text of every article written in the 1970s.
We will continue to update NYTimes.com with newly migrated and transcribed articles in the near future. Stay tuned! You can follow @NYTArchives on Twitter for more updates.