Dealing with PDFs – an SEO’s guide

There might be instances in which Google encounters non-HTML files such as PDF documents and determines they deserve to rank higher than HTML pages.  This behaviour may not always be desirable.

Author

Category

Read Time

Last Updated

There might be instances in which Google encounters non-HTML files such as PDF documents and determines they deserve to rank higher than HTML pages.  This behaviour may not always be desirable.

Why do PDFs rank higher than HTML pages?

The text within PDFs is readable by search engines.  Often content presented in PDF format is particularly in-depth spanning multiple pages and with keyword rich content. HTML pages may lack depth and keyword relevance.  Some report style content will be cited externally resulting in inbound links pointing to PDF files.  This can result in PDFs being seen as a better result to return to users, and therefore PDFs outranking HTML webpages.

Why would you not want a PDF to rank?

Due to associated problems with PDF files, it is often preferable to rank an HTML page in their place because PDFs:

  • Provide visitors with a poor user experience, particularly on mobile.
  • Lack site navigation, taking users away from the website and do not allow visitors to easily discover new or relevant pages which could potentially assist with increased overall traffic and conversions.
  • Are difficult to maintain and update, and therefore visitors might get access to outdated or unreliable content.
  • Do not allow for implementation of structured markup.  PDFs do not work the same way as HTML and prevents structured markup from being applied to the content.
  • Tracking Issues.  The data acquired from Google Analytics on how people are using PDFs are limited such as getting data on what links users followed to PDFs.

Approaches Used to Handle PDF Documents.

Various approaches can be taken when dealing with PDF documents in favour of HTML pages. These might include:

  1. Adding a “X-Robots: noindex” tag in the HTTP header used to serve the document.
  2. Inserting a canonical link in the PDF’s HTTP header referencing the HTML page.
  3. Creating an HTML version of the published PDF
  4. Linking to HTML pages containing PDF content instead of the PDF themselves
  5. Adding the PDF to the robots.txt file, however this does not prevent it from being indexed in the search results.

We’ve compiled this table outlining the pros and cons associated with different approaches to deal with the occurrence of PDFs in search results.

ApproachPDF ActionLanding Page ActionProsCons
1noindex applied to file x-robots HTTP headerNo actionThis will prevent the PDF file appearing in search results once it is recrawledAll value obtained from any links to the pdf is lost.

There is no guarantee that your HTML landing page will perform as well as the PDF file
2Insert a canonical link in the HTTP header referencing the HTML landing pageNo actionIf canonical is respected the PDF will stop ranking and most of the value will be transferred to your HTML landing page boosting its ranking potentialIt is unlikely that the canonical reference will be respected if the landing page and PDF serve content which significantly differs
3Insert a canonical link in the HTTP header referencing the HTML landing page Significantly expand landing page such that it contains the bulk of copy from the PDFCanonical is likely to be respected, PDF will drop from the index and most of the value will be transferred to your HTML landing page boosting its ranking potential Landing page may no longer serve its initial purpose as it contains large amounts of additional content
4301 redirect or canonical to new HTML pageCreate new HTML page with bulk of PDF copy Existing HTML landing page is conserved so can focus on existing goals (e.g. conversion).
New HTML page can incorporate CTAs and can link through to landing page.
PDF will no longer rank
Search users likely to land on newly created HTML page rather than desired landing page so same situation (minus PDF!)
5Block access to PDF repository via robots.txt No actionNoneBlocking a file in robots.txt does not prevent it from appearing in search results, but the PDF will no longer transfer value to any linked landing pages
6301 redirect to HTML landing pageNo actionPDF no longer ranks.

Value of inbound links to PDF is transferred to HTML landing page
PDF is inaccessible, well performing relevant content is lost and it is usually relevance which means PDFs outrank HTML landing pages so rankings are unlikely to transfer
7301 redirect to HTML landing page No actionPDF no longer ranks.

Value of any inbound links pointing to old PDF file is transferred to HTML landing page.

PDF can still be accessed via on-site CTAs, behind lead capture forms etc.
Relevance obtained from valuable PDF content is lost.

There is no guarantee that your HTML landing page will perform as well as the PDF file
8Update to include more CTAs/Links back to HTML landing page No actionLow risk

Links within PDF files are crawled so some additional value will be transferred to HTML landing page
PDF will still rank as this contains the most relevant content
9No actionNo actionNo additional risk to current rankingsUnlikely to be any change in behaviour

Should Google crawl and index PDF files?

If you have PDFs on your website, you need to decide whether you would like them to rank in organic search.  If not, these files should not be indexed in search results.

The best approach would be to create an accessible HTML page containing content from the PDF document. However, this would result in duplicate content, in which case a canonical link element can be inserted in the HTTP header of the PDF referencing the HTML page.  Alternatively, you could add a X-Robots: noindex tag in the HTTP header however this will result in some equity loss.

If you do not want the PDF to be crawled by search engine bots, you can add an exclusion rule for the PDF document (or subfolder) to the robots.txt file.  You want to ensure that the landing page is ranking for relevant keywords and not the PDF.

One reason why a PDF may rank higher than the HTML page is due to pages on the website linking to the PDF.  Avoid linking to the PDF across the site, but instead link to the page the PDF is on. This will allow value to be passed to the landing page and assist it in ranking organically for the keyword it targets.

Advantages of PDF file format.

Despite the disadvantages associated with using PDFs to display content, there are several reasons as to why people might prefer this kind of format:

  • Quick and easy to create
  • Avoids design limitations
  • Easy to download and allows for offline use

PDF is the preferred format for archiving and offline use, however the web is about searchability, linking and responsiveness, especially as it relates to mobile. Users prefer the lighter and interactive elements of HTML.

Although certain kinds of content work well in PDF format, it is vital that your important content remains as HTML pages.

Using HTML to Display Content.

There are many advantages to using HTML to display content, for example:

  1. HTML is search engine and mobile friendly
  2. Allows for the inclusion of interactive content
  3. Easier to update.
  4. Easier to integrate into the rest of your site architecture and link to related pages

Tracking PDF Performance.

There are tracking issues involved with PDFs. Although tracking clicks to PDF files both from your site and search engines is possible, tracking within a PDF is not possible.  This makes it much more difficult to fully understand how a visitor is progressing through your content.

Clicks through to PDF files can be tracked using Google Analytics Event tracking and/or the use of virtual pageviews.

You might also want to establish how your PDF pages are performing in search results based on their rankings. This can be done through Google Search Console’s Performance Report and clicking on “+ New”– Page– and entering ‘pdf’ into the search. The export should include data on the number of clicks, impressions, and rankings.

Server log files or the Crawl Stats report in Google Search Console can also be useful to acquire data on how search bots are crawling your site, and hence determining how crawl budget is spent.

In conclusion, PDFs are evidently not the best choice for SEO, however this does not mean that they are bad for SEO, but simply that they do not offer the same kind of advantages that HTML pages do, in respect of site navigation, user control, measurement and accessibility. 

It is therefore recommended that if PDFs are ranking higher than your desired HTML page, an HTML version of the PDF content should be created, providing improved flexibility, visibility and tracking benefits.

Found is a London-based multi-award-winning digital growth, SEO, PPC, Social and Digital PR agency that harnesses the efficiencies of data and technology and future-thinking to help clients grow their businesses online.