PDF documents offer a great advantage over other document types: They appear exactly the same on any device.
Once you have created the PDF file, every element (title, image, text) remains in the same position regardless of the PDF format used. This article will show you tips on how you can optimally use PDFs for your SEO strategy.
For highly competitive keywords, PDFs rarely appear in the top 10 search results. However, Google technically makes no distinction between an HTML page and a PDF document. The search engine only focuses on presenting the user with the best search results.
Texts: Google can index PDFs in any language or character encoding as long as the document is not password protected or encrypted. Texts that are implemented as images are partially processed and "read" using OCR algorithms. You can find out if Google is able to read your PDF text without complications using a simple test: If you can copy & paste text from the PDF, Google will have no problem reading and understanding the text.
Images: Images on PDF files are not well suited for the classic Google image search. If you want users to find you using the images on the PDF file, you should use a classical HTML page.
Links: Similar to HTML documents, PDFs can also contain links that can inherit the link power. This was recently confirmed by Gary Illyes:
Figure 1: Links in PDFs pass on the link power
Note: When using PDFs, always keep in mind that PDF visits are not recorded by tracking tools such as Google Analytics. Your PDF could therefore have many visitors, but this traffic is not used accordingly.
In order to identify potentials and weaknesses, it is advisable to perform a log file analysis to evaluate the visits of non-HTML files. Log file analyses are also well-suited for evaluation of crawler activities based on the User Agent.
From a search engine point of view, PDFs are a double-edged topic. On the one hand, PDFs can be listed in the search results just like other document types. On the other hand, they do not offer the user any navigation or interaction elements.
It is therefore important to define the actual role played by PDFs in your SEO strategy. The most important question to ask yourself is: "Can a PDF meet the expectations of a search engine visitor?"
Making sure PDFs that do not serve as landing pages are not indexed
If an indexable PDF is not able to fully meet the information requirement of the user, you should make sure that the PDF file is not indexed by search engines.
The easiest way to exclude PDFs from the index is by using x-robot in the HTTP header. This can either be with noindex or a canonical tag. Whereas noindex only tells the search engine not to index the content, the canonical tag can be used to refer to the HTML version of the PDF.
Use case: What suits me best?
By using noindex in the HTTP header for these PDFs, you would be wasting valuable link power and only the URLs that are linked from the PDF documents would benefit from this. Using a canonical tag is much more practical, especially for PDFs that have previously generated many backlinks. The canonical tag passes on the entire link power to its corresponding landing page. The PDF would not appear in the search engine index, and the corresponding landing page would be displayed in the search results.
Figure 2: Example of a landing page instead of a PDF
Don’ts:
Blocking PDFs in the robots.txt file – The PDFs will still be indexed and the incoming link power wasted.
PDF version of a page – Certain CMS automatically provide a PDF version of all HTML pages. Using a canonical tag would solve the indexing problem in this case, but search engines will still have to crawl the PDFs hence wasting valuable crawl resources.
Identify indexable PDFs
You can easily and quickly identify indexable PDFs using OnPage.org Zoom. Simply go to "Indexability" → "What is indexable?" activate the "Indexable" filter (1), and then click on the "PDF" Mime type (2).
Figure 3: Display only the indexable PDFs
Once you have activated the filters, all PDFs found in the crawl are listed in the table below.
Figure 4: List of all indexable PDFs
A list of all PDFs that are already in the Google index can be viewed by using the "file type:pdf" and "site:domain.tld" search operators:
Figure 5: List of all PDFs that are already in the Google index
Identify the indexability of PDFs that are relevant for indexing
In some cases, providing PDFs for the Google index could offer an added value for your users. This is particularly useful if the PDFs contain specific information that is important to the user and the user has no need to interact with the website.
A good example includes public transport network plans such as the Munich metro network plan. All users want is to get quick information, download the PDF, and save it on their mobile devices, without interacting with the website.
Figure 6: Example of a PDF that is well suited as a landing page in the search engine index
Figure 7: Network plan of Munich in PDF format
Indexability of the document is the most important prerequisite for a PDF to appear in the search engine index.
Criteria for indexability:
HTTP status code is 200 OK
Noindex should not be used as meta robots
If the canonical tag is used, it should not point to another URL
The document will not be indexed if either one of these criteria is not met.
OnPage.org Zoom enables you to easily identify non-indexable PDFs. Simply go to "Indexability" → "What is indexable" and select "PDF". This then allows you to view a list of the non-indexable PDFs in the graph as well as the corresponding reasons (e.g., all PDFs that have a "noindex" meta robots tag).
Figure 8: Identify non-indexable URLs
Tip: Indexable URLs should always contain a link to the corresponding landing page. This enables users to quickly navigate to the website.
Similar to HTML pages, PDFs can also be listed in search engine results. However, not all PDF documents are well suited as landing pages. You should therefore think of the role that the PDFs should play in your SEO strategy and find a way to make the most of them. PDFs that are not suitable as landing pages but that contain a lot of incoming link power should have the x-robots element in the HTTP header pointing to the corresponding landing page. As for PDFs that are relevant for indexing, you should ensure they meet all the criteria that are necessary for indexing.
Published on Aug 10, 2016 by Stephan Walcher