For years now, companies have been paying to have their PDF documents converted to HTML so web users can easily find their information. Is this time consuming and sometimes expensive effort really necessary?
Press releases, forms, whitepapers and case studies are often produced in PDF format and linked to from business websites. For years, search engine optimization professionals have recommended that their clients convert these documents to pure HTML as a way to make them easier to discover. That was good advice at one time but those efforts are no longer needed if a business produces well crafted PDF files.
Since 2001 Google has been quietly indexing PDFs that are linked from websites. Yahoo! and Bing have since followed suit.
The PDF format has some advantages over traditional HTML and those advantages should not be sacrificed without a good reason. A few things PDFs do well include very granular control over layout and the resulting printed output and font embedding to ensure that everyone viewing the document sees the document the way the author intended.
Not All PDFs are Created Equal
For search engines to accurately index a PDF it needs to include . A PDF is really just a container for information. That information can be in the form of text, images, or a combination of the two.
Data Conversion Laboratory has created a useful White Paper about the various type of PDFs in common use.
Google on PDFs and Search
Here is a collection of the PDF questions that Google’s is asked most frequently and the official answers:
Q: Can Google index any type of PDF file?
A: Generally we can index textual content (written in any language) from PDF files that use various kinds of character encodings, provided they’re not password protected or encrypted. The general rule of the thumb is that if you can copy and paste the text from a PDF document into a standard text document, we should be able to index that text.
Q: What happens with the images in PDF files?
A: Currently the images are not indexed. In order for us to index your images, you should create HTML pages for them.
Q: How are links treated in PDF documents?
A: Generally links in PDF files are treated similarly to links in HTML: they can pass PageRank and other indexing signals, and we may follow them after we have crawled the PDF file. It’s currently not possible to “nofollow” links within a PDF document.
Q: How can I prevent my PDF files from appearing in search results; or if they already do, how can I remove them?
A: The simplest way to prevent PDF documents from appearing in search results is to add an X-Robots-Tag: noindex in the HTTP header used to serve the file. If they’re already indexed, they’ll drop out over time if you use the X-Robot-Tag with the noindex directive. For faster removals, you can use the URL removal tool in Google Webmaster Tools.
Q: Can PDF files rank highly in the search results?
A: Sure! They’ll generally rank similarly to other webpages. For example, at the time of this post, [mortgage market review], [irs form 2011] or [paracetamol expert report] all return PDF documents that manage to rank highly in our search results, thanks to their content and the way they’re embedded and linked from other webpages.
Q: Is it considered duplicate content if I have a copy of my pages in both HTML and PDF?
A: Whenever possible, we recommend serving a single copy of your content. If this isn’t possible, make sure you indicate your preferred version by, for example, including the preferred URL in your Sitemap or by specifying the canonical version in the HTML version.
Q: How can I influence the title shown in search results for my PDF document?
A: We use two main elements to determine the title shown: the title metadata within the file, and the anchor text of links pointing to the PDF file. To give our algorithms a strong signal about the proper title to use, we recommend updating both.
PDFs in Search Results
When a Google search returns a PDF in the results it is clearly marked as shown here:
As noted above, you can exert some control over how your PDFs appear in Google’s search results by crafting quality metadata for your PDFs before putting them on the web.
While Google only makes use of the metadata’s title information, you should take the time to fully populate the metadata properties to future-proof your documents as shown below:
If you have concerns about the content on your business or organization’s website you can contact Iris L. Hanney of Unlimited Priorities and talk about getting a website analysis report and some clear and pragmatic advice.