Saturday, January 01, 2011

Multiple HTML files to Single PDF with wkhtmltopdf

Very useful (to me). It took a good deal of searching to find the right little application to do what I needed—convert locally saved HTML pages to PDF and collate them into a single file.

It sounds so simple, but actually finding a free / Open Source app to do this was problematic. I hoped HTMLDOC would do the trick and perhaps with a little more experience, I could achieve the desired results. In the meantime I discovered an app for Linux (also available for other operating systems) called wkhtmltopdf that did just what I needed it to do.

It's a command line utility, very easy to use and quite effective. It doesn't make a mess of the pages when converting/collating them. It also allows you to customize page headers & footers among other things. I have been using it under Windows XP1 (which is kind of a pain due to shell limitations compared to bash under Ubuntu, but whatever as long as it works). Entering the command: wkhtmltopdf.exe --extended-help gives you a nice list of options to explore.

My goal was (or seemed) simple enough: Create a single PDF file from the multiple web pages that I had saved locally via the Firefox add-on DownThemAll!

Why do this? The print version pages of the (lengthy) material I wanted to read seems to be intentionally organized on the website to make it difficult to work with for off-line viewing. Lame, but not uncommon by any means of course. Here's the method I used to solve my dilemma:
  • Made a list of the printable version URLs and saved the html files locally. (Not as tedious as it sounds, and mostly automated once the structure of the URLs was determined.)

  • Saved the pages locally with DownThemAll!

  • Used wkhtmltopdf to put the files together using a command similar to the following:
"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" --footer-center [page] -s Letter articles.htm articles_001.htm articles_002.htm articles_003.htm articles_004.htm articles_005.htm articles_006.htm CoolRead.pdf
--footer-center [page] gives me page numbering at the bottom and -s Letter sets it to 8.5" x 11" page size (other page size options available). Several other options are also available. Can you simply create a lists of the pages you want to collate instead of listing them on the CLI? I haven't noticed that as an option in --help or --extended-help but it's probably there and easy enough to do. (Can you tell I have only recently started using wkhtmltopdf?) I wanted some kind of cover and toc-type page for the articles. and it just so happens that the main webpage with the links for the on-line articles fit the bill well enough. wkhtmltopdf can work directly with on-line pages so away we went:
"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -s Letter -O Portrait "http://supercoolwebsite.org/docs/?docID=6" Cover.pdf
Update 1/3/2010: I decided I didn't like the "cover sheet" generated with the command above. I ended up creating my own cover page and "table of contents" in OpenOffice Draw and exporting it to PDF. To make life simple I put the wkhtmltopdf command text into a batch file and ran it in the same folder containing the html files. Easy to modify the options & run until I got the (more) precise results I wanted.

Example:
"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -g --disable-smart-shrinking --footer-center [page] -s Letter -T 14mm -B 14mm -L 14mm -R 14mm --disable-external-links --disable-internal-links articles.htm articles_001.htm articles_002.htm articles_003.htm articles_004.htm articles_005.htm articles_006.htm CoolRead.pdf
The additional options shown apply the following formatting to the generated PDF file:
  1. -g = Generate in Grayscale
  2. --disable-smart-shrinking = Without this option, the font size of the generated PDF files was too small. This option provided a larger font-size and made the documents easier to read.
  3. --footer-center [page] = Add page numbers to the bottom-center of each page in the file.
  4. -s Letter = Letter sized pages.
  5. -T 14mm -B 14mm -L 14mm -R 14mm = Increase the page margins. The default put the text too close to the edges of the page.
  6. --disable-external-links & --disable-internal-links = These options do just what you would think: they remove the hyperlinks that are in the html files from the generated PDF file.
Time to put the Cover Page/TOC & the collated HTML-turned-PDF files together for the finished product.

Enter PDFTK Builder from the PortableApps collection.
  1. Add the Cover page/TOC & the collated articles PDF files to PDFTK Builder's Collate option
  2. Click Save As... to combine them.
  3. Done!
Does it sound like a lot of work? It's really not, and I get (almost) exactly what I want. Should there be, or is there an all-in-one Open Source application that does ALL of this for you in one shot? Probably, but I haven't found it yet. The most irratating aspect of this approach? HTML files with bad character entities in them. Tracking those weird characters down in the files and figuring out what to use to replace them is a real pain (depending on how large your final file turns out to be).

At the end of the day, using FireFox + DownThemAll!, wkhtmltopdf, and PDFTK Builder (with a generous nod to Notepad++—a great text / code editor for working with the batch files, link lists, etc.) the job gets done.

There are several HTML articles, etc. that I would like to collate into single PDFs. This combination of free tools is a valuable solution for me.

1 I was not able to use the version of wkhtmltopdf in the Ubuntu repositories as throws an error message about not being built with the correct version of QT4 or something like that. There are articles on how to fix this but I haven't bothered with it yet.

7 comments:

Pradeep said...

Very good post. Thanks for sharing.
I am also exploring this engine to generate PDF for multiple HTML files (At present I use HTMLDOC) which will passed through a script from my HTML site. I was wondering if you know how to show page numbers and links to index page. Using --book one can make TOC and index but it doesn't give page number and links on index page.

Anonymous said...

Thanks for your explanation! Very good informations.

The part "collecting all needed URL's" seems very time consuming to me.
Do we have another possibility to get all the URL's from a given domain or page?

chronicon said...

Collecting the URLs is a pain. I've been thinking about a script to do that, but it would have to be flexible from site to site.

Maybe use WGET or HTTrack to get urls. Not sure. Haven't looked at this in a little while.

Anonymous said...

It's me again. Still to little problems:
a) How can I tell wkhtmltopdf to wait some seconds until data on a page are fully loaded?
b) How can I print special characters like © in the footer? (Actually, I get -® printed instead)

P.S. If you create such an "URL-Catcher", please let us know .-)

chronicon said...

I just downloaded the latest version for Windows (wkhtmltox-0.11.0_rc1-installer.exe) and I'll check to see if the options you're looking for are in it.

[Note: the Windows install puts it in C:\Program Files (x86)\wkhtmltopdf under Windows 7. Don't use the modify path option during install with Vista or Win7, it says it will break your path. You'll have to add it manually it seems.]

There's a manual page for version 0.10.0 and a Wiki (but the Wiki seems to be geared towards developers instead of users).

Regarding the copyright symbol, I know the version I was using had difficulties with some characters. Maybe that's all been fixed in this Release Candidate.

Tom Clancy’s Rainbow Six Vegas Full said...

Thanks for your explanation! Very good informations.

David said...

I tried to run command C:\Program Files (x86)\wkhtmltopdf\bin\wkhtmltopdf.exe *.html *.pdf and I get the error: C:\Program is not recognized as an internal or external command...

It doesn't seem to like the spaces in the Program Files (x86) folder.