Saturday, January 01, 2011
Multiple HTML files to Single PDF with wkhtmltopdf
Very useful (to me). It took a good deal of searching to find the right little application to do what I needed—convert locally saved HTML pages to PDF and collate them into a single file.
Example:
Enter PDFTK Builder from the PortableApps collection.
It sounds so simple, but actually finding a free / Open Source app to do this was problematic. I hoped HTMLDOC would do the trick and perhaps with a little more experience, I could achieve the desired results. In the meantime I discovered an app for Linux (also available for other operating systems) called wkhtmltopdf that did just what I needed it to do.
It's a command line utility, very easy to use and quite effective. It doesn't make a mess of the pages when converting/collating them. It also allows you to customize page headers & footers among other things. I have been using it under Windows XP1 (which is kind of a pain due to shell limitations compared to bash under Ubuntu, but whatever as long as it works). Entering the command: wkhtmltopdf.exe --extended-help gives you a nice list of options to explore.
My goal was (or seemed) simple enough: Create a single PDF file from the multiple web pages that I had saved locally via the Firefox add-on DownThemAll!
Why do this? The print version pages of the (lengthy) material I wanted to read seems to be intentionally organized on the website to make it difficult to work with for off-line viewing. Lame, but not uncommon by any means of course. Here's the method I used to solve my dilemma:
- Made a list of the printable version URLs and saved the html files locally. (Not as tedious as it sounds, and mostly automated once the structure of the URLs was determined.)
- Saved the pages locally with DownThemAll!
- Used wkhtmltopdf to put the files together using a command similar to the following:
"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" --footer-center [page] -s Letter articles.htm articles_001.htm articles_002.htm articles_003.htm articles_004.htm articles_005.htm articles_006.htm CoolRead.pdf--footer-center [page] gives me page numbering at the bottom and -s Letter sets it to 8.5" x 11" page size (other page size options available). Several other options are also available. Can you simply create a lists of the pages you want to collate instead of listing them on the CLI? I haven't noticed that as an option in --help or --extended-help but it's probably there and easy enough to do. (Can you tell I have only recently started using wkhtmltopdf?) I wanted some kind of cover and toc-type page for the articles. and it just so happens that the main webpage with the links for the on-line articles fit the bill well enough. wkhtmltopdf can work directly with on-line pages so away we went:
"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -s Letter -O Portrait "http://supercoolwebsite.org/docs/?docID=6" Cover.pdfUpdate 1/3/2010: I decided I didn't like the "cover sheet" generated with the command above. I ended up creating my own cover page and "table of contents" in OpenOffice Draw and exporting it to PDF. To make life simple I put the wkhtmltopdf command text into a batch file and ran it in the same folder containing the html files. Easy to modify the options & run until I got the (more) precise results I wanted.
Example:
"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -g --disable-smart-shrinking --footer-center [page] -s Letter -T 14mm -B 14mm -L 14mm -R 14mm --disable-external-links --disable-internal-links articles.htm articles_001.htm articles_002.htm articles_003.htm articles_004.htm articles_005.htm articles_006.htm CoolRead.pdfThe additional options shown apply the following formatting to the generated PDF file:
- -g = Generate in Grayscale
- --disable-smart-shrinking = Without this option, the font size of the generated PDF files was too small. This option provided a larger font-size and made the documents easier to read.
- --footer-center [page] = Add page numbers to the bottom-center of each page in the file.
- -s Letter = Letter sized pages.
- -T 14mm -B 14mm -L 14mm -R 14mm = Increase the page margins. The default put the text too close to the edges of the page.
- --disable-external-links & --disable-internal-links = These options do just what you would think: they remove the hyperlinks that are in the html files from the generated PDF file.
Enter PDFTK Builder from the PortableApps collection.
- Add the Cover page/TOC & the collated articles PDF files to PDFTK Builder's Collate option
- Click Save As... to combine them.
- Done!
At the end of the day, using FireFox + DownThemAll!, wkhtmltopdf, and PDFTK Builder (with a generous nod to Notepad++—a great text / code editor for working with the batch files, link lists, etc.) the job gets done.
There are several HTML articles, etc. that I would like to collate into single PDFs. This combination of free tools is a valuable solution for me.
There are several HTML articles, etc. that I would like to collate into single PDFs. This combination of free tools is a valuable solution for me.
1 I was not able to use the version of wkhtmltopdf in the Ubuntu repositories as throws an error message about not being built with the correct version of QT4 or something like that. There are articles on how to fix this but I haven't bothered with it yet.
Labels: HTML, PDF, PDFTK Builder, software, wkhtmltopdf
Comments:
Links to this post:
<< Home
Very good post. Thanks for sharing.
I am also exploring this engine to generate PDF for multiple HTML files (At present I use HTMLDOC) which will passed through a script from my HTML site. I was wondering if you know how to show page numbers and links to index page. Using --book one can make TOC and index but it doesn't give page number and links on index page.
Post a Comment
I am also exploring this engine to generate PDF for multiple HTML files (At present I use HTMLDOC) which will passed through a script from my HTML site. I was wondering if you know how to show page numbers and links to index page. Using --book one can make TOC and index but it doesn't give page number and links on index page.
Links to this post:
<< Home










