It sounds so simple, but actually finding a free / Open Source app to do this was problematic. I hoped HTMLDOC would do the trick and perhaps with a little more experience, I could achieve the desired results. In the meantime I discovered an app for Linux (also available for other operating systems) called wkhtmltopdf that did just what I needed it to do.
It's a command line utility, very easy to use and quite effective. It doesn't make a mess of the pages when converting/collating them. It also allows you to customize page headers & footers among other things. I have been using it under Windows XP1 (which is kind of a pain due to shell limitations compared to bash under Ubuntu, but whatever as long as it works). Entering the command: wkhtmltopdf.exe --extended-help gives you a nice list of options to explore.
My goal was (or seemed) simple enough: Create a single PDF file from the multiple web pages that I had saved locally via the Firefox add-on DownThemAll!
Why do this? The print version pages of the (lengthy) material I wanted to read seems to be intentionally organized on the website to make it difficult to work with for off-line viewing. Lame, but not uncommon by any means of course. Here's the method I used to solve my dilemma:
- Made a list of the printable version URLs and saved the html files locally. (Not as tedious as it sounds, and mostly automated once the structure of the URLs was determined.)
- Saved the pages locally with DownThemAll!
- Used wkhtmltopdf to put the files together using a command similar to the following:
"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" --footer-center [page] -s Letter articles.htm articles_001.htm articles_002.htm articles_003.htm articles_004.htm articles_005.htm articles_006.htm CoolRead.pdf--footer-center [page] gives me page numbering at the bottom and -s Letter sets it to 8.5" x 11" page size (other page size options available). Several other options are also available. Can you simply create a lists of the pages you want to collate instead of listing them on the CLI? I haven't noticed that as an option in --help or --extended-help but it's probably there and easy enough to do. (Can you tell I have only recently started using wkhtmltopdf?) I wanted some kind of cover and toc-type page for the articles. and it just so happens that the main webpage with the links for the on-line articles fit the bill well enough. wkhtmltopdf can work directly with on-line pages so away we went:
"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -s Letter -O Portrait "http://supercoolwebsite.org/docs/?docID=6" Cover.pdfUpdate 1/3/2010: I decided I didn't like the "cover sheet" generated with the command above. I ended up creating my own cover page and "table of contents" in OpenOffice Draw and exporting it to PDF. To make life simple I put the wkhtmltopdf command text into a batch file and ran it in the same folder containing the html files. Easy to modify the options & run until I got the (more) precise results I wanted.
Example:
"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -g --disable-smart-shrinking --footer-center [page] -s Letter -T 14mm -B 14mm -L 14mm -R 14mm --disable-external-links --disable-internal-links articles.htm articles_001.htm articles_002.htm articles_003.htm articles_004.htm articles_005.htm articles_006.htm CoolRead.pdfThe additional options shown apply the following formatting to the generated PDF file:
- -g = Generate in Grayscale
- --disable-smart-shrinking = Without this option, the font size of the generated PDF files was too small. This option provided a larger font-size and made the documents easier to read.
- --footer-center [page] = Add page numbers to the bottom-center of each page in the file.
- -s Letter = Letter sized pages.
- -T 14mm -B 14mm -L 14mm -R 14mm = Increase the page margins. The default put the text too close to the edges of the page.
- --disable-external-links & --disable-internal-links = These options do just what you would think: they remove the hyperlinks that are in the html files from the generated PDF file.
Enter PDFTK Builder from the PortableApps collection.
- Add the Cover page/TOC & the collated articles PDF files to PDFTK Builder's Collate option
- Click Save As... to combine them.
- Done!
At the end of the day, using FireFox + DownThemAll!, wkhtmltopdf, and PDFTK Builder (with a generous nod to Notepad++—a great text / code editor for working with the batch files, link lists, etc.) the job gets done.
There are several HTML articles, etc. that I would like to collate into single PDFs. This combination of free tools is a valuable solution for me.
There are several HTML articles, etc. that I would like to collate into single PDFs. This combination of free tools is a valuable solution for me.
1 I was not able to use the version of wkhtmltopdf in the Ubuntu repositories as throws an error message about not being built with the correct version of QT4 or something like that. There are articles on how to fix this but I haven't bothered with it yet.
11 comments:
Very good post. Thanks for sharing.
I am also exploring this engine to generate PDF for multiple HTML files (At present I use HTMLDOC) which will passed through a script from my HTML site. I was wondering if you know how to show page numbers and links to index page. Using --book one can make TOC and index but it doesn't give page number and links on index page.
Thanks for your explanation! Very good informations.
The part "collecting all needed URL's" seems very time consuming to me.
Do we have another possibility to get all the URL's from a given domain or page?
Collecting the URLs is a pain. I've been thinking about a script to do that, but it would have to be flexible from site to site.
Maybe use WGET or HTTrack to get urls. Not sure. Haven't looked at this in a little while.
It's me again. Still to little problems:
a) How can I tell wkhtmltopdf to wait some seconds until data on a page are fully loaded?
b) How can I print special characters like © in the footer? (Actually, I get -® printed instead)
P.S. If you create such an "URL-Catcher", please let us know .-)
I just downloaded the latest version for Windows (wkhtmltox-0.11.0_rc1-installer.exe) and I'll check to see if the options you're looking for are in it.
[Note: the Windows install puts it in C:\Program Files (x86)\wkhtmltopdf under Windows 7. Don't use the modify path option during install with Vista or Win7, it says it will break your path. You'll have to add it manually it seems.]
There's a manual page for version 0.10.0 and a Wiki (but the Wiki seems to be geared towards developers instead of users).
Regarding the copyright symbol, I know the version I was using had difficulties with some characters. Maybe that's all been fixed in this Release Candidate.
Thanks for your explanation! Very good informations.
I tried to run command C:\Program Files (x86)\wkhtmltopdf\bin\wkhtmltopdf.exe *.html *.pdf and I get the error: C:\Program is not recognized as an internal or external command...
It doesn't seem to like the spaces in the Program Files (x86) folder.
you need to go to the directory "C:\Program Files (x86)\wkhtmltopdf\bin" then execute wkhtmltopdf.exe "your agrs here"
Hi! Thank you for your post! It earned me a lot of time. I am not a programmer, so I use Excel VBA, as I use it for other stuff. Today I used it to generate 2218 pdf files from list of 13308 html files. 5 minutes coding and done!
A new tool constructed for merging PDF documents doesn’t lag behind. You may combine pdfs online without extra efforts. In the best traditions of our platform, the procedure is self-explanatory and easy in usage. Our user-friendly interface attracts your attention to main moments and step-by-step leads you to the successful result.merge, altomerge, merging PDF documents, combine pdfs online
If you need any information just click here combine pdf
This is very educational content and written well for a change. It's nice to see that some people still understand how to write a quality post.! this
Post a Comment