Search This Blog

Thursday, January 10, 2013

Download and Install pdf2htmlEX


Recently, I wanted to convert a large PDF document to html so that I could extract tables from it that spanned several pages. I found that the tool to help me achieve the task was called pdf2htmlex available at http://coolwanglu.github.com/pdf2htmlEX/.




The website points to github which provides the command line commands to download it. To download the source and build it, open a terminal window and enter the following command.

> git clone --depth 1 git://github.com/coolwanglu/pdf2htmlEX.git


Well, turns out I did not have Git installed. So I proceeded to download and install it.


Git was downloaded and installed correctly.

Once Git was installed, I re-issued the previous command, and this time it was downloaded and installed correctly.

Next we cd into the newly created directory and issue commands to build the software.

> cd pdf2htmlEX
> cmake . && make && sudo make install


Turns out there were several dependencies missing. One of the major ones was poppler for which Ubuntu had an older version and a version higher than 0.20.0 was needed. I proceeded to download the version from http://poppler.freedesktop.org/ and compiled and built it on the local machine.

The latest version of the poppler software can be accessed from http://poppler.freedesktop.org/releases.html as shown below.



Once downloaded, we need to extract and build the software.

> sudo apt-get install libopenjpeg-dev
> tar -xvf poppler-0.21.4.tar.gz
> cd poppler-0.21.4/
> ./configure --enable-xpdf-headers
> make
> sudo make install

libpoppler was installed successfully!

Note: The option to enable xpdf headers in the configure statement is very important. Without it, poppler will compile, but pdf2htmlEX may not compile.

Next, we need to install the latest version of libfontforge. This has certain dependencies, that can be installed by issuing the following command on the terminal.

> sudo apt-get update; sudo apt-get install libpng12-dev zlibc zlib1g-dev libtiff-dev libungif4-dev libjpeg-dev libxml2-dev libuninameslist-dev xorg-dev subversion cvs gettext git libpango1.0-dev libcairo2-dev python-dev;

Next we need to downloadthe sourcecode and install it. For this, we make a src folder and cd into it using the following commands.

> mkdir src
> cd src

We then proceed to download the sourcecode from the website (https://github.com/fontforge/fontforge/downloads) as shown below.


Once downloaded, we can issue the following command to unzip the file.

> bunzip2 fontforge_full-20120731-b.tar.bz2


The resulting tar can be unzipped by issuing the following command

> tar -xvf fontforge_full-20120731-b.tar

Next we need to download a few other modules called freetype and spiro

> git clone git://git.sv.gnu.org/freetype/freetype2.git;
> svn co http://libspiro.svn.sourceforge.net/svnroot/libspiro/;
Now, we need to build all these together.

> cd ./libspiro;
./configure;
make;

Followed by
> sudo make install


Next we need to install fontforge

> cd ./fontforge-20120731-b/; ./autogen.sh; ./configure; make; sudo make install; sudo ldconfig;


Then we cd into the directory and configure and make the program by issuing the following commands.

> ./configure
> make
> sudo make install

Finally, the software is built and installed.



Now, we can run the pdf2htmlex command as follows:

> pdf2htmlex

In fact, executing the following explains all the options available with the command.

> pdf2htmlex --help


There you go folks... a simple tool to convert pdf documents to html.

2 comments:

Anonymous said...

There is actually an Ubuntu PPA available.
Also in your title, 'EX' is missing :)

awachs said...

Thanks for the pointer on the ppa. I have also fixed the title.