Recently, I wanted to convert a large
PDF document to html so that I could extract tables from it that
spanned several pages. I found that the tool to help me achieve the
task was called pdf2htmlex available at
http://coolwanglu.github.com/pdf2htmlEX/.
The website points to github which provides the command line commands to download it. To download the source and build it, open a terminal window and enter the following command.
> git clone --depth 1 git://github.com/coolwanglu/pdf2htmlEX.git
Well, turns out I did not have Git
installed. So I proceeded to download and install it.
Git was downloaded and installed
correctly.
Once Git was installed, I re-issued the
previous command, and this time it was downloaded and installed
correctly.
Next we cd into the newly created
directory and issue commands to build the software.
>
cd
pdf2htmlEX
> cmake . && make && sudo make install
Turns out there were several
dependencies missing. One of the major ones was poppler for which
Ubuntu had an older version and a version higher than 0.20.0 was
needed. I proceeded to download the version from
http://poppler.freedesktop.org/
and compiled and built it on the local machine.
The latest version of the poppler
software can be accessed from
http://poppler.freedesktop.org/releases.html
as shown below.
Once downloaded, we need to extract and
build the software.
>
sudo apt-get install libopenjpeg-dev
>
tar -xvf poppler-0.21.4.tar.gz
>
cd poppler-0.21.4/
>
./configure --enable-xpdf-headers
>
make
>
sudo make install
libpoppler was installed successfully!
Note: The option to enable xpdf headers
in the configure statement is very important. Without it, poppler
will compile, but pdf2htmlEX may not compile.
Next, we need to install the latest
version of libfontforge. This has certain dependencies, that can be
installed by issuing the following command on the terminal.
> sudo apt-get
update; sudo apt-get install libpng12-dev zlibc zlib1g-dev
libtiff-dev libungif4-dev libjpeg-dev libxml2-dev libuninameslist-dev
xorg-dev subversion cvs gettext git libpango1.0-dev libcairo2-dev
python-dev;
Next
we need to downloadthe sourcecode and install it. For this, we make a
src folder and cd into it using the following commands.
>
mkdir src
>
cd src
We
then proceed to download the sourcecode from the website
(https://github.com/fontforge/fontforge/downloads)
as shown below.
Once
downloaded, we can issue the following command to unzip the file.
>
bunzip2 fontforge_full-20120731-b.tar.bz2
The
resulting tar can be unzipped by issuing the following command
>
tar -xvf fontforge_full-20120731-b.tar
Next
we need to download a few other modules called freetype and spiro
>
git clone git://git.sv.gnu.org/freetype/freetype2.git;
> svn co http://libspiro.svn.sourceforge.net/svnroot/libspiro/;
Now, we need
to build all these together.
>
cd ./libspiro;
./configure;
make;
Followed
by
>
sudo make install
Next
we need to install fontforge
> cd
./fontforge-20120731-b/; ./autogen.sh; ./configure; make; sudo make
install; sudo ldconfig;
Then
we cd into the directory and configure and make the program by
issuing the following commands.
>
./configure
>
make
>
sudo make install
Finally,
the software is built and installed.
Now,
we can run the pdf2htmlex command as follows:
>
pdf2htmlex
In
fact, executing the following explains all the options available with
the command.
>
pdf2htmlex --help
There
you go folks... a simple tool to convert pdf documents to html.
2 comments:
There is actually an Ubuntu PPA available.
Also in your title, 'EX' is missing :)
Thanks for the pointer on the ppa. I have also fixed the title.
Post a Comment