Problem: There are over 500 PDF documents on a web site. The domain name has changed. Many of them had links to the old website. These links need to be converted to the new domain because the old domain is deprecated and will soon be unresolveable. Since a PDF is considered a binary file, it is difficult to do a mass change.
Solution: I was able to find a partial solution. After some testing I found out that the visible rendering of the links in the PDF document uses some kind of graphic or postcript-like language (i.e., not text). However, the actual link that is passed to the browser when the user clicks on the link is text, which is editable (assuming the PDF is uncompressed).
I found an open source PDF Toolkit and loaded it on a Linux Centos machine:
poppler-utils-0.5.4-4.4.el5_5.14
poppler-0.5.4-4.4.el5_5.14.
I wrote a script ( cnvt.ksh) that will take a list of files and un-compress the PDF, update the link, and then re-compress the PDF:
#!/bin/ksh
IFS=”#”
lst=$1
while read fn;do
echo “working on: “$fn
rm /tmp/a.pdf
pdftk “$fn” output /tmp/a.pdf uncompress
(echo ‘1,$s/olddomainname\.com/newdomainname.com/g';echo ‘w’)|ed /tmp/a.pdf
pdftk /tmp/a.pdf output “$fn” compress
done < $lst
The process required the capture of all of the PDF files into a TAR. I did this by creating a list using the find command on the web server:
cd /usr/local/www/www.docs/
find . -type f -name ‘*.pdf’ -exec grep -l ‘\.olddomainname\.com’ {} \; > /tmp/pdf_list.txt
Then, using the list, I create a TAR file
tar cf /usr/pdfs_orig.tar -I /tmp/pdf_list.txt
On another machine I copy over the TAR file using scp:
scp root@newmachine:/usr/pdfs_orig.tar .
I then extract the files on that machine, create a list of files, and run the script I wrote:
tar xf pdfs_orig.tar
Find data > /tmp/pdf_files.txt
./cnvt /tmp/pdf_files.txt
Once the job finished. I created a TAR file with the updated PDF files and copied it back to the web server. On the web server I overlayed the existing PDF files with the updated ones using TAR.
tar cf pdfs_new.tar data
scp pdfs_new.tar root@webserver:/usr/
On the Web Server:
cd /usr/local/www/www.docs/
tar xf /usr/pdfs_new.tar
If any PDF is found to be broken, we can restore it from the original TAR file that was created.