Date: 2015-11-04
Install the following tools:
imagemagick
pdfsandwich
scantailor
You probably already have imagemagick
.
Say you have a scanned PDF file scan.pdf
.
Move it to a new directory and go to that directory.
Assume we have one or more PDFs in a directory /path/to/scan/
.
First we must get some images from the scanned PDF.
Option 1: pdfimages. Try running:
ls -U | grep -a -e '\.[Pp][Dd][Ff]$' | \
while read -r f
do
g=$(printf '%s' "$f" | sed -e 's/\.[^.]*$//')
pdfimages "$f" "$g"
done
The output images are in .pbm
or .ppm
format, depending on if they have color or not.
Take a look at these images, if they are not 1 image per page, then you’ll have to do option 2.
The tool we use in the next step does not recognize Netpbm formats, so convert the images to .png
files.
ls -U | grep -a -e '\.p[bp]m$' | \
while read -r f
do
g=$(printf '%s' "$f" | sed -e 's/\.[^.]*$//')
pnmtopng "$f" > "$g".png
done
Option 2: imagemagick. This option takes a longer time and will not correlate exactly with the original scan.
ls -U | grep -a -e '\.[Pp][Dd][Ff]$' | \
while read -r f
do
g=$(printf '%s' "$f" | sed -e 's/\.[^.]*$//')
convert -density 400 "$f" "$g".png
done
Run scantailor
and create a new project in the directory with all of the .png
files.
Let the output directory be out/
.
This tool lets you reorient the pages, split them, make text darker, etc.
It is very helpful for file size and OCR.
After fixing up the formatting and exporting, stitch together the output images into all.pdf
.
{ ls -v out/*.tif ; echo all.tif ; } | xargs -d '\n' tiffcp
tiff2pdf -o all.pdf -u m -p "letter" -F all.tif
You can also specify a title and author for the PDF using the -t
and -a
flags respectively.
Here we use the pdfsandwich
tool to do OCR.
This runs tesseract
in the background using multiple processes if available.
Just run the following command to make a (hopefully) text-searchable final.pdf
.
pdfsandwich -o final.pdf all.pdf