Extracting Images from PDFs

Dythiese · Post by **Dythiese** » Fri May 01, 2020 11:55 am

When I prep my games, I like to use maps and character illustrations from the PDF of the module I'm going to run.

Images in PDFs can be saved in one of two ways: As regular (generally jpeg) images, or as masks. JPEG can not save transparency information, so the image will have a white or black background, which makes using the image in a VTT look a little ugly. The mask allows for transparency, but since they're two separate files they have to be combined into a PNG before use.

In order to extract the images and then combine them for transparency we're going to need to use the Linux command line.

You don't need a dedicated Linux box or partition.

Windows 10 has an official subsystem that allows you to use the command line without requiring a Virtual Machine.

If you already have access to a Linux command line, just install ImageMagick and Poppler-Utils and use the script below.

Setting up the Windows Subsystem for Linux

https://docs.microsoft.com/en-us/window ... tall-win10
https://wiki.ubuntu.com/WSL

NOTE: This can not be done on Windows 10 "S Mode".

1. Open Windows Powershell as Administrator
2. Enter the following:

Code: Select all

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux

3. Reboot Windows.

4. Open the Windows Store (Microsoft Store) and search for Ubuntu.
5. Click 'Get'.
6. After it has downloaded, 'Get' will change to 'Launch'.
7. Click 'Launch'. It will take a few minutes to set up the first time. It will then prompt for a username and password. Make up whatever you want.
You should see something that looks like this:

: ubuntu.png (24.44 KiB) Viewed 4946 times

To open the command line in the future, you may want to Pin 'Ubuntu' to the Start menu.

First, you'll want to ensure all of your programs are up to date. To do that, type the following:

Code: Select all

sudo apt update
sudo apt upgrade

The 'sudo' command tells Linux to execute the following commands as an administrator (root). You'll be prompted for your password.

This may take a little bit, and you may be prompted to reboot your system.

You can either reboot your entire PC, or just use Powershell (run as Administrator). Wait about 3 seconds after the completion of 'net stop LxssManager' before running 'net start LxssManager'

Code: Select all

net stop LxssManager
net start LxssManager

Now you need to install the utilities the script will call. To do that, in the Ubuntu prompt type the following:

Code: Select all

sudo apt install poppler-utils
sudo apt install imagemagick

Your Windows drive(s) will be mounted under '/mnt/'. If you don't know what that means, read "Navigating the Linux command line" below:

Navigating the Linux command line

Red Hat article on Navigation

NOTE: Linux commands use white space as separators. You need to enclose file and folder names with double quotation marks if they contain whitespace.

You can autocomplete by pressing TAB. If there is more than one possible auto completion you may hear a ding. Press TAB twice to show all possible completions.

'pwd' displays the current directory that you're in in.

Code: Select all

:~$ pwd
/home/username
:~$

'mkdir' creates a new directory. 'mkdir new folder' will create TWO folders, "new" and "folder". To create a folder with spaces (whitespace), type

Code: Select all

:~$ mkdir "new folder"

'ls' lists all files and folders in the current directory. You can give it the name of a folder to view the contents of that folder. Examples:

Code: Select all

:~$ ls
'new folder'
:~$  ls "new folder"
:~$

There is no output for the last command because the folder is empty.

Code: Select all

:~$ ls /
bin   dev  home  lib    lib64   media  opt   root  sbin  srv  tmp  var
boot  etc  init  lib32  libx32  mnt    proc  run   snap  sys  usr

'/' is the root directory. Everything else is mounted here.

'cd' changes your current directory. It requires arguments. '..' means one directory up. '/' at the beginning means the root directory. Running 'cd' without any arguments will return you to your home directory (the one you start in when you launch Ubuntu from Windows).

You can access files on your Windows computer by navigating to '/mnt/'

I have two drives, C: and S:
They are mounted under /mnt/ as '/mnt/c' and '/mnt/s'

Code: Select all

:~$ cd /mnt
:~$ ls
c  s
:~$

Once you're able to navigate to your Desktop, make a directory and enter it.
On your Windows Desktop, a folder should have appeared. Make a copy of your target pdf in that folder.

I tested this script with Paizo's free 'Skittershot' Starfinder module, available from their website at https://paizo.com/products/btpya1aa?Sta ... itter-Shot

After downloading the Skitter Shot pdf, rename it to "Skitter Shot.pdf". All of the image files extracted will be prefaced with the name of the PDF, so you don't want the PDF to have a long name. I would use AA1, AA2, and Core; for Alien Archive 1, Alien Archive 2, and the Core Rulebook.

Download this file into the folder.

imageExtraction.txt: (1.72 KiB) Downloaded 69 times

NOTE: You're about to run a file provided by some rando on the internet. Open it in notepad, and even if you don't understand most of it make sure it doesn't have anything sketchy in it, like IP address calls, or URLs.

Normally, use of the 'rm' and 'mv' command would be suspect because that will permanently delete a file. I use 'rm' it in this script to cleanup extraneous image files after transparencies have been composited and 'mv' for overwriting temporary log files the script creates.

Check the use of 'cd' commands and see if the script tries to escape the current directory to do unknown nasties.

use the 'mv' command to rename "imageExtraction.txt" to "imageExtraction.sh"

Code: Select all

:~$ mv imageExtraction.txt imageExtraction.sh

Run the script. Use the filename of the pdf as an argument.

Code: Select all

:~$ ./imageExtraction.sh ./imageExtraction.sh Skitter\ Shot.pdf

This will take a few minutes, and Skitter Shot is only 20 pages.

You may notice that "Skitter Shot" is "Skitter\ Shot.pdf" without double quotes. That's another way to deal with whitespace, putting a "\" before the character. If you use TAB to autocomplete, it will add those in for you.

Cleanup

Assuming you used the Skitter Shot pdf, the folder will be "SkitterShot.tmp".

Open it up, and you should see the following:

: imageExtraction Results.png (300.02 KiB) Viewed 4946 times

There's a lot of stuff here.

First off, the naming structure for all of the files is

Code: Select all

PDF NAME-PAGE NUMBER-UNIQUE NUMBER.EXTENSION

Next, you'll notice that there are a lot of duplicate images. If you scroll down, you'll find "SkitterShot-015-165.png", which is an uncropped version of "SkitterShot-003-031.png" at the top.

You'll also see the many, many, many red line images. Most of them are exact duplicates of one another, but some are different by a few pixels in width and/or height.

For exact duplicates (pixel for pixel exact matches), you can search for 'duplicate file cleaner'. I use fslint for Linux, and CCleaner for Windows.

But for cropper/rotated images, you'll need a dedicated image duplication cleaner.

AntiDupl is free, and available for Windows at https://sourceforge.net/projects/antidupl/

Image duplication cleaners aren't perfect, so they provide you with images they believe to be duplicates and require you to confirm the deletion. To make that take less time, you should run CCleaner or FSlint first to eliminate the exact, byte-for-byte duplicates.

CCleaner Instructions

1. After opening CCleaner, navigate to Tools -> Duplicate Finder.

: ccleaner.png (43.73 KiB) Viewed 4946 times

2. Ensure that all of your other drives in the Include pane are unchecked, then click Add, and add the SkitterShot.tmp folder.
3. Click Search

: ccleaner1.png (72.85 KiB) Viewed 4946 times

4. Right click on any of the files and click on "Select All". This will not select all. It will select all but one of each duplicate.
5. Click "Delete Selected" in the bottom right corner of the window.
6. Confirm.

AntiDupl Instructions

: antidupl.png (31.94 KiB) Viewed 4946 times

1. Click on the "Paths" icon
2. Select the only folder (It defaults to the folder that the program is located in)
3. Click "Change" in the lower right corner.
4. Select the SkitterShot.tmp folder.
5. Click "Start Search" (Green Arrow , two buttons to the left of the "Paths" icon.

: antidupl1.png (228.8 KiB) Viewed 4946 times

The author's website has good info on how to use the interface: https://ermig1979.github.io/AntiDupl/da ... eview.html

The short of it, is the NumPad will be your friend here. Num1 will delete the first image, Num2 will delete the second image, and Num5 will not delete either image. Hover over the icons to see what they mean and what their shortcuts are.

The dimensions column will highlight the smaller image in red. Generally, you want to keep the largest image available and then scale that one image down as you need.

Finally, you'll notice a lot of weird misc images.

: danglies.png (358.96 KiB) Viewed 4946 times

That's just how they were saved in the PDF for whatever editor's reasons. There's no scripts or programs for these, just delete them.

From what's left, remember that the file nomenclature is "PDF-SOURCE PAGE-UNNECESSARY NUMBER", so when you start renaming the files, use the page number as a reference for what the NPC's name is, or if there's extra info you need about an image.

The Script

imageExtraction.txt: (1.72 KiB) Downloaded 69 times

You'll need to rename the attachment to imageExtraction.sh. The forum doesn't allow the *.sh filetype to be uploaded.

I haven't coded non-gcode since I was a teenager, and I had to teach myself bash scripting to do this, so I would greatly appreciate any improvements from real programmers.

Code: Select all

#!/bin/bash

filename="$1"

if [[ -z $filename ]]; then
	echo "Usage: imageExtraction <PDF-File>"
	exit
elif [[ ! -r $filename ]]; then
	echo "$filename is not a readable file."
	exit
fi

pdf="${filename%.*}"
pdf="${pdf//[[:space:]]/}"
folder="$pdf".tmp

mkdir "$folder"
if [ $? != 0 ]; then
	echo "Unable to create directory "$folder""
	exit
fi

log="$pdf".pdfimages.log
pdfimages -list "$filename" > "$folder"/"$log"
pdfimagesExit=$?

case "$?" in
	0)
		;;
	1)
		echo "pdfimages is unable to open "$filename""
		exit
		;;
	2)
		echo "pdfimages is unable to open an output file."
		exit
		;;
	3)
		echo "pdfimages does not have PDF permissions."
		exit
		;;
	*)
		echo "Unknown error in pdfimages."
		exit
		;;
esac

cd "$folder"
echo "Extracting images from "$filename"..."
pdfimages -all -p ../"$filename" "$pdf"

case $? in
	0)
		echo "Extraction complete."
		;;
	*)
		echo "Unknown error in pdfimages."
		cd ..
		exit
		;;
esac

tail -n +3 "$log" > "$log".tmp
mv "$log".tmp "$log"

lastNum=''
lastPage=''
lastType=''
curFile=''
lastFile=''

while read page num type remainder; do
	if [[ $type == smask ]] || [[ $type == mask ]]; then
		#The numbers in pdfimages -list output are not fixed width, while the actual file output is.
		printf -v num "%03d" $num
		printf -v page "%03d" $page
		printf -v lastNum "%03d" $lastNum
		printf -v lastPage "%03d" $lastPage
		
		curFile=$(echo "$pdf"-"$page"-"$num".*)
		lastFile=$(echo "$pdf"-"$lastPage"-"$lastNum".*)
		newFile="${lastFile%.*}".png

		echo "Masking "$lastFile" with "$curFile""
		composite "$curFile" "$lastFile" -compose copy-opacity "$newFile"
		echo "Deleting "$curFile""
		echo "Deleting "$lastFile""
		rm "$curFile" "$lastFile"
	fi

	lastNum=$num
	lastPage=$page
	lastType=$type
done < "$log"

NOTE: The script will only extract masking layers that are present in the PDF. It can not auto-generate masking layers for images in PDFs that don't have them (scans).

I have no problem with Paizo PDFs, but on trying a Wizards of the Coast 5E pdf, it threw errors because they use JPEG2000 (*.jp2). The most current version of Imagemagick can decode .jp2 files, but not the one in the Ubuntu repository.

Dythiese · Post by **Dythiese** » Fri May 01, 2020 12:30 pm

After going through AntiDupl, this is what remains. The little green guy has 2 duplicates, but that's a lot easier to see now. You could also play with the settings in AntiDupl to increase its fuzziness and get more matches.

: folder.png (347.42 KiB) Viewed 4931 times

Going off the pdf, this is page 6 and you'll see that Voryna Kopali has more than just her image, so let's make something that looks like that for a player handout.

PDF Page

: page.png (211.64 KiB) Viewed 4931 times

Unfortunately, the quality of the "SkitterShot-006-072.jpg" mask is very low, and I only get blurry results when I play with it. Similarly, all of the numeral masks at the bottom are meant to be overlays for the maps, but I find them too blurry for that.

So to make a card like in the pdf, I zoom in on the pdf and take a screenshot 'ALT-PRINT SCREEN', then paste that into paint.net.

After that, I crop it and save it as "Voryna_card.png"

Then I open "SkitterShot-006-069.png" (Voryna's image) and crop that, and save it as "Voryna.png"

You can stop here and import the two images into MapTool separately, that way you can reveal her name after she gives it, but you'll have to remember to select both the card and the image when moving them. Also, if you need to flip the image, the text will also be flipped if it's all one image.

The advantages to making it a single image are that you can use the combined card and portrait image as a handout, as seen below, and also it's less to deal with; if you have a few of these, remembering to select both the image and the card at the same time can be a pain.

Token Properties

: properties.png (114.35 KiB) Viewed 4931 times

As a Handout

: handout.png (595.05 KiB) Viewed 4931 times

Map

: map.png (695.73 KiB) Viewed 4931 times

The numeral 2 on the floor is on the Hidden layer, for me as a GM to be able to reference rooms from the pdf. I created it from the extracted 2 mask, but as you can see it's not good.

I would recommend making a stock set of numbers and reusing those where needed.

RPTools.net

Extracting Images from PDFs

Extracting Images from PDFs

Re: Extracting Images from PDFs