Extracting Slide Images from a Pdf
Being a programmer, I'm often tempted to crack open a compiler when faced with a task that involves anything to do with computers.
In this case, I needed to create a some screenshots of the slides from a presentation. Unfortunately the presentation document was only available as a pdf. This left me with the choice of crafting a solution myself or in finding an already existing tool or coercing a set of tools to do the work.
Now I had a vague recollection that Inkscape could import a pdf file. On investigating the command line reference, it seems that Inkscape can import a pdf and export an image from it.
Inkscape.exe slidedeck.pdf --export-png=slidedeck.webp
Great. Except that it only works with the first page in the document.
So what I needed is a tool to take a multi-page pdf and split it into single pages. Which is exactly what Coherent Pdf Tools are able to do. Like this
cpdf.exe -split slidedeck.pdf -o slidedeck%%%%.pdf
Which generates a sequence of single page pdfs, numbered
Once I had the syntax of the two command line building blocks I used a bit of MsBuild batching magic to handle the conversion of each individual page into an image.
<Project DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003" ToolsVersion="4.0">
<PropertyGroup>
<PathToCpdf>"C:\bin\cpdf.exe"</PathToCpdf>
<PathToInkscape>"C:\Program Files (x86)\Inkscape\inkscape.exe"</PathToInkscape>
<PathToPdfToSvg>"C:\bin\pdf2svg.exe"</PathToPdfToSvg>
<ExportViaSvg>true</ExportViaSvg>
</PropertyGroup>
<Target Name="GenerateSinglePagePdfs">
<ItemGroup>
<Slides Include="$(MSBuildThisFileDirectory)*.pdf" />
</ItemGroup>
<Exec Command='$(PathToCpdf) -split %(Slides.Identity) -o %(Slides.Filename)%%%%.pdf' />
</Target>
<Target Name="GenerateImagesFromPdfs">
<ItemGroup>
<SingleSlides Include="$(MSBuildThisFileDirectory)*.pdf" />
</ItemGroup>
<Exec Condition="$(ExportViaSvg)" Command='$(PathToPdfToSvg) %(SingleSlides.Identity) "%(SingleSlides.Filename).svg" 1' ContinueOnError="true" />
<Exec Condition="$(ExportViaSvg)" Command='$(PathToInkscape) -f "%(SingleSlides.Filename).svg" -e "%(SingleSlides.Filename).png" ' ContinueOnError="true" />
<Exec Condition="!$(ExportViaSvg)" Command='$(PathToInkscape) %(SingleSlides.Identity) --export-png="%(SingleSlides.Filename).png"' ContinueOnError="true" />
</Target>
<Target Name="Build" DependsOnTargets="GenerateSinglePagePdfs;GenerateImagesFromPdfs" />
</Project>
Update
Now I've been playing with this technique for a number of different slide decks, I discovered that any pdf with embedded fonts didn't export correctly. The font was replaced by Inkscape with a weird "not understood" icon.
Clearly another approach was needed. After some googling, I discovered another tool, pdf2svg which could export pdfs to svg and could handle embedded fonts.
pdf2svg.exe slide01.pdf slide01.webp 1
Fortunately, but unsurprisingly, Inkscape is so awesome it can handle conversion from svg to png.
Inkscape.exe -f slide01.svg -e slide01.webp
So the script has had to change a little to include the extra tool in the chain. It now goes....
Multi-page-pdf -> batch-single-page-pdfs -> batch-single-svgs -> batch-single-pngs
I have refreshed the gist above to reflect this new information.