[svlug] ePub processing: was Some pretty serious parsing

Akkana Peck akkana at shallowsky.com
Sun Nov 15 12:05:35 PST 2015

Steve Litt writes:
> Or perhaps there are too many good standards, each contradicting each
> other, [ ... ]

Yep -- you know what they say about standards.

> > I've been updating my epub python module recently to
> > handle and correct cover images 
> Just so we're on the same page, do you mean you've written a Python
> program to take ePubs you've acquired and tweak them for better
> reading? Or are you referring to something else?

Examine and tweak existing epubs.

I use it for displaying titles and tags of books in my library,
fixing tags (getting rid of all those excessive Gutenberg tags
and replacing them with tags I can use for indexing like science
fiction, mystery, astronomy, etc.), importing those tags into
my Kobo's shelves database, and lately, fixing titles and replacing
book covers. I should probably rename it since it handles a lot more
than just tags now.

Once Stylz gets past the parsing part, of course you're welcome to
use any or all of my code for the epub parts of Stylz, and I'd
love to collaborate with you on that and add what's needed (which
will be a lot, since I haven't yet had the need to assemble epubs
from scratch, only examine and modify existing epubs).

> If the ePub manufacturer was doing things at all right, he/she put a
> reference to the cover in the Guide section of the OPF file. This
> assures that the device's native location services can find the cover.
> HOWEVER, some covers are just an image, and some covers are a whole
> (X)html file containing the cover image. IIRC, different "you must do
> it this way" Kindle documents straight from Amazon tell you to do it
> different ways. I typically use the method of containing the cover
> image in the Xhtml file pointed to by the OPF->Guide->Cover entry.
> And, of course, some boze could simply put a graphic at the start of
> the book, with no metadata indicating it's the cover. After all, people
> read from front to back, right? What could possibly go wrong? Such
> bozes are all over the Internet.

Yep, seen (and had to code for) all of these variants. My module
ignores XHTML covers at this point because the immediate problem
I'm trying to solve is cover images with no text. In the books I've
tested so far, if there's XHTML at all, it's just a wrapper that
shows the image, so it's still the image that needs to be changed.

> You really can't blame Gutenberg for what they do. They're converting
> thousands and thousands of books from, basically, plain text, with no
> available metadata other than (maybe) chapter starts and ends. In the

I don't really blame Gutenberg: most of these books were probably
converted a decade ago with who knows what tools and when the epub
spec was just getting started. Though if they're *still* generating
new epubs with covers like this, I'd blame them for that, since it's
so easy now to make a better cover with automated tools.

> case of Gutenberg type book cover pages, you might get away with simply
> replacing the first graphic in the first HTML page in the book with an
> SVG created by a little Python substitution program that fills in

That's basically what I'm doing (the script named fixbookcover in
the same directory I linked above). I extract the current cover
image, show it, and then (after user confirmation) use ImageMagick
to add some text with title and author to the existing image.


More information about the svlug mailing list