[svlug] ePub processing: was Some pretty serious parsing

Steve Litt slitt at troubleshooters.com
Sun Nov 15 10:11:48 PST 2015


On Sun, 15 Nov 2015 09:29:23 -0700
Akkana Peck <akkana at shallowsky.com> wrote:

 
> Steve Litt writes:
> > The big publishers just throw their old manuscripts into a meat
> > grinder: Let the buyer beware. Some small publishers and
> > self-publishers do an excellent job, while others are even more
> > attrocious than the big publishers.
> 
> So true. Epub books vary tremendously and there doesn't seem to be a
> good standard. 

Or perhaps there are too many good standards, each contradicting each
other, so people say "screw it, I'll just do it the simplest possible
way."

> I've been updating my epub python module recently to
> handle and correct cover images 

Just so we're on the same page, do you mean you've written a Python
program to take ePubs you've acquired and tweak them for better
reading? Or are you referring to something else?

> (like all those Project Gutenberg
> epubs that use a picture of a Palm Pilot as a cover image, with no
> text to tell you what the book is). Finding the cover image is
> really a matter of guesswork in most epubs: does it have "cover" in
> the filename somewhere and does the extension imply it's an image?

If the ePub manufacturer was doing things at all right, he/she put a
reference to the cover in the Guide section of the OPF file. This
assures that the device's native location services can find the cover.

HOWEVER, some covers are just an image, and some covers are a whole
(X)html file containing the cover image. IIRC, different "you must do
it this way" Kindle documents straight from Amazon tell you to do it
different ways. I typically use the method of containing the cover
image in the Xhtml file pointed to by the OPF->Guide->Cover entry.

And, of course, some boze could simply put a graphic at the start of
the book, with no metadata indicating it's the cover. After all, people
read from front to back, right? What could possibly go wrong? Such
bozes are all over the Internet.

You really can't blame Gutenberg for what they do. They're converting
thousands and thousands of books from, basically, plain text, with no
available metadata other than (maybe) chapter starts and ends. In the
case of Gutenberg type book cover pages, you might get away with simply
replacing the first graphic in the first HTML page in the book with an
SVG created by a little Python substitution program that fills in
author, title, copyright date etc in a template with tokens. That kind
of thing works great: I use it in my course diploma-maker. Because you
didn't change a filename, you needn't change any metadata.

Last but not least, to get a feel of what's inside an ePub, see this:

http://www.troubleshooters.com/ebooktech/epub_demystify.htm

Thanks,

SteveT

Steve Litt 
November 2015 featured book: Troubleshooting Techniques
     of the Successful Technologist
http://www.troubleshooters.com/techniques



More information about the svlug mailing list