[svlug] Some pretty serious parsing

Steve Litt slitt at troubleshooters.com
Sat Nov 14 16:02:04 PST 2015

On Sat, 14 Nov 2015 23:54:30 +0100
Ivan Sergio Borgonovo <mail at webthatworks.it> wrote:

> On 11/14/2015 02:42 PM, Steve Litt wrote:
> > Hi all,
> >
> > I need a fast, easy book authoring system to write books destined
> > for both PDF/paper and ePub. It does not currently exist in the free
> > software world.
> > I've used LyX to write books (to PDF/paper) since 2001, and would
> > continue to use it if it could write to both PDF and ePub. But it
> > can't: The (X)html LyX outputs is pigeon html rendering pigeon
> > ePubs with serious readtime deficiencies and inability to pass
> > standards with eBook vendors.
> [snip]
> > The Stylz cheatsheet is at
> > http://troubleshooters.com/projects/stylz/cheatsheet.htm . It
> > evolves every few days: It's still in a state of flux.
> [snip]
> > So what do you all think? What's a good way to parse a fairly
> > complex non-XML grammar to convert it to Xhtml?
> I didn't get what's really wrong with the generated html.

Did you take more than a 5 minute look at (X)html exported from LyX?
Styles solely for the sake of first paragraph non-indentation.
Antiquated anchor tags (expressly forbidden by Kindle). Two or three
times the HTML you need to convey the text and styles.

> I really never found a nice epub and after a while I stopped to look
> for nice ones, so no surprise I may miss what you're expecting.

Like you said, no surprise. Most publishers just put their existing
LaTeX or MSWord through a meat grinder, that prematurely translates
styles into appearance and then runs several more conversion stages, and
end up with hard to read garbage.

For instance, because my eyes are so bad, I decided to order "The Goal"
by Goldratt in Kindle format. The styling for character dialog (which
is a huge part of the book) was so bad that person A and person B would
speak on the same line. And even if there were a line break, it was no
bigger than intra paragraph line spacing. You couldn't tell who was
saying what.

I ordered a Stephen King book via Kindle. In the front matter, some
fonts were so huge they walked off the page, while others were so tiny
they couldn't be deciphered.

> What are ebooks vendors standards?

The big publishers just throw their old manuscripts into a meat
grinder: Let the buyer beware. Some small publishers and
self-publishers do an excellent job, while others are even more
attrocious than the big publishers.

> You'd better investigate what other publishers use to produce pdf and
> epub.

PDF's not a problem. LaTeX does a spectacular job. As far as ePub, I
need to set my sights higher than what the publishers are doing,
because I swear, their formatting mistakes cut your reading speed in

> If you've never written a parser and you're not very very comfortable 
> with C I'd start with a python or Java parser (I don't like Java but 
> there are some really nice parsers written in Java).
> I've found PLY the one with the best balance between power and ease
> of use. PLY seems just a little bit more maintained than fetchmail
> but as Rick may say it could be just that it reached perfection.

I like Python: I'll check out PLY. Thanks for the tip.

> It could be still easier to automatically add css to your lyx
> generated HTML.

Been there, done that. My first Kindle book was done that
way. Almost no features, and it still took days to code a converter to
go from LyX' pidgeon HTML output to ePub, and the majority was just
remedial work with the HTML. Never again.



Steve Litt 
November 2015 featured book: Troubleshooting Techniques
     of the Successful Technologist

More information about the svlug mailing list