From Pages ’09 to the web: An XML-based workflow (continued)

Posted by Pierre Igot in: Blogging, Macintosh
April 22nd, 2011 • 11:02 am

It’s the frustrated developer in me. Once I start fiddling with code, I cannot stop. (When I was a teenager, I spent hours and hours writing programs, first in Basic, then in compiled Basic, then in assembly code, first on a Sharp MZ80K and then on a Commodore 64. I could have become a software developer.)

I wrote earlier today about a workflow that I have developed for writing posts for another web site that I maintain. Since posting that earlier article, I have fixed and improved the AppleScript script for the Find/Replace actions in BBEdit using grep, so that it works better in various situations and also offers new features.

Here’s the revised script:

tell application "BBEdit"
	activate
	replace "<sf:link href=\"(.*?)\"><sf:span sf:style=\"SFWPCharacterStyle-23\">(.*?)</sf:span></sf:link>" using "<sf:link href=\"\\1\"><sf:span sf:style=\"SFWPCharacterStyle-23\">\\&#x3C;a href=\"\\1\"\\&#x3E;\\2\\&#x3C;/a\\&#x3E;</sf:span></sf:link>" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "<sf:lnbr/>" using "\\&#x3C;br /\\&#x3E;" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "(<sf:p sf:style=\"paragraph-style-32\">)(.*?)(<sf:crbr/>|<sf:br/>)" using "\\1\\&#x3C;p\\&#x3E;\\2\\&#x3C;/p\\&#x3E;\\3" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "(<sf:p sf:style=\"paragraph-style-47\">)(.*?)(<sf:crbr/>|<sf:br/>)" using "\\1\\&#x3C;div class=\"exEN\"\\&#x3E;\\2\\&#x3C;/div\\&#x3E;\\3" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "(<sf:p sf:style=\"paragraph-style-50\">)(.*?)(<sf:crbr/>|<sf:br/>)" using "\\1\\&#x3C;div class=\"exFRbad\"\\&#x3E;\\2\\&#x3C;/div\\&#x3E;\\3" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "(<sf:p sf:style=\"paragraph-style-52\">)(.*?)(<sf:crbr/>|<sf:br/>)" using "\\1\\&#x3C;div class=\"exFR\"\\&#x3E;\\2\\&#x3C;/div\\&#x3E;\\3" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "(<sf:p sf:style=\"SFWPParagraphStyle-110\">)(.*?)(<sf:crbr/>|<sf:br/>)" using "\\1\\&#x3C;li\\&#x3E;\\2\\&#x3C;/li\\&#x3E;\\3" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "<sf:span sf:style=\"character-style-15\">(.*?)</sf:span>" using "<sf:span sf:style=\"character-style-15\">\\&#x3C;span class=\"wordEN\"\\&#x3E;\\1\\&#x3C;/span\\&#x3E;</sf:span>" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "<sf:span sf:style=\"character-style-14\">(.*?)</sf:span>" using "<sf:span sf:style=\"character-style-14\">\\&#x3C;span class=\"wordFR\"\\&#x3E;\\1\\&#x3C;/span\\&#x3E;</sf:span>" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "<sf:span sf:style=\"character-style-0\">(.*?)</sf:span>" using "<sf:span sf:style=\"character-style-0\">\\&#x3C;em\\&#x3E;\\1\\&#x3C;/em\\&#x3E;</sf:span>" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "<sf:span sf:style=\"character-style-5\">(.*?)</sf:span>" using "<sf:span sf:style=\"character-style-5\">\\&#x3C;strong\\&#x3E;\\1\\&#x3C;/strong\\&#x3E;</sf:span>" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "<sf:span sf:style=\"character-style-12\">(.*?)</sf:span>" using "<sf:span sf:style=\"character-style-12\">\\&#x3C;span class=\"mysmallcaps\"\\&#x3E;\\1\\&#x3C;/span\\&#x3E;</sf:span>" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
	replace "<sf:span sf:style=\"character-style-13\">(.*?)</sf:span>" using "<sf:span sf:style=\"character-style-13\">\\&#x3C;span class=\"mysup\"\\&#x3E;\\1\\&#x3C;/span\\&#x3E;</sf:span>" searching in text 1 of selection options {search mode:grep, starting at top:true, wrap around:false, backwards:false, case sensitive:true, match words:false, extend selection:false}
end tell

And here is what I did.

First, I’ve added a feature that lets me use the WYSIWYG tool in Pages ’09 for hyperlinks (in the “Link” inspector) to add hypertext links to my articles. I wrote earlier today that I couldn’t do that because of a bug in my script that caused it to strip & (ampersand) characters in URLs if they contained any, which happens quite often.

I decided that I was not going to accept the existence of this bug in my script and so determined to find out how to eliminate it. It turned out that the problem was due to my grep patterns for adding p tags and div tags, which was too destructive, because they only attempted to match part of the closing tag for a paragraph (br/>). I originally did that because, during my testing for the initial script, I found that Pages ’09 was somewhat unpredictable regarding the closing tags it would use for forming paragraphs. Sometimes it would close with <sf:br/><sf:p> and sometimes it would close with <sf:crbr/><sf:p>. I couldn’t figure out exactly why it was doing that and whether it was following a specific pattern. So I just used a partial match (br/>), but then of course I didn’t replace it properly, and so the end bits of XML code for each paragraph were broken and that was what, indirectly, caused Pages ’09 to lose my & characters in my URLs, because they were processed as part of unclosed tags.

The unclosed tags didn’t cause any problems with rendering the tagged text in Pages ’09, but they were why I said I trashed the tagged file after copying and pasting it into WordPress, because I knew that some of the tags were broken.

Anyway, I eventually figured out that the proper way to match these alternating endings was to use alternation in grep, i.e. the | bar. Now for each paragraph that needs either a p or a div tag, I look for “(<sf:crbr/>|<sf:br/>)” at the end of the paragraph, and I include the same in the replacement string, so that no tag is broken.

And so the elimination of this bug now means that my URLs are no longer accidentally altered by my script. I am therefore able to add a line in the script (at the very top) that looks for WYSIWYG hyperlinks in the Pages ’09 document and inserts the proper a href tag around each of them, preserving both the link itself and the link text.

I also added two more lines in the script for CSS character styles that I called mysup and mysmallcaps. The default behaviour for the sup tag in web pages is often problematic, so I use a custom style that does not create line spacing problems:

.mysup
{
    font-size: smaller;
    vertical-align: baseline;
    position: relative;
    bottom: 0.33em;
}

The mysmallcaps style simply applies the small-caps font variant to the enclosed text.

The only drawback to this approach is that I had to use specific user-defined character styles in Pages ’09 rather than the default manual formatting options for superscript and small caps because, as indicated in my other post earlier, built-in manual formatting options use a character style numbering scheme that is unpredictable. This means that, when editing my articles in WYSIWYG in Pages ’09, I have to apply specific user-defined character styles for superscript and small caps rather than the default manual formatting options.

What I lose here is the ability to use keyboard shortcuts, because I already have keyboard shortcuts with Keyboard Maestro for the superscript and small caps manual formatting options in Pages ’09 and I am not about to change all my other documents to use specific user-defined character styles for these two simple things instead of the default manual formatting options. I could start doing this from now on for new documents, but I have thousands of existing documents that use the manual formatting options and such a switch would create lots of undesirable additional overhead.

So I will have to apply small caps and superscript with the mouse using my specific user-defined character styles in the Styles drawer when editing my articles for this particular site. It’s not really a big deal, because I don’t need small caps and superscript in those articles all that often anyway.

I also improved the script by eliminating the line for applying the strong tag to manual bold formatting. I decided that it was too problematic to rely on Pages ’09’s unpredictable numbering scheme for built-in styles and that it would be better if I just made sure that I always applied my own user-defined Strong Emphasis character style instead of simply applying bold. That is not really a problem because I have an easy shortcut with Keyboard Maestro for this particular style (command-shift-E), which I already use all the time in other contexts anyway.

So there we are. This revised script now deals with the unpredictability of XML paragraph endings in Pages ’09 documents in a proper, elegant way, which incidentally allows me to also support WYSIWYG editing of hyperlinks — admittedly a pretty significant improvement in editing articles for the web.

And I have other character styles for superscript and small caps that are also included in my script.

I will probably make further improvements in the future if I have other WYSIWYG editing needs. But I am already pretty pleased with this workflow as it exists now.


One Response to “From Pages ’09 to the web: An XML-based workflow (continued)”

  1. Betalogue » From Pages ’09 to the Web: An XML-based workflow (illustration) says:

    [...] From Pages ’09 to the Web: An XML-based workflow (continued) [...]