Text Box: Bulletin board (you post, we answer)

Welcome to a new column that will, we hope, become a regular feature of bulletin. Simply write into [e-mail address] with details of any professional problem and we'll attempt to publish a concise yet practical answer in the months to come. Kicking the series off, Michael Benis takes a fresh look at a perennial problem that he addressed during the translation memory workshop at ITI's September conference.

CONVERTING PDF FILES

First things first

More and more companies are using PDF files to publish information on the Internet, which means that - whether we like it or not - an increasing proportion of our work arrives in that format. The first bit of advice is to get in touch with your client immediately. The reason for this is that a PDF document is always prepared from another format and you will almost never be expected to deliver your translation as a PDF file, but rather in a format that can be imported easily into your client's word processing or desktop publishing system. If they can send you the text for translation in this format then life will be easier for everyone concerned. We're going to assume that this isn't possible, although it often is and you would be best advised to always check with your client before spending any of your time and their money on converting the PDF file to another format.

Slow and dirty

The cheapest solution for files that haven't been locked is simply to copy and paste the text content of the PDF file to whichever application you prefer to work in. I'll assume that's Word, since it's a program with which most bulletin readers are likely to be familiar. You can do this using the standard Windows keystroke shortcuts:

Press the Control key (generally in the bottom left-hand corner of the keyboard and labelled Ctrl) and the "A" key at the same time to select all the text. You should see that this highlights all the text in the PDF file's current page. If this is not what happens, get in touch with your client immediately and arrange for them to send you a PDF file that has not been locked. If they are unable to provide such a file, you will have no option but to use one an Optical Character Recognition program such as those mentioned below. Assuming the above went smoothly, however, you can now open Word and press the Control key and "V" key at the same time to paste the text you have selected into Word.

You will, unfortunately, need to do this one page at a time, but as you can see from the screenshots alongside that's not the only time-consuming problem that occurs. The order in which information is presented has also changed and there is a hard break at the end of every line.

Getting rid of the latter is the least of your problems providing you adopt a little strategy to ensure that doing this automatically doesn't interfere with the paragraph structure of your document. Using Word's Find and Replace function, position the cursor in the "Find what" box and press the full-stop key "." then select the advanced options display by clicking on "More". The dialogue box will expand. Click the button marked "Special" and select "Paragraph Mark". The "Find what" box should consequently contain the following characters ".^p". Next enter the following characters in the "Replace with" box: ".£!$" (actually, it doesn't have to be this: you can in fact enter any daft combination that is unlikely to ever occur in your document). Now click on "Replace All". This marks all the points in the document where you wish to retain a change of paragraph. You will need to do exactly the same thing but with a space between the full stop and paragraph mark (". ^p") since many paragraphs may also end in that way. You can now do yet another Find and Replace operation with the paragraph mark "^p" in the "Find what" box (but only once this time) and a space in the "Replace with" box. Clicking "Replace All" again will get rid of all the hard returns, replacing them with spaces and thereby allowing your text to flow properly. All you need to do now is to restore the paragraph breaks by replacing ".£!$" with ".^p^p" following the same procedure.

Unfortunately that's not the end of it. You will need to take a careful look through the document to clean up the occasional lines that didn't end with full stops and have now been run on, such as headers and the table of contents. If you have a large table of contents you could simply copy and paste this in unchanged from the PDF document. Lastly, you will also need to do some work in documents that originally had a complex layout with, for example, two columns or lots of tables and charts, since the order in which the text is displayed may have changed. So, although this solution doesn't actually require you to buy any software (you can use Adobe's free Acrobat Reader), it can be very time consuming.

Heavyweight solutions

No one will, I am sure, be stunned to read that quicker and easier solutions are available to those who are prepared to pay for them. Surprisingly, the most effective and cost-effective solutions are not actually provided by Adobe, or even those supplying plug-ins for Adobe products. Instead, they were until recently to be found in the popular optical character recognition (OCR) products Abbyy FineReader 7 and OmniPage Pro 12. Neither of these products is perfect, but they can both be set up to flow the text in so that it doesn't need too much editing to be usable. What's more, they can of course also be used with a scanner both for OCR and to make photocopies, archive documents, send faxes and so on. At around £80, both are very much better value than the specialist products available on the market and work just as well. FineReader has the edge in terms of keeping the text flow in Word. If you already have a scanner with OCR software you can often save even more money by buying the lower-priced competitive upgrade versions. The only drawback to this solution for PDFs is that if you generally find software programs a little confusing, you may find it difficult to set them up as required to get the best results. Bulletin will be reporting on OCR programs in more detail in the near future.

Light relief

The good news is that ScanSoft, the manufacturers of OmniPage Pro, have been quick to spot a growing niche in the market and launched a simple plug-in for Microsoft Word that manages to make the whole process much easier while simultaneously producing pretty good results and costing an awful lot less (the recommended retail price is £39.99 though it will probably be available for even less online). This new product allows you to convert a PDF file to a Word document either from within Word or from Outlook or Windows Explorer. It isn't at all configurable (meaning there aren't any settings to confuse anyone) and it works very well, as you can see from the screenshots accompanying this article. Its' only disadvantage is that it will not perform any OCR operations whatsoever, which can mean that some of the information in charts and diagrams is only displayed as an image that cannot be edited in the Word document. Otherwise, ScanSoft's PDF Converter is as practical as its unpretentious name. Even more impressive, however, it in many ways does an even better job than the more expensive OCR packages, especially when it comes to using Word's formatting features. Two column documents, for example, are formatted perfectly, making life much easier both for you - especially if you use translation memory - and for your client after delivery. Tables are also handled well. Meaning that, all in all, it's one of the cheapest and best solutions to the PDF problem although it does, of course, rely on having Word installed on your computer. That is, however, unlikely to be a problem for practically everyone reading this article.

I currently have a range of solutions for converting PDF to RTF files, but this is the one I'll be using in the months to come.
[Captions]

<Bitmap: C&PRaw.jpg>

This is the result after cutting and pasting the text from our PDF document into a word processor. As you can see it will take some cleaning up. Even the order of text and tables has changed.

<Bitmap: PDF Orig.jpg>

Here's what our document looked like in the PDF version created from the Quark XPress original.

<Bitmap: PDF Conv.jpg>

This is the result using ScanSoft's new PDF Converter. Images have been imported, but the text can be used pretty much unchanged.

<Bitmap: PDF Conv Table.jpg>

PDF Converter also does a good job of handling two columns, tables and bullet points compared to....

Abbyy FineReader <afrbullet.jpg> and....

OmniPage Pro <omnibullet.jpg>

First published in ITI Bulletin, 2003.