|
|
< Day Day Up > |
|
Hack 80 Decipher and Navigate PDF at the Text Level
Turn obfuscated PDF code into transparent data so you can work with it directly. PDF uses an element framework for organizing data. When editing PDFs at the text level, it helps to know how to navigate these nodes. The data itself usually is compressed and unreadable. pdftk [Hack #79] can uncompress these streams, making the PDF more interesting to read and much more hackable. First, uncompress your PDF document using pdftk: pdftk mydoc.pdf output mydoc.uncompressed.pdf uncompress Next, fire up your text editor. A good text editor enables you to inspect any document at its lowest level by reading its bytes right off of the disk. Not all text editors can handle the mix of human-readable text and machine-readable binary data that PDF contains. Other editors can read and display this data, but they can't write it properly. I recommend using gVim [Hack #82] .
Open a PDF in your text editor and you will find some plain-text data and some unreadable binary data. All of this data is organized using a few basic objects. The PDF Reference 1.5 section 3.2 describes these in detail. Here is a quick key to get you started.
Dictionaries tend to be the most interesting objects. They represent things such as pages and annotations. You can tell what a dictionary describes by checking its /Type and /Subtype keys. Conversely, you can find something in a PDF by searching on its type. For example, you can find each page in a PDF by searching for the text /Page. For annotations, search for /Annot, and for images, /Image. At the end of the PDF file is the XREF lookup table. It gives the byte offset for every indirect object in the PDF file. This allows rapid random access to PDF pages and other data. Text-level PDF editing can corrupt the XREF table, which breaks the PDF. [Hack #81] solves this problem. |
|
|
< Day Day Up > |
|