PDF Export glyph-character mapping is odd

Bug #1353873 reported by Darin Nelson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Inkscape
New
Undecided
Unassigned

Bug Description

It appears that there is something strange in the CIDMap used with PDF export in Inkscape 0.48.4 r9939. I've seen hints of this bug in older reports (that led to the PDF+LaTeX feature), but nothing used the keywords that seemed relevant to me, so I'm opening a new one.

The initially observed behavior was that some PDF exports containing text had one or more character substitutions, so that even though the PDF displayed properly, copy-paste of text gave incorrect results. The glyph/character mismatches changed for different files, or the same file, if edited. Sometimes there was no mismatch at all.

I don't know the Inkscape code/libraries well enough to be sure where the blame for the bug should lie.
I would be tempted to call this a _reader_ error except mismatches reproduce on reading with both Adobe Acrobat, and using the Python text-extraction module pdfminer. I am not bold or knowledgeable enough to call it a spec error. So I'm reporting here.

At least I can offer a workaround.

So:

The font I'm using is plain old Ariel, *without* conversion of text to paths.

After initial inquiry (pdfminer to examine full file contents), the behavior can be explained as a double table mapping.

If you have a CIDMap entry (in a PDF file made by Inkscape) of:

<01> <0020>

then it should map glyph 1 to a space (0x20). If there is no entry <20>, it does. However, if there is, later down the list, an entry

<20> <0021>

then it appears as though the 0x20 is remapped into the table, and the resulting mapping of glyph-to-character produces 0x21 -- an exclamation point. The same behavior reproduces at least for other ASCII/Unicode values below 127; I don't have any tables long enough to test other values.

The "remapping" theory is based on behavior, and not any analysis or understanding of the underlying mechanics.

And I note that if the CIDMap table is actually _supposed_ to work like that, then I am very surprised. Nevertheless,
it interferes with the use of Inkscape-produced PDF files.

WORKAROUND:

The reason for the file-contents dependent behavior appears to be that Inkscape sends characters to be mapped in the Z-order they are encountered. So editing text can change the mapping, and moving text objects up and down in the Z-order does the same thing. This suggests a workaround: put in an appropriate hidden text item to control the CIDMap order. Hidden means camouflaged against the background, or under another object; making it transparent or off-page will cause the item to be dropped. The hidden item needs to be bottom-most in the Z-order of the objects exported to the PDF.

A string that works in my hands (for the characters it contains) is:

`abcdefghijklmnopqrstuvwxyz{|}~ !&quot;#$%&amp;'()*+,-./01234567890:;&lt;=&gt;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

Basically, the space gets pushed out to position 0x32 (position 0x01 is first), and all characters after it are likewise positioned ordinally at their own code points. The table is short enough that the small letters don't get corresponding table entries to overwrite them unless some further character is added to the table by another text element. In my experiments, other characters got added to a different CIDMap, so this workaround should cover many situations
where higher-code point characters are used.

But this is All So Wrong.

Any ideas?

Revision history for this message
Darin Nelson (darin-nelson) wrote :

EDIT: I meant: "the space gets pushed out to decimal position 32." Hex is 0x20.

Revision history for this message
su_v (suv-lp) wrote :

Please add information about OS/platform to the bug description (if linux, which versions of cairo do you have installed?), as well as steps to reproduce from scratch (or attach a reduced test case (Inkscape SVG file, exported PDF file)) to ease further investigations.

Revision history for this message
Darin Nelson (darin-nelson) wrote :

The OS is Windows 7.

You specified Cairo version only for Linux, but here they are anyway, to avoid hunting:
    libcairo-2.dll, dated 23-Jun-2011.
     libcairomm-1.0-1.dll dated 20-Jun-2012.

Cases attached in a single .zip file ("mismatch" and "ok").

The .svg files contain the same two text objects, but in different z-order. One order ("ok") gives good text extraction from a resulting PDF, the other ("mismatch") does not.

The .pdf files were exported from Inkscape 0.48 under Windows 7 using Save a Copy. Screenshot of export settings attached (PDF export settings.png).

The .pdf.txt files give auto-extracted text from pdfminer (copy-paste from Adobe Acrobat X gives the same results). The .pdf.xml files give auto-extracted text from pdfminer with additional information (probably not needed for this).

The .pdf.dump files give full PDF content from pdfminer in psuedo-xml format. Because of character escaping, it's hard to read the cmap data, so I also provide cmaps.txt files dumped directly from the uncompressed stream strings.

Revision history for this message
Darin Nelson (darin-nelson) wrote :

ANOTHER EDIT:
  The workaround string in the original post got XML-escaped and has a carriage return in it. Unescape and remove the <CR> in order to use it. Also note that the posted demo .svg file use the same string in one of the two text objects.

Revision history for this message
su_v (suv-lp) wrote :

On 2014-08-07 11:48 , Darin Nelson wrote:
> The OS is Windows 7.
>
> You specified Cairo version only for Linux, but here they are anyway, to avoid hunting:
> libcairo-2.dll, dated 23-Jun-2011.
> libcairomm-1.0-1.dll dated 20-Jun-2012.

The cairo version for the Windows packages of Inkscape is known (cairo is part of the Windows devlibs used to create the official installer), on platforms like Linux it depends on which distro (and which version of the distro) is used because inkscape uses the cairo library provided by the system.

Current Inkscape packages for Windows include a rather dated development snapshort of cairo [1]:
cairo 1.11.2

--
[1] <http://bazaar.launchpad.net/~inkscape.dev/inkscape-devlibs/trunk/view/head:/readme.txt#L25>

Revision history for this message
su_v (suv-lp) wrote :

Thanks for the test cases - attaching samples produced on OS X 10.7.5 with various inkscape & cairo versions.
Notes:
- SVG file had to be changed to use Arial explicitly
  ('Sans' on other platforms defaults to other fonts than Arial)
- Arial font version used (system-provided):
  Version: Version 5.01.2x
  Unique Name: Monotype:Arial Regular:Version 5.01 (Microsoft)
- Quick tests had been done with 'pdftotext' from poppler 0.26.3
  (see the generated *.txt files included in the archive)

AFAICT none of generated PDFs has the reported problem - at least 'pdftotext' here produces the same output for all of them as for your 'glyph-character ok.pdf'.

Revision history for this message
Darin Nelson (darin-nelson) wrote :

Confirmed that the PDF files you attached do not have the reported problem on my setup either, with either Acrobat X or pdfminer.

I _do_ get the problem if I build a PDF file from your .svg file, so differences due to your edit are irrelevant.

I note also that the extracted cmaps are different in your posting from those in mine.

For example, 485-cairo-1_12_16 and _2 begin with entry <20> <0020>, and preserve 1:1 matching of glyph and character throughout, while what I posted begins with <01> <20>, and goes up from there.

0482-cairo-1_10_2 begins with entry <0001> <0020>, so it's also different. Possibly relates to the codespacerange difference, which specifies the 2-byte range, as far as I understand.

does that isolate to cairo version?

Revision history for this message
Nathan Lee (nathan.lee) wrote :

Testing on Windows 10, no longer occurs on versions 0.91-1, 0.92.0, 0.92.3, or 1.0.1, but did occur on 0.48.5-1.

Also couldn't replicate with 1.0.1 on Gentoo or 0.91 r13725 (Jul 1 2016) Gentoo live disk, or 0.91 r13725 on Ubuntu 14.04.7

We probably can close this issue as fixed then.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.