UnicodeDecodeError in posixpath for non-ascii filename

Bug #437295 reported by Martin von Gagern
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
CVS to Bazaar importer
Triaged
Medium
Unassigned
Mukti Bangla Open Type font
New
Undecided
Unassigned

Bug Description

cvsps-import fails for me, probably due to a non-ascii character in some file name.

$ bzr cvsps-import --encoding=latin1 $PWD/CVSROOT ModuleName .
Creating cvsps dump file: ./staging/ModuleName.dump
Read 5718 patchsets (string cache hits: 0, total: 14181)
Failed while processing: Patchset(871, HEAD, materlik, 2002/02/11 18:01:04)
Processed 870 patches (870 new, 0 existing) on 14 branches (6 tags) in 2264.3s (0.38 patch/s)
bzr: ERROR: exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 20: ordinal not in range(128)

Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/bzrlib/commands.py", line 842, in exception_to_return_code
    return the_callable(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/bzrlib/commands.py", line 1037, in run_bzr
    ret = run(*run_argv)
  File "/usr/lib/python2.6/site-packages/bzrlib/commands.py", line 654, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/home/mvg/.bazaar/plugins/cvsps_import/__init__.py", line 95, in run
    importer.process()
  File "/home/mvg/.bazaar/plugins/cvsps_import/cvsps/importer.py", line 1275, in process
    self._process_patchsets(cvs_to_bzr, patchsets, pb=pb)
  File "/home/mvg/.bazaar/plugins/cvsps_import/cvsps/importer.py", line 1214, in _process_patchsets
    rev_id, action = cvs_to_bzr.handle_patchset(patchset)
  File "/home/mvg/.bazaar/plugins/cvsps_import/cvsps/importer.py", line 755, in handle_patchset
    revision_id = self._extract_changes(patchset)
  File "/home/mvg/.bazaar/plugins/cvsps_import/cvsps/importer.py", line 992, in _extract_changes
    txt, executable = self._cvs_updater.get_text(member, revision)
  File "/home/mvg/.bazaar/plugins/cvsps_import/cvsps/importer.py", line 672, in get_text
    rcs_file = self._get_rcs_filename(filename)
  File "/home/mvg/.bazaar/plugins/cvsps_import/cvsps/importer.py", line 607, in _get_rcs_filename
    filename + ',v')
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 20: ordinal not in range(128)

bzr 2.0.0 on python 2.6.2 (Linux-2.6.30-gentoo-r5-i686-Intel-R-_Pentium-R-_4_CPU_3.00GHz-with-gentoo-2.0.1)
arguments: ['/usr/bin/bzr', 'cvsps-import', '--encoding=latin1', '$PWD/Cinderella/CVSROOT', 'ModuleName', '.']
encoding: 'UTF-8', fsenc: 'UTF-8', lang: 'de_DE.utf8'
plugins:
  bzrtools /usr/lib/python2.6/site-packages/bzrlib/plugins/bzrtools [2.0.0]
  cvsps_import /home/mvg/.bazaar/plugins/cvsps_import [unknown]
  launchpad /usr/lib/python2.6/site-packages/bzrlib/plugins/launchpad [2.0.0]
  netrc_credential_store /usr/lib/python2.6/site-packages/bzrlib/plugins/netrc_credential_store [2.0.0]
  qbzr /usr/lib/python2.6/site-packages/bzrlib/plugins/qbzr [0.14.0]
  svn /home/mvg/.bazaar/plugins/svn [0.6.4dev]

*** Bazaar has encountered an internal error. This probably indicates a
    bug in Bazaar. You can help us fix it by filing a bug report at
        https://bugs.launchpad.net/bzr/+filebug
    including this traceback and a description of the problem.

The fact that even if I configured the cvs encoding as latin1 and my filesystem as utf8, python tries to interpret the string as ascii, seems a clear indication that this is not a configuration issue.

Related branches

Revision history for this message
Martin von Gagern (gagern) wrote :

I added some debug output. The problem lies in these lines in _get_rcs_filename:
            rcs_file = osutils.pathjoin(self._cvs_root, self._cvs_module,
                                        filename + ',v')
The first two arguments are unicode strings, while the third one is a byte string. Given the fact that afaik CVS doesn't particularly care about encodings, and that there might well be some legacy files in some Attic which are illegal according to current filesystem encoding. So paths inside the repository should be treated as binary, and the fact that a clean conversion using the current filesystem character set might be impossible should be taken into account as well. I'm thinking about a patch, but have no good solution yet.

Revision history for this message
Martin von Gagern (gagern) wrote :

OK, importing these files is a real problem, as of course bzr itself is unicode-oriented, not byte-oriented, so one has to somehow determine the name of such a file. I guess the real troubles will start if a file was renamed, e.g. from its latin1 encoded from to its utf-8 encoded form. Might well happen during the deverlopment of a project. In that case there would be two different byte strings representing the same unicode string, and encoding would change over time. A really tough situation. I guess for the moment I'll simply assume that file name encoding corresponds to message encoding.

Revision history for this message
Martin von Gagern (gagern) wrote :

I've pushed a branch containing a proposed fix for this:
https://code.launchpad.net/~gagern/bzr-cvsps-import/lp437295

The fix was written with Linux in mind, and I don't know whether it would work as well on platforms using unicode file names, like Windows or OS X. But as I believe that non-ascii filename support restricted to Linux is preferable to no such support at all, I'd like to see these changes merged in any case, unless you have serious objections.

Jelmer Vernooij (jelmer)
Changed in bzr-cvsps-import:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Dr Anirban Mitra (mitradranirban) wrote :

My cvs import from savannah.nongnu.org fails with similar error massage.

Revision history for this message
Dr Anirban Mitra (mitradranirban) wrote :

Attached log file

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.