Language selection installs too many packages

Bug #1797860 reported by Didier Roche-Tolomelli
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
language-selector (Ubuntu)
New
Undecided
Unassigned

Bug Description

Multiple issues arise when installing any languages in ubuntu:
1. if you select en_GB, en_US is selected instead
2. if you select fr_FR, fr_FR + en_US is selected
3. As soon as en_US is selected (which is always right now), en is then selected, which in turns requests installing all en_* languages.
4. ubiquity, if en_US is selected, only install en_US + en packages, but then, check-language-support wants to bring back all en_* variants (hunspell-en-au hunspell-en-ca hunspell-en-gb hunspell-en-za hyphen-en-ca hyphen-en-gb libreoffice-help-en-gb libreoffice-l10n-en-gb libreoffice-l10n-en-za mythes-en-au thunderbird-locale-en-gb in cosmic for instance) which were discared by ubiquity.

The last point is due to /usr/share/language-tools/language-options reporting needing (in the fr_FR default installation for instance):
en_US
fr_FR
en
en_AU
fr
en_GB
en_CA

A big rework/revamp would be needed in language support, account-services and ubiquity, backed up with tests.
Ideally, the seed and check-language-support will always be in sync, the list of package to install is strictly regulated by check-language-support (which is supposed to be the case below, but we see in 4. that it's not), and we limit the number of components disagreeing in which languages are installed/supported.

We need to take into account ofc the debian singularirity about generated locales.

Revision history for this message
Gunnar Hjalmarsson (gunnarhj) wrote :

Hey Didier,

On 2018-10-15 11:13, Didier Roche wrote:
> 1. if you select en_GB, en_US is selected instead

Yes, if you do a British install, the installer keeps the en_US language support instead. It's bug #1732222.

> 2. if you select fr_FR, fr_FR + en_US is selected

Well, yes. English is always present. It has been that way all since I started to use Ubuntu in 2010.

> 3. As soon as en_US is selected (which is always right now), en is
> then selected, which in turns requests installing all en_* languages.

I think this is related to the fact that the English language packs include all the English dialects. So if we don't want all the dialects installed always, one way to deal with it would be to split the English language packs into dialect specific ditto.

> 4. ubiquity, if en_US is selected, only install en_US + en packages, but
> then, check-language-support wants to bring back all en_* variants
> (hunspell-en-au hunspell-en-ca hunspell-en-gb hunspell-en-za
> hyphen-en-ca hyphen-en-gb libreoffice-help-en-gb libreoffice-l10n-en-gb
> libreoffice-l10n-en-za mythes-en-au thunderbird-locale-en-gb in cosmic
> for instance) which were discared by ubiquity.

Yes, that's an obvious inconsistency. My idea for a solution is to make Ubiquity install them all. It's bug #1294858 (please see comment #3).

> The last point is due to /usr/share/language-tools/language-options
> reporting needing (in the fr_FR default installation for instance):
> en_US
> fr_FR
> en
> en_AU
> fr
> en_GB
> en_CA

The idea with that script is to provide a list of options representing available translations (rather than a list with all available locales). Originally it was created as a fix of bug #693337. Personally I like that idea, possibly because I brought it up. :)

I'm surprised to see both fr_FR and fr in that list, though. Only one of them should be there.

But besides that, what's the problem you see with the script?

> A big rework/revamp would be needed in language support,
> account-services and ubiquity, backed up with tests.

I agree there are some loose ends with respect to this area. You point at some of them above; there are a couple of other Ubiquity bugs in my mind as well.

> Ideally, the seed and check-language-support will always be in sync,
> the list of package to install is strictly regulated by
> check-language-support (which is supposed to be the case below, but we
> see in 4. that it's not), and we limit the number of components
> disagreeing in which languages are installed/supported.

I'd be happy to help with achieving a better consistency. I think we need to talk about the approach, though. For instance: Would it be worth it to split the English language packs?

How do we continue?

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

I think we should answer some questions first in term of fundamentals.

1.
For instance, if I install fr_FR, I don't expect to have fr_CA installed (and it's not the case here), nor fr.
Why would that be different for English? I thinks en_US should only install en_US + the common (shared) packages. However, that doesn't count as "en" being installed, and doesn't pull en_GB for instance.

^ this is the first inconsistency we should fix IMHO.

1.5 -> fr shouldn't be installed, that's just a bug from what you told in the perl script.

2.
I don't understand why we install en_US when selecting !en. If a translation is missing, it will fallback to C, which is the program string, which is most of the time in english. Or, if we really want a fallback, shouldn't that be "en" (always installed then), which doesn't pull any en_*?

3.
Keep in sync the list with the seeds + installer.

I don't know if there will be or not a new desktop installer yet (that will be soon decide), but it's something to take into account. At least, we can try solving those first two and get a good use/case direction, what do you think? (there is certainly some points I'm missing by not being an expert though, I'm aware of this and happily will listen to your knowledge there ;))

Revision history for this message
Gunnar Hjalmarsson (gunnarhj) wrote :
Download full text (3.2 KiB)

I think we need to distinguish between what's installed and what the language-options script outputs.

Installing a language means that a set of language pack packages are installed. (Let's disregard the language support packages for now.)

So when installing French, the French langpacks are installed with the fr translations. It's the same langpacks whether you install from France, Canada, or somewhere else. There is only one French language option in the installer as well as in language-selector, so you don't really install fr_FR - you install French.

All the French locales are generated as well:

$ locale -a | grep ^fr
fr_BE.utf8
fr_CA.utf8
fr_CH.utf8
fr_FR.utf8
fr_LU.utf8

The reason why the language-options script does not only list fr is that there exist fr_CA translations in the /usr/share/locale folder. So in order to make those dialect translations selectable for those users who prefer them, the language-options script outputs both fr_CA and fr_FR. (It's still fewer options compared to listing all the five locales.)

Most language packs contain only one language dialect. There are a few exceptions:

English
Portuguese
Catalan
(maybe some more which I don't recall right now)

For instance, if you install Portuguese (as spoken in Portugal) or Brazilian Portuguese, you get both, since they are both shipped with the language-pack-pt-base etc. packages. This way to organize the language packs isn't a natural law, of course. I don't know if there is a rationale behind it. One thing I can think of is that if you want Portuguese, it's not unlikely that you want to use Brazilian Portuguese as fallback language, and this makes a natural fallback language handy available. It's also worth mentioning that installing the langpacks for a language (ll) triggers the generation of all available ll_* UTF-8 locales.

Same with the English dialects, as you already pointed out.

Given this way to organize the language packs, you could argue that French (Canada) should be included in the French language packs. I'm not sure why it's not, but it may be because too few strings are translated into French (Canada) - a threshold is applied.

Now over to your items in comment #2:

1.
It sounds like you'd like to split the English language packs into dialect specific langpacks. Might make sense; I have no firm opinion yet. One thing which must be cleared in that case is how to deal with locale generation - installing the English language packs triggers the creation of all the en_* UTF-8 locales.

Subscribed Łukasz, since he is currently responsible for the langpack handling.

1.5.
I can't reproduce that fr item. In Cosmic I get:

$ /usr/share/language-tools/language-options | grep ^fr
fr_CA
fr_FR

which is the intended output (see above).

2.
I think we can consider en and en_US to be equivalent in practice.

I can't tell why the English langpacks (and locales) are always kept, when the user picks a non-English language in the installer. Probably it's not necessary, and it sounds like it would make sense to make that change in the installer.

3.
Yes, indeed (assuming you are talking about check-language-support when used by language-selector). I mentioned my idea...

Read more...

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

When you are telling that you select French, or English, and so, all dialects should be installed for them, why do we have that separation in different packages though? Shouldn't we only have one "English", "French", … package?

I don't understand the difference between:
$ locale -a | grep ^fr
fr_BE.utf8
fr_CA.utf8
fr_CH.utf8
fr_FR.utf8
fr_LU.utf8

and the script outputting only fr/fr_FR/fr_CA. There is no translation available in /usr/share/locale for fr_BE for instance? How this does differ fr_BE, from fr_FR, as it's the same currency, time format and no specific string (as no separate langpack) and so on? It seems that a little bit later in the comment, you reach to the same conclusion.
I really think that if we decide to always ship all variants (as the script requires right now), it should be one single package: easier maintenance, list and logic.

On 1.: on the contrary, if we go to install all dialects selecting a given language, I would rather packs them all in a single (well, single as "per type", keeping dict, libreoffice and such separated) package. As we require them to be installed on the system anyway, this doesn't make any difference for the user, but ease our side.

1.5: we can debug that later on, but it seems we want to have "fr" anyway as we have "en", correct? Or we want people to specifically select, like fr_BE (to have the specifics for this locale), but still install all langpaks without having "fr" listed.
So my question in that case would be: what is "en", then? People would rather select a specific one, like en_US to have $ as currency and a weird date format, rather than en_GB, which would use £ and anotter date format :)

2. -> agreed, we can work towards that. I don't understand though why we would have "en" and "en_US", but not "fr", and "fr_FR" as told in my previous paragraph.

3. Hum, I still don't understand the "if we decide to split the english langpacks", they are already splitted. The original issue which triggered that discussion is that on a fresh install, check-language-support complains about missing:
"hunspell-en-au hunspell-en-ca hunspell-en-gb hunspell-en-za hyphen-en-ca hyphen-en-gb libreoffice-help-en-gb libreoffice-l10n-en-gb libreoffice-l10n-en-za mythes-en-au thunderbird-locale-en-gb"

(I'm not only talking about the main langpacks, but for everything we split in langpacks: dictionaries, libreoffice, main applications…)

Revision history for this message
Gunnar Hjalmarsson (gunnarhj) wrote :
Download full text (6.5 KiB)

On 2018-10-17 08:14, Didier Roche wrote:
> When you are telling that you select French, or English, and so, all
> dialects should be installed for them, why do we have that separation
> in different packages though? Shouldn't we only have one "English",
> "French", … package?

I now understand that we have been talking past each other wrt to the definition of "language packs".

With language packs I usually refer to the Ubuntu specific language packs which provide translations from LP. So for English, for instance, there is one set of language packs:

language-pack-en-base
language-pack-en
language-pack-gnome-en-base
language-pack-gnome-en

Such a set includes all the dialects of a language if there are more than one.

Also, the "Installed Languages" window in Language Support checks for those very packages when determining whether a language is installed or not.

The other language related packages (spell checking, firefox translations, input methods...) I usually call "language support". Some of those are indeed split into dialects, which mostly is the result of how Debian organizes it in e.g. libreoffice-dictionaries.

> I don't understand the difference between:
> $ locale -a | grep ^fr
> fr_BE.utf8
> fr_CA.utf8
> fr_CH.utf8
> fr_FR.utf8
> fr_LU.utf8
>
> and the script outputting only fr/fr_FR/fr_CA. There is no
> translation available in /usr/share/locale for fr_BE for instance?

For me there isn't. If some fr_BE translation would end up in /usr/share/locale due to some universe package, the script would include fr_BE too in the output. The output is dynamically generated.

> How this does differ fr_BE, from fr_FR, as it's the same currency,
> time format and no specific string (as no separate langpack) and so
> on?

They probably don't differ much. When you install French, all the French UTF-8 locales provided by glibc are installed. There is no mechanism in place to determine their usefulness.

Please note that the language-options script serves the purpose of providing a list of languages only, i.e. it let's the user select the display language. That should be distinguished from selecting the locale for regional formats. For the latter purpose the user is offered an option list consisting of all the generated UTF-8 locales.

> I really think that if we decide to always ship all variants (as the
> script requires right now),

That script isn't the center of it. Its only purpose is to provide a list of options for selecting the display language which consists of the installed translations.

So again, it's the "language packs" (se "my" definition above) which makes it sensible (IMHO) to ship all language support variants.

It may be worth mentioning that many cycles ago, the Language Support GUI had the ability to distinguish between translations, spell checking, writing aids etc., but that was dropped. I think you'd need to install Lucid to check out what that looked like. :)

> it should be one single package: easier
> maintenance, list and logic.

Are you talking about creating meta packages to pull the language support instead of what's currently in the seed and in language-selector's pkg_depends? If so, I can see the advantage with...

Read more...

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :
Download full text (3.5 KiB)

Yeah, sorry that we had different terms for the same things :) Ok, I understand now as well. I'm trying to look at this whole thing globally and not from the technical split (which is a little bit artificial) that we did.

> Please note that the language-options script serves the purpose of providing a list of languages only, i.e. it let's the user select the display language. That should be distinguished from selecting the locale for regional formats. For the latter purpose the user is offered an option list consisting of all the generated UTF-8 locales.

I'm unsure to understand this split; but that's probably due to my inexperience. Let's take from the user's point of view:
- they have a list of language, let's say (randomly :p) that I select French here
- then I have a list of countries, I select "France", then I have Euro and a particular date format configured.

The second part is pre-configured based on the country I selected, but this is basically fr_FR default, correct?

If I select Portuguese (Brasil), I would have Portuguse (+Brazil dialect installed) and BRL currency + their date format.

Then, of course, I can mix and match in some other UI to change the currency and have Portuguese (Brazil + USD + european date format) if I want.

If I understand you correctly, we need to have locales generated on disk to know about the avaiable variants, like _foo?

I'm trying to see how we can rationalize in the hypothese of a new installer, and ensuring that both GNOME Control Center (which isn't in a very good state regarding displaying locales) can be enhanced, not really focus on "there is that script or that script". Does it makes sense?
One of the issue in Control Center is that it expects to have all locales generated to display them IIRC for currency, language and date options, that's correct, isn't it?

> Are you talking about creating meta packages to pull the language support instead of what's currently in the seed and in language-selector's pkg_depends?
I don't really know, it's an opened question. I wonder how we can minimize and have the best layout for what we want to achieve.
(Not quoting all your sentence, but ack on not diverging from Debian for the dictionaries for isntance).

>> Or we want people to specifically
>> select, like fr_BE (to have the specifics for this locale), but still
>> install all langpaks without having "fr" listed.
> That's how it currently works. I think it makes sense. (Well, for languages without alternative dialects present, like German or Swedish, the list only shows "de" respective "sv".)

Ok, so we always include "base language", and even if for some packages (like libreoffice-dictionnaries, thunderbird), we have splitted by regional settings (due to debian), we install them all, considering the impact in installed size is minimal.
We would thus change "en" to apply the same semantic, and be included as soon as en en_* something option is selected, and thus, install all en_*. (which is what check-langage-support wants to do already, but no ubiquity…). That would prevents bugs like #1732222 to exists. For the "sync with ubiquity", the all_langpacks sounds like a good solution, do you mind doing a M...

Read more...

Revision history for this message
Gunnar Hjalmarsson (gunnarhj) wrote :
Download full text (6.9 KiB)

On 2018-10-22 11:19, Didier Roche wrote:> Gunnar Hjalmarsson wrote:
>> Please note that the language-options script serves the purpose of
>> providing a list of languages only, i.e. it let's the user select
>> the display language. That should be distinguished from selecting
>> the locale for regional formats. For the latter purpose the user is
>> offered an option list consisting of all the generated UTF-8
>> locales.
>
> I'm unsure to understand this split; but that's probably due to my
> inexperience.

There are far more locales than translations. The most extreme languages are English, Spanish and Arabic which are all represented by a large number of locales. The script simply restricts the options shown to the users to the available translations instead of showing a long list of locales. Thus, when selecting "language", the user is shown the installed translations, and when selecting "regional format", the user is shown the generated locales.

> Let's take from the user's point of view:
> - they have a list of language, let's say (randomly :p) that I select
> French here
> - then I have a list of countries, I select "France", then I have
> Euro and a particular date format configured.
>
> The second part is pre-configured based on the country I selected,
> but this is basically fr_FR default, correct?

Well, if you are talking about the installer now, the user is currently not offered the option to explicitly select a country. The installer uses the time zone location to 'guess' the user's preferences with respect to currency, date formats etc. and picks a locale for regional formats accordingly.

> If I select Portuguese (Brasil), I would have Portuguse (+Brazil
> dialect installed) and BRL currency + their date format.

If you install from France (i.e. select a French time zone location) you'd still have the fr_FR locale for currency, date format etc.

> Then, of course, I can mix and match in some other UI to change the
> currency and have Portuguese (Brazil + USD + european date format) if
> I want.

Yes. Please note, though, that the UIs only allow you to distinguish between display language and regional formats. For a more fine tuned use of the available locale categories, for instance USD for currency and ISO 8601 like date format, you need to use the terminal. (Kubuntu is an exception in this respect; its UI can be used to specify each locale category.)

> If I understand you correctly, we need to have locales generated on
> disk to know about the avaiable variants, like _foo?

Yes, that's how it currently works in Ubuntu. Locales are generated in two ways:

- At installation of Ubuntu's language packs
- Separately by the installer if needed to pick a locale for regional formats if your time zone location does not match the selected language.

I think that g-c-c in vanilla GNOME shows all locales, whether generated or not, and creates the one you select if needed. But OTOH GNOME does not have our language packs; all translations on GNOME are provided by respective application package. So our use of language packs (the ones with LP translations) explains this difference in approach.

> I'm trying to see how we can rationaliz...

Read more...

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

Hey Gunnar,

After reorganizing the seeds for ubuntu desktop and see which strategy we are going to take for the new installer (negative layered langpacks, I can expand on this a little bit later), I think we have a robust story to avoid the unsync we have between language-selector and the installer/live image default list.

In disco (note: not available in the daily image which isn't including the langpacks until our livecd-rootfs MP is in production: https://code.launchpad.net/~jibel/livecd-rootfs/add_multi_layered_squashfses_support/+merge/358490)
* Basically, default language selection are now for the ubuntu desktop flavor seed:
https://git.launchpad.net/~ubuntu-core-dev/ubuntu-seeds/+git/ubuntu/tree/languages. There are 2 seeds for each language:
- the minimal one, corresponding to the new ubuntu-desktop-minimal package: https://git.launchpad.net/~ubuntu-core-dev/ubuntu-seeds/+git/ubuntu/tree/languages/desktop-minimal-de
- the "full" one, which, in addition to the desktop-minimal one has the dictionaries, libreoffice, thunderbird and mozilla translations: https://git.launchpad.net/~ubuntu-core-dev/ubuntu-seeds/+git/ubuntu/tree/languages/desktop-de

You can see that input methods are included in some, like the chinese one: https://git.launchpad.net/~ubuntu-core-dev/ubuntu-seeds/+git/ubuntu/tree/languages/desktop-minimal-zh (we may have some wrong assignement, if you spot anything, feel free!)

So, the idea would be:
- have, for the default languages, language-selector on ubuntu-desktop using those seeds (probably picking the list at package build time?). It needs to adapt depending on the installed metapackage (ubuntu-desktop-minimal or ubuntu-desktop). Note that ubuntu-desktop installs ubuntu-desktop-minimal.
- expand the heuristic for other languages for ubuntu-desktop-minimal/ubuntu-desktop. We can maybe include the regexp in the seed if that helps? That way, we have a single place where we define all languages support.

In addition to this:
- we remove the magic for "en" (enforcing too many time to install them). It isn't required if installing an other language
- installing one language, do, as of today, install all "deviations", like fr installs fr_FR, fr_CA, fr_BE, even if those are part of other binary packages. The default languages seed should already cover that.

The installer would then only rely on language-support to ensure everything installed is correctly there.

What do you think? Do you have some time to work on this for this cycle?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.