want a way to mark transport operations as cacheable

Bug #120697 reported by Reinhard Tartler
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
Medium
Unassigned
bzr (Debian)
Fix Released
Unknown

Bug Description

Transport.get (and readv, etc) should take a flag indicating the the requested file is cacheable. This should pass through to the http client, and it should be set when reading pack and index files.

----

This is actually debian bug 396227:

From: Roland Mas <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: bzr: Very network-unfriendly
Date: Mon, 30 Oct 2006 18:20:02 +0100
Package: bzr
Version: 0.11-1
Severity: normal

bzr seems to deliberately eat my bandwidth. The following two
invocations of bzr have been run in a row, after the $http_proxy and
$HTTP_PROXY environment variables have been set up to point to a
(working) squid proxy/cache. As you can see, there's no gain in using
a cache:

,----
| roland@mirexpress:/tmp$ time bzr branch http://bazaar-ng.org/bzr/bzr.dev
| Bzrtools is not up to date with installed bzr version 0.11.0.
| There should be a newer version available, e.g. 0.11.
| Branched 2100 revision(s).
|
| real 23m28.351s
| user 0m31.538s
| sys 0m1.880s
| roland@mirexpress:/tmp$ mv bzr.dev/ bzr.dev.bak
| roland@mirexpress:/tmp$ time bzr branch http://bazaar-ng.org/bzr/bzr.dev
| Bzrtools is not up to date with installed bzr version 0.11.0.
| There should be a newer version available, e.g. 0.11.
| Branched 2100 revision(s).
|
| real 23m58.854s
| user 0m32.130s
| sys 0m1.740s
| roland@mirexpress:/tmp$
`----

The Squid logs lots of TCP_MISS and TCP_CLIENT_REFRESH_MISS entries.
According to the Squid documentation, this means that "The client
issued a "no-cache" pragma, or some analogous cache control command
along with the request. Thus, the cache has to refetch the object."

  Bzr is slow enough already, bypassing proxy-caches is not good. One
could, with lots of goodwill, argue the case for initial branching,
but re-fetching more than a megabyte afterwards for each bzr pull (one
in each branch) is stretching it.

Tags: http transport
Changed in bzr:
status: Unknown → Confirmed
Changed in bzr:
status: Confirmed → Fix Released
Martin Pool (mbp)
Changed in bzr:
importance: Undecided → Medium
status: New → Confirmed
Vincent Ladeuil (vila)
Changed in bzr:
assignee: nobody → v-ladeuil
Vincent Ladeuil (vila)
Changed in bzr:
status: Confirmed → Fix Committed
Revision history for this message
Vincent Ladeuil (vila) wrote : Re: bzr shouldn't bypass http caches

Buggy proxies (returning stale data) have been encountered in the past.
If a cache gives us older versions of some files, and newer versions of others, we'll see the repository as broken.
Therefore, bzr explicitly bypass cache by using a 'no-cache' directive as Roland rightly deduced.

The attached patch disables that behavior. Use at your own risk.

Revision history for this message
Vincent Ladeuil (vila) wrote :

There are no reliable ways to detect whether or not a proxy will behave correctly regarding caching so bzr will continue to use 'no-cache' directives.

But we intend to make it more network-friendly by other means.

Changed in bzr:
status: Fix Committed → Won't Fix
Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 120697] Re: bzr shouldn't bypass http caches

On Sun, 2007-11-25 at 13:58 +0000, vila wrote:
> There are no reliable ways to detect whether or not a proxy will behave
> correctly regarding caching so bzr will continue to use 'no-cache'
> directives.
>
> But we intend to make it more network-friendly by other means.

I think we should keep this open a bit longer.

Specifically, the current code assumes all files are equally hard to
cache. This is bogus.

e.g.
foo.knit: not safe to cache
foo.kndx: not safe to cache
branch.conf: not safe to cache
3847868641027641243.pack: safe to cache
3847868641027641243.iix: safe to cache
branch-format: safe to cache
format: safe to cache
lock/*: not safe to cache.

etc.

-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message
Vincent Ladeuil (vila) wrote : Re: bzr shouldn't bypass http caches

Right, bug reopened then.

But the transport layer will not identify such files by itself.

Changed in bzr:
status: Won't Fix → Triaged
Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 120697] Re: bzr shouldn't bypass http caches

Right, so this should be a per-file option, defaulting to 'don't
cache'. It could be done as an option on each read, as an option on
the transport, or by introducing a new method to open a file for
reading and making this part of the state of that object.

Revision history for this message
John A Meinel (jameinel) wrote : Re: bzr shouldn't bypass http caches

As an aside, is squid smart enough to cache readv() requests? Does it just request the whole file, and then return the pieces?

Otherwise, I'm guessing that Bazaar is still going to be unfriendly to caches, because it makes requests for pieces of the files. (which is sometimes the whole file, but the proxy wouldn't know whether 0-1111 is the whole file, or if there is a byte 1112.)

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 120697] Re: bzr shouldn't bypass http caches

On Mon, 2007-11-26 at 20:41 +0000, John A Meinel wrote:
> As an aside, is squid smart enough to cache readv() requests? Does it
> just request the whole file, and then return the pieces?
>
> Otherwise, I'm guessing that Bazaar is still going to be unfriendly to
> caches, because it makes requests for pieces of the files. (which is
> sometimes the whole file, but the proxy wouldn't know whether 0-1111 is
> the whole file, or if there is a byte 1112.)

A sufficiently complex range will cause squid to read the whole file -
and serve later responses from cache (unless cache busting headers are
used).

If cache busting headers are used by the client it does not attempt to
process the ranges at all and just forwards them upstream.

-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message
Matt Nordhoff (mnordhoff) wrote :

Robert Collins wrote:
> I think we should keep this open a bit longer.
>
> Specifically, the current code assumes all files are equally hard to
> cache. This is bogus.
>
> e.g.
> foo.knit: not safe to cache
> foo.kndx: not safe to cache
> branch.conf: not safe to cache
> 3847868641027641243.pack: safe to cache
> 3847868641027641243.iix: safe to cache
> branch-format: safe to cache
> format: safe to cache
> lock/*: not safe to cache.
>
> etc.

branch-format and format aren't really safe to cache. What if someone
runs 'bzr upgrade'?
--

Revision history for this message
Robert Collins (lifeless) wrote :

On Wed, 2007-11-28 at 00:56 +0000, Matt Nordhoff wrote:
> Robert Collins wrote:
> > I think we should keep this open a bit longer.
> >
> > Specifically, the current code assumes all files are equally hard to
> > cache. This is bogus.
> >
> > e.g.
> > foo.knit: not safe to cache
> > foo.kndx: not safe to cache
> > branch.conf: not safe to cache
> > 3847868641027641243.pack: safe to cache
> > 3847868641027641243.iix: safe to cache
> > branch-format: safe to cache
> > format: safe to cache
> > lock/*: not safe to cache.
> >
> > etc.
>
> branch-format and format aren't really safe to cache. What if someone
> runs 'bzr upgrade'?

Well, there's safe and safe I guess.

on the one hand those files change very rarely.

on the other they matter quite a bit :).

-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message
Vincent Ladeuil (vila) wrote : Re: bzr shouldn't bypass http caches

lifeless said on IRC:

http://www.squid-cache.org/Versions/v2/2.6/changesets/11996.patch
 you need http://www.squid-cache.org/Versions/v2/2.6/changesets/SQUID_2_6_STABLE19.html
I wonder if detecting squid versions with the bug would be useful, give a warning

James Westby (james-w)
Changed in bzr:
status: Triaged → Fix Released
Vincent Ladeuil (vila)
Changed in bzr:
status: Fix Released → Triaged
Revision history for this message
Martin Pool (mbp) wrote :

Debian says this was fixed by the introduction of the packs format, which is a bit weird. Oh well.

It looks like the remaining part of this bug is that we want a way to say, on a particular Transport operation, that the file we're reading is potentially cacheable. This would be safe to set if the file is never modified once it has been written, which is true for pack and index contents.

I'm going to mark vincent's old patch as "not a patch" as it's not actually a foundation for closing this bug.

summary: - bzr shouldn't bypass http caches
+ want a way to mark transport operations as cacheable
Changed in bzr:
status: Triaged → Confirmed
description: updated
tags: added: http transport
Revision history for this message
Robert Collins (lifeless) wrote :

I propose the following:
 - we add a flag 'permit_cached' to transport.get, get_bytes, readv.
 - that flag gets passed down the stack to the header functions vila changes in his draft patch.

Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 120697] Re: want a way to mark transport operations as cacheable

On 3 April 2010 08:28, Robert Collins <email address hidden> wrote:
> I propose the following:
>  - we add a flag 'permit_cached' to transport.get, get_bytes, readv.

I wonder if this ought to be richer than a boolean: must-revalidate vs
disable caches altogether, and perhaps the concept of
if-modified-since (though bzr itself may not be able to use that yet.)

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Robert Collins (lifeless) wrote :

We could make it richer yes; do we need that though?

Revision history for this message
Martin Pool (mbp) wrote :

On 6 April 2010 15:25, Robert Collins <email address hidden> wrote:
> We could make it richer yes; do we need that though?

I don't know but it seems possible. Some files are just about
guaranteed to be immutable once written, like pack files. Some are
modified but then otherwise cacheable, and in that case
must-revalidate would make sense.

At any rate I wouldn't want a proliferation of options on every single call.

Perhaps it should be per-transport state and then you can create
different transport objects. Or a dict of options per call.

--
Martin <http://launchpad.net/~mbp/>

Vincent Ladeuil (vila)
Changed in bzr:
assignee: Vincent Ladeuil (vila) → nobody
Revision history for this message
Eric Seigne (eric-seigne) wrote :

Woaw ! Thanks to Vincent Ladeuil (vila) wrote on 2007-11-21: Re: bzr shouldn't bypass http caches

Before:

time bzr branch http://url -Dbytes
Branched 8 revisions.
Transferred: 6470kB (185.6kB/s r:6444kB w:26kB)

real 0m35.063s
user 0m0.688s
sys 0m0.168s

After:

time bzr branch http://url -Dbytes
Branched 8 revisions.
Transferred: 6469kB (431.2kB/s r:6445kB w:24kB)

real 0m15.226s
user 0m0.620s
sys 0m0.128s

bzr --version
Bazaar (bzr) 2.5.0
  Python interpreter: /usr/bin/python 2.6.5
  Python standard library: /usr/lib/python2.6
  Platform: Linux-3.0.0-17-generic-x86_64-with-Ubuntu-10.04-lucid
  bzrlib: /usr/lib/pymodules/python2.6/bzrlib
  Bazaar configuration: /home/utilisateurs/eric.seigne/.bazaar
  Bazaar log file: /home/utilisateurs/eric.seigne/.bzr.log

Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
Jelmer Vernooij (jelmer)
tags: removed: check-for-breezy
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.