Bug #321181 “Attribute searches are inefficient” : Bugs : Hypy

Revision history for this message

Cory Dodt (corydodt) wrote on 2009-01-25:

#1

Jamie: could you add a simple timing benchmark to this case so I can reproduce it more easily?

Revision history for this message

Cory Dodt (corydodt) wrote on 2009-01-25:

#2

Yes, please. It's important to be able to reproduce the bug using the same conditions that you saw so I know when I'm done fixing it. :-)

> Well, I have a moderately large benchmark, involving the creation of a
> 100k random documents from a dictionary of words, but it takes more than 1
> day to populate the data. However, it's not automated, so it may not fit
> your needs; with a big enough data set, you can count the seconds by hand
> (i.e, without using timeit) (I saw times to search drop from fifteen
> seconds to seven seconds.)
>
> If you'd still like it, I'll upload it to launchpad! (sorry for not
> opening a bug for the original patch.)

Revision history for this message

Jamie (jamieson-becker) wrote on 2009-01-29:

#3

Just a simple set of functions (wordlist included, could be supplemented with /usr/share/dict/words) to create random entries and insert into hypy to quickly and easily test searches etc across a large dataset.

Revision history for this message

Jamie (jamieson-becker) wrote on 2009-01-29:

#4

Build quick test dataset Edit (5.1 MiB, application/x-tar)

Very ugly code that just inserts a lot of records into the database. See README

Revision history for this message

Jamie (jamieson-becker) wrote on 2009-01-29:

#5

.. and the second version actually works ;-)

Revision history for this message

Cory Dodt (corydodt) wrote on 2009-02-20:

#6

Hi Jamieson,

I guess I need more information here.

- the testing.py code doesn't express any actual tests, so I don't know what you're comparing to what. What exact call is slow with the code the way it is, and would be made faster by your patch?

- I tried your suggested patch and it seems to actually slow attribute searches down significantly. I added this to the bottom of testing.py:

if __name__ == '__main__':
loadup(10)

    t1 = time.time()
    search(u"CVE")
    t2 = time.time()
    search(attrs=[u"@title STRINC code"]) # **
    # search("*.**", attrs=[u"@title STRINC code"]) # **
    t3 = time.time()

print '%s seconds to search CVE, %s seconds to STRINC @title' % (t2-t1,
t3-t2)

I modified the ** line, and the code. Below, "unpatched" means original code, and "patched" means with your default-phrase modification. *.** means I used the search line with *.**, and "" means I used the other one.

I didn't test unpatched "" because it returns no results. Times shown are the best run out of about 3 for each, but the scale was consistent, each test relative to the others.

(unpatched *.**)
0.561604976654 seconds to search CVE, 0.153902053833 seconds to STRINC @title
(patched *.**)
0.429752826691 seconds to search CVE, 0.199048995972 seconds to STRINC @title
(patched "")
0.627918958664 seconds to search CVE, 0.226984024048 seconds to STRINC @title

The slowest one was the patched "" version.

Revision history for this message

Cory Dodt (corydodt) wrote on 2009-02-20:

#7

OH. It was getting progressively slower each time, a very unexpected result, and I realized it wasn't cleaning up /tmp between each run. Fixed that, now I get these results:

(unp *.**)
0.0435090065002 seconds to search CVE, 0.0187568664551 seconds to STRINC @title
(p *.**)
0.0455119609833 seconds to search CVE, 0.0256581306458 seconds to STRINC @title
(p "")
0.0446078777313 seconds to search CVE, 0.0117180347443 seconds to STRINC @title

Which is consistent with what you described. So I will make this change. Not sure whether I want timing tests in the unit tests, but I will need to add a test for both "" and "*.**" attribute searches.

Changed in hypy:
assignee:	nobody → corydodt
importance:	Undecided → Medium
milestone:	none → 0.9
status:	New → Confirmed

Revision history for this message

Jamie (jamieson-becker) wrote on 2009-02-20: Re: [Bug 321181] Re: Attribute searches are inefficient

#8

Download full text (4.7 KiB)

Thanks for the excellent questions and thank you for taking the time to
investigate this so thoroughly!

That two line patch makes it possible to search HyPy without searching the
body fulltext for, well, everything: *.**

This is behavior that was exposed in the original Hyper Estraier C
interface but it looks like it was removed in order to simplify the
interface. The C interface does not run a body search unless you specify
it. It will search attributes only if you only request those, which is
clearly going to be faster (and is, as you'll see in tests)

The test code doesn't have any tests at all -- it's a stopwatch test. You
could use the UNIX time command to automate this a bit more or the timing
as you configured below. (timing tests are hard to express in a unittest,
of course, because you'd have to run it against a large enough corpus to
be meaningful and try to predict a ratio where the test would succeed or
fail, or just hardcode an expected lapsed time. Both are hard to do, but
they're not needed for this change since it's a basic architectural
change, not a change that can or should be tested for.)

It actually makes sense if you think about it: if you had to look through
a book's chapter titles for instances of the word "CVE", it would be
faster to look in the table of contents than to flip through and read each
page.

Here's how I ran the test. I think there's a README there somewhere;

1. Use the package to generate a very large (multi-gigabyte) test set and
inject into Hyper Estraier for full-text search. This will probably take a
day or two, so you may want to nice -19 19 python before starting it (if
it's on your desktop) and maybe do it in a gnu screen session. loadup(10)
is too small of a sample to get meaningful results, I used 200000. (i.e.,
simulating 200k documents in the engine)

2. Then use the search() command on an unmodified HyPy to search for
attributes and full text.

3. Modify HyPy to remove the requirement to search the whole body (i.e.,
both phrase="" and if phrase:)

4. run the search again and compare the two results. The second results
will be approximately half the number of seconds that the first result
took, give or take. The larger the sample, the bigger the difference in
timing.

I agree, you can't pass "" without also modifying hypy by adding the if
phrase:, or Hyper Estraier will try to search the whole body for nothing
and always return no results.

> if phrase:

If you still don't see a difference, let me know

-Jamieson

> Hi Jamieson,
>
> I guess I need more information here.
>
> - the testing.py code doesn't express any actual tests, so I don't know
> what you're comparing to what. What exact call is slow with the code
> the way it is, and would be made faster by your patch?
>
> - I tried your suggested patch and it seems to actually slow attribute
> searches down significantly. I added this to the bottom of testing.py:
>
>
> if __name__ == '__main__':
> loadup(10)
>
> t1 = time.time()
> search(u"CVE")
> t2 = time.time()
> search(attrs=[u"@title STRINC code"]) # **
> # search("*.**", attrs=[u"@title STRINC code"]) # **
> t3 = time.time()
>
> print '%...