Add new aggregate API

Bug #670358 reported by Seif Lotfy
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zeitgeist Framework
Invalid
Wishlist
Siegfried Gevatter

Bug Description

Zeitgeist API can give us vague information but not statistics over the API
Use case:
- Give me counts of every subject_text from actor = Unity
Currently to do that one will need to either:
Request all events with Unity as an Actor and count the subject_text
This can be done much better IMHO
I am thinking of an Aggregation Extension.
But before we hack on that we need to agree if its necessary and how the API will look like.

Revision history for this message
Seif Lotfy (seif) wrote :

My current suggestion based on Michal Hruby's requirements would be a method called

def get_events_count(timerange, event_templates):
    ...
    return dict

where dict = {event_template_1: count1,
                     event_template_2: count2,
                     ...
                     }

Revision history for this message
Michal Hruby (mhr3) wrote :

Why exactly are the event_templates duplicated to the result?

Revision history for this message
Markus Korn (thekorn) wrote :

I suggest something like:

def find_events_and_data(*find_event_arguments, datatype_const):
    ...
    return result

result = [events, data]

datatype_const:
    DATATYPE_COUNT -> result of COUNT() from within the sql-query
    DATATYPE_RELATIVE_COUNT -> result of COUNT() from within the sql-query relative to the overall result
    ...

same with ids

which is more flexible than seif's proposal, data is not necessarily a count...

Changed in zeitgeist:
assignee: nobody → Siegfried Gevatter (rainct)
Revision history for this message
Seif Lotfy (seif) wrote : Re: [Bug 670358] Re: Add new aggregate API

I started working on a wiki with user stories and use cases to allow us to
triage the requirements better
http://wiki.zeitgeist-project.com/index.php?title=Aggregation_API

On Thu, Nov 4, 2010 at 3:39 PM, Markus Korn <email address hidden> wrote:

> I suggest something like:
>
> def find_events_and_data(*find_event_arguments, datatype_const):
> ...
> return result
>
> result = [events, data]
>
> datatype_const:
> DATATYPE_COUNT -> result of COUNT() from within the sql-query
> DATATYPE_RELATIVE_COUNT -> result of COUNT() from within the sql-query
> relative to the overall result
> ...
>
> same with ids
>
> which is more flexible than seif's proposal, data is not necessarily a
> count...
>
> --
> Add new aggregate API
> https://bugs.launchpad.net/bugs/670358
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Zeitgeist Framework: New
>
> Bug description:
> Zeitgeist API can give us vague information but not statistics over the API
> Use case:
> - Give me counts of every subject_text from actor = Unity
> Currently to do that one will need to either:
> Request all events with Unity as an Actor and count the subject_text
> This can be done much better IMHO
> I am thinking of an Aggregation Extension.
> But before we hack on that we need to agree if its necessary and how the
> API will look like.
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/zeitgeist/+bug/670358/+subscribe
>

--
This is me doing some advertisement for my blog http://seilo.geekyogre.com

Seif Lotfy (seif)
Changed in zeitgeist:
importance: Undecided → Wishlist
status: New → Confirmed
milestone: none → 0.7
Revision history for this message
Seif Lotfy (seif) wrote :

OK I gave it more thought and I think Markus's idea makes a lot of sense.
This will allow us to even later work with more "data_types", so if for example geo-location extension becomes officially supported we can shoot out the locations as a results. having a an array of dicts in data will allow us to expand and experiment more with the fields.
Another example would be that data would look like
data = [
            data_1 = {
                            "relative_count" : "0.6",
                            "absolute count" : "1234",
                            "owner" : "seif",
                           }
           ]

I am giving +1 for Markus's idea

Revision history for this message
Seif Lotfy (seif) wrote :

the problem we have with both methods is that we can only handle 1 event_template
and the result types MostRecentEvents and LeastRecentEvent can not be used since it will always return a count of 1

Revision history for this message
Seif Lotfy (seif) wrote :

Let me elaborate on my last comment...

If we set result_type to MostRecentSubject we will return the most recent event of each unique subject and the count how many times the subject occurred.
However if we ask for MostRecentEvent we will return most recent event and count how many times the event occurred which is always 1.

Revision history for this message
Seif Lotfy (seif) wrote :

OK I think I need to spam this bug a bit.

While I agree that use cases and user stories are the way to go. There is no such thing as a simple use case for some features. Developers want to rank their stuff so they need COUNTS. And currently we have no simple way of exposing COUNTS.
The use case is for developers (developer case?) where the developer wants to enrich the sorting/ranking of his results.

"I want to know how many times a set of subjects was used without having to ask for all the events and then iterate and count"

If the use case if going to be for actual users it would sound more like

"I want my search results to be sorted better" The user cares nothing about how its done.

It is a simple as that.

Revision history for this message
Seif Lotfy (seif) wrote :

Again the use case is not visible to the user. It is all about sorting and ranking.
I want to use it in Synapse. For that I need to have "starred subjects" which are all subject that have been ACCESSED or MODIFIED more than 10 times within the last 24h. These are usually things that are stamped in the users mind.
Now querying for most used does not tell me how many times it was used. Thus if i opened something 2 times in a day making it the most used subject it actually bullcrap.
Allowing me to query for the subjects and having a count would tell me how they relatively match to each other as well as how they rank in an absolute matter.
I know it sounds stupid but there is no use case for it. It makes development easier.
Another one I can think of is Unity Places. I have the most used. Now lets say the most used person was contacted 10 times in the last 24 hours and all others were contacted 9 times. This makes the count irrelevant since 10 to 9 is not big of a difference to make it stand out. Thus I shouldn't consider it as a star item.

Revision history for this message
Manish Sinha (मनीष सिन्हा) (manishsinha) wrote :

I am really interested in this API since it would help a lot to the Music players like number of time a track is player. Music player is used by everyone, so it is a good use case which might reach out the maximum to the users. I just need some explanation

def find_events_and_data(*find_event_arguments, datatype_const):
    ...
    return result

result = [events, data]

datatype_const:
    DATATYPE_COUNT -> result of COUNT() from within the sql-query
    DATATYPE_RELATIVE_COUNT -> result of COUNT() from within the sql-query relative to the overall result

Can anyone explain a sample of how the *find_event_arguments might look? Is it an event template? Single or a list?

I got datatype_const (which is sort of an enumeration, but not true in the strict sense)

about [events,data] how are they contained? It would be very clear if Markus gives a complete example. Sample input data, and sample output data.

Revision history for this message
Manish Sinha (मनीष सिन्हा) (manishsinha) wrote :

On day job I work on .NET where LINQ is a boon to all the programmers stuck up in Windows .NET programming

Inspired from LINQ (which I use a lot) I would like to propose 5 basic aggregate operations as of now
* Sum
* Count
* Max
* Min
* Average

I know average/max/min sounds stupid at first glance since how can we have average etc of an event. Let us consider an example. I use Clementine, RB and Banshee for satisfying my music needs. I log the events from all these 4 players. One fine day I would like to know which
* track was played maximum on Banshee
* average number of plays from each player
etc etc etc

Seif Lotfy (seif)
Changed in zeitgeist:
milestone: 0.7.0 → none
Revision history for this message
Markus Korn (thekorn) wrote :
Revision history for this message
Michal Hruby (mhr3) wrote :

@Markus: I basically like your proposal, but since I was told that event ids are not unique, isn't there a huge flaw in that API?

Revision history for this message
Siegfried Gevatter (rainct) wrote :

> FindEventIdsStats(..., ResultType.MostRecentEvents) --> ([1, 2], [500.0, 250.0])
> For `MostRecentEvents` the stats are returning the timestamps for each
> event.

That's redundant, the events already include the last timestamp.

Revision history for this message
Markus Korn (thekorn) wrote :

@Siegfried: yes, but it is only redundant for FindEventsStats as FindEventIdsStats only returns ids and not events. It can't be avoided in this particular case, as the stats field has to return *someting*.

@Michal: as we already clearified on irc, the mapping is done based on index, so the first element in the stats array points to the first element in the result one, etc.

Revision history for this message
Michal Hruby (mhr3) wrote :

> @Markus: I basically like your proposal, but since I was told that event ids are not unique, isn't there a huge flaw in that API?

Nevermind, I misread the example...

Revision history for this message
Seif Lotfy (seif) wrote :

I am very very pleased with the API proposal. GREAT GREAT WORK thekorn. You
rock big time :)
+1 from me

On Sat, Nov 27, 2010 at 1:48 PM, Michal Hruby <email address hidden> wrote:

> > @Markus: I basically like your proposal, but since I was told that
> event ids are not unique, isn't there a huge flaw in that API?
>
> Nevermind, I misread the example...
>
> --
> Add new aggregate API
> https://bugs.launchpad.net/bugs/670358
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Zeitgeist Framework: Confirmed
>
> Bug description:
> Zeitgeist API can give us vague information but not statistics over the API
> Use case:
> - Give me counts of every subject_text from actor = Unity
> Currently to do that one will need to either:
> Request all events with Unity as an Actor and count the subject_text
> This can be done much better IMHO
> I am thinking of an Aggregation Extension.
> But before we hack on that we need to agree if its necessary and how the
> API will look like.
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/zeitgeist/+bug/670358/+subscribe
>

--
This is me doing some advertisement for my blog http://seilo.geekyogre.com

Revision history for this message
Mikkel Kamstrup Erlandsen (kamstrup) wrote : Re: [Zeitgeist] [Bug 670358] Re: Add new aggregate API

Sorry to ruin the party, but I really don't like any of the proposed
solutions. The use cases described in the wiki seems very academic and
more intended on doing some theoretical counting exercises than
solving actual user problems.

Unless we have some crystal clear use cases (fx. a UI mockup someone
actually wants to develop and deploy) or a very clear idea on how we
can extend and adapt those cases to other situations I don't think it
makes sense to add new API. It will just be technical debt.

Concerning the actual proposals (minding that I don't think we're at a
place where it makes sense to discuss it yet):

Seif: I think this is way too simple. We *also* need something where
you can also do a query and do the counting in one roundtrip -
preferably with one SQL call under the hood. I say also because there
are situations where you want to display the events fast, and can wait
a bit longer to display the counts - because the counting is often a
slower task.

Markus: Your proposal is more flexible, although I wonder why you use
a double. It would seem very awkward to have to convert everything
from double to int in most cases. Maybe a variant which type is
determined by the ResultType you pass in? Regarding
ResultType.MostFrequentActor this is identical to our current
ResultType.MostPopularActor, right? Like the initial use cases I think
the examples you add a somewhat contrived.

Again, sorry if I come out as overly negative. I just feel like we're
taking stabs in the dark here.

Seif Lotfy (seif)
Changed in zeitgeist:
milestone: none → 0.8.0
Seif Lotfy (seif)
Changed in zeitgeist:
milestone: 0.8.0 → none
Revision history for this message
Seif Lotfy (seif) wrote :

OK I am convinced we can do this in another way. Extension makes sense for me and thus I think as an extension it has nothing to do here. Thus I will make it as invalid

Changed in zeitgeist:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.