metadata cache

Sun Feb 21 04:34:15 PST 2010

>> Would it be possible to 'split' the cache into two files, e.g.  
>> essential info and additional info. The essential info would  
>> basically be the current metadata cache but only holding what is  
>> absolutely necessary and the additional info would be looked up  
>> only when needed (like description, group and such like). The idea  
>> being that this would reduce the size of the main cache that is used.
>
>
> It would be possible to do a separate SQL index, that would map  
> "pkgKey" into "file offset".

Meant to write "pkgId", not "pkgKey": (the key is internal to the  
sqlite database, not external)

CREATE TABLE packages (  pkgKey INTEGER PRIMARY KEY,  pkgId TEXT,   
name TEXT,  arch TEXT,  version TEXT,  epoch TEXT,  release TEXT,   
summary TEXT,  description TEXT,  url TEXT,  time_file INTEGER,   
time_build INTEGER,  rpm_license TEXT,  rpm_vendor TEXT,  rpm_group  
TEXT,  rpm_buildhost TEXT,  rpm_sourcerpm TEXT,  rpm_header_start  
INTEGER,  rpm_header_end INTEGER,  rpm_packager TEXT,  size_package  
INTEGER,  size_installed INTEGER,  size_archive INTEGER,   
location_href TEXT,  location_base TEXT,  checksum_type TEXT);

Addding an index to the already existing xml also saves having to  
duplicate all the information.

> Just that ElementTree doesn't help much with this, so it would need  
> a separate indexing run.

Added an (pyexpat) index creator in http://bazaar.launchpad.net/~afb/ 
smart/metadata/revision/941

If you run it on each repodata file, index looks like:

# tests/data/rpm/repodata/primary.xml.gz
781a4605a429eb27846f0234657f84f1a5831696        156
b70ad189a33ba47c50f368475458b0fc19630f5f        1981

# tests/data/rpm/repodata/filelists.xml.gz
781a4605a429eb27846f0234657f84f1a5831696        113
b70ad189a33ba47c50f368475458b0fc19630f5f        289

# tests/data/rpm/repodata/other.xml.gz
781a4605a429eb27846f0234657f84f1a5831696        109
b70ad189a33ba47c50f368475458b0fc19630f5f        259

Where the number is the byte offset to the <package>.

So now it doesn't need to scan the entire file, but it can seek  
directly to the element start...

--anders