metadata cache

Anders F Björklund afb at algonet.se
Fri Feb 19 02:34:19 PST 2010


I was trying to explain why the Fedora "cache" is 2-4x bigger
than the Ubuntu "cache", so figured I might as well post it...
It's all about the "base" (genbasedir) versus the "repodata",
not about the distros - they're about equal at ~ 20,000 pkgs.


The basic structure for packages in Smart is the Package,
stored in a Cache after being loaded from data by a Loader.
There are different types of Packages and Loaders, for the
different types of package managers and package channels.

The APT-* loaders are using the file offset to load a package.
So a cached package looks something like (offset = 524782):
<smart.backends.deb.loader.DebTagFileLoader object at 0x289f410>: 524782
The cached Loader holds the path to the actual tagfile used.

The RPM-MD loader (and slack/arch), however, are using "info".
That is, they dump the _entire info_ dict as loaded from data:
<smart.backends.rpm.metadata.RPMMetaDataLoader object at 0x1fc10350>:  
{'installed_size': 22102446, 'description': 'Python is an  
interpreted, interactive, object-oriented programming\nlanguage often  
compared to Tcl, Perl, Scheme or Java. Python includes\nmodules,  
classes, exceptions, very high level dynamic data types and\ndynamic  
typing. Python supports interfaces to many system calls and 
\nlibraries, as well as to various windowing systems (X11, Motif, Tk, 
\nMac and MFC).\n\nProgrammers can write new built-in modules for  
Python in C or C++.\nPython can be used as an extension language for  
applications that need\na programmable interface. This package  
contains most of the standard\nPython modules, as well as modules for  
interfacing to the Tix widget\nset for Tk and RPM.\n\nNote that  
documentation for Python is provided in the python-docs\npackage.',  
'license': 'PSF - see LICENSE', 'url': 'http://www.python.org/',  
'build_time': 1252006932, 'summary': 'An interpreted, interactive,  
object-oriented programming language.', 'sha':  
'9bfee41ad19a336e614793bcd57e77a25d0029e3', 'location': 'CentOS/ 
python-2.4.3-27.el5.x86_64.rpm', 'time': 1254357742, 'group':  
'Development/Languages', 'sourcerpm': 'python-2.4.3-27.el5.src.rpm',  
'size': 6237221}

There are no file offsets available from the ElementTree (expat),
but one can use the repodata "pkgKey" package identifier instead:
<smart.backends.rpm.database.RPMMetaDataLoader object at 0x1d988410>:  
'9bfee41ad19a336e614793bcd57e77a25d0029e3'

The problem, when using the regular loader, is that it now needs
to scan the _entire file_ for each package. Which literally takes
ages, like 1-3 seconds per package - or 30 minutes to list all...
Test code is at: https://code.launchpad.net/~afb/smart/metadata

So this needs a different storage, and that's where the alternative
SQLite repodata comes in. Now it doesn't need to parse XML, but can
do a SQL query instead to retrieve the information for a "pkgKey".
That code is at: https://code.launchpad.net/~afb/smart/sqlite


Keeping a SQL connection up is more hassle than reading a file,
whether it's a flat tagfile or a pickled cache of the pkg info,
so thus far it hasn't been worth it. If adding pathlist/changelog,
it would definitely be required. Reading XML is just too slow...

I would be interested in some feedback from others using "RPM-MD",
but to me it seems like it should continue to use the bigger cache
until the time is right to switch to using SQLite repodata instead ?
It's created with the -d option to createrepo/yum-createrepo/rpmrepo.

--anders


PS. The RPM-Database and RPM-HeaderList don't have this problem,
     it's specific to the RPM-MetaData as being used by yum/zypp.




More information about the Smart mailing list