metadata cache
Rehan
rehan.khan at dsl.pipex.com
Fri Feb 19 10:25:43 PST 2010
Thanks for explaining this issue. I now understand why smart seems so clunky on yum based distros.
Would it be possible to 'split' the cache into two files, e.g. essential info and additional info. The essential info would basically be the current metadata cache but only holding what is absolutely necessary and the additional info would be looked up only when needed (like description, group and such like). The idea being that this would reduce the size of the main cache that is used.
However it might be more work than it's worth what with there being an sqlite alternative for yum.
Rehan
-----Original Message-----
From: smart-bounces at lists.labix.org [mailto:smart-bounces at lists.labix.org] On Behalf Of Anders F Björklund
Sent: 19 February 2010 10:46
To: smart at lists.labix.org
Subject: metadata cache
I was trying to explain why the Fedora "cache" is 2-4x bigger
than the Ubuntu "cache", so figured I might as well post it...
It's all about the "base" (genbasedir) versus the "repodata",
not about the distros - they're about equal at ~ 20,000 pkgs.
The basic structure for packages in Smart is the Package,
stored in a Cache after being loaded from data by a Loader.
There are different types of Packages and Loaders, for the
different types of package managers and package channels.
The APT-* loaders are using the file offset to load a package.
So a cached package looks something like (offset = 524782):
<smart.backends.deb.loader.DebTagFileLoader object at 0x289f410>: 524782
The cached Loader holds the path to the actual tagfile used.
The RPM-MD loader (and slack/arch), however, are using "info".
That is, they dump the _entire info_ dict as loaded from data:
<smart.backends.rpm.metadata.RPMMetaDataLoader object at 0x1fc10350>:
{'installed_size': 22102446, 'description': 'Python is an
interpreted, interactive, object-oriented programming\nlanguage often
compared to Tcl, Perl, Scheme or Java. Python includes\nmodules,
classes, exceptions, very high level dynamic data types and\ndynamic
typing. Python supports interfaces to many system calls and
\nlibraries, as well as to various windowing systems (X11, Motif, Tk,
\nMac and MFC).\n\nProgrammers can write new built-in modules for
Python in C or C++.\nPython can be used as an extension language for
applications that need\na programmable interface. This package
contains most of the standard\nPython modules, as well as modules for
interfacing to the Tix widget\nset for Tk and RPM.\n\nNote that
documentation for Python is provided in the python-docs\npackage.',
'license': 'PSF - see LICENSE', 'url': 'http://www.python.org/',
'build_time': 1252006932, 'summary': 'An interpreted, interactive,
object-oriented programming language.', 'sha':
'9bfee41ad19a336e614793bcd57e77a25d0029e3', 'location': 'CentOS/
python-2.4.3-27.el5.x86_64.rpm', 'time': 1254357742, 'group':
'Development/Languages', 'sourcerpm': 'python-2.4.3-27.el5.src.rpm',
'size': 6237221}
There are no file offsets available from the ElementTree (expat),
but one can use the repodata "pkgKey" package identifier instead:
<smart.backends.rpm.database.RPMMetaDataLoader object at 0x1d988410>:
'9bfee41ad19a336e614793bcd57e77a25d0029e3'
The problem, when using the regular loader, is that it now needs
to scan the _entire file_ for each package. Which literally takes
ages, like 1-3 seconds per package - or 30 minutes to list all...
Test code is at: https://code.launchpad.net/~afb/smart/metadata
So this needs a different storage, and that's where the alternative
SQLite repodata comes in. Now it doesn't need to parse XML, but can
do a SQL query instead to retrieve the information for a "pkgKey".
That code is at: https://code.launchpad.net/~afb/smart/sqlite
Keeping a SQL connection up is more hassle than reading a file,
whether it's a flat tagfile or a pickled cache of the pkg info,
so thus far it hasn't been worth it. If adding pathlist/changelog,
it would definitely be required. Reading XML is just too slow...
I would be interested in some feedback from others using "RPM-MD",
but to me it seems like it should continue to use the bigger cache
until the time is right to switch to using SQLite repodata instead ?
It's created with the -d option to createrepo/yum-createrepo/rpmrepo.
--anders
PS. The RPM-Database and RPM-HeaderList don't have this problem,
it's specific to the RPM-MetaData as being used by yum/zypp.
More information about the Smart
mailing list