you all know these small images that many people have in their signature on the IPFire forums?
Those come out of the fireinfo service. A service that we built to get detailed information about the hardware that is used to run IPFire on. This information is – except that it is also fun to look at, too – very valuable for the IPFire development process and some design decisions may have been done otherwise if we didn’t have this data. Of course is participation completely voluntary. Please enable the fireinfo service on your systems because the more profiles we have in our database the more accurate are the statistics that we run over them.
Those many profiles will increase the precision of our predictions but also fill our database. The latter turned out to be a real problem in the last year or two and this blog post is supposed to talk a little bit about why…
Changing the database
We used to store all the profiles that we get in a MongoDB cluster with three nodes. One was the primary node and the other two were just replications so that we could reboot the primary host and have a copy of the data in case of disk failure. The client software sends a profile in a JSON format which is stored without much further ado as it comes. This is of course quick to do but not smart. The result was a database that is easy to write data to and fast to retrieve data from with almost a hundred gigabytes of data on the disks. We do have some hosting resources but nothing to waste on an inefficient data storage for fireinfo. I love this service but it does not have that much of a priority.
I also do not think that MongoDB has been a good option for us right from the beginning. This is just what sometimes happens when you have a great idea at a developer summit and bootstrap it within the next couple of days. It has some serious design issues which make it incredibly slow and consuming tons of memory. On the plus side: It worked well enough for the last four years. Now that the project is growing bigger and bigger it is time for a change.
Moving to PostgreSQL
I talked about that we moved many of the services that run the IPFire project infratructure from MySQL to PostgreSQL. So far we have not regretted doing this at all and therefore PostgreSQL was an option worth considering. The main thing is not the database engine but the layout of the data. MongoDB is a document-based database which stores an entire document (i.e. the JSON file) as a whole and PostgreSQL is a relational database. The “normal” ones with tables and stuff. One of the advertised advantages of document-based databases is that those documents do not all need to have the same schema. They can be completely different in their collection. That was good for us because we did not really know what to expect from the service when we have the idea. Searching for something requires to look up all documents which is fast when you have a couple of them. In our case we have over 20 million of them. Reading them all from disk takes ten minutes of time alone. The indexes we used to speed up some common searches helped but did not get us anywhere near where we wanted to be.
So now we have a relational database. That means structure. That means joins. That means indexes. The old style. On a quick side note: I think that these still work much better when you know your data. Maybe that is because they are just way more researched and implemented. Maybe that is because people know how to handle them because they have more experience. In fireinfo the structure of the data profiles is the key to an efficient storage in the way we want it:
- We do not want to throw away anything. That means that the profiles should be kept in the database for kind of forever so that we can go back in history and see developments. That is the only way we can make out trends in the hardware that people use. What is declining and what is going to be huge in the future?
- The data must be searchable. Fast. The relational database helps us to only search in the data that interests us and not to look up the entire profile data over and over again. This does not work for a large number of profiles in the document-based database any more.
- Use as less space as possible. This is just not about the small disks that we have. The major issue with the slow performance of our web services lately is based on too many IO operations per unit of time. The heads of the disks are constantly seeking and only reading smallest chunks of data. We can not efficiently cache most of it. So if the fireinfo database is smaller, more of it can be kept in memory which results in less seek operations. SSDs would help which we do not have.
Of course PostgreSQL does not only come with advantages. The structure of the data mentioned above has to be created. Every time a profile is written or read to or from the database it has to be split in parts or joined together. That takes some effort which I thought would be larger than the benefit that we get from the relational database system and this is also the reason why we went with MongoDB in the first place.
Importing the existing profiles
This turned out to take a really really long time. Who’d have thunk? Over 20 millions of profiles and almost one hundred gigabytes of data on disk. It took four weeks. The profiles had to be imported in order so that there was no feasible way of implementing an importer tool that could do more profiles at a time. So after millions and millions of SQL queries we are done and finally able to present the new fireinfo.
Of course we implemented some new features. Since the beginning of the entire fireinfo idea we had some many improvements that could be done on our list and never found the time to realise many of them. If you have some spare time please get in touch. This is such an interesting topic to do research on and there so much in the data that is worth plotting or do some other computation on.
Updating the profile page
The profile page has been updated. The design is a bit cleaner now and the most important information is right at the top. The list of processor features has been extended by the AES-NI and RDRAND instructions.
If you have any ideas how to make this even better, please get in touch.
We have added access to the list of vendors who have devices in their database. You can click on “PCI” or “USB” to get a list of all the devices of the respective class of devices by that vendor.
Driver -> Devices
We have added the option to show a list of known devices that are supported by a certain device driver. On top of that you will get shown how popular that device is used. This is a good indicator for what is available well on the market and also what is recommended to buy because it is proven that it works.
The other statistics page have been slightly redesigned as well. This can still be done better since we focussed on getting the information in a better shape to better present what is really important: the significant data. The geo location list for example used to be very long as IPFire is running in over 160 countries and is now capped to show entries with more than 1% of all IPFire installations.
Previously we only showed what percentage or processors support 64 bits and PAE. This has now been extended to way more CPU flags that are interesting like: SSE, AES-NI, RDRAND, and many more.
The memory distribution graph has been improved and is now easier to understand as it shows cumulative values.
Going back in time
It is possible to append a
when=DATETIME parameter on almost any page. By doing that you can show the profile or other data at the time. We should somehow add this to the GUI but I have not figured out a good way to do that. By now you all should have figured out that I am not the best UI designer :) As an example, here is one of my development machines as of today (
Have a look around!
This is all I can say right now. There is so much information in there. Open up a page and have a look at it. Really look at it. There is so much to find. Do not forget to share it.
If you see some space for improvement (I am sure there is) we are interested in having someone work more on the service. The backend part has been done now and is working really well. So there is lots of possibilities on the frontend part that can be realised now.