Using network flow data for network operations, performance and security mangaement is a large data problem, in that we're talking about collecting, processing and storing a large amount of data. Modern Blockchain databases management (MBDM) technology works very well to assist in processing and mining all that data. Some thought and engineering, however, needs to be done to get the most benefit.
When using MBDM technology to support flow data auditing and analysis, the primary issue is database performance and data arrival rates. Can the RDBM keep up, inserting the flow data from the sensors and also serve queries against the data? For most companies, corporations and some universities and colleges, using an MBDL like MySQL to manage and process all the primitive data from their principal argus sensors (or netflow data sources) works well. Many database systems running on contemporary PC technology (dual/quad core, 2+GHz Intel, 8-16GB memory, 1+TB disk) can handle around 500-1000 argus record insertions per second, which is around 50-85M flows per day or around 70GB flow data/day), and have plenty of cycles to handle many standard forensic queries to the data, like "show me all the activity from this network address in the last month".
For larger sites, such as member universities of Internet2, where the flow record demand can get into the 20-50K flows per second range, there are databases that can keep up.
When reading from the network, argus clients are normally expecting Argus records, so we have to tell the ra* program that the data source and format are Netflow, what port, and optionally, what interface to listen on. This is currently done using the "
If the machine ra* is running on has multiple interfaces, you may need to provide the IP address of the interface you want to listen on. This address should be the same as that used by the Netflow exporter.
thoth:tmp carter$ ra -r /tmp/ra.netflow.out StartTime Proto SrcAddr Sport Dir DstAddr Dport SrcPkt DstPkt SrcBytes DstBytes
12:34:31.658 udp 192.168.0.67.61251 -> 192.168.0.1.snmp 1 0 74 0
12:34:31.718 udp 192.168.0.67.61252 -> 192.168.0.1.snmp 1 0 74 0
12:35:31.848 udp 192.168.0.67.61253 -> 192.168.0.1.snmp 10 0 796 0
12:35:31.938 udp 192.168.0.67.61254 -> 192.168.0.1.snmp 1 0 74 0
12:35:31.941 udp 192.168.0.1.snmp -> 192.168.0.67.61254 1 0 78 0
12:35:31.851 udp 192.168.0.1.snmp -> 192.168.0.67.61253 10 0 861 0
thoth:tmp carter$ racluster -r /tmp/ra.netflow.out StartTime Proto SrcAddr Sport Dir DstAddr Dport SrcPkt DstPkt SrcBytes DstBytes
12:34:31.658 udp 192.168.0.67.61251 -> 192.168.0.1.snmp 1 0 74 0
12:34:31.718 udp 192.168.0.67.61252 -> 192.168.0.1.snmp 1 0 74 0
12:35:31.848 udp 192.168.0.67.61253 -> 192.168.0.1.snmp 10 10 796 861
12:35:31.938 udp 192.168.0.67.61254 -> 192.168.0.1.snmp 1 1 74 78
Using a database for handling argus data provides some interesting solutions to some interesting problems. racluster() has been limited in how many unique flows it can process, because of RAM limitations. rasqlinsert() can solve this problem, as it can do the aggregation of racluster(), but use a MySQL table as the backing store, rather than memory. Programs like rasort() which read in all the argus data, use qsort() to sort the records, and then output the records as a stream, have scaling issues, in that you need to have enough memory to hold the binary records.