Aaron Davis

The government and big data: Use, problems and potential

March 21, 2012 12:56 PM EDT

When it comes to managing data, government agencies have always had the same issue. From national intelligence to the IRS, the U.S. Census to local municipalities, there are massive amounts of data in agency computer systems. Much of that information is unstructured, meaning it does not fit into a pre-defined data model.

To understand what patterns exist in unstructured data, government agencies apply statistical models to large quantities of unstructured data.  The result is a movement called big data, which seeks to capture and process vast amounts of unstructured data. Since public agencies have not traditionally had enough human capital or computational capacity to manage and analyze all of their data, and given the shifting nature and exponentially growing amount of data, cloud computing-enabled big data tools are essential. Additionally, because government data is of a global nature, big data must take into account not only different types of data but also the multi-lingual nature of much of the data collected today. As a result, translation technologies are also a key component in effectively managing, sorting and distributing the types of unstructured data encountered in the public sector.

One of the most common tools utilized by government organizations is MapReduce. MapReduce is a software framework launched by Google in 2004 that supports distributed computing on large sets of data for computer clusters.  MapReduce makes generating statistics from both large amounts of data and unstructured data possible, in essence generating manageable structured data from unstructured data. As a result, agencies today can process large data sets quickly on cloud-based distributed systems. While the traditional way to analyze unstructured data was limited to a single computing node, the cloud model enables multiple computing nodes to each work on a portion of the data in parallel.  MapReduce frameworks like Hadoop help to bring the complexity of parallel computation on distributed systems within reach.

How the Government Mines Big Data

Many agencies use MapReduce clusters and analytics across the board to generate statistics that help them understand regional, local and global patterns and trends. I've run across a number of examples in my work with government clients. For one, local and state governments use big data to track and analyze usage patterns for their services so they can develop better, more effective services. Another example involves sorting citizen-oriented documents. Large numbers of census forms, IRS forms, election forms and many other official documents in different languages must be collected and managed. The government must not only ensure that people of diverse cultures can read important forms and information, but must also collect the incoming information from a diverse, multi-lingual citizenry. More and more, technology helps to sort and translate this information.

Another language-based example involves Iraq, where there is a huge undertaking to translate rule-of-law documents into Arabic so that Iraq can have a model for its own justice system. Additionally, Defense Advanced Research Projects Agency's (DARPA's) Broad Operational Language Translation (BOLT) program is a machine translation program that analyzes huge amounts of information in a variety of languages and formats, then determines which information is important for national security and military purposes. BOLT translates and sorts structured and unstructured data, including everything from foreign phrases to print documents and voice recordings to video.

Intelligence is also a huge global application of big data. On a foreign policy level, I've seen agencies use big data to understand sentiment around overseas elections. Big data can analyze satellite images to find salient patterns that may be tactically significant. The advantage of big data here is its ability to handle such a broad swath of data types.

Problems Faced by the Government in its Usage of Big Data

Big data is still a relatively new concept, and although it has many advantages, challenges to big data's usage in the public sector exist too. One is cost: many agencies pay a high premium to both internal resources and external third parties to manage their data. Additionally, data management can sometimes be redundant if not properly set up. For example, an agency may translate documents and foreign social media feeds that, more often than not, are replicas of documents translated in the past. By combining intelligent solutions that couple big data with, for example, compatible translation tools, the government can be more efficient with spending.

Another challenge of big data comes with the public sector's inherent regulatory nature. Often regulations don't take into account the new, expanded capabilities that IT offers. While the commercial enterprise sector is currently working to set up IT governance to better manage its computing assets in a lean resource environment, the government already has all kinds of existing laws and regulations in place. Regulation can be so strong at times that it's hard to push forward with IT initiatives. Even with the enthusiasm and know-how in place, the approval process can bottleneck the speed at which new developments are deployed, sometimes causing government agencies to run five to 10 years behind enterprise IT adoption.

Catching Up to the Open Cloud

With MapReduce clusters in place, the government is making a dent in its abundance of data. Taking a look into the near future, once the approval process progresses, government agencies plan to deploy open cloud environments. Agencies are developing positions dedicated to the science of intelligently managing their data. These things will speed up and simplify government IT systems even more and enable the further advancement of data management in the public sector. When it comes to big data, there is perhaps no bigger data source than the U.S. government itself.

Aaron Davis is Chief Technology Officer of Lingotek | The Translation Network and a leader in collaborative translation platforms. Aaron is at the forefront of the technology and theories enabling organizations to communicate and interact with a global audience.