Using Stream Computing Techniques to Process Big Quantities of Textual Information

Xabier Artola, Zuhaitz Beloki, Aitor Soroa


Computational power needs have grown dramatically in recent years. This is also the case in many language processing tasks, due to very big quantities of documents that must be processed in a reasonable time frame. This scenario has led to a paradigm change in the computing architectures and large-scale text processing strategies used in the NLP field. In this paper we describe a series of experiments carried out in the context of the NewsReader project with the goal of analyzing the scaling capabilities of the language processing pipeline used in it. We explore the use of Storm in a new approach for scalable distributed language processing across multiple machines and evaluate its effectiveness and efficiency for processing documents on a medium and large scale. The experiments have shown that there is a big room for improvement regarding processing performance when adopting parallel architectures, and that we might expect even better results with the use of large clusters with many processing nodes.

Full Text:



  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.