The CrossLingMind Project

Automated analysis of opinions in a multilingual context

Since Internet users can produce content and interact, a massive amount of subjective text is published in sites such as blogs, social networks, information channels or consumer sites. Opinion Mining (OM) is the task of analyzing opinions, sentiments or emotions expressed towards entities such as products, services, organizations, issues, and the various aspects of these entities. The two main OM approaches are (i) a machine learning approach, in which the system learns the polarity (positive or negative) of an opinionated segment from annotated examples and (ii) a lexicon-based approach, based on rules involving opinion-bearing words and phrases, opinion shifters, contrary clauses (but), etc. Thus in most OM systems three main types of resources and text are involved: 1) the set of resources, such as annotated examples or lexicons, used to train the system, 2) the opinions to be analyzed and 3) the analysis output by the system, which may be, for example, a set of opinionated terms about the same topic or conveying similar opinions, forming semantic spaces.

The Internet multilinguality and the globalization of products and services create situations in which these three types of resources are not all in the same language. In these situations, a language transfer is needed at some point to perform the opinion analysis (or to understand its results), thus called cross-lingual opinion mining (CLOM). We may for example want to analyze opinions in Spanish, but avail only of annotated examples in English. We may on the contrary have all the resources necessary to build an OM system in Spanish, but need the analysis output in English.

The CrossLingMind project aimed at developing a CLOM system. The idea was to achieve this objective by undertaking several steps including the language transfer of relevant terms focusing on their context, the adaptation of sentiment lexicons (lexicons of opinion-bearing words), the mapping of semantic spaces across languages, as well as the study of the best architecture for the final cross-lingual opinion mining system developed. Finally, the goal was to develop a system useful in real life and which meets industrial interests and requirements.

In order to be useful in the most common settings, in which the opinions are generated by social media users, the monolingual opinion mining pipeline existing at the host institution was first adapted to this type of content and to languages of high interest for potential industrial partners. In this line, a research on processing user-generated content in social media was conducted and published in the Natural Language Engineering journal (“Selection of correction candidates for the normalization of Spanish user-generated content”). Due to the high commercial interest in the Portuguese language, an opinion mining pipeline in this language was also built (and presented at LREC 2014: “Adapting Freely Available Resources to Build an Opinion Mining Pipeline in Portuguese”). Finally, the monolingual opinion mining system was improved and got an admirable 5th rank in the constrained track of task 2A on polarity classification at SemEval 2013.

In parallel, a thorough study of the state of the art in cross-lingual opinion mining was performed. This study was situated in the broader context of cross-lingual algorithms and applications used in natural language processing. It gave rise to a tutorial on Cross-Language Text Mining given at the ESWC 2013 conference, a course on Cross-Language Algorithms and Applications at the ESSLLI 2014 summer school and a special track on Cross-Language Algorithms and Applications at the JAIR journal. In this last publication, the view is even broader, encouraging research performed over deeper analogies between crossing language barriers and other modalities of perception and efforts toward unifying themes and paths to develop the science of multilingualism.

A part of the project was devoted to the generation of sentiment lexicons in a new language. We generated a lexicon of sentiment words (the LIWC lexicon) in Catalan via triangulation from similar lexicons existing in other languages. This work was presented at AIRS 2013 in Singapore (“Generating New LIWC Dictionaries by Triangulation”).

As to the development of the CLOM system, the first question was the granularity of the analysis. While most research on CLOM has been performed considering the polarity of a whole sentence or document, in many cases this is not appropriate since the same sentence or document can contain positive opinions towards specific aspects and negative ones towards other aspects. We thus developed an aspect-based CLOM system, that is a system performing an analysis at the level of the aspects of the entities about which opinions are expressed. According to our knowledge, the CrossLingMind system is the first reported aspect-based CLOM system. This creates an additional difficulty in a cross-lingual setting since opinionated units may be reordered or even split and dispersed during language translation. The system was trained on data from the OpeNER european project, annotated at the opinionated unit level, in several languages (we used data in Spanish and English), in the hotel review domain. To perform the language translation, we adapted the Moses statistical machine translation system to the hotel review domain, via cross-entropy based data selection methods. We designed a strategy to translate documents with opinionated units keeping track of these opinionated units in the translation output. This strategy consists in using re-ordering restrictions available in Moses which ensure that the opinionated expression words are not mixed with other words. In this way the opinionated units are preserved across language transfer, and can be recovered in the output. This method can be used to map semantic spaces resulting from the opinion analysis. We also built a context-based statistical machine translation system (Haque et al. 2011) to try and improve the translation of relevant words taking their context into account. However we did not yet achieve any improvement via this technique. This technique also presents the inconvenient of being very memory-intensive, making it difficult to include in a practical system.

We finally tested several CLOM architectures and found that the best architecture depends on the particular data set and on the language direction. We performed experiments in Spanish and English. Assuming we want to analyze opinions in a language L but we lack an opinion mining system in this language, we use resources in the other languages to perform the analysis. Depending on the architecture and source language used, the CLOM system achieved an opinion classification accuracy of only between 4% and 10% less than the baseline one (the monolingual opinion mining pipeline in language L), which proves its usefulness.

In conclusion, the CrossLingMind project has proved that aspect-based cross-lingual opinion mining is achievable and obtains competitive classification results. It will thus encourage researchers to not restrict their research to sentence- or document-level cross-lingual opinion mining and work at the aspect level. The CrossLingMind system was designed to be used in real-life settings and has a potential to be part of a technology transfer framework and be eventually used by the general public. It will be useful for groups interested in knowing the opinion of their clients or users who may speak different languages, such as companies interested in feedback about their products, political organizations desirous to better understand their potential supporters, or institutions wanting to know the opinion of the general public about them or about issues they are involved in. As such, the project will contribute to open the language barriers that both difficult the communication between citizens and fragment the exploitable data and the digital market, especially in Europe.

A demonstrator of the final system is available at the following website:

This project has received funding from the Seventh Framework Program of the European Commission through the Intra-European Fellowship (CrossLingMind-2011-300828) Marie Curie Action.