Jul 14, 2019,

SpaCy - a glimpse of the community’s first conference

Necessity is the mother of all invention. Laziness not so, and it does not usually create an entire community. Yet in the case of Matthew Honnibal, a linguist by training, it was laziness to learn C++ to build the language models he needed for his post-doctoral research. Instead he took a new path using Cython for just that. The first thing he pushed to GitHub in July 2014 was a routine for tokenisation, aptly called ’spaCy’.

Five years on from that moment a community of 200 from all corners of the world gathered in a historic theatre in Berlin to exchange knowledge on the spaCy library, what has become a core staple in the field, and the development of natural language processing at large. Michal and I, two data scientists from Signal, were lucky to be amongst this community for their first ever conference: spaCy IRL.

Three sessions spanned all aspects from research and development to the practicality of industry applications, of spaCy and natural language processing in general. Each speaker was honoured with a heartfelt introduction from Matthew Honnibal and his spaCy co-founder Ines Montani, a genuine tribute to the impact those people had on the development of NLP-solutions and this remarkable open source project.

Visionary Talks

“What are the missing elements in NLP right now?” asked, Yoav Goldberg (NLP researcher at Bar-Ilan University, Israel and Research Director AI2) rather philosophically. Building on his long experience in the field, he gave an overview of how natural language processing, as we know it today, evolved from rule-based to corpus-based approaches until the early 2000’s, to machine learning. There’s a sharp contrast with industry reality, he pointed out: while researchers are improving on the latest deep learning model, applied NLP tasks often don’t go beyond ngrams, tfidf and regex. The missing elements he discussed could close the gap between symbolic NLP and the power of deep learning and advocate for a tight human-machine interaction to improve the process of continuous learning.

A more recent star on the NLP stage, Sebastian Ruder, who made a huge impact on the field through his PhD work on transfer learning applied to NLP, provided both, an unassuming introduction to transfer learning as well as a vision for where he sees the this technique going in the near future. Data and code is being shared already, but he envisions a rise in sharing models via the hub models of Tensorflow and PyTorch.

Added functionality in spaCy

Giannis Daras, built on Sebastian Ruder’s talk in asking proactively “So when will spaCy support BERT?”. Despite being out-performed by new models BERT has made big waves in ML in recent months, for example though beating humans on a popular Question and Answer task - but its slow. spaCy was always build with speed in mind, being able to run your analysis even locally if need be. With very compelling graph visuals, Giannis managed to explain the concept of how a simplification of the full attention matrix can make sparse transformer models faster while compromising little on performance.

Interestingly Giannis joined spaCy in a ‘Google Summer of Code’ to adapt it’s algorithms to his native language, Greek. One advantage of a growing open source community in the field of NLP is the potential for expert contributions from multiple language backgrounds. Like Giannis, Guadalupe Romero, hit limits with the English models in her native Spanish - particularly in the area of lemmatisation. Lemmatisation can be done with CNNs, but it is faster with rules. In English a handful of lemmatisation rules are sufficient, she explained, but in Spanish, with it’s conditional verb forms adapted to both gender and plurality, we need more than 1000. German brings similar complexity through it’s adaptation of nouns and adjectives to cases. Guadalupe presented her work on rules-based lemmatisation for these two languages, laying the groundwork to do the same with more languages to come.

Plurality in language isn’t the only challenge - “domain differences matter in NLP”, according to Mark Neumann, a researcher at Allen AI. He presented a spaCy pipeline for analysing scientific and biomedical text (ScispaCy). Biomedical text is full of non-standard vocabulary, compounds, molecules, chemical reactions, and abbreviations which makes common vocabulary pipelines suboptimal. The good news is that the biomedical community has built a vast number of structured knowledge-bases creating a link between domain-specific words and phrases and their definitions. Beyond giving science a new NLP tool, he shared general advice on how to adapt spaCy for other specific domains.

Another recent extension to spaCy is its entity linking functionality. Sofie Van Landeghem, an NLP consultant from Belgium, presented her work in collaboration with spaCy on grounding textual mentions to knowledge-base concepts. This will not only improve NER through linking with Wikidata but include the ability to train custom relationships with your own knowledge-base. This problem space specifically rang a bell for us here at Signal Research for it may be an approach resolving long-tail entities in news texts.

The Industry View

Although development, technology and application didn’t feel separate topics at this conference, four talks had more of a focus on applied NLP. McKenzie Marshall and Patrick Harrison spoke about their NLP workflows in Asset Management (Barings) and Finance (S&P Global) respectively. In both cases the challenge is to identify relevant entities, mostly companies, from external sources to give the client a competitive advantage in predicting the market. News articles are a frequent source for asset managers - one of their challenges beyond NER is the interpretation of journalistic voice in deciding if something or someone is being talked about positively or negatively: “In the world of traders, these things are often conditional…”, Marshall explained. At S&P Global things are rather absolute: We have a “100% precision, 100% recall rule”, Patrick Harrison said boldly about their data standards. It sounds nonsensical from a data science perspective but he explained how a human-in-the-loop (and as a final check) workflow allows them to get close to that target.

Industry and customer expectations often seem difficult to align with scientific reasoning - a theme hilariously illustrated by Peter Baumgartner with client personas, such as “Show-off Sarah” or “Labeling Larry”. In “Applied NLP: Lessons from the Field” he talked about identifying the problem and successful product delivery. In what he calls ‘trail-blazing’, Baumgartner made a passionate case for sharing both, research and applied NLP lessons, in blogs with practical and open-source examples based on public data.

Conferences around a specific technology, such as a library like spaCy, are often either a closed nerdfest for developers or a business networking event for users without much technical depth. SpaCy IRL was different: The nature of spaCy as an open-source library attracted a multifaceted crowd: users, from industry and academia, which at the same time are contributors to the project, it’s core or extension libraries. The conference was accessible for all, from the casual NLP practitioner to anyone who wanted to discuss the latest performance improvement in a specific component. I came away impressed with the power of open-source development, which allowed a knowledge community to grow within 5 years from a tokeniser built by a researcher to a full NLP-framework that is snowballing increasingly multi-lingual into many different domains!

Full list of talks at spaCy IRL on Explosion’s YouTube Channel, slides are linked in the comments.