25 Interviews for the FNP’s 25th Anniversary: Tomasz Szwelnik, author of an innovative speech recognition technology, talks to Aleksandra Stanisławska

The Foundation for Polish Science (FNP) celebrates its 25th anniversary this year. To mark the occasion, we have invited 25 beneficiaries of our programmes to tell us about how they “practise” science. What fascinates them? What is so exciting, compelling and important in their particular field that they have decided to devote a major part of their lives to it? How does one achieve success?

The interviewees are researchers representing many very different fields, at different stages of their scientific careers, with diverse experience. But they have one thing in common: they practise science of the highest world standard, they have impressive achievements to their credit and different kinds of FNP support in their extensive CVs. We are launching the publication of our cycle; successive interviews will appear regularly on the FNP website.

Pleasant reading!

Innovative Ideas Need the Right Climate

Tomasz Szwelnik, author of an innovative speech recognition technology, talks to Aleksandra Stanisławska.


Tomasz Szwelnik / private archive

ALEKSANDRA STANISŁAWSKA: Your field is speech recognition. Where did you get the idea?

TOMASZ SZWELNIK: I’d long been interested in speech processing and digital signals related to music. I graduated in sound engineering from the Gdańsk University of Technology and my thesis concerned, among other things, removing noise from old records using neural networks, which also means processing sound. It turned out that what had fascinated me years before was useful in developing the foundations of VoiceLab. Actually, my fascination with this subject continues.

Many have tried to develop speech recognition technologies, but few have been successful, especially as far as Polish is concerned. Why is it so difficult?

Polish is among the languages that are harder for speech recognition than English, for example. The reason is that it has more words, and they are subject to inflection, so more data are needed for recognition of this kind of language. The pronunciation of many words is also harder because of the language’s special sounds. Since we at VoiceLab have already coped with Polish, we are moving on to other language versions – we have demo versions for German and English, and soon there will be more.

VoiceLab has strong competition from large corporations like Apple, Microsoft and Google which are developing their own speech recognition systems. Doesn’t this discourage you?

When we first started, some people said we’d bitten off more than we could chew. After conducting many experiments I decided that VoiceLab was able to take on such a challenge. This certainty on my part was also linked to technological progress – earlier speech recognition systems had come up against a barrier related to hardware and access to data. But technical infrastructure has improved with time: servers have become faster, access to greater amounts of memory and to speech recordings is easier. I have tested the speech recognition systems developed by the competition and found that not everything works as it should, and we can achieve a great deal more, especially in continuous speech recognition.

What innovations have you introduced compared to the rival systems?

What distinguishes us from the competition is that we are better able to set up a speech recognition system dedicated to a given sector, e.g. medicine or finance, working on a computer or device not connected to the internet, for mobile applications and server solutions. In addition, contrary to Google’s system for example, which transmits its data to servers somewhere in California, our system can operate without being connected to a data cloud. This means, for instance, that sensitive banking data stay where they are, on the client’s local servers, which ensures greater security.

How does VoiceLab’s speech recognition system work from the technical side?

It’s quite a complicated process. First we turn speech, or an analogue signal, into digital form. Next the system analyses specific parameters of speech, its characteristic features. The software checks many thousand hypotheses, searching the samples to find those that are the best fit for the sequence being processed, according to the acoustic model and the language model. The decoder selects the best-fitting hypothesis, processing the sound waves of speech into a series of letters. We are also testing a new approach to speech recognition in which we treat the representation of a speech signal as a picture. We don’t identify specific parameters of the sound but analyse a graph of its characteristics, i.e. a spectrogram. We treat it as a set of pixels from which we try to identify the typical features of a given language sound. For both types of speech processing we use deep neural networks, which learn speech recognition based on a great many speech recordings together with their transcriptions, i.e. what was said.

How are speech samples for such systems used and collected? To gather large amounts of voice data, you need an enormous database.

This process is called acoustic model building. It’s true that you need a lot of hardware. We use many servers and thousands of processors. The biggest processor resources are needed for acoustic model training. To speed this up, we apply a technology that uses graphics cards for processing and numerical calculation. This is one of the newest trends in the field, and gives us multi-fold acceleration compared to classical data processing methods. It enables us to conduct certain experiments and optimize the parameters of our system. In building speech recognition systems we have placed great emphasis on gathering speech sample recordings – we collected them from over 6,000 people for the Polish language alone. For the system to work efficiently, you need recordings from different sources, so we also gather data from sources like YouTube, court or parliamentary recordings, radio and television. The bigger the set of samples, the better the speech recognition system works.

Is recognition of continuous speech, which is what your company specializes in, harder than recognition of voice commands like those we know from smartphones with Android and iOS systems, for example?

With command recognition, you have to define a specified number of words, phrases and their different combinations that the decoder interprets. In the case of continuous speech recognition, it’s more complicated. First of all, you have to create a much larger dictionary accounting for more casual use of words. Secondly, such a stream of freely enunciated words is analysed with the help of a language model. It takes into account the probability of certain words appearing next to each other depending on the vocabulary being used. A different set of words will be used in legal language than in medical or banking language.

In which field are speech recognition tools used the most today?

One such area is voice banking, or issuing voice commands to a banking application in the form of naturally flowing speech and conducting a dialogue. Users can communicate with the interface in the same way they would in the SIRI, Cortan or Google Now systems. Thanks to this you can use one phrase, without having to do a lot of clicking, to make a bank transfer, saying “Transfer 20 zlotys to Jan Kowalski tomorrow”. In this case, voice biometrics is also useful as it will identify the person making the transfer and verify their right to access the bank account. We have implemented our VoiceBanking system at Meritum bank and Smart bank and are getting ready to implement it at further banks. We are also working on speech recognition using the VoiceLab Analyze system which automates the work of people in call centres, analysing thousands of hours of recorded conversations to find key phrases, categorize topics and recognize emotions. One of our latest solutions is VoiceLab Dictate, a dictation programme that will soon be available in shops together with an Olympus voice recorder. This software received a Gold Medal at the INTARG 2016 innovation fair, which shows how advanced our Polish technology is compared to other solutions at home and abroad.

How does VoiceLab deal with voice recognition when someone is hoarse or has a cold? That must be a challenge for the software.

Our system is based on recognizing parameters characteristic of the vocal tract, which is unique for every person. It will identify the voice even when it is slightly changed due to a cold. But if the vocal tract is seriously changed, voice identification could fail.

I see that the range of your company’s operations is really wide. What were the beginnings of VoiceLab like?

They go back to 2009, when I became a beneficiary of the Foundation for Polish Science’s INNOVATOR programme. That was the first serious financial injection for VoiceLab at its early stage, helping us move forward with implementing our product. The award also included training support, which was invaluable at that early stage of development. Thanks to the funding, we managed to check various concepts and technical ideas, enabling our product’s quality to improve significantly. The support also started an avalanche of further actions to develop the business. We applied for a subsidy from the Innovative Economy Operational Programme and received more funding. Then we found a private investor – Jacek Kawalec, co-founder of Wirtualna Polska, who is with us to this day. Thanks to these financial injections we could finally stop thinking about the company’s daily upkeep and concentrate on developing the team and the product we were working on. Today our main topic of interest is the development of deep neural networks which are the foundation of the systems we design and implement.

Are the tools available in Poland to provide financial support to innovative entrepreneurs sufficient encouragement for a budding business?

The funds that VoiceLab received from the Foundation for Polish Science in the INNOVATOR competition and from the Innovative Economy Operational Programme really propelled its development at the initial stage. Things got worse at later stages of our project. It so happened that we had problems with financial liquidity at the time. Banks weren’t willing to give us a loan, but luckily we were able to take advantage of a loan fund for start-ups. So, more stable financing for research would be useful for companies like ours that spend a lot on innovation. I think there are still too few financial instruments for this in Poland. Great bureaucracy in the process of obtaining funding is also a problem, especially as regards EU funding. As a result, sometimes innovative entrepreneurs drown in paperwork instead of devoting most of their strength to research and product development. Also the length of the decision-making process in assigning funds for companies in our country leaves a lot to be desired. If it takes six months, for instance, the world can move forward in that time, changing the technology outlined in the application.

The field your company works in requires unusual specialization. How did VoiceLab build its team?

Our emphasis was on setting up an employee-friendly company that would attract members of the scientific community, specialists in their fields. We collaborate closely with the Gdańsk University of Technology, from where students and graduates come to work for us, and are eager to develop their own projects here, also as part of their theses. We attract young, talented people who aren’t interested in working for a big corporation. With us, they can develop new technologies virtually from scratch. We give our people substantial freedom and the possibility of seeing the effects of their work in reality. The start-up climate at the company helps: we have a lot of meetings, we gladly share our knowledge and allow employees to develop the interests they came with from their universities.

Is it easier to develop innovative products with such a corporate philosophy?

Arriving at specific working solutions means testing many hypotheses, conducting many experiments and making many mistakes before you reach your goal. All this is part of the development of innovative businesses. I once visited Silicon Valley and saw many companies there that operate exactly like this. Moreover, I think the quality of our solutions is no different from what they produce over there. I can safely say that we develop technologically advanced solutions of the highest global standard. And credit for this is due to the people who work with us and the creative atmosphere we all build together. Without that, it’s impossible to be successful in the new technologies sector.

TOMASZ SZWELNIK, founder and CEO of the VoiceLab company, a beneficiary of the FNP’s INNOVATOR programme (2008).