Please confirm that you would like to cancel your order and delete files.


File type not recognised. Please upload an Audio or Video file.

Subscribe to our Newsletter

Stay up to date on our latest promotions and deals.

Download your quote

Fill out your details below and we will generate a quote for you.

Nibity Speech To Text - Audio & Video

Speech To Text – Audio & Video


“Super Fast Speech To Text Account Set-up. 99%+ Accuracy. Great Price Packages. Collect on-the-go.”


  • Easy: Simply upload, get collection date and pay.
  • Quick: Get your transcript back in 24 hours, or next in queue.
  • Accessible: Easily access your account to check status and collect transcripts.
  • Affordable: Select one of our discount packages for even cheaper options.


Nibity is used to transcribe English language audio or video recordings for a wide range of subject matter and recordings from many different settings.

Clients use Nibity transcription for medical research, legal cases, academic lectures and speeches, general conversations, media interviews, marketing presentations, film captions, financial seminars, sermons, as well as conference events and police investigations, amongst others.

Following this, we transcribe recordings of meetings, focus group research, dictations, teleconferences, one-on-one interviews, as well as phone calls.

We accept all popular file types and formats, as well as links to YouTube and Dropbox.


Get your prices here, live, now. Try find a service that offers the same quality anywhere near our rates. You save, we serve.

File Name
Cost ($)





$ 0.00


We offer a standard speech to text format that includes:

  • The file name at the top of the page.
  • Two free timestamps per page.
  • Speaker identification for up to 3 speakers (if provided).
  • For 4+ speakers, each speaker identified as SP.
  • Delivered as a Microsoft Word document.

Furthermore, you can view a sample transcript on our home page (transcript).



  • REGISTER an account.
  • Upload your audio or video file or drop a link.
  • Add information for files to be transcribed.
  • View transcript collection date and, if happy, check out.

Watch our short video for more about the process.



USA Nibity Speech to Text
CA Nibity Speech to Text
AU Nibity Speech to Text
UK Nibity Speech to Text
SA Nibity Speech to Text
GLOBAL Nibity Speech to Text

Nibity Speech To Text is available throughout the United States, Canada, Australia, United Kingdom, SingaporeSouth Africa, as well as worldwide.



We have a wide variety of audio and video to text users that include legal and medical professionals, management, researchers, care workers, consultants, law enforcement officers, marketing agents, broadcast media, clergy, as well as speakers.

Nibity speech to text is also popular with users from universities and colleges, legal services, marketing companies, brand research groups, government departments, technology groups (often start-ups or NLP oriented), as well as private users.


Transcribers who work with us apply from around the world. Subsequently, successful candidates are trained and managed by us to provide a professional transcription service that ensures the delivery of accurate transcripts. Furthermore, transcribers are supported by our workflow automation tools and software solutions, where required, but at no risk to affecting the accuracy of the transcript.


What Is Speech To Text?

Automatic conversion of spoken language into text, Speech Recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT).

Some speech recognition systems require “training” (also called “enrolment”) where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person’s specific voice and uses it to fine-tune the recognition of that person’s speech, resulting in increased accuracy. Systems that do not use training are called “speaker independent” systems.

Speech recognition applications include voice user interfaces such as voice dialling (e.g. “call home”), call routing (e.g. “I would like to make a collect call”), domestic appliance control, search key words (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents).

The term voice recognition refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person’s voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.

Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems.

Raj Reddy was the first person to take on continuous speech recognition as a graduate student at Stanford University in the late 1960s. Previous systems required users to pause after each word. Reddy’s system issued spoken commands for playing chess. Around this time Soviet researchers invented the dynamic time warping (DTW) algorithm and used it to create a recognizer capable of operating on a 200-word vocabulary.

Although DTW would be superseded by later algorithms, the technique carried on. Achieving speaker independence remained unsolved at this time period. 1971 – DARPA funded five years for Speech Understanding Research, speech recognition research seeking a minimum vocabulary size of 1,000 words.

BBN, IBM, Carnegie Mellon and Stanford Research Institute all participated in the program. This revived speech recognition research post John Pierce’s letter. what is speech to text. 1972 – The IEEE Acoustics, Speech, and Signal Processing group held a conference in Newton, Massachusetts. 1976 The first ICASSP was held in Philadelphia, which since then has been a major venue for the publication of research on speech recognition.

A decade later, at CMU, Raj Reddy’s students James Baker and Janet M. Baker began using the Hidden Markov Model (HMM) for speech recognition. James Baker had learned about HMMs from a summer job at the Institute of Defense Analysis during his undergraduate education. The use of HMMs allowed researchers to combine different sources of knowledge, such as acoustics, language, and syntax, in a unified probabilistic model.

Jelinek’s group independently discovered the application of HMMs to speech. This was controversial with linguists since HMMs are too simplistic to account for many common features of human languages. However, the HMM proved to be a highly useful way for modeling speech and replaced dynamic time warping to become the dominant speech recognition algorithm in the 1980s – speech to text.

Baker, was one of IBM’s few competitors. The 1980s also saw the introduction of the n-gram language model. 1987 – The back-off model allowed language models to use multiple length n-grams, and CSELT used HMM to recognize languages (both in software and in hardware specialized processors, e.g. RIPAC). Much of the progress in the field is owed to the rapidly increasing capabilities of computers.

It could take up to 100 minutes to decode just 30 seconds of speech (what is speech to text). Two practical products were: 1987 – a recognizer from Kurzweil Applied Intelligence 1990 – Dragon Dictate, a consumer product released in 1990 AT&T deployed the Voice Recognition Call Processing service in 1992 to route telephone calls without the use of a human operator.

By this point, the vocabulary of the typical commercial speech recognition system was larger than the average human vocabulary. Raj Reddy’s former student, Xuedong Huang, developed the Sphinx-II system at CMU. The Sphinx-II system was the first to do speaker-independent, large vocabulary, continuous speech recognition and it had the best performance in DARPA’s 1992 evaluation.

L&H was an industry leader until an accounting scandal brought an end to the company in 2001. The speech technology from L&H was bought by ScanSoft which became Nuance in 2005. Apple originally licensed software from Nuance to provide speech recognition capability to its digital assistant Siri. In the 2000s DARPA sponsored two speech recognition programs: Effective Affordable Reusable Speech-to-Text (EARS) in 2002 and Global Autonomous Language Exploitation (GALE).

EARS involved Pittsburgh, Cambridge University, and a team composed of ICSI, SRI and University of Washington. EARS funded the collection of the Switchboard telephone speech corpus containing 260 hours of recorded conversations from over 500 speakers. The GALE program focused on Arabic and Mandarin broadcast news speech. Google’s first effort at speech recognition came in 2007 after hiring some researchers from Nuance.

The recordings from GOOG-411 produced valuable data that helped Google improve their recognition systems. Google Voice Search is now supported in over 30 languages. In the United States, the National Security Agency has made use of a type of speech recognition for keyword spotting since at least 2006. This technology allows analysts to search through large volumes of recorded conversations and isolate mentions of keywords.

Some government research programs focused on intelligence applications of speech recognition, e.g. DARPA’s EARS’s program and IARPA’s Babel program. In the early 2000s, speech recognition was still dominated by traditional approaches such as Hidden Markov Models combined with feedforward artificial neural networks. Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Sepp Hochreiter & Jürgen Schmidhuber in 1997 (speech to text).

Around 2007, LSTM trained by Connectionist Temporal Classification (CTC) started to outperform traditional speech recognition in certain applications. In 2015, Google’s speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which is now available through Google Voice to all smartphone users. The use of deep feedforward (non-recurrent) networks for acoustic modeling was introduced during later part of 2009 by Geoffrey Hinton and his students at University of Toronto and by Li Deng and colleagues at Microsoft Research, initially in the collaborative work between Microsoft and University of Toronto which was subsequently expanded to include IBM and Google (hence “The shared views of four research groups” subtitle in their 2012 review paper).

In contrast to the steady incremental improvements of the past few decades, the application of deep learning decreased word error rate by 30%. This innovation was quickly adopted across the field. Researchers have begun to use deep learning techniques for language modeling as well. In the long history of speech recognition, both shallow form and deep form. But these methods never won over the non-uniform internal-handcrafting Gaussian mixture model/Hidden Markov model (GMM-HMM) technology based on generative models of speech trained discriminatively. A number of key difficulties had been methodologically analyzed in the 1990s, including gradient diminishing and weak temporal correlation structure in the neural predictive models. All these difficulties were in addition to the lack of big training data and big computing power in these early days.

Hinton et al. and Deng et al. reviewed part of this recent history about how their collaboration with each other and then with colleagues across four groups (University of Toronto, Microsoft, Google, and IBM) ignited a renaissance of applications of deep feedforward neural networks to speech recognition. By early 2010s speech recognition, also called voice recognition was clearly differentiated from speaker recognition, and speaker independence was considered a major breakthrough – speech to text.

A 1987 ad for a doll had carried the tagline “Finally, the doll that understands you.” – despite the fact that it was described as “which children could train to respond to their voice”.

So what is speech to text? In 2017, Microsoft researchers reached a historical human parity milestone of transcribing conversational telephony speech on the widely benchmarked Switchboard task. The speech recognition word error rate was reported to be as low as 4 professional human transcribers working together on the same benchmark, which was funded by IBM Watson speech team on the same task. Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms. Language modeling is also used in many other natural language processing applications such as document classification or statistical machine translation. Modern general-purpose speech recognition systems are based on Hidden Markov Models. These are statistical models that output a sequence of symbols or quantities. HMMs are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal.

The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians, which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.

Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation.

Many systems use so-called discriminative training techniques that dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data (speech to text). Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE). Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model, which includes both the acoustic and language model information, and combining it statically beforehand (the finite state transducer, or FST, approach).

The set of candidates can be kept either as a list (the N-best list approach) or as a subset of the models (a lattice). Re scoring is usually done by trying to minimize the Bayes risk (or an approximation thereof): Instead of taking the source sentence with maximal probability, we try to take the sentence that minimizes the expectancy of a given loss function with regards to all possible transcriptions (i.e., we take the sentence that minimizes the average distance to other possible sentences weighted by their estimated probability).

Efficient algorithms have been devised to re score lattices represented as weighted finite state transducers with edit distances represented themselves as a finite state transducer verifying certain assumptions. Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach – what is speech to text.

For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another he or she were walking more quickly, or even if there were accelerations and deceleration during the course of one observation. DTW has been applied to video, audio, and graphics – indeed, any data that can be turned into a linear representation can be analyzed with DTW.

In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g., time series) with certain restrictions. That is, the sequences are “warped” non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.

Since then, neural networks have been used in many aspects of speech recognition such as phoneme classification, phoneme classification through multi-objective evolutionary algorithms, isolated word recognition, audiovisual speech recognition, audiovisual speaker recognition and speaker adaptation. Neural networks make fewer explicit assumptions about feature statistical properties than HMMs and have several qualities making them attractive recognition models for speech recognition.

However, in spite of their effectiveness in classifying short-time units such as individual phonemes and isolated words, early neural networks were rarely successful for continuous recognition tasks because of their limited ability to model temporal dependencies. One approach to this limitation was to use neural networks as a pre-processing, feature transformation or dimensionality reduction, step prior to HMM based recognition.

Deep Neural Networks and Denoising Autoencoders are also under investigation. A deep feedforward neural network (DNN) is an artificial neural network with multiple hidden layers of units between the input and output layers. Similar to shallow neural networks, DNNs can model complex non-linear relationships. DNN architectures generate compositional models, where extra layers enable composition of features from lower layers, giving a huge learning capacity and thus the potential of modeling complex patterns of speech data.

See comprehensive reviews of this development and of the state of the art as of October 2014 in the recent Springer book from Microsoft Research. See also the related background of automatic speech recognition and the impact of various machine learning paradigms, notably including deep learning, in recent overview articles.

Question 2

Answer 2

Question 3

Answer 3