Published on

How to speak spanish like a colombian drug lord!

Tagged with: PythonPandasNLP

I've been living in Puerto Rico for 4 years but two of those have been COVID and so I haven't been able to practice Spanish as much as I'd like. So to speed up my learning I've decided I want to watch a lot of spanish speaking television to start training my ears, but to do this I need a baseline of words I understand to be able to even know what they are saying!

Learning through apps like Duolingo, Drops, etc start with weird topics like vegetables that don't get you to a very good baseline for actually understanding daily conversations, so I think consuming TV is a better use of my time.

Subtitles

I've decided the way to understand what the best words to study are is to download every subtitle for every episode of a show I want to watch and then count each word. The more a word is spoken the more important it is for me to know it since I'll be hearing it a lot in the show.

I'm going to download subtitles from Netflix. Subtitles in Netflix are in WebVTT format, which looks like this:

248
00:17:58.285 --> 00:18:01.163  position:50.00%,middle  align:middle size:80.00%  line:79.33% 
Yo de verdad espero que ustedes
me vean como una amiga, ¿mmm?

249
00:18:01.247 --> 00:18:02.539  position:50.00%,middle  align:middle size:80.00%  line:84.67% 
No como una madrastra.

250
00:18:04.250 --> 00:18:06.127  position:50.00%,middle  align:middle size:80.00%  line:84.67% 
Yo nunca te vi como una madrastra.

It gives you a start time, end time, and the text on the screen. So my first process was parsing this format and just turning it into a list of words using https://github.com/glut23/webvtt-py.

Dummy parsing

What I basically did was text.split(" ") and started counting the words. This approach was quick and painless but it had a few downs falls. Some words look the same when in reality they are not and so this meant I'd have to study every meaning of a word even if it was more rare.

An example of this is the word "como", you can say:

  • Haz como te digo: "Do as I say", where como means "as"
  • como tacos todos los dias: "I eat tacos every day", where como is a conjugated form of the verb "to eat"

I need to know which version of a word is being used so I can count it properly.

Regular Expressions are always the answer

I couldn't figure out what the word was without it being in a complete sentence, but subtitles are fragments. They are split up into timings for displaying on the screen but they don't include entire sentences. For example, it might look like this:

23
00:01:21.960 --> 00:01:23.520  position:50.00%,middle  align:middle size:80.00%  line:84.67% 
Solo las que luchan por ellos

24
00:01:23.680 --> 00:01:25.680  position:50.00%,middle  align:middle size:80.00%  line:84.67% 
consiguen sus sueños.

I want to detect the start of a sentence and the end of a sentence and then combine it, so that you end up with "Solo las que luchan por ellos consiguen sus sueños.". My first thought was a regular expression on punctuation. This worked well most of the time but there were enough exceptions to the rule that it broke often on generated a lot of broken sentences:

  • Abbreviations like "EE. UU" for estados unidos (united states)
  • Ellipsis

Splitting on spaces also didn't work for identifying the parts of speech since I needed the context around the word.

Natural Language Processing

So to solve my pain I decided to grab https://spacy.io/ and do some NLP on the subtitles so that I could identify the proper parts of speech and get an accurate representation of the words I needed to learn.

The way spaCy works is you can send it a sentence and it'll return you a set of tokens:

>>> import spacy
>>> nlp = spacy.load("es_core_news_sm")
>>> [x.pos_ for x in nlp("Hola, como estas?")]
['PROPN', 'PUNCT', 'SCONJ', 'PRON', 'PUNCT']

So now I could identify the parts of speech and pull sentences together through end of sentence punctation. The first thing I did was generate a CSV of sentences that looked like this:

sentence start end show file
Si no, le voy a cortar todos los deditos 00:00:20.605 00:00:24.125 El marginal El marginal S02E02 WEBRip Netflix es[cc].vtt

Once I had a CSV of sentences I could send those back through spaCy for NLP and then start counting words, to generate another CSV:

word pos show file
a ADP El marginal El marginal S02E02 WEBRip Netflix es[cc].vtt
cortar VERB El marginal El marginal S02E02 WEBRip Netflix es[cc].vtt
todos PRON El marginal El marginal S02E02 WEBRip Netflix es[cc].vtt

From there I had all the data I needed! So now it was time to start doing some data analysis!

Data analysis

Using a jupyter notebook ( https://jupyter.org/ ) I grabbed pandas ( https://pandas.pydata.org/ ) and read in my CSVs to start analyzing the results.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 1000)
words = pd.read_csv('word_data.csv.gz', compression='gzip', delimiter=',')

The words dataframe is built up out of the second table I showed above with just words and their parts of speech. I started off grouping the dataset by the word so I could get a count for how many times it was spoken in every series I parsed:

grouped_result = (words.groupby(words.word).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count')
   .drop_duplicates(subset='word'))

grouped_result.head(300)

Which returned a list of words and their count:

	word	count
0	que	94430
1	no	75931
2	a	70968
3	de	67982
4	ser	64226
5	la	52143
6	y	44390
7	estar	37819
8	el	35920

Now I wanted to identify where my diminishing returns would be. Is there a set of words that I must learn because they are spoken so often that I wouldn't understand a conversation if they weren't in my vocabulary?

As you can see in this chart, the usage for words drops off at around the ~200 mark. So there are basically 150 words I must know and then the rest are equally important. I wasn't quite happy with this because some parts of speech are higher priority than others, for example I think having a strong understanding of the popular verbs will go a long way. So I also wanted to identify what are the most important verbs to learn:

grouped_verbs = (words[words.pos == 'VERB'].groupby(['word', 'pos']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count')
   .drop_duplicates(subset='word'))

grouped_verbs.head(50)

Which got me this:

	word	pos	count
0	tener	VERB	22072
1	hacer	VERB	14946
2	ir	VERB	12570
3	decir	VERB	11314
4	querer	VERB	11083
5	ver	VERB	10269
6	estar	VERB	9780
7	saber	VERB	8704
8	ser	VERB	7674
9	dar	VERB	5722
10	pasar	VERB	5528
11	hablar	VERB	5355
12	venir	VERB	5145
13	creer	VERB	4895
14	salir 	VERB	3395

Verbs had a slightly different drop-off pattern when I targeted them directly:

I get a big bang for my buck by learning those top 40 verbs. Nouns on the other hand are much more spread out and most are evenly distributed:

word	pos	count
0	gracias	NOUN	4676
1	favor	NOUN	4625
2	señor	NOUN	4116
3	verdad	NOUN	3566
4	vida	NOUN	2673
5	hombre	NOUN	2601
6	madre	NOUN	2597
7	vez	NOUN	2537
8	tiempo	NOUN	2492
9	hijo	NOUN	2215

So then I thought to myself... How much of a show would I understand if I just learned these most important words? So I started by excluding some of the easy parts of speech and focused on the most important:

find_important_words = (words[~words.pos.isin(['PRON', 'CONJ', 'ADP', 'ADV', 'SCONJ', 'AUX', 'INTJ'])].groupby(['word', 'pos']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count')
   .drop_duplicates(subset='word'))

find_important_words.head(50)

The top 20 were all verbs except for bueno and gracias. So now with my list of what I considered "important words" I plotted it to find what amount of words I wanted to learn:

It looks like 200 learned words would give me a reasonable amount of understanding for a series, so I decided to calculate how much of a series I would understand if I learned just those first 200 words:

percentages = {}

for show_name in words['media'].drop_duplicates().values:
    words_in_show = (words[words.media == show_name].groupby(words.word).size() 
       .sort_values(ascending=False) 
       .reset_index(name='count')
       .drop_duplicates(subset='word'))
    
    total_words_handled = 0

    for word in grouped_result['word'][:200]:
        values = words_in_show[words_in_show.word == word]['count'].values

        if values.size > 0:
            total_words_handled += values[0]

    percentages[show_name] = total_words_handled / words_in_show.sum().loc['count']

Now I had a table that would show me what percentage of the spoken words were covered by the first 200 words in my list:

p_df = pd.DataFrame(percentages.items(), columns=['show', 'percentage'])
p_df = p_df.sort_values(by='percentage')
p_df['percentage'] = p_df['percentage'] * 100
pd.options.display.float_format = '{:,.2f}%'.format
p_df
Show Percentage
Verónica 64.24%
El ciudadano ilustre 65.28%
El Chapo 66.68%
Neruda 66.89%
La casa de papel 67.56%
El Ministerio del Tiempo 68.03%
Club de Cuervos 68.19%
El marginal 68.47%
Ingobernable 68.59%
Pablo Escobar 70.20%
Fariña 70.95
La Reina del Sur 71.52%
Gran Hotel 73.15%
Las chicas del cable 73.58%
Élite 73.78%
La Piloto 74.03%
El bar 74.07%
La casa de las flores 75.40%
Tarde para la ira 75.59%

But living in Puerto Rico, one thing I've realized is speed of speech is also important. I have a much easier time speaking with people from Colombia and Mexico than I do with Puerto Ricans because they speak so much faster. So even though I could understand 75% of "Tarde para la ira" if I learned the 200 words, I want to make sure they are speaking at a pace I could understand as well.

So I loaded up the other CSV file that was the full sentences and added a "time per word" column:

sentences = pd.read_csv('sentences.csv.gz', compression='gzip', delimiter=',', parse_dates=['start', 'end'])
sentences['total_time'] = (sentences['end'] - sentences['start']).dt.total_seconds()
sentences['word_count'] = sentences['sentence'].str.split().str.len()
sentences['time_per_word'] = sentences['total_time'] / sentences['word_count']

Then I was able to have a speed rating for each show:

sentence_group = sentences.groupby([sentences.media])
sentence_group.time_per_word.mean().reset_index().sort_values('time_per_word')
media time_per_word
Gran Hotel 0.58
El Chapo 0.59
Las chicas del cable 0.61
Élite 0.63
Ingobernable 0.64
El Ministerio del Tiempo 0.64
Fariña 0.65
El ciudadano ilustre 0.67
Neruda 0.68
La Piloto 0.69
La casa de papel 0.70
El bar 0.70
Verónica 0.72
La Reina del Sur 0.75
Club de Cuervos 0.76
El marginal 0.76
Pablo Escobar 0.77
Tarde para la ira 0.77
La casa de las flores 0.81

Luckily the two series that have the least amount of vocabulary also speak the slowest! So these will be the series I start with. The final question I wanted to answer is "What are the top words I'm missing for a series". Since I'll know 75% of the series from the top 200 words, I'm hoping there are some top words from a specific series that I can also learn to get an even higher understanding.

First, find which words are in each show but not in the top 200:

missing_words_by_show = {}

for show_name in words['media'].drop_duplicates().values:
    words_in_show = (words[words.media == show_name].groupby(words.word).size() 
       .sort_values(ascending=False) 
       .reset_index(name='count')
       .drop_duplicates(subset='word'))
    
    frequency_words = grouped_result['word'][:200]

    missing_words = words_in_show[~words_in_show.word.isin(frequency_words.values)]
    missing_words_by_show[show_name] = missing_words

Then we were able to grab them per show:

missing_words_by_show['La casa de las flores'].head(50)

word	count
31	mamá	252
70	florería	87
98	perdón	56
102	sea	54
116	además	44
126	ahorita	40
132	cárcel	38
133	fiesta	38

So adding those few words to my vocabulary will also give me a better understanding of the series.

Conclusion

I believe a data-driven approach to language learning will be an effective way to get me speaking better spanish. It was a ton of fun to play with spaCy, pandas, and jupyter as well!

I'll improve the data analysis over time as well but I do believe this is a pretty good starting point!