In [ ]:
import pandas as pd  
import numpy as np
import os  
import re
from string import digits
from gensim.models import word2vec
from gensim.models import phrases 
In [ ]:
current_directory = os.getcwd()
print("current_dictory", current_directory)
In [3]:
df = pd.read_csv('dreams.csv')  # Load a CSV file named 'dreams.csv' 

Open dreams.csv and take a quick look. We can see that the format of dreams is as follows:

dreams_text
0 001 Nightmare in Cambodia. In the dream we are...
1 002 The enemy is above, in the sky...
2 003 We are on a firebase. It is night time...
3 004 We are on an LZ; I am. saying ...

Therefore, we need to extract the text in the second column "dreams_text" as our corpus.

In [4]:
#Initialize the corpus
corpus = []
# Iterate through each line of text in the CSV file
for text in df['dreams_text']:
    # Ensure the text is of string type
    if not isinstance(text, str):
        text = str(text)
    corpus.append(text)
In [5]:
# Check the length of the corpus
print(f"Total documents: {len(corpus)}")
Total documents: 30799

Clean the corpus¶

In [6]:
# Initialize the cleaned corpus
corpus_clean = []

Transforms the raw text corpus into a cleaned list of sentences, each composed of words in lowercase, with punctuation and numbers removed.

In [7]:
for document in corpus:
    doc = re.sub(';?:!"', '.', document)  # Replace semicolons, colons, exclamation marks, and quotation marks with dots, as they will be used to split sentences
    doc = re.sub(r'[^\w\s.]', '', doc)  # Remove all remaining punctuation marks except for dots
    translation_table = str.maketrans('', '', digits)  # Create a translation table to remove digits
    doc = doc.translate(translation_table)  # Use the translation table to remove digits from the document
    doc = doc.lower()  # Convert all letters in the document to lowercase
    doc = re.sub(r'\s+', ' ', doc)  # Remove any extra spaces
    doc = doc.split('.')  # Split the document into sentences using dots as delimiters
    doc2 = [j.strip().split(' ') for j in doc]  # Split each sentence into words and remove any remaining extra spaces
    for i in doc2:
        filter(None, i)  # Remove empty words
    corpus_clean.extend(doc2)  # Extend the corpus_clean list with the processed sentences

Thus, we have obtained a list named corpus_clean, which is structured as a nested list. Each top-level element in this list is itself a list representing a sentence. Furthermore, each of these sentence lists contains a series of words, with each word being an individual element within the sentence list.

Index Value
0 ['nightmare', 'in', 'cambodia']
1 ['in', 'the', 'dream', 'we', 'are', 'being', 'overrun', 'by', 'sappers', 'who', 'have', 'got', 'past', 'the', 'night', 'defensive', 'perimeter', 'trips', 'and', 'claymores', 'and', 'now', 'crawl', 'forward']
2 ['i', 'wake', 'up', 'and', 'see', 'a', 'boot', 'tread', 'close', 'to', 'my', 'face']
3 ['i', 'slowly', 'withdraw', 'my']
4 ['from', 'its', 'holster', 'pull', 'the', 'hammer', 'back', 'then', 'aim', 'it', 'at', 'the', 'boot']
5 ['just', 'then', 'the', 'cloudobscured', 'moon', 'comes', 'out', 'and', 'i', 'realize', 'the', 'boot', 'is', 'american', 'and', 'that', 'it', 'is', 'jerry', 'biecks', 'foot']

Word embedding inputs are typically structured as lists of words nested within lists of sentences because this format preserves contextual information while offering computational efficiency. It maintains sentence boundaries, facilitating the capture of semantic relationships and enabling easy processing of varying sentence lengths.

This structure also supports efficient implementation of sliding window techniques used in many embedding algorithms, retains document structure, and provides flexibility for both sentence-level and corpus-wide operations.

Next, we can start training the model.¶

Before formally training the model, we can use bigram_transformer to create a tool for bigram features. A bigram is a statistical language model that considers the joint probability of two adjacent words in a text. This means the model takes into account the order of words, thereby capturing dependencies between them. By using bigrams, we can identify word combinations that frequently occur together, which helps the model better understand the structure and context of language.

phrases.Phrases is a class, used to identify and create bigrams or trigrams. When you apply this class to a text corpus, it analyzes the corpus and identifies frequently occurring word pairs or word triplets. This process can help us discover common phrases and fixed collocations, thereby improving the model's understanding of language usage patterns. Using this method can significantly enhance the effectiveness of text analysis and natural language processing tasks, especially when dealing with specialized terminology or fixed expressions.

In [8]:
bigram_transformer = phrases.Phrases(corpus_clean) 
In [9]:
Model_100_10_1HS_samp1 = word2vec.Word2Vec( bigram_transformer[corpus_clean], workers=4, sg=1, hs=1, vector_size=100, window=10, sample=1e-3 )

Explain Parameters:

workers=4: This parameter specifies the number of threads used during the training process. Set to 4 here, it means that the training process will use 4 threads in parallel to speed up the computation.

sg=1: This parameter is used to select the type of model for training. sg stands for "skip-gram," and setting it to 1 indicates the use of the skip-gram model. The skip-gram model is suitable for processing smaller corpora and can better capture the relationships of rare words.

hs=1: This parameter is used to activate the "hierarchical softmax" model. Set to 1, it means that hierarchical softmax is enabled. Hierarchical softmax is a technique used to accelerate the training process of word2vec, reducing the computational complexity of the model.

vector_size=100: This parameter specifies the dimension of the generated word vectors. Set to 100 here, it means that each word will be represented as a 100-dimensional vector.

window=10: This parameter specifies the context window size for words during training. Set to 10 here, it means that during the training process, the context of each word will include 10 words before and after the target word.

sample=1e-3: This parameter is used to set the downsampling rate for training data. Set to 1e-3, it means that high-frequency words will be downsampled, appearing less frequently during the training process, reducing their impact on the model and allowing the model to focus more on low-frequency words with greater information content.

The model is trained; we can now examine the relationships between some words within this model.¶

In [10]:
#Let's see if the model can correctly recognize semantics.
Model_100_10_1HS_samp1.wv.doesnt_match("man woman summer girl".split())
Out[10]:
'summer'
In [11]:
print(Model_100_10_1HS_samp1.wv.most_similar(positive=['girl', 'man'], negative=['boy'])) #boy:man as girl:_?___ WOMAN!
[('woman', 0.8032967448234558), ('person', 0.7134841680526733), ('lady', 0.6778110265731812), ('guy', 0.6606528759002686), ('young_man', 0.6053479313850403), ('stranger', 0.6048056483268738), ('lady_who', 0.6013890504837036), ('someone', 0.5913842916488647), ('young_woman', 0.5792549848556519), ('verbal', 0.5633144974708557)]

Through some simple examples, it can be seen that our model is capable of identifying semantically different words from "man woman summer girl" and can also calculate that "man - boy + girl ≈ woman".

In [12]:
# demonstrate the five words most similar to the word "terrifying" in the Word2Vec model, along with their similarity scores to "terrifying".
WORD = "terrifying"

similar_words = Model_100_10_1HS_samp1.wv.most_similar(WORD, topn=5)

print(f"similar to ", WORD)
for word, similarity in similar_words:
    print(f"{word}: {similarity}")
similar to  terrifying
frightening: 0.733578622341156
scarey: 0.6859025359153748
nocturnal: 0.6689965128898621
nonetheless: 0.6657446026802063
very_realistic: 0.6455093622207642
In [13]:
WORD = "disturbing"
similar_words = Model_100_10_1HS_samp1.wv.most_similar(WORD, topn=5)
print(f"similar to ", WORD)
for word, similarity in similar_words:
    print(f"{word}: {similarity}")
similar to  disturbing
nostalgic: 0.5985854268074036
embarrassing: 0.5805899500846863
ecstatic: 0.5769983530044556
deja_vu: 0.5763196349143982
creeped_out: 0.571252703666687
In [14]:
WORD = "sweet"
similar_words = Model_100_10_1HS_samp1.wv.most_similar(WORD, topn=5)
print(f"similar to ", WORD)
for word, similarity in similar_words:
    print(f"{word}: {similarity}")
similar to  sweet
tasty: 0.6262686252593994
frail: 0.6100146174430847
soft: 0.5955066084861755
honey: 0.5822334289550781
bashful: 0.5809429883956909
In [15]:
#Find words that are semantically similar to the concept represented by the word "terrifying" when contrasted with the concept represented by the word "sweet"
print(Model_100_10_1HS_samp1.wv.most_similar(positive=['terrifying'], negative=['sweet'], topn=10))
[('control', 0.4562285244464874), ('outcome', 0.43268337845802307), ('impending', 0.41403695940971375), ('explosion', 0.4125122129917145), ('several_hundred', 0.409423828125), ('fully_awake', 0.406212717294693), ('frightening', 0.39035752415657043), ('crab', 0.38695216178894043), ('source', 0.38543036580085754), ('aware', 0.38370537757873535)]

Exploring word relationships through word embeddings can lead to fascinating insights. These high-dimensional vector representations allow us to:

  1. Quantify semantic and syntactic properties of words in various dimensions
  2. Uncover latent patterns and associations in language
  3. Detect subtle nuances between related concepts
  4. Analyze how words shift meaning across different contexts
  5. Understand how complex ideas and emotions are encoded in language

Word embeddings enable us to perform arithmetic operations on words, revealing analogies and conceptual relationships. For instance, the example "man - boy + girl ≈ woman" demonstrates how these models capture gender relationships. Such capabilities not only enhance our understanding of linguistic structures but also power numerous natural language processing applications, from machine translation to sentiment analysis.

Furthermore, by examining the cosine similarity between word vectors, we can identify synonyms, antonyms, and words with similar usage patterns. This allows for a more nuanced exploration of language, going beyond simple dictionary definitions to understand the multifaceted nature of word meanings and their interconnections within the broader lexical landscape.

Next, we can measure how closely a piece of text semantically aligns with a specific feature, such as 'terrifying,' by comparing the cosine similarity between the text and that feature. This approach allows us to estimate the degree of terror conveyed in a dream.

In [16]:
from gensim.utils import simple_preprocess
import numpy as np  # Import numpy

def measure_dreams(model, text, feature):
    # Preprocess the text
    words = simple_preprocess(text)
    
    # Get the vector for the feature you want to calculate
    feature_vector = model.wv[feature]  # Use the feature variable
    
    # Calculate the average vector for the text
    text_vectors = [model.wv[word] for word in words if word in model.wv]
    if not text_vectors:
        return 0  # Return 0 if no words are in the vocabulary
    text_vector = np.mean(text_vectors, axis=0)
    
    # Calculate cosine similarity
    similarity = np.dot(feature_vector, text_vector) / (np.linalg.norm(feature_vector) * np.linalg.norm(text_vector))
    
    return similarity

# Example usage
text = " Nightmare in Cambodia. In the dream we are being overrun by sappers who have got past the Night Defensive Perimeter trips and claymores and now crawl forward. I wake up and see a boot tread close to my face. I slowly withdraw my .45 from its holster, pull the hammer back, then aim it at the boot. Just then the cloud-obscured moon comes out and I realize the boot is American and that it is Jerry Bieck's foot. In the pitch stillness I point the .45 straight up in the air. Pinching the hammer tightly I pull the trigger and settle the hammer back in place. I re-holster the pistol and go back to sleep. The next day, after a very difficult march, all the men are overjoyed to be out of Cambodia. I tell no one what almost happened."
feature_score = measure_dreams(Model_100_10_1HS_samp1, text, 'terrifying')  # Use quotes for the feature
print(f"The score of the text is: {feature_score}")
The score of the text is: 0.42425134778022766
In [17]:
text = "I know my ex-boyfriend from college, Tracey was in the dream, although not sure I remember much about that. I was at the mall walking with someone and I ran into these two people. One of them was Teri this girl I went to college with. We were really good friends in college. I was like 'TERI!!' and we hugged, it was very nice to see her. I remember telling her that I had been to Virginia beach a few times already and would be coming back this year. She is from va beach There was much more to this dream, just cannot remember."
feature_score = measure_dreams(Model_100_10_1HS_samp1, text, 'terrifying')  # Use quotes for the feature
print(f"The score of the text is: {feature_score}")
The score of the text is: 0.3600655198097229