This project implements a text generation and TF-IDF (Term Frequency-Inverse Document Frequency) analysis system using various Natural Language Processing (NLP) techniques. It combines text generation capabilities with advanced text processing and analysis features.
- Text Generation: Uses GPT-2 model for generating text based on given prompts
- Text Preprocessing Pipeline:
- Text Cleaning (removes special characters)
- Text Normalization (converts to lowercase)
- Tokenization (splits text into words)
- Lemmatization (reduces words to their base form)
- Stopword Removal
- Unique Word Extraction
- TF-IDF Analysis: Calculates and analyzes term frequencies in the generated text
- Python 3.x
- Required Python packages:
- scikit-learn
- nltk
- transformers
- numpy
- pandas
- Clone the repository:
git clone https://github.com/Yossefmohammed/Generate-docs-and-calc-TF-IDF.git
cd Generate-docs-and-calc-TF-IDF
- Install the required packages:
pip install -r requirements.txt
- Download required NLTK data:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
The project is implemented as a Jupyter notebook (Generate docs and calc TF-IDF.ipynb
). You can run it using Jupyter Notebook or Jupyter Lab.
The notebook demonstrates:
- Text generation from prompts
- Text preprocessing pipeline
- TF-IDF analysis of the generated text
Generate docs and calc TF-IDF.ipynb
: Main notebook containing the implementationREADME.md
: Project documentationrequirements.txt
: List of required Python packages
- Uses the GPT-2 model from Hugging Face's transformers library
- Generates text based on user-provided prompts
- Configurable maximum length for generated text
- Cleaning: Removes special characters and non-alphabetic content
- Normalization: Converts text to lowercase
- Tokenization: Splits text into individual words
- Lemmatization: Reduces words to their base form using WordNet
- Stopword Removal: Removes common English stopwords
- Unique Word Extraction: Identifies unique words in the processed text
- Calculates term frequencies
- Identifies important terms in the generated text
- Provides insights into the most significant words in the corpus
Feel free to submit issues and enhancement requests!
This project is open source and available under the MIT License.
Youssef Mohammed
- Email: [email protected]
- Phone: 01126078938