Indonesian Skill Extractor v1.0
A production-ready, rule-based NER system for extracting and categorizing technical and soft skills from Indonesian job postings.
π― Model Description
This model specializes in identifying and categorizing skills from Indonesian job market texts. It uses a comprehensive skill taxonomy with 200+ predefined skills across 7 categories, combined with intelligent pattern matching and normalization.
Key Features
- β Zero Dependencies: Pure Python, no ML frameworks required
- β 200+ Skills: Comprehensive taxonomy across 7 categories
- β Bilingual: Handles English and Indonesian (including code-switching)
- β Skill Normalization: Maps aliases to canonical forms (jsβjavascript, etc.)
- β Proficiency Detection: Identifies beginner/intermediate/expert levels
- β Fast & Deterministic: 1000+ docs/sec, reproducible results
- β Production Ready: Lightweight (~20 KB), easy integration
Skill Categories
| Category | Count | Examples |
|---|---|---|
| programming | 30+ | Python, Java, JavaScript, TypeScript, PHP, C++, Go, Rust |
| frontend | 40+ | React, Vue, Angular, Next.js, HTML, CSS, Tailwind, Webpack |
| backend | 30+ | Node.js, Django, Laravel, Spring Boot, Express, FastAPI |
| database | 25+ | MySQL, PostgreSQL, MongoDB, Redis, Elasticsearch, Oracle |
| cloud | 35+ | AWS, Azure, GCP, Docker, Kubernetes, Jenkins, Terraform |
| data_science | 30+ | Pandas, TensorFlow, PyTorch, Tableau, Power BI, Spark |
| soft_skills | 20+ | Communication, Leadership, Teamwork, Problem Solving |
Total: 200+ skills with 40+ aliases and variations
π Quick Start
Installation
No installation required! Just download the single Python file:
# Download skill_extractor.py from this repository
# Place it in your project directory
from skill_extractor import IndonesianSkillExtractor
# Or use convenience function
from skill_extractor import extract_skills
Basic Usage
from skill_extractor import IndonesianSkillExtractor
# Initialize
extractor = IndonesianSkillExtractor()
# Extract skills from text
text = "Menguasai Python, React, MySQL, dan komunikasi yang baik"
result = extractor.extract(text)
print(result)
# Output:
# {
# 'skills': [
# {'original': 'Python', 'normalized': 'python', 'category': 'programming', 'proficiency': None},
# {'original': 'React', 'normalized': 'react', 'category': 'frontend', 'proficiency': None},
# {'original': 'MySQL', 'normalized': 'mysql', 'category': 'database', 'proficiency': None},
# {'original': 'komunikasi', 'normalized': 'komunikasi', 'category': 'soft_skills', 'proficiency': None}
# ],
# 'total_count': 4,
# 'unique_count': 4,
# 'by_category': {
# 'programming': [...],
# 'frontend': [...],
# 'database': [...],
# 'soft_skills': [...]
# }
# }
Simple Extraction
from skill_extractor import extract_skills
# Quick extraction (returns list of skill names)
skills = extract_skills("Python, React, MySQL, AWS")
print(skills)
# Output: ['python', 'react', 'mysql', 'aws']
Batch Processing
extractor = IndonesianSkillExtractor()
texts = [
"Python, Django, PostgreSQL",
"React, TypeScript, Node.js",
"AWS, Docker, Kubernetes"
]
results = extractor.batch_extract(texts)
for i, result in enumerate(results):
print(f"Text {i+1}: {result['total_count']} skills, {len(result['by_category'])} categories")
Get Top Skills
extractor = IndonesianSkillExtractor()
job_descriptions = [
"Python, Django, React...",
"Java, Spring, MySQL...",
"Python, FastAPI, PostgreSQL..."
]
top_skills = extractor.get_top_skills(job_descriptions, top_n=5)
print(top_skills)
# Output: [('python', 2), ('react', 1), ('django', 1), ...]
π Features
1. Skill Normalization
Handles variations and aliases:
extractor = IndonesianSkillExtractor()
# These all normalize to the same skill
texts = ["JS", "js", "JavaScript", "javascript"]
for text in texts:
skills = extract_skills(text)
print(skills) # All output: ['javascript']
40+ Aliases Supported:
- js β javascript
- ts β typescript
- py β python
- reactjs, react.js β react
- nodejs β node.js
- pg, postgres β postgresql
- mongo β mongodb
- k8s β kubernetes
2. Proficiency Detection
Extracts skill levels from text:
text = "Expert in Python, Advanced React, Basic MySQL"
result = extractor.extract(text)
for skill in result['skills']:
print(f"{skill['normalized']}: {skill['proficiency']}")
# Output:
# python: expert
# react: expert (advanced maps to expert)
# mysql: beginner (basic maps to beginner)
Proficiency Keywords:
- Expert: expert, advanced, mahir, ahli, mastery
- Intermediate: intermediate, menengah, competent
- Beginner: beginner, basic, pemula, dasar
3. Indonesian Language Support
Handles Indonesian skill names and code-switching:
text = "Komunikasi yang baik, kerja sama tim, kepemimpinan, Python"
result = extractor.extract(text)
for skill in result['skills']:
print(f"{skill['original']} β {skill['category']}")
# Output:
# Komunikasi β soft_skills
# kerja sama tim β soft_skills
# kepemimpinan β soft_skills (leadership)
# Python β programming
4. Comprehensive Parsing
Handles multiple formats:
# Comma-separated
extract_skills("Python, React, MySQL")
# Semicolon-separated
extract_skills("Python; React; MySQL")
# Bullet points
extract_skills("β’ Python β’ React β’ MySQL")
# Newline-separated
extract_skills("Python\nReact\nMySQL")
# Mixed with proficiency
extract_skills("Python (Expert), React (2 years), MySQL")
π Performance
| Metric | Value |
|---|---|
| Speed | 1000+ docs/second |
| Model Size | ~20 KB (pure Python) |
| Dependencies | None (stdlib only) |
| Skills Covered | 200+ |
| Categories | 7 |
| Aliases | 40+ |
| Languages | Indonesian + English |
Comparison with ML Models
| Feature | Skill Extractor | BERT-based NER |
|---|---|---|
| Training Data | Not required | Required (1000+ samples) |
| Model Size | 20 KB | 300+ MB |
| Speed | 1000+ docs/sec | 50 docs/sec |
| Deterministic | β Yes | β No |
| Explainable | β Yes | β No |
| Easy to Update | β Just edit dict | β Requires retraining |
π― Use Cases
1. Job-Candidate Matching
# Extract skills from job posting
job_skills = extract_skills(job_description)
# Extract skills from resume
candidate_skills = extract_skills(resume_text)
# Calculate match percentage
matching_skills = set(job_skills) & set(candidate_skills)
match_score = len(matching_skills) / len(job_skills) * 100
2. Skills Gap Analysis
# Get market demand
market_skills = extractor.get_top_skills(job_postings, top_n=20)
# Get candidate pool skills
candidate_skills = extractor.get_top_skills(resumes, top_n=20)
# Find gaps
in_demand = set(s[0] for s in market_skills)
available = set(s[0] for s in candidate_skills)
skill_gaps = in_demand - available
3. Trend Analysis
from collections import Counter
# Group by time period
skills_by_month = {}
for job in jobs:
month = job['month']
skills = extract_skills(job['requirements'])
if month not in skills_by_month:
skills_by_month[month] = []
skills_by_month[month].extend(skills)
# Analyze trends
for month, skills in skills_by_month.items():
top_5 = Counter(skills).most_common(5)
print(f"{month}: {top_5}")
4. Resume Screening
required_skills = ['python', 'django', 'postgresql']
nice_to_have = ['react', 'docker', 'aws']
def score_resume(resume_text):
candidate_skills = set(extract_skills(resume_text))
# Required skills (2 points each)
required_score = len(candidate_skills & set(required_skills)) * 2
# Nice to have (1 point each)
bonus_score = len(candidate_skills & set(nice_to_have)) * 1
return required_score + bonus_score
# Rank candidates
candidates = [...]
ranked = sorted(candidates, key=lambda c: score_resume(c['resume']), reverse=True)
π§ API Reference
IndonesianSkillExtractor
Main class for skill extraction.
Methods:
extract(text: str) -> Dict
- Full extraction with metadata
- Returns: skills, counts, categories, proficiency
extract_simple(text: str) -> List[str]
- Simple extraction returning skill names
- Returns: List of normalized skill strings
batch_extract(texts: List[str]) -> List[Dict]
- Process multiple texts
- Returns: List of extraction results
get_top_skills(texts: List[str], top_n: int) -> List[Tuple]
- Get most frequent skills across texts
- Returns: List of (skill, count) tuples
get_stats() -> Dict
- Get model statistics
- Returns: version, total_skills, categories, etc.
Convenience Functions
extract_skills(text: str) -> List[str]
- Quick one-line extraction
- Creates extractor instance automatically
π License
This model is released under the MIT License.
Citation:
@software{indonesian_skill_extractor_2024,
author = {Herlambang Haryo Putro},
title = {Indonesian Skill Extractor v1.0},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/herlambangharyoputro/indonesian-skill-extractor-v1}
}
π€ Contributions
Part of the Job Market Intelligence Platform project.
Repository: GitHub - job-market-intelligence-platform
Related Datasets:
Contributions welcome! If you:
- Find missing skills or categories
- Have suggestions for improvements
- Want to add more language support
- Build interesting projects using this model
Please open an issue or pull request on GitHub.
π§ Contact
- Author: Herlambang Haryo Putro
- Email: [email protected]
- GitHub: @herlambangharyoputro
- Project: Job Market Intelligence Platform
π Version History
- v1.0.0 (December 2024): Initial release
- 200+ skills across 7 categories
- 40+ aliases for normalization
- Proficiency level detection
- Indonesian language support
- Zero dependencies
β οΈ Limitations
Coverage
- Limited to predefined skill taxonomy (200+ skills)
- New/emerging skills may be categorized as 'other'
- Domain-specific skills may not be recognized
Language
- Primarily optimized for Indonesian job market
- May not capture all regional variations
- English technical terms preferred over Indonesian equivalents
Accuracy
- Rule-based approach may miss context-dependent skills
- Acronyms can be ambiguous (e.g., "AI" = Artificial Intelligence or Adobe Illustrator)
- Proficiency detection based on keywords only
Recommendations
- Best for structured skill lists (bullets, commas)
- Review 'other' category for domain-specific additions
- Combine with manual review for critical applications
- Consider ML-based approach for unstructured text
π― Future Improvements
Planned features for v2.0:
- Expanded skill taxonomy (300+ skills)
- Industry-specific categories
- Skill clustering and relationships
- Confidence scoring
- Multi-language support (Javanese, Sundanese)
- Experience year extraction
- Certification detection
Last Updated: December 2024
Model Version: 1.0.0
Status: β
Production Ready
Type: Rule-based NER
For questions or collaboration, visit GitHub.