Indonesian Skill Extractor v1.0

A production-ready, rule-based NER system for extracting and categorizing technical and soft skills from Indonesian job postings.

🎯 Model Description

This model specializes in identifying and categorizing skills from Indonesian job market texts. It uses a comprehensive skill taxonomy with 200+ predefined skills across 7 categories, combined with intelligent pattern matching and normalization.

Key Features

✅ Zero Dependencies: Pure Python, no ML frameworks required
✅ 200+ Skills: Comprehensive taxonomy across 7 categories
✅ Bilingual: Handles English and Indonesian (including code-switching)
✅ Skill Normalization: Maps aliases to canonical forms (js→javascript, etc.)
✅ Proficiency Detection: Identifies beginner/intermediate/expert levels
✅ Fast & Deterministic: 1000+ docs/sec, reproducible results
✅ Production Ready: Lightweight (~20 KB), easy integration

Skill Categories

Category	Count	Examples
programming	30+	Python, Java, JavaScript, TypeScript, PHP, C++, Go, Rust
frontend	40+	React, Vue, Angular, Next.js, HTML, CSS, Tailwind, Webpack
backend	30+	Node.js, Django, Laravel, Spring Boot, Express, FastAPI
database	25+	MySQL, PostgreSQL, MongoDB, Redis, Elasticsearch, Oracle
cloud	35+	AWS, Azure, GCP, Docker, Kubernetes, Jenkins, Terraform
data_science	30+	Pandas, TensorFlow, PyTorch, Tableau, Power BI, Spark
soft_skills	20+	Communication, Leadership, Teamwork, Problem Solving

Total: 200+ skills with 40+ aliases and variations

🚀 Quick Start

Installation

No installation required! Just download the single Python file:

# Download skill_extractor.py from this repository
# Place it in your project directory
from skill_extractor import IndonesianSkillExtractor

# Or use convenience function
from skill_extractor import extract_skills

Basic Usage

from skill_extractor import IndonesianSkillExtractor

# Initialize
extractor = IndonesianSkillExtractor()

# Extract skills from text
text = "Menguasai Python, React, MySQL, dan komunikasi yang baik"
result = extractor.extract(text)

print(result)
# Output:
# {
#   'skills': [
#     {'original': 'Python', 'normalized': 'python', 'category': 'programming', 'proficiency': None},
#     {'original': 'React', 'normalized': 'react', 'category': 'frontend', 'proficiency': None},
#     {'original': 'MySQL', 'normalized': 'mysql', 'category': 'database', 'proficiency': None},
#     {'original': 'komunikasi', 'normalized': 'komunikasi', 'category': 'soft_skills', 'proficiency': None}
#   ],
#   'total_count': 4,
#   'unique_count': 4,
#   'by_category': {
#     'programming': [...],
#     'frontend': [...],
#     'database': [...],
#     'soft_skills': [...]
#   }
# }

Simple Extraction

from skill_extractor import extract_skills

# Quick extraction (returns list of skill names)
skills = extract_skills("Python, React, MySQL, AWS")
print(skills)
# Output: ['python', 'react', 'mysql', 'aws']

Batch Processing

extractor = IndonesianSkillExtractor()

texts = [
    "Python, Django, PostgreSQL",
    "React, TypeScript, Node.js",
    "AWS, Docker, Kubernetes"
]

results = extractor.batch_extract(texts)

for i, result in enumerate(results):
    print(f"Text {i+1}: {result['total_count']} skills, {len(result['by_category'])} categories")

Get Top Skills

extractor = IndonesianSkillExtractor()

job_descriptions = [
    "Python, Django, React...",
    "Java, Spring, MySQL...",
    "Python, FastAPI, PostgreSQL..."
]

top_skills = extractor.get_top_skills(job_descriptions, top_n=5)
print(top_skills)
# Output: [('python', 2), ('react', 1), ('django', 1), ...]

📊 Features

1. Skill Normalization

Handles variations and aliases:

extractor = IndonesianSkillExtractor()

# These all normalize to the same skill
texts = ["JS", "js", "JavaScript", "javascript"]
for text in texts:
    skills = extract_skills(text)
    print(skills)  # All output: ['javascript']

40+ Aliases Supported:

js → javascript
ts → typescript
py → python
reactjs, react.js → react
nodejs → node.js
pg, postgres → postgresql
mongo → mongodb
k8s → kubernetes

2. Proficiency Detection

Extracts skill levels from text:

text = "Expert in Python, Advanced React, Basic MySQL"
result = extractor.extract(text)

for skill in result['skills']:
    print(f"{skill['normalized']}: {skill['proficiency']}")

# Output:
# python: expert
# react: expert (advanced maps to expert)
# mysql: beginner (basic maps to beginner)

Proficiency Keywords:

Expert: expert, advanced, mahir, ahli, mastery
Intermediate: intermediate, menengah, competent
Beginner: beginner, basic, pemula, dasar

3. Indonesian Language Support

Handles Indonesian skill names and code-switching:

text = "Komunikasi yang baik, kerja sama tim, kepemimpinan, Python"
result = extractor.extract(text)

for skill in result['skills']:
    print(f"{skill['original']} → {skill['category']}")

# Output:
# Komunikasi → soft_skills
# kerja sama tim → soft_skills
# kepemimpinan → soft_skills (leadership)
# Python → programming

4. Comprehensive Parsing

Handles multiple formats:

# Comma-separated
extract_skills("Python, React, MySQL")

# Semicolon-separated
extract_skills("Python; React; MySQL")

# Bullet points
extract_skills("• Python • React • MySQL")

# Newline-separated
extract_skills("Python\nReact\nMySQL")

# Mixed with proficiency
extract_skills("Python (Expert), React (2 years), MySQL")

📈 Performance

Metric	Value
Speed	1000+ docs/second
Model Size	~20 KB (pure Python)
Dependencies	None (stdlib only)
Skills Covered	200+
Categories	7
Aliases	40+
Languages	Indonesian + English

Comparison with ML Models

Feature	Skill Extractor	BERT-based NER
Training Data	Not required	Required (1000+ samples)
Model Size	20 KB	300+ MB
Speed	1000+ docs/sec	50 docs/sec
Deterministic	✅ Yes	❌ No
Explainable	✅ Yes	❌ No
Easy to Update	✅ Just edit dict	❌ Requires retraining

🎯 Use Cases

1. Job-Candidate Matching

# Extract skills from job posting
job_skills = extract_skills(job_description)

# Extract skills from resume
candidate_skills = extract_skills(resume_text)

# Calculate match percentage
matching_skills = set(job_skills) & set(candidate_skills)
match_score = len(matching_skills) / len(job_skills) * 100

2. Skills Gap Analysis

# Get market demand
market_skills = extractor.get_top_skills(job_postings, top_n=20)

# Get candidate pool skills
candidate_skills = extractor.get_top_skills(resumes, top_n=20)

# Find gaps
in_demand = set(s[0] for s in market_skills)
available = set(s[0] for s in candidate_skills)
skill_gaps = in_demand - available

3. Trend Analysis

from collections import Counter

# Group by time period
skills_by_month = {}
for job in jobs:
    month = job['month']
    skills = extract_skills(job['requirements'])
    
    if month not in skills_by_month:
        skills_by_month[month] = []
    skills_by_month[month].extend(skills)

# Analyze trends
for month, skills in skills_by_month.items():
    top_5 = Counter(skills).most_common(5)
    print(f"{month}: {top_5}")

4. Resume Screening

required_skills = ['python', 'django', 'postgresql']
nice_to_have = ['react', 'docker', 'aws']

def score_resume(resume_text):
    candidate_skills = set(extract_skills(resume_text))
    
    # Required skills (2 points each)
    required_score = len(candidate_skills & set(required_skills)) * 2
    
    # Nice to have (1 point each)
    bonus_score = len(candidate_skills & set(nice_to_have)) * 1
    
    return required_score + bonus_score

# Rank candidates
candidates = [...]
ranked = sorted(candidates, key=lambda c: score_resume(c['resume']), reverse=True)

🔧 API Reference

`IndonesianSkillExtractor`

Main class for skill extraction.

Methods:

extract(text: str) -> Dict

Full extraction with metadata
Returns: skills, counts, categories, proficiency

extract_simple(text: str) -> List[str]

Simple extraction returning skill names
Returns: List of normalized skill strings

batch_extract(texts: List[str]) -> List[Dict]

Process multiple texts
Returns: List of extraction results

get_top_skills(texts: List[str], top_n: int) -> List[Tuple]

Get most frequent skills across texts
Returns: List of (skill, count) tuples

get_stats() -> Dict

Get model statistics
Returns: version, total_skills, categories, etc.

Convenience Functions

extract_skills(text: str) -> List[str]

Quick one-line extraction
Creates extractor instance automatically

📄 License

This model is released under the MIT License.

Citation:

@software{indonesian_skill_extractor_2024,
  author = {Herlambang Haryo Putro},
  title = {Indonesian Skill Extractor v1.0},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/herlambangharyoputro/indonesian-skill-extractor-v1}
}

🤝 Contributions

Part of the Job Market Intelligence Platform project.

Repository: GitHub - job-market-intelligence-platform

Related Datasets:

Contributions welcome! If you:

Find missing skills or categories
Have suggestions for improvements
Want to add more language support
Build interesting projects using this model

Please open an issue or pull request on GitHub.

📧 Contact

Author: Herlambang Haryo Putro
Email: [email protected]
GitHub: @herlambangharyoputro
Project: Job Market Intelligence Platform

🔄 Version History

v1.0.0 (December 2024): Initial release
- 200+ skills across 7 categories
- 40+ aliases for normalization
- Proficiency level detection
- Indonesian language support
- Zero dependencies

⚠️ Limitations

Coverage

Limited to predefined skill taxonomy (200+ skills)
New/emerging skills may be categorized as 'other'
Domain-specific skills may not be recognized

Language

Primarily optimized for Indonesian job market
May not capture all regional variations
English technical terms preferred over Indonesian equivalents

Accuracy

Rule-based approach may miss context-dependent skills
Acronyms can be ambiguous (e.g., "AI" = Artificial Intelligence or Adobe Illustrator)
Proficiency detection based on keywords only

Recommendations

Best for structured skill lists (bullets, commas)
Review 'other' category for domain-specific additions
Combine with manual review for critical applications
Consider ML-based approach for unstructured text

🎯 Future Improvements

Planned features for v2.0:

Expanded skill taxonomy (300+ skills)
Industry-specific categories
Skill clustering and relationships
Confidence scoring
Multi-language support (Javanese, Sundanese)
Experience year extraction
Certification detection

Last Updated: December 2024
Model Version: 1.0.0
Status: ✅ Production Ready
Type: Rule-based NER

For questions or collaboration, visit GitHub.

Downloads last month: -; Downloads are not tracked for this model. How to track