Indonesian Skill Extractor v1.0

A production-ready, rule-based NER system for extracting and categorizing technical and soft skills from Indonesian job postings.

🎯 Model Description

This model specializes in identifying and categorizing skills from Indonesian job market texts. It uses a comprehensive skill taxonomy with 200+ predefined skills across 7 categories, combined with intelligent pattern matching and normalization.

Key Features

  • βœ… Zero Dependencies: Pure Python, no ML frameworks required
  • βœ… 200+ Skills: Comprehensive taxonomy across 7 categories
  • βœ… Bilingual: Handles English and Indonesian (including code-switching)
  • βœ… Skill Normalization: Maps aliases to canonical forms (jsβ†’javascript, etc.)
  • βœ… Proficiency Detection: Identifies beginner/intermediate/expert levels
  • βœ… Fast & Deterministic: 1000+ docs/sec, reproducible results
  • βœ… Production Ready: Lightweight (~20 KB), easy integration

Skill Categories

Category Count Examples
programming 30+ Python, Java, JavaScript, TypeScript, PHP, C++, Go, Rust
frontend 40+ React, Vue, Angular, Next.js, HTML, CSS, Tailwind, Webpack
backend 30+ Node.js, Django, Laravel, Spring Boot, Express, FastAPI
database 25+ MySQL, PostgreSQL, MongoDB, Redis, Elasticsearch, Oracle
cloud 35+ AWS, Azure, GCP, Docker, Kubernetes, Jenkins, Terraform
data_science 30+ Pandas, TensorFlow, PyTorch, Tableau, Power BI, Spark
soft_skills 20+ Communication, Leadership, Teamwork, Problem Solving

Total: 200+ skills with 40+ aliases and variations

πŸš€ Quick Start

Installation

No installation required! Just download the single Python file:

# Download skill_extractor.py from this repository
# Place it in your project directory
from skill_extractor import IndonesianSkillExtractor

# Or use convenience function
from skill_extractor import extract_skills

Basic Usage

from skill_extractor import IndonesianSkillExtractor

# Initialize
extractor = IndonesianSkillExtractor()

# Extract skills from text
text = "Menguasai Python, React, MySQL, dan komunikasi yang baik"
result = extractor.extract(text)

print(result)
# Output:
# {
#   'skills': [
#     {'original': 'Python', 'normalized': 'python', 'category': 'programming', 'proficiency': None},
#     {'original': 'React', 'normalized': 'react', 'category': 'frontend', 'proficiency': None},
#     {'original': 'MySQL', 'normalized': 'mysql', 'category': 'database', 'proficiency': None},
#     {'original': 'komunikasi', 'normalized': 'komunikasi', 'category': 'soft_skills', 'proficiency': None}
#   ],
#   'total_count': 4,
#   'unique_count': 4,
#   'by_category': {
#     'programming': [...],
#     'frontend': [...],
#     'database': [...],
#     'soft_skills': [...]
#   }
# }

Simple Extraction

from skill_extractor import extract_skills

# Quick extraction (returns list of skill names)
skills = extract_skills("Python, React, MySQL, AWS")
print(skills)
# Output: ['python', 'react', 'mysql', 'aws']

Batch Processing

extractor = IndonesianSkillExtractor()

texts = [
    "Python, Django, PostgreSQL",
    "React, TypeScript, Node.js",
    "AWS, Docker, Kubernetes"
]

results = extractor.batch_extract(texts)

for i, result in enumerate(results):
    print(f"Text {i+1}: {result['total_count']} skills, {len(result['by_category'])} categories")

Get Top Skills

extractor = IndonesianSkillExtractor()

job_descriptions = [
    "Python, Django, React...",
    "Java, Spring, MySQL...",
    "Python, FastAPI, PostgreSQL..."
]

top_skills = extractor.get_top_skills(job_descriptions, top_n=5)
print(top_skills)
# Output: [('python', 2), ('react', 1), ('django', 1), ...]

πŸ“Š Features

1. Skill Normalization

Handles variations and aliases:

extractor = IndonesianSkillExtractor()

# These all normalize to the same skill
texts = ["JS", "js", "JavaScript", "javascript"]
for text in texts:
    skills = extract_skills(text)
    print(skills)  # All output: ['javascript']

40+ Aliases Supported:

  • js β†’ javascript
  • ts β†’ typescript
  • py β†’ python
  • reactjs, react.js β†’ react
  • nodejs β†’ node.js
  • pg, postgres β†’ postgresql
  • mongo β†’ mongodb
  • k8s β†’ kubernetes

2. Proficiency Detection

Extracts skill levels from text:

text = "Expert in Python, Advanced React, Basic MySQL"
result = extractor.extract(text)

for skill in result['skills']:
    print(f"{skill['normalized']}: {skill['proficiency']}")

# Output:
# python: expert
# react: expert (advanced maps to expert)
# mysql: beginner (basic maps to beginner)

Proficiency Keywords:

  • Expert: expert, advanced, mahir, ahli, mastery
  • Intermediate: intermediate, menengah, competent
  • Beginner: beginner, basic, pemula, dasar

3. Indonesian Language Support

Handles Indonesian skill names and code-switching:

text = "Komunikasi yang baik, kerja sama tim, kepemimpinan, Python"
result = extractor.extract(text)

for skill in result['skills']:
    print(f"{skill['original']} β†’ {skill['category']}")

# Output:
# Komunikasi β†’ soft_skills
# kerja sama tim β†’ soft_skills
# kepemimpinan β†’ soft_skills (leadership)
# Python β†’ programming

4. Comprehensive Parsing

Handles multiple formats:

# Comma-separated
extract_skills("Python, React, MySQL")

# Semicolon-separated
extract_skills("Python; React; MySQL")

# Bullet points
extract_skills("β€’ Python β€’ React β€’ MySQL")

# Newline-separated
extract_skills("Python\nReact\nMySQL")

# Mixed with proficiency
extract_skills("Python (Expert), React (2 years), MySQL")

πŸ“ˆ Performance

Metric Value
Speed 1000+ docs/second
Model Size ~20 KB (pure Python)
Dependencies None (stdlib only)
Skills Covered 200+
Categories 7
Aliases 40+
Languages Indonesian + English

Comparison with ML Models

Feature Skill Extractor BERT-based NER
Training Data Not required Required (1000+ samples)
Model Size 20 KB 300+ MB
Speed 1000+ docs/sec 50 docs/sec
Deterministic βœ… Yes ❌ No
Explainable βœ… Yes ❌ No
Easy to Update βœ… Just edit dict ❌ Requires retraining

🎯 Use Cases

1. Job-Candidate Matching

# Extract skills from job posting
job_skills = extract_skills(job_description)

# Extract skills from resume
candidate_skills = extract_skills(resume_text)

# Calculate match percentage
matching_skills = set(job_skills) & set(candidate_skills)
match_score = len(matching_skills) / len(job_skills) * 100

2. Skills Gap Analysis

# Get market demand
market_skills = extractor.get_top_skills(job_postings, top_n=20)

# Get candidate pool skills
candidate_skills = extractor.get_top_skills(resumes, top_n=20)

# Find gaps
in_demand = set(s[0] for s in market_skills)
available = set(s[0] for s in candidate_skills)
skill_gaps = in_demand - available

3. Trend Analysis

from collections import Counter

# Group by time period
skills_by_month = {}
for job in jobs:
    month = job['month']
    skills = extract_skills(job['requirements'])
    
    if month not in skills_by_month:
        skills_by_month[month] = []
    skills_by_month[month].extend(skills)

# Analyze trends
for month, skills in skills_by_month.items():
    top_5 = Counter(skills).most_common(5)
    print(f"{month}: {top_5}")

4. Resume Screening

required_skills = ['python', 'django', 'postgresql']
nice_to_have = ['react', 'docker', 'aws']

def score_resume(resume_text):
    candidate_skills = set(extract_skills(resume_text))
    
    # Required skills (2 points each)
    required_score = len(candidate_skills & set(required_skills)) * 2
    
    # Nice to have (1 point each)
    bonus_score = len(candidate_skills & set(nice_to_have)) * 1
    
    return required_score + bonus_score

# Rank candidates
candidates = [...]
ranked = sorted(candidates, key=lambda c: score_resume(c['resume']), reverse=True)

πŸ”§ API Reference

IndonesianSkillExtractor

Main class for skill extraction.

Methods:

extract(text: str) -> Dict

  • Full extraction with metadata
  • Returns: skills, counts, categories, proficiency

extract_simple(text: str) -> List[str]

  • Simple extraction returning skill names
  • Returns: List of normalized skill strings

batch_extract(texts: List[str]) -> List[Dict]

  • Process multiple texts
  • Returns: List of extraction results

get_top_skills(texts: List[str], top_n: int) -> List[Tuple]

  • Get most frequent skills across texts
  • Returns: List of (skill, count) tuples

get_stats() -> Dict

  • Get model statistics
  • Returns: version, total_skills, categories, etc.

Convenience Functions

extract_skills(text: str) -> List[str]

  • Quick one-line extraction
  • Creates extractor instance automatically

πŸ“„ License

This model is released under the MIT License.

Citation:

@software{indonesian_skill_extractor_2024,
  author = {Herlambang Haryo Putro},
  title = {Indonesian Skill Extractor v1.0},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/herlambangharyoputro/indonesian-skill-extractor-v1}
}

🀝 Contributions

Part of the Job Market Intelligence Platform project.

Repository: GitHub - job-market-intelligence-platform

Related Datasets:

Contributions welcome! If you:

  • Find missing skills or categories
  • Have suggestions for improvements
  • Want to add more language support
  • Build interesting projects using this model

Please open an issue or pull request on GitHub.

πŸ“§ Contact

πŸ”„ Version History

  • v1.0.0 (December 2024): Initial release
    • 200+ skills across 7 categories
    • 40+ aliases for normalization
    • Proficiency level detection
    • Indonesian language support
    • Zero dependencies

⚠️ Limitations

Coverage

  • Limited to predefined skill taxonomy (200+ skills)
  • New/emerging skills may be categorized as 'other'
  • Domain-specific skills may not be recognized

Language

  • Primarily optimized for Indonesian job market
  • May not capture all regional variations
  • English technical terms preferred over Indonesian equivalents

Accuracy

  • Rule-based approach may miss context-dependent skills
  • Acronyms can be ambiguous (e.g., "AI" = Artificial Intelligence or Adobe Illustrator)
  • Proficiency detection based on keywords only

Recommendations

  • Best for structured skill lists (bullets, commas)
  • Review 'other' category for domain-specific additions
  • Combine with manual review for critical applications
  • Consider ML-based approach for unstructured text

🎯 Future Improvements

Planned features for v2.0:

  • Expanded skill taxonomy (300+ skills)
  • Industry-specific categories
  • Skill clustering and relationships
  • Confidence scoring
  • Multi-language support (Javanese, Sundanese)
  • Experience year extraction
  • Certification detection

Last Updated: December 2024
Model Version: 1.0.0
Status: βœ… Production Ready
Type: Rule-based NER

For questions or collaboration, visit GitHub.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support