About JAPAN AI

JAPAN AI, Inc. was established in April 2023 as a group company of Geniee, Inc. (TSE Growth Market) with the mission of dramatically expanding human potential through AI technology. We drive cutting-edge AI R&D both domestically and internationally.

Related URLs

Why We're Hiring

JAPAN AI is rapidly expanding its enterprise AI agent suite, including JAPAN AI AGENT / CHAT / SPEECH. As the core of our products shifts to LLMs and multi-agent systems, we are establishing a new specialized organization to scientifically evaluate the quality, safety, and reliability of AI outputs.

Mission

"Make AI Output Quality a Science — Prove Agent Reliability through Research and Development of Evaluation Methods."

You will quantitatively evaluate and improve the output quality of LLMs and AI agents using methods from machine learning, statistics, and psychometrics. This position is not for "people who test" — it is for "scientists who define and measure what makes a good AI."

Role & Expectations

As an AI Evaluation Scientist, you will lead the design, construction, and operation of the AI agent quality-evaluation infrastructure.

Research and develop evaluation metrics — scientifically define "what constitutes quality" through LLM-as-Judge calibration, reward modeling, and benchmark design
Design and build automated evaluation pipelines — integrate research outcomes into production CI/CD to deliver scalable quality gates
Red teaming and safety verification — automate adversarial testing and build policy compliance verification frameworks
Drive quality improvement through statistical experimental design — quantitatively verify the effectiveness of prompt strategies and model changes through A/B tests and significance testing
Feed evaluation signals back to research and development teams — build a compound-interest loop for model improvement
Ensure the quality of products used in production by ~200 companies through a "science of quality" approach

Why You'll Love This Role

Evaluation Science in practice : Practice "AI Evaluation Science" — the discipline that Apple, Anthropic, Scale AI, and others are investing in — within the context of Japanese enterprise AI. This is a globally rare position where evaluation methodology itself is the research subject.
A new application of ML/DS skills : Apply your machine learning and statistics expertise not to "building models" but to "evaluating models." Intellectual challenges span both research and implementation — reward modeling, LLM-as-Judge calibration theory, and benchmark design.
Quality determines product trust : In a production environment used by ~200 companies, the evaluation infrastructure you build becomes the last line of defense for release quality. You will feel the direct business impact of quality assurance.
Greenfield position : Design and build the entirely new specialized domain of AI agent evaluation science from scratch. You will have significant autonomy — from evaluation metric R&D to production deployment of automated evaluation pipelines.
Frontline of AI safety : Engage in Responsible AI practices including automated red teaming, adversarial testing, and policy compliance verification. You will play a key role in scientifically guaranteeing safety in a world where AI agents autonomously execute business operations as "the brain of the enterprise."
Rapid-growth environment : In a startup that has grown to 200+ people and 9 products in just 3 years, you will have significant autonomy in technical decision-making. You will work closely with Research Engineers and Agent Harness Engineers, influencing quality across the entire product suite.

Job Description

As an AI Evaluation Scientist, you will lead the design, construction, and operation of the AI agent Evaluation Infrastructure.

Evaluation Metric Research & Development
- Research and implement LLM-as-Judge calibration methods (rubric design, bias detection, proper scoring rules)
- Design, build, and validate evaluation benchmarks (construct validity, contamination detection)
- Research the application of reward modeling / preference learning to evaluation
- Select and design evaluation metrics (win rate, task success, factuality, harm detection)
- Design, build, and maintain evaluation sets (synthetic data + real logs)
Automated Evaluation Pipeline Design & Development
- Design and implement scalable automated evaluation pipelines
- Integrate evaluation pipelines into CI/CD and build quality gates
- Design agent evaluation harnesses (multi-turn, tool use, long-context support)
- Ensure reproducibility and reliability of evaluation pipelines
Safety & Quality Verification
- Research and implement automated red teaming (automated adversarial testing)
- Build safety and policy compliance verification frameworks
- Research and implement hallucination detection and calibration methods
- Design and execute prompt / tool regression tests
Statistical Analysis & Experimental Design
- Design and analyze statistical experiments (A/B tests, significance testing)
- Visualize quality trends and automate regression detection
- Create quality reports and improvement proposals
- Feed evaluation signals back to research and development teams

Key Results (KR/Metrics)

Evaluation coverage rate (test case coverage)
Regression detection rate (pre-release quality degradation detection ≥ 95%)
Evaluation pipeline execution time (completed within CI/CD)
LLM-as-Judge and human evaluation agreement rate
False positive / false negative rate
Safety incident rate (post-release)

Team Structure

Approximately 120 members are part of the development organization.
The AI Evaluation Scientist operates as a dedicated quality assurance function, collaborating closely with:

Agentic Product Engineer — Agent feature development
Research Engineer — Research and development, model improvement
Agent Harness Engineer / Software Engineer (AI Platform) — AI execution infrastructure development
Product Manager — Product design and quality requirements definition

You May Be a Good Fit If You

Education & Experience
- Master's degree or higher (or equivalent practical experience) in Computer Science, Machine Learning, Statistics, Mathematics, Physics, Psychometrics, or related fields
- 3+ years of practical experience as an ML Engineer, Data Scientist, Research Engineer, or in ML/AI evaluation-related roles
Technical Skills
- Deep knowledge of LLM / generative AI evaluation methods (benchmark design, LLM-as-Judge, quantitative output quality measurement, hallucination detection, etc.)
- Practical knowledge of statistics and experimental design (hypothesis testing, A/B testing, confidence intervals, effect sizes, etc.)
- Experience building ML / evaluation pipelines in Python
- Practical experience with machine learning frameworks (PyTorch, JAX, TensorFlow, etc.)
- Experience designing and implementing evaluation metrics (task-specific metric design beyond precision/recall)
Language requirement (at least one of the following):
- Japanese: Fluent — able to discuss product development without friction
- English: Business level

This position is a research and development role responsible for AI output Evaluation Science. Research or implementation experience in ML model evaluation / LLM evaluation is required.

Strong Candidates May Also Have

Publication experience at top ML/NLP conferences (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
Research or implementation experience with reward modeling / preference learning (RLHF, DPO, etc.)
Experience with LLM-as-Judge calibration and rubric design
Knowledge or experience in AI safety, Responsible AI, and red teaming
Experience with benchmark design and validity verification (IRT, construct validity)
Experience evaluating multi-agent workflows, tool use, and long-context scenarios
Large-scale data processing experience (Spark / BigQuery, etc.)
Experience integrating ML / evaluation pipelines into CI/CD
Ability to read, comprehend, and reproduce research papers
Technical communication ability in English

Tech Stack

Languages : Python (evaluation pipelines & analysis) , TypeScript / React / Next.js (frontend) / NX
Evaluation/QA : pytest, LangSmith, Weights & Biases, custom eval frameworks
Data : BigQuery, Spark, Pandas
Infrastructure : GCP (containers / K8s) , Docker, Terraform
CI/CD : GitHub Actions
Tools : Slack, Confluence, Linear, Google Workspace, GitHub, Notion
AI Dev Support: Claude Code MAX Plan, Cursor, ChatGPT, Devin
Work environment : Mac (Apple Silicon) , dual monitors available

職種 / 募集ポジション	【JAPAN AI】AI Evaluation Scientist / English
雇用形態	正社員
給与	年収 Monthly: ¥571,429～¥1,142,857 (incl. 45h fixed overtime) Stock options available Reviews & bonuses: twice/year OT beyond 45h paid separately Negotiable based on experience and skills
勤務地	163-6006 東京都新宿区西新宿住友不動産新宿オークタワー５/６階地図で確認
Learning & Development Support	[AI Tool Usage Support] Company covers the cost of using AI tools such as JAPAN AI SaaS services, Cursor, ChatGPT, Claude, etc. Development Tool Support If a desired development tool is paid, the cost is covered (up to ¥30,000 per year) [Book Purchase Assistance] Company covers the cost of purchasing books for learning, such as technical books (up to ¥30,000 per half-year) Language Learning / Qualification Support Company covers the cost of Japanese or English learning programs and qualification acquisition [Refresh Allowance] Company covers the cost of services used for personal refreshment (up to ¥5,000 per month) e.g., gym, yoga, chiropractic, aquarium, movies, theme park tickets, etc. [Housing Allowance] Housing allowance provided for those living in designated areas (up to ¥30,000 per month)
Work Style	Hybrid work : 3 days in office, 2 days remote Flexible working hours : Core time is negotiable Flexibility : Future consideration for more flexible work styles is possible
Hiring Process	1. Application Review 2. Coding Assessment 3. Interviews (4–5 rounds) 4. Offer A reference check will be conducted prior to the final interview.

会社情報
会社名	株式会社ジーニー
事業内容	・広告プラットフォーム事業・マーケティングSaaS事業・デジタルPR事業
設立年月日	2010年4月14日
代表者	代表取締役社長工藤智昭
資本金	100百万円（連結、2025年3月末現在）
従業員数	877名（連結、2025年3月末現在）
本社所在地	東京都新宿区西新宿6-8-1　住友不動産新宿オークタワー5/6階
就業時間	10:00～19:00 ※土日祝は休業日となります ※出向の場合は、出向先の規程に準じます
福利厚生	【待遇・福利厚生】 <正社員> ・書籍購入補助（半期 30,000円まで）・リフレッシュ手当（毎月 5,000円まで）・部活動手当（毎月5,000円まで）・家賃手当（当社指定の駅を対象とし毎月30,000円まで）・シャッフルランチ/ディナー（四半期に一度ランチ1,000円まで、ディナー5,000円まで）・資格取得支援制度、英語学習支援制度（業務に必要な場合のみ）・リフレッシュ休暇制度（3年間継続勤務した社員へ毎年付与される特別休暇 2日）・定期健康診断（年1回）・従業員持株会 <契約社員> ・書籍購入補助（半期 30,000円まで）・リフレッシュ手当（毎月 5,000円まで）・部活動手当（毎月5,000円まで）・シャッフルランチ/ディナー（四半期に一度ランチ1,000円まで、ディナー5,000円まで）・リフレッシュ休暇制度（3年間継続勤務した社員へ毎年付与される特別休暇 2日）・定期健康診断（年1回）【保険】・社会保険完備【諸手当】・交通費全額支給
代表プロフィール	早稲田大学大学院卒業後、株式会社リクルート（現株式会社リクルートホールディングス）へ入社。2010年4月株式会社ジーニーを創業、代表取締役社長に就任。2023年4月には戦略的AIカンパニーJAPAN AI株式会社を設立し、同社の代表取締役社長を兼任している。
企業成長ランキング	■ Financial Times社発表のアジア成長企業ランキング2020を受賞 Financial Times社とStatista社が、アジア太平洋地域12カ国5,000万以上の企業を対象に実施した調査で、飛躍的活躍を遂げた企業500社に選出されました。
休日休暇	完全週休二日制所定休日：土・日・祝日休暇：年次有給休暇、夏季休暇（3日）、年末年始休暇（12月31日〜1月3日）、慶弔休暇
グループ会社	CATS株式会社（日本） JAPAN AI株式会社（日本）ソーシャルワイヤー株式会社（日本） Geniee International Pte., Ltd.（シンガポール） Geniee Vietnam Co., Ltd.（ベトナム） PT. Geniee Technology Indonesia（インドネシア） PT. Adstars Media Pariwara（インドネシア） Geniee US Inc.（米国） Geniee Software India Pvt. Ltd.（インド） GENIEE ADTECH – FZCO（UAE）
備考	・試用期間　正社員/契約社員：1か月・受動喫煙対策　敷地内禁煙（屋外に喫煙場所設置）・従事すべき業務の変更の範囲　会社の定める業務・就業の場所の変更の範囲　会社の定める場所・有期労働契約を更新する場合の基準に関する事項（通算契約期間又は更新回数の上限を含む）　更新の上限なし

応募する

【JAPAN AI】AI Evaluation Scientist / English