1. 株式会社ジーニー
  2. 株式会社ジーニー 採用情報
  3. 株式会社ジーニー の求人一覧
  4. 【JAPAN AI】AI Quality Scientist / English

【JAPAN AI】AI Quality Scientist / English

  • 【JAPAN AI】AI Quality Scientist / English
  • 正社員

株式会社ジーニー の求人一覧

【JAPAN AI】AI Quality Scientist / English | 株式会社ジーニー

About JAPAN AI

JAPAN AI, Inc. was established in April 2023 as a group company of Geniee, Inc. (TSE Growth Market) with the mission of dramatically expanding human potential through AI technology. We drive cutting-edge AI R&D both domestically and internationally.

Related URLs

Why We're Hiring

The output quality of AI agents is directly tied to enterprise operations. "Sort of working" is not acceptable.

In a world where JAPAN AI STUDIO functions as "the brain of the enterprise" — autonomously executing tasks such as approval workflows, resource allocation, and prospect discovery — a wrong AI output means approvals that should have been rejected go through, incorrect staffing decisions are made, and inappropriate customers are approached. For "the brain of the enterprise" to be trusted, a system that scientifically evaluates and guarantees the accuracy, safety, and consistency of generated responses is essential.

Traditional QA engineering has centered on test case design and execution. However, quality assurance for LLM agents demands ML/DS expertise — research and development of evaluation metrics themselves, LLM-as-Judge calibration theory, reward modeling, statistical experimental design, and benchmark design.

JAPAN AI is hiring an AI Quality Scientist to establish "AI Evaluation Science" — the discipline that Apple, Anthropic, Scale AI, and Google DeepMind are pioneering — within the context of Japanese enterprise AI.

Mission

"Science the quality of AI — prove agent reliability through evaluation research and development."

Quantitatively evaluate and improve LLM / AI agent output quality using methods from machine learning, statistics, and psychometrics. Establish "AI Evaluation Science" as a new research discipline within the company — from evaluation metric R&D to production deployment of automated evaluation pipelines — and scientifically guarantee the quality of products used in production by approximately 200 companies.

Role & Expectations

As an AI Quality Scientist, you will lead both the research and implementation aspects of AI agent quality evaluation.

  • Research and develop evaluation metrics — scientifically define "what constitutes quality" through LLM-as-Judge calibration, reward modeling, and benchmark design
  • Design and build automated evaluation pipelines — integrate research outcomes into production CI/CD to deliver scalable quality gates
  • Red teaming and safety verification — automate adversarial testing and build policy compliance verification frameworks
  • Drive quality improvement through statistical experimental design — quantitatively verify the effectiveness of prompt strategies and model changes through A/B tests and significance testing
  • Feed evaluation signals back to research and development teams — build a compound-interest loop for model improvement
  • Ensure the quality of products used in production by ~200 companies through a "science of quality" approach

Why You'll Love This Role

  • Evaluation Science in practice : Practice "AI Evaluation Science" — the discipline that Apple, Anthropic, Scale AI, and others are investing in — within the context of Japanese enterprise AI. This is a globally rare position where evaluation methodology itself is the research subject.
  • A new application of ML/DS skills : Apply your machine learning and statistics expertise not to "building models" but to "evaluating models." Intellectual challenges span both research and implementation — reward modeling, LLM-as-Judge calibration theory, and benchmark design.
  • Quality determines product trust : In a production environment used by ~200 companies, the evaluation infrastructure you build becomes the last line of defense for release quality. You will feel the direct business impact of quality assurance.
  • Greenfield position : Design and build the entirely new specialized domain of AI agent evaluation science from scratch. You will have significant autonomy — from evaluation metric R&D to production deployment of automated evaluation pipelines.
  • Frontline of AI safety : Engage in Responsible AI practices including automated red teaming, adversarial testing, and policy compliance verification. You will play a key role in scientifically guaranteeing safety in a world where AI agents autonomously execute business operations as "the brain of the enterprise."
  • Rapid-growth environment : In a startup that has grown to 200+ people and 9 products in just 3 years, you will have significant autonomy in technical decision-making. You will work closely with Research Engineers and Agent Harness Engineers, influencing quality across the entire product suite.

Job Description

  • Evaluation Metric Research & Development
    • Research and implement LLM-as-Judge calibration methods (rubric design, bias detection, proper scoring rules)
    • Design, build, and validate evaluation benchmarks (construct validity, contamination detection)
    • Research the application of reward modeling / preference learning to evaluation
    • Select and design evaluation metrics (win rate, task success, factuality, harm detection)
    • Design, build, and maintain evaluation sets (synthetic data + real logs)
  • Automated Evaluation Pipeline Design & Development
    • Design and implement scalable automated evaluation pipelines
    • Integrate evaluation pipelines into CI/CD and build quality gates
    • Design agent evaluation harnesses (multi-turn, tool use, long-context support)
    • Ensure reproducibility and reliability of evaluation pipelines
  • Safety & Quality Verification
    • Research and implement automated red teaming (automated adversarial testing)
    • Build safety and policy compliance verification frameworks
    • Research and implement hallucination detection and calibration methods
    • Design and execute prompt / tool regression tests
  • Statistical Analysis & Experimental Design
    • Design and analyze statistical experiments (A/B tests, significance testing)
    • Visualize quality trends and automate regression detection
    • Create quality reports and improvement proposals
    • Feed evaluation signals back to research and development teams

Key Results (KR/Metrics)

  • Evaluation coverage rate (test case coverage)
  • Regression detection rate (pre-release quality degradation detection ≥ 95%)
  • Evaluation pipeline execution time (completed within CI/CD)
  • LLM-as-Judge and human evaluation agreement rate
  • False positive / false negative rate
  • Safety incident rate (post-release)

Team Structure

Approximately 120 members are part of the development organization.
The AI Quality Scientist operates as a dedicated quality assurance function, collaborating closely with:

  • Agentic Product Engineer — Agent feature development
  • Research Engineer — Research and development, model improvement
  • Agent Harness Engineer / Software Engineer (AI Platform) — AI execution infrastructure development
  • Product Manager — Product design and quality requirements definition

You May Be a Good Fit If You

  • Education & Experience
    • Master's degree or higher (or equivalent practical experience) in Computer Science, Machine Learning, Statistics, Mathematics, Physics, Psychometrics, or related fields
    • 3+ years of practical experience as an ML Engineer, Data Scientist, Research Engineer, or in ML/AI evaluation-related roles
  • Technical Skills
    • Deep knowledge of LLM / generative AI evaluation methods (benchmark design, LLM-as-Judge, quantitative output quality measurement, hallucination detection, etc.)
    • Practical knowledge of statistics and experimental design (hypothesis testing, A/B testing, confidence intervals, effect sizes, etc.)
    • Experience building ML / evaluation pipelines in Python
    • Practical experience with machine learning frameworks (PyTorch, JAX, TensorFlow, etc.)
    • Experience designing and implementing evaluation metrics (task-specific metric design beyond precision/recall)
  • Language requirement (at least one of the following):
    • Japanese: Fluent — able to discuss product development without friction
    • English: Business level

Strong Candidates May Also Have

  • Publication experience at top ML/NLP conferences (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
  • Research or implementation experience with reward modeling / preference learning (RLHF, DPO, etc.)
  • Experience with LLM-as-Judge calibration and rubric design
  • Knowledge or experience in AI safety, Responsible AI, and red teaming
  • Experience with benchmark design and validity verification (IRT, construct validity)
  • Experience evaluating multi-agent workflows, tool use, and long-context scenarios
  • Large-scale data processing experience (Spark / BigQuery, etc.)
  • Experience integrating ML / evaluation pipelines into CI/CD
  • Ability to read, comprehend, and reproduce research papers
  • Technical communication ability in English

Tech Stack

  • Languages : Python (evaluation pipelines & analysis) , TypeScript / React / Next.js (frontend) / NX
  • Evaluation/QA : pytest, LangSmith, Weights & Biases, custom eval frameworks
  • Data : BigQuery, Spark, Pandas
  • Infrastructure : GCP (containers / K8s) , Docker, Terraform
  • CI/CD : GitHub Actions
  • Tools : Slack, Confluence, Linear, Google Workspace, GitHub, Notion
  • AI Dev Support: Claude Code MAX Plan, Cursor, ChatGPT, Devin
  • Work environment : Mac (Apple Silicon) , dual monitors available
職種 / 募集ポジション 【JAPAN AI】AI Quality Scientist / English
雇用形態 正社員
給与
年収
Monthly: ¥571,429~¥1,142,857 (incl. 45h fixed overtime) 
Stock options available
Reviews & bonuses: twice/year
OT beyond 45h paid separately
Negotiable based on experience and skills
勤務地
  • 163-6006  東京都新宿区西新宿住友不動産新宿オークタワー 5/6階
    地図で確認
 
Learning & Development Support
[AI Tool Usage Support]
Company covers the cost of using AI tools such as JAPAN AI SaaS services, Cursor, ChatGPT, Claude, etc.
Development Tool Support
If a desired development tool is paid, the cost is covered (up to ¥30,000 per year)

[Book Purchase Assistance]
Company covers the cost of purchasing books for learning, such as technical books (up to ¥30,000 per half-year)
Language Learning / Qualification Support
Company covers the cost of Japanese or English learning programs and qualification acquisition

[Refresh Allowance]
Company covers the cost of services used for personal refreshment (up to ¥5,000 per month)
e.g., gym, yoga, chiropractic, aquarium, movies, theme park tickets, etc.

[Housing Allowance]
Housing allowance provided for those living in designated areas (up to ¥30,000 per month)
Work Style
Hybrid work : 3 days in office, 2 days remote
Flexible working hours : Core time is negotiable
Flexibility : Future consideration for more flexible work styles is possible
Hiring Process
1. Application Review
2. Coding Assessment
3. Interviews (4–5 rounds)
4. Offer

A reference check will be conducted prior to the final interview.
会社情報
会社名 株式会社ジーニー
事業内容
・広告プラットフォーム事業
・マーケティングSaaS事業
・デジタルPR事業
設立年月日
2010年4月14日
代表者
代表取締役社長 工藤 智昭
資本金
100百万円(連結、2025年3月末現在)
従業員数
877名(連結、2025年3月末現在)
本社所在地
東京都新宿区西新宿6-8-1 住友不動産新宿オークタワー5/6階
就業時間
10:00~19:00
※土日祝は休業日となります
※出向の場合は、出向先の規程に準じます
福利厚生
【待遇・福利厚生】
<正社員>
・書籍購入補助(半期 30,000円まで)
・リフレッシュ手当(毎月 5,000円まで)
・部活動手当(毎月5,000円まで)
・家賃手当(当社指定の駅を対象とし毎月30,000円まで)
・シャッフルランチ/ディナー(四半期に一度ランチ1,000円まで、ディナー5,000円まで)
・資格取得支援制度、英語学習支援制度(業務に必要な場合のみ)
・リフレッシュ休暇制度(3年間継続勤務した社員へ毎年付与される特別休暇 2日)
・定期健康診断(年1回)
・従業員持株会

<契約社員>
・書籍購入補助(半期 30,000円まで)
・リフレッシュ手当(毎月 5,000円まで)
・部活動手当(毎月5,000円まで)
・シャッフルランチ/ディナー(四半期に一度ランチ1,000円まで、ディナー5,000円まで)
・リフレッシュ休暇制度(3年間継続勤務した社員へ毎年付与される特別休暇 2日)
・定期健康診断(年1回)

【保険】
・社会保険完備

【諸手当】
・交通費全額支給
代表プロフィール
早稲田大学大学院卒業後、株式会社リクルート(現 株式会社リクルートホールディングス)へ入社。2010年4月株式会社ジーニーを創業、代表取締役社長に就任。2023年4月には戦略的AIカンパニーJAPAN AI株式会社を設立し、同社の代表取締役社長を兼任している。
企業成長ランキング
■ Financial Times社発表のアジア成長企業ランキング2020を受賞
Financial Times社とStatista社が、アジア太平洋地域12カ国5,000万以上の企業を対象に実施した調査で、飛躍的活躍を遂げた企業500社に選出されました。
休日休暇
完全週休二日制
所定休日:土・日・祝日
休暇:年次有給休暇、夏季休暇(3日)、年末年始休暇(12月31日〜1月3日)、慶弔休暇
グループ会社
CATS株式会社(日本)
JAPAN AI株式会社(日本)
ソーシャルワイヤー株式会社(日本)
Geniee International Pte., Ltd.(シンガポール)
Geniee Vietnam Co., Ltd.(ベトナム)
PT. Geniee Technology Indonesia(インドネシア)
PT. Adstars Media Pariwara(インドネシア)
Geniee US Inc.(米国)
Geniee Software India Pvt. Ltd.(インド)
GENIEE ADTECH – FZCO(UAE)
備考
・試用期間
 正社員/契約社員:1か月

・受動喫煙対策
 敷地内禁煙(屋外に喫煙場所設置)

・従事すべき業務の変更の範囲
 会社の定める業務

・就業の場所の変更の範囲
 会社の定める場所

・有期労働契約を更新する場合の基準に関する事項(通算契約期間又は更新回数の上限を含む)
 更新の上限なし