[{"content":"","date":"1 May 2026","externalUrl":null,"permalink":"/tags/agent/","section":"tags","summary":"","title":"agent","type":"tags"},{"content":"","date":"1 May 2026","externalUrl":null,"permalink":"/tags/ai/","section":"tags","summary":"","title":"ai","type":"tags"},{"content":"I pitted Claude Opus 4.7 via Bedrock API against GPT 5.5 — not as chatbots, but as agentic supervisors. Each was given the same task: write a delegation prompt for a coding agent, dispatch it, review the result, and iterate to correctness. Same codebase. Same delegate model. Same constraints. Only the parent agent changed.\nJudge: DeepSeek V4 Pro (open source, no affiliation with either competitor)\nWhat We Tested # The task was non-trivial: implement Phases 1–3 of a walk-forward backtesting system in my quantlab repository — that\u0026rsquo;s type definitions, Pydantic configuration models, ORM schema changes, and an Alembic database migration. Roughly 800 lines of new code across 15+ files touching a real PostgreSQL database.\nBoth agents ran on opencode, using the same base system prompt. The delegate model in both cases was kimi-for-coding/k2p6.\nThe Harness # The experiment ran inside opencode, an agentic harness that provides the tool-execution environment, safety layers, and session management for both agents. On top of this, I used the delegate-to-oc skill — a delegation workflow that creates isolated git worktrees for subagents, monitors their progress via tmux sessions, and surfaces results for human review.\nBoth agents had access to the same delegate-to-oc skill files containing guardrail instructions (anti-loop rules, credential-discovery checks, retry budgets). The difference was whether they chose to use them.\nThe Results at a Glance # Metric Claude GPT Prompt size 896 lines (split across 2 delegations) 254 lines (single delegation) Review rounds needed 1 (PASS on first try) 4 (FAIL → FAIL → FAIL → PASS) Delegate fix iterations 0 3 Bugs found 1 minor (parent-side fix, 30s) 5 distinct blocking issues Final code quality Clean, correct, tight scope Slightly more complete, better tests, more robust migration Both produced correct, functional code. But the paths they took were wildly different.\nHow Claude Approached It # Claude\u0026rsquo;s approach was meticulous from the start. Before writing a single line of the delegation prompt, it:\nVerified the codebase state — checked that parent_run_id and lookback_days actually existed where the spec claimed they did (they did). Read every reference document — proposal, design doc, spec, tasks — and found a bug in the spec itself: the phase boundary between Phases 3 and 5 was leaky (renaming a column in Phase 3 would break the writer in Phase 5 unless the rename was swept across all files simultaneously). Stood up a local PostgreSQL via docker-compose and applied the current Alembic head, so the delegate would have a real DB for autogenerate. Split the work into two delegations: types+config first, schema+migration second — each independently reviewable. It then wrote a 296-line prompt for Delegation 1 and a 600-line prompt for Delegation 2. These weren\u0026rsquo;t wordy — they were dense. Exact code snippets. Exact operation ordering for the Alembic migration. A Python script to verify DB state before autogenerate. A compound bash verification command with Python assertions checking every column post-upgrade and post-downgrade. Retry budgets (\u0026ldquo;1 bash call, retry once, then BLOCKED\u0026rdquo;). A machine-parsable output format.\nBefore dispatching, Claude presented the human with 7 specific design decisions to review — \u0026ldquo;here are the tradeoffs, what do you want me to do?\u0026rdquo; — including whether to rename a function parameter, how to handle a cross-phase dependency, which constraint names to keep, and whether to use a single branch or stacked branches.\nThe delegate returned correct code on the first try. 9 tests passed, 83 existing tests untouched. Claude re-ran everything independently — not trusting the delegate\u0026rsquo;s status report — and verified the migration up/down/up against a real database. It found one minor issue (a dev seed script missed in the rename sweep), fixed it in 30 seconds in the parent session, and amended the commit. Done.\nTotal wall clock: ~20 minutes for both delegations. Zero delegate fix iterations.\nHow GPT Approached It # GPT\u0026rsquo;s approach was faster to launch but costlier at review time. It:\nRead the spec artifacts and checked the codebase. Bundled all three phases into a single 254-line prompt — types, config, schema, and migration in one shot. Wrote descriptive instructions (\u0026ldquo;Implement X with Y\u0026rdquo;) rather than exact code templates. Listed existing files and stale references — useful context that Claude\u0026rsquo;s prompt didn\u0026rsquo;t include. Asked the human one design question: whether to slice the work into multiple delegations. What the prompt didn\u0026rsquo;t include: retry budgets, compound verification commands, DB-state pre-checks, exact operation ordering for the migration, or exact enum values (it said \u0026ldquo;kind VARCHAR(32)\u0026rdquo; instead of showing the exact SAEnum with values_callable).\nThe delegate wrote 15 files of code and committed. Then the review began:\nRound 1 (FAIL): 4 blocking issues. VerdictMetric had wrong enum values (the prompt didn\u0026rsquo;t specify them). BacktestRun.kind used String(32) instead of SAEnum (the prompt said \u0026ldquo;VARCHAR\u0026rdquo;). The migration downgrade was broken (handwritten without a DB). Fold primitives were in the wrong module. Round 2 (FAIL): 4 issues fixed, 1 new one. BacktestRun.kind was now SAEnum but storing enum NAMES ('SINGLE') instead of VALUES ('single'). Subtle bug — the prompt never mentioned values_callable. Round 3 (FAIL): Fixed the enum binding. Migration up/down/up cycle: the repeated-upgrade path failed because downgrade re-added parent_run_id without FK, and upgrade unconditionally tried to drop the FK. Another real edge case — Claude\u0026rsquo;s prompt prevented this by specifying operation ordering. Round 4 (PASS): All 5 issues resolved. Migration clean. Tests green. Enum binding correct. Total wall clock: ~45 minutes across 4 review rounds, 3 fix delegations, and 3 new worktrees.\nWhere They Diverged # 1. Prompt philosophy: blueprint vs sketch # Claude treated the prompt as a blueprint — exact measurements, exact materials, exact verification steps. The delegate was expected to execute precisely and report back in a machine-parsable format.\nGPT treated the prompt as a sketch — here\u0026rsquo;s what we\u0026rsquo;re building, here are the dimensions, fill in the details. Good for a human developer. Risky for an agent with no context.\nThe difference showed in the delegate\u0026rsquo;s output. The enum values were wrong because the prompt didn\u0026rsquo;t specify them. The column type was wrong because the prompt gave a DDL description instead of SQLAlchemy code. Every issue in GPT\u0026rsquo;s review #1 was directly traceable to something the prompt didn\u0026rsquo;t say.\n2. Guardrail discipline # Claude pulled anti-loop rules, retry budgets, and compound verification from the delegate-to-oc skill files. GPT wrote a prompt that had guardrails (don\u0026rsquo;t invent credentials, don\u0026rsquo;t touch files outside scope) but missed the structural guardrails that prevent loops and force clean verification.\nThe result: Claude\u0026rsquo;s delegate was locked into \u0026ldquo;1 bash call, retry once, then BLOCKED.\u0026rdquo; GPT\u0026rsquo;s delegate could have looped indefinitely — it didn\u0026rsquo;t, but nothing in the prompt would have stopped it.\n3. Review ownership # When Claude found the seed script bug (4 lines in a file outside the prompt\u0026rsquo;s grep scope), it said: \u0026ldquo;my oversight, not the delegate\u0026rsquo;s\u0026rdquo; — fixed it in the parent session, amended the commit, moved on.\nWhen GPT found 5 blocking issues that all traced back to the prompt being too loose, it never once acknowledged the prompt was the root cause. Every issue was framed as \u0026ldquo;the delegate did X wrong.\u0026rdquo;\nThis isn\u0026rsquo;t about politeness — it\u0026rsquo;s about efficiency. If the parent agent owns its mistakes, it can fix them directly (30 seconds vs another delegation round). If it always blames the delegate, every issue becomes a round-trip.\n4. Communication style # Claude communicated like a collaborator: \u0026ldquo;Here\u0026rsquo;s what I found, here are the decisions I need from you, here\u0026rsquo;s what I suggest.\u0026rdquo; It volunteered the branch name and surfaced 7 design decisions for review before dispatching.\nGPT communicated like a tool: \u0026ldquo;Task received. Executing.\u0026rdquo; The human had to ask for the branch name. Design decisions were implied rather than surfaced.\nBoth styles are valid. But for human trust in an autonomous delegation system, the collaborative style builds confidence that the agent is thinking critically rather than mechanically.\nThe Counterintuitive Finding # Here\u0026rsquo;s the plot twist: despite Claude\u0026rsquo;s objectively cleaner process, GPT\u0026rsquo;s final PR was slightly better.\nGPT\u0026rsquo;s migration handles repeated upgrade→downgrade→upgrade cycles (using DROP CONSTRAINT IF EXISTS — a real edge case). Its constraint names are consistent with the renamed column. It has a dedicated test for verifying that SAEnum stores values, not member names. Its fold primitives are already implemented, making the config layer end-to-end testable.\nHow did this happen? The messy process forced extra scrutiny. Each failed review round surfaced an edge case that Claude\u0026rsquo;s clean process missed — because Claude\u0026rsquo;s delegate got everything right on the first try, the review was faster and less adversarial. GPT\u0026rsquo;s 4-round ordeal created a de facto adversarial review.\nThis suggests an ideal hybrid: precise prompts (Claude\u0026rsquo;s strength) + adversarial review (GPT\u0026rsquo;s accidental strength). Write the prompt as if the delegate will get it right the first time. Review as if you expect it to fail.\nQ\u0026amp;A with the Judge # After reading the results, I sat down with DeepSeek V4 Pro — the independent reviewer — to dig into what this all means.\nMe: The delegate model here was kimi k2.6, which — let\u0026rsquo;s be honest — is not the strongest instruction-follower. It\u0026rsquo;s great for the price, but it drifts when things aren\u0026rsquo;t spelled out. Given that, which supervisor style is actually better?\nDeepSeek: For a weaker delegate, Claude\u0026rsquo;s blueprint approach is the clear winner, and the evidence is right in the data. Kimi got everything right on the first try with Claude\u0026rsquo;s prompt — because there was nothing to interpret. When GPT gave it room to fill in blanks, it filled them in wrong: wrong enum values, wrong column type (String instead of SAEnum), wrong module for fold primitives, a broken migration downgrade.\nThe pattern is: a weaker delegate amplifies the gap between the two prompt styles. With a stronger delegate (say, a full Opus or GPT itself), GPT 5.5\u0026rsquo;s descriptive style might work fine — the delegate would infer the right SQLAlchemy pattern from \u0026ldquo;kind VARCHAR(32)\u0026rdquo; because it knows the framework. Kimi doesn\u0026rsquo;t have that depth, so \u0026ldquo;VARCHAR(32)\u0026rdquo; became Column(String(32)) — technically correct by the prompt, semantically wrong for the codebase.\nIf you\u0026rsquo;re using cost-efficient coding models as delegates, you must write like Claude. There is no safe \u0026ldquo;trust the delegate to figure it out\u0026rdquo; mode.\nMe: But I don\u0026rsquo;t want to just pick the winner. Both had strengths. What\u0026rsquo;s the real way to combine their power?\nDeepSeek: They\u0026rsquo;re complementary on different axes, and neither is redundant.\nClaude excels at planning and precision: auditing specs, catching phase-boundary leaks, decomposing work into safe slices, writing guardrail-heavy prompts. It\u0026rsquo;s the architect you want before a delegate ever sees the code.\nGPT 5.5 excels at context inventory and adversarial review: listing existing files and stale references upfront, and — ironically because of its messy process — catching edge cases the clean process misses. Its review rounds #3 and #4 found the DROP CONSTRAINT IF EXISTS bug and the SAEnum values vs names distinction. Those are genuine discoveries.\nThe ideal pairing would be something like: Claude plans and writes the delegation prompt. GPT 5.5 reviews the result adversarially. Or, more practically: if you\u0026rsquo;re using Claude as your supervisor, be more adversarial in your own review — don\u0026rsquo;t let a clean first pass make you complacent. If you\u0026rsquo;re using GPT 5.5, front-load more precision into your prompts — don\u0026rsquo;t rely on the review rounds to catch what the prompt should have prevented.\nYou could also split by task type: use Claude for schema/migration work (high stakes, irreversible, needs exact operation ordering), use GPT 5.5 for feature implementation (more room for iteration, benefits from broader context inventory).\nMe: What if the complexity grows? I only tested Phases 1-3 of an 11-phase change. The migration was 197 lines. What happens when it\u0026rsquo;s a 500-line migration touching 10 tables with production data?\nDeepSeek: That\u0026rsquo;s where Claude\u0026rsquo;s approach scales and GPT 5.5\u0026rsquo;s starts to get expensive — fast.\nWith Claude\u0026rsquo;s method, complexity growth translates to a longer prompt, not more review rounds. For a 10-table migration, Claude would write a 2000-line prompt: exact operation ordering, pre-migration state verification per table, conditional rollback paths, a multi-stage compound verification script. The delegate runs it once, passes or BLOCKs. One round.\nWith GPT 5.5\u0026rsquo;s method, every unit of complexity you don\u0026rsquo;t specify upfront becomes a potential review round. A 10-table migration with ambiguous instructions could produce 10+ bugs discovered one at a time across 5+ rounds. Each round is another delegation, another worktree, another human review. The overhead compounds.\nThe scaling math is: Claude\u0026rsquo;s approach is O(prompt_size) in complexity. GPT 5.5\u0026rsquo;s is O(bug_count) — and bug_count grows with ambiguity × complexity.\nFor Phases 1-3, the gap was 1 round vs 4 rounds. For Phases 1-11, with an under-specified prompt, you could be looking at 15-20 review rounds — that\u0026rsquo;s hours of human review time, not minutes. The blueprint approach doesn\u0026rsquo;t just save time now; it prevents a death march later.\nThat said, I\u0026rsquo;ll offer one nuance: GPT 5.5\u0026rsquo;s continuation behavior — the fact that it doesn\u0026rsquo;t hard-stop at limits — means it can handle a death march if you\u0026rsquo;re willing to pay for it. Claude would force a clean checkpoint. GPT 5.5 would power through. Different failure modes: Claude\u0026rsquo;s is \u0026ldquo;stop and replan,\u0026rdquo; GPT 5.5\u0026rsquo;s is \u0026ldquo;keep going and keep billing.\u0026rdquo;\nMe: So what\u0026rsquo;s your actual recommendation? If I\u0026rsquo;m building something real — not an experiment — which way do I go?\nDeepSeek: If you\u0026rsquo;re building something real and you have to pick one: use Claude\u0026rsquo;s blueprint approach for the prompt. It\u0026rsquo;s the only one that scales safely with complexity. But when you review, channel GPT 5.5\u0026rsquo;s paranoia. Don\u0026rsquo;t let a clean first pass make you trust the output. Re-run every verification yourself. Check for the edge cases the prompt didn\u0026rsquo;t think to prevent.\nAnd if you have both available: use Claude to plan and delegate, use GPT 5.5 as a second reviewer before merge. That\u0026rsquo;s the hybrid this experiment accidentally validated.\nWhat This Means for Agent Delegation # If you\u0026rsquo;re building delegation systems:\nSpend tokens on guardrails, not spec duplication. Every line of exact instruction translates directly into one fewer bug the delegate creates. Your future self (or your reviewer) will thank you.\nAudit the spec, don\u0026rsquo;t just transcribe it. Claude found a leaky phase boundary in the original plan that would have broken the build. GPT didn\u0026rsquo;t. The parent agent\u0026rsquo;s value isn\u0026rsquo;t in writing prompts — it\u0026rsquo;s in catching problems before the delegate ever sees them.\nMachine-parsable output is underrated. Claude\u0026rsquo;s STATUS=DONE\\nPYTEST=92 passed is grep-able and scriptable. GPT\u0026rsquo;s free-form status reports required manual reading every round. At scale, this matters.\nFix in the parent if it\u0026rsquo;s faster. Don\u0026rsquo;t bounce trivial issues back to the delegate just because the process says so. A 30-second parent-side patch is always better than a full round-trip delegation.\nThe process and the artifact are different axes. Clean process usually produces clean artifacts — but not always. Be willing to learn from the messy process too.\nTechnical Context # Both agents were asked to implement the following scope from the walk-forward-runner OpenSpec change in the quantlab repository:\nPhase 1 — Types: StrEnum definitions for Verdict, BacktestKind, TrainingType, GridType, Objective, VerdictMetric.\nPhase 2 — Config: Pydantic models for GridPoint, GridSpec (with product/zip expansion), RollingFolds, AnchoredFolds (discriminated union), and WalkForwardConfig.\nPhase 3 — Schema \u0026amp; Migration: Pydantic BacktestMetrics blob, ORM changes (drop parent_run_id, add kind/walk_forward_run_id/metrics_json, rename lookback_days → period_days), new WalkForwardRun and WalkForwardFold ORM models, and a single Alembic migration with clean up/down cycle.\nParent agents Claude Opus 4.7 (Bedrock, extended thinking) vs GPT 5.5 (high thinking budget) Delegate model kimi-for-coding/k2p6 — identical for both Judge DeepSeek V4 Pro Platform opencode with identical base system prompt, using delegate-to-oc skill A note on token economics # GPT 5.5\u0026rsquo;s session continued well past normal usage limits without cutting off — it burned through all 4 review rounds without hitting a hard stop. Combined with the round-trip overhead, GPT 5.5\u0026rsquo;s approach consumed roughly 3x the wall clock time, and notably more tokens, to reach a comparable result. For anyone building delegation pipelines at scale, this matters: the \u0026ldquo;continue after limits\u0026rdquo; behavior is generous, but it means you\u0026rsquo;re paying for every extra round of speculation your prompts invite.\nThanks to DeepSeek V4 Pro for serving as the independent code reviewer in this experiment. No models were harmed, though GPT\u0026rsquo;s delegate did write 3 fix commits it probably didn\u0026rsquo;t enjoy.\n","date":"1 May 2026","externalUrl":null,"permalink":"/2026/05/ai-log-1-claude-vs-gpt-a-head-to-head-agent-delegation-experiment/","section":"posts","summary":"I pitted Claude Opus 4.7 via Bedrock API against GPT 5.5 — not as chatbots, but as agentic supervisors. Each was given the same task: write a delegation prompt for a coding agent, dispatch it, review the result, and iterate to correctness. Same codebase. Same delegate model. Same constraints. Only the parent agent changed.\n","title":"AI Log 1: Claude vs GPT — A Head-to-Head Agent Delegation Experiment","type":"post"},{"content":"","date":"1 May 2026","externalUrl":null,"permalink":"/categories/ai-log/","section":"categories","summary":"","title":"ai-log","type":"categories"},{"content":"Engineering leader with 9+ years building data platforms, GenAI systems, and the teams that run them. Currently Director of GenAI Software Engineering at Standard Chartered, leading a squad delivering RAG pipelines and agentic workflows inside one of Asia\u0026rsquo;s most regulated institutions. I tend to care as much about how things get shipped as what gets shipped — the process, the culture, the constraints that nobody warned you about.\nWhat I Do # I sit somewhere at the intersection of data engineering, GenAI systems, and engineering leadership. Over the years that\u0026rsquo;s meant writing a lot of Python, designing a lot of pipelines, and more recently spending a meaningful chunk of time on things like how to get an agentic workflow past a compliance review, or how to grow an engineer who\u0026rsquo;s never had a 1:1 before.\nMy technical home is data — lakehouses, pipelines, transformations, the full stack from raw ingestion to analytical surface. The last couple of years have pulled me toward GenAI infrastructure: RAG, vector databases, LLM orchestration, and the newer question of how you build reliable agentic systems that actually hold up in production.\nOn the leadership side I\u0026rsquo;ve built teams from scratch (3 to 11 engineers at ADI), run pre-sales architecture discussions, designed delivery SOPs, and spent a lot of time just trying to make sure people have what they need to do their best work.\nExperience # Director, GenAI Software Engineering · Standard Chartered Singapore Dec 2025 – Present\nLeading Team Vectors — a 9-person squad building GenAI-powered data solutions inside a global bank. The technical work covers RAG pipelines, vector-database preparation, and durable agentic workflows via Temporal.io, deployed across both on-prem and Databricks cloud environments.\nWhat makes this interesting (and occasionally difficult) is the gap between what\u0026rsquo;s architecturally elegant and what can realistically reach production given data security boundaries, governance approval paths, and shared infrastructure at varying maturity levels. A lot of my job is navigating that gap without losing momentum — or relationships.\nAlso mandated and operationalised AI-assisted coding across the team in a compliance-compatible way. The team has fully transitioned to spec-driven development and agentic engineering practices. Built and open-sourced agentic-beacon as part of that culture work.\nData Engineering Lead (Consultant) · Asian Development Bank Jun 2025 – Nov 2025\nTechnical lead on ADB\u0026rsquo;s GenAI data platform, with a team of 4. The flagship deliverable was a RAG application over a 200,000+ document knowledge base covering multilateral development research and policy content, with daily ingestion growth.\nAdapted an architecture previously battle-tested in commercial engagements (Airflow, Databricks, MLFlow, Terraform) to ADB\u0026rsquo;s requirements as a multilateral institution. Also contributed to architectural design of ADB\u0026rsquo;s new credit analysis platform before the engagement concluded.\nHead, Data Engineering · Aboitiz Data Innovation (ADI) Apr 2024 – Jun 2025\nExpanded scope to pre-sales technical leadership alongside running the engineering function. Attended pre-sales meetings as the senior technical voice, providing architecture estimates and the kind of credibility that helps close contracts.\nGrew the function to 11 engineers across three teams: MLE delivery, DE delivery, and an internal platform/infra/products team.\nLead, Data Engineering · Aboitiz Data Innovation (ADI) Feb 2023 – Apr 2024\nScaled a 3-person team to 10 in 12 months, while simultaneously delivering commercial data projects and building reusable internal products. The headline product was a white-label Data Platform — fully Terraformed, reproducible across AWS and Azure, with configurable security and data boundaries per client.\nStandardised delivery lifecycles for both DE and MLE engagements: SOPs, scoping templates, handoff workflows. The kind of work that\u0026rsquo;s invisible when it\u0026rsquo;s done well.\nPod Lead, Data Platform · Xendit Sep 2021 – Jan 2023\nLed the self-serve data platform pod at a high-growth fintech. Maintained a 1.5 PB lakehouse serving ~300 internal consumers, oversaw 1,500+ batch and 30+ streaming pipelines, and drove the Delta Lake migration from a Parquet-based lake.\nDesigned a YAML-based centralised pipeline configuration repo integrated with Airflow — making pipeline additions and removals fully self-serviceable. Introduced semantic versioning, end-to-end test frameworks, poetry, and Airflow 2 in distributed setup.\nData Team Lead · Real Estate Analytics (8prop) May 2020 – Sep 2021\nBuilt the full data stack from scratch for a proptech startup across SG, MY, and HK — Airflow + dbt scheduler, a distributed web crawler ingesting 50+ data sources, and 60+ entity models in Postgres. Team of 5 covering all data engineering, ops, and analytics.\nData Engineer · Healint Aug 2018 – May 2020\nFull-stack data engineering in health-tech. Migrated a legacy Postgres warehouse to Redshift, built and maintained ~40 Airflow batch DAGs and 2 streaming pipelines, and shipped 4 microservices end-to-end including 2 data science products. First real taste of owning the full lifecycle from infra to production.\nData Engineer · Aly Pte. Ltd. (Spiking) Jul 2017 – Apr 2018\nFirst proper engineering job. ETL pipelines, database optimisation, and a self-learning NLP keyword extractor for stock news. Learned a lot about what good (and bad) data infrastructure looks like by inheriting some of each.\nTech I\u0026rsquo;ve Worked With # AI / GenAI — RAG architecture, vector databases, LLM orchestration, MLFlow, Temporal.io, embedding pipelines, agentic engineering, spec-driven development\nData Engineering — Apache Spark, Airflow, dbt, Kafka, Delta Lake, Trino, Hive, Databricks, Redshift, Postgres\nCloud \u0026amp; Infra — AWS (Fargate, Redshift, S3, Kinesis, EC2, ELB, VPC), Azure, Terraform, Kubernetes, Docker, CI/CD\nBackend — Python, SQL, microservices, REST APIs, FastAPI, SQLAlchemy, event-driven architecture\nEducation # Bachelor of Science, Business Analytics · National University of Singapore, School of Computing 2014 – 2018 · Science \u0026amp; Technology Scholarship\nElsewhere # You should only notice me when something is broken — but if you want to reach out, GitHub is the best place to find me.\n","date":"1 May 2026","externalUrl":null,"permalink":"/career/","section":"Shadowsong's Personal Website","summary":"","title":"Career","type":"page"},{"content":"","date":"1 May 2026","externalUrl":null,"permalink":"/categories/","section":"categories","summary":"","title":"categories","type":"categories"},{"content":"","date":"1 May 2026","externalUrl":null,"permalink":"/tags/llm/","section":"tags","summary":"","title":"llm","type":"tags"},{"content":"","date":"1 May 2026","externalUrl":null,"permalink":"/post/","section":"posts","summary":"","title":"posts","type":"post"},{"content":"","date":"1 May 2026","externalUrl":null,"permalink":"/","section":"Shadowsong's Personal Website","summary":"","title":"Shadowsong's Personal Website","type":"page"},{"content":"","date":"1 May 2026","externalUrl":null,"permalink":"/tags/","section":"tags","summary":"","title":"tags","type":"tags"},{"content":"This article talks about several techniques I used when I code. It is not meant for direct adoption by readers because we might be writing different types of code. I was focusing mostly on Data Engineering and Software engineering (backend) code. If you were a data scientist, take this with a pinch of salt.\nSome context on my stacks:\nTool - Cline Models - Gemini 2.5 Pro for Planning, Claude 3.7 for Act MCP - sequentialthinking , context7 , mcp-clickhouse because I have setup my homelab warehouse with clickhouse. There might be other choices out there but Clickhouse is a pretty popular open-sourced data warehouse Custom instructions: Cline memory bank with modification to my usual tasks. e.g. to run dbt tasks, I have a dedicated section in the custom instructions to setup the dbt execution environment when opening up a new shell. # DBT Execution When I am executing DBT commands, I will always prepare the new bash shell session by running: 1. `source .venv/bin/activate` to activate the virtualenv 2. to cd into `/Users/shadowsong/Code/asp/asp-core/dbt` folder. This is my working directory for DBT commands. I will always execute DBT commands under this directory.. 3. run `export DBT_TARGET=dev` 4. run `export DBT_PROJECT_DIR=\u0026#34;/Users/shadowsong/Code/asp/asp-core/dbt/asp\u0026#34;` Once finished all these, run `dbt debug` to ensure everything is set up correctly. This will prepare me the environment for executing DBT commands correctly. .clinerules to indicate several coding practices. #1: Be super Explicit in Custom instructions, and normal explicit in standard prompts # I think we kinda expect nowadays, the model to understand our vague instructions. However, when it comes to custom instructions, it’s better to be super explicit. Using the custom instruction for dbt above, I started with cd into asp folder which it might execute several times and end up in a completely different directory. In fact, it has become my habit (though probably unnecessary as models become more intelligent) when writing prompts to always tell the model where the file is and where do I want to change it as much as possible. I don\u0026rsquo;t think it\u0026rsquo;s always needed for the standard prompts.\nFor standard prompts, I will always try to balance my laziness with explicit context, especially in reviewing of an update.\nPoint form replies help shape my thoughts just as if I am writing my PR review, without the need to add euphemism providing reasons.\ne.g.\n1. rename the classname to `xxx` (here you don\u0026#39;t have to specify which class to rename from usually) 2. abstract the recursion logic into a new method called `_apply_something` within the same class #2: Think critically, do not happily accept everything # One thing we need to be clear is that AI is here to help us deliver, we are ultimately be held responsible for producing code that smells. So be sure to review every analysis that the AI is producing while you are waiting, it will help you discover potential issue with the current execution direction before it’s too late, and sometimes even new knowledge (but obviously please fact checks this with other sources).\nIf sometimes the updates are too massive to review, you should instruct the tool / model to produce less updates per run (I am used to big PRs sadly).\nJust go check some instagram or 小红书 while you are waiting is theoretically not good but if you are on a stressful day hey why not? Mental health matters.\n#3: Avoid multiple tasks per task, unless super small # I notice the performance of the model (or at least Claude 3.7) will degrade drastically (and sometimes, start to duplicating response) if the context is too big. This may seem to be an obvious rule but sometimes we get too carried away while chatting happily with our AI friend and watching it spits out shinny garbage in our project.\n#4: Be active to shape how you want the model to behave # If you do not like the generic try-except code that AI friend is generating or the massive amount of parse-args where in fact you want it to use click or typer, state it and write it down in the rule files. You need to mindfully and proactively do this as if you are teaching a student. Model will not know this and this is again another important thing I notice during my usage. If I am being lazy, I will have the same crap response some time again in the future so when you notice it, do it right.\n#5: Know exactly when to think, and when to delegate # The first four points are mostly related to providing context (”Context Engineering” is indeed an appropriate word). The last note I want to mention is more on how to keep us afloat in today’s vibe coding environment and grow.\nTo me, it’s such a good time to increase my time of coding daily because AI helps me automate things I am not fond of, time consuming, but yet crucially important - unit tests, e2e tests, generating mock data, remembering syntax, algorithms that I studied 7 years ago. I used to spend a huge chunk of time to do all these out of professionalism, but they are definitely not my favourite code to write, but I have done it a few times manually, and I know it DOES NOT help too much in my personal growing anymore, I am only doing it out of necessity. Thus, I am knowingly delegate that portion of the code to AI.\nHowever, if you have never written any recursion in real work context (like parsing the struct schema in Spark), I would at least recommend you to do it once to understand it, and then delegate to AI. This is something I could not stress enough, AI is not a magical box, you will be able to determine its upper limit. There are certain things you should experience and grasp yourselves.\nFrom a different perspective, sometimes AI are just not smart enough yet to perfectly execute your requests. It might very well take 2 - 5 more carefully constructed prompts for it to understand the exact needs you want. In this case, please, do it manually if you can within 20s instead of chatting and instructing AI with it for 5 mins, focusing on your objective which is to deliver. I am sure people who spend millions poaching AI talents from each other could have done a better job in upskilling our AI friend than we did in those 5 mins.\nConclusion # These are not meant to be prescriptive but rather what worked for me. As AI continues to evolve (and probably at a faster rate than we can imagine), we need to evolve our approaches too. The balance between delegating and thinking is fluid - what works today might be obsolete tomorrow. But I think the essence stays the same - be mindful, be critical, and most importantly, focus on growing your skills while leveraging AI to handle the stuff you\u0026rsquo;ve mastered but don\u0026rsquo;t particularly enjoy. After all, we\u0026rsquo;re still the ones responsible for the code that goes into production, so let\u0026rsquo;s make sure it doesn\u0026rsquo;t smell too bad.\n","date":"9 July 2025","externalUrl":null,"permalink":"/2025/07/rt-log-8-5-vibe-coding-techniques/","section":"posts","summary":"This article talks about several techniques I used when I code. It is not meant for direct adoption by readers because we might be writing different types of code. I was focusing mostly on Data Engineering and Software engineering (backend) code. If you were a data scientist, take this with a pinch of salt.\n","title":"RT Log 8: 5 Vibe Coding techniques","type":"post"},{"content":"","date":"9 July 2025","externalUrl":null,"permalink":"/categories/rt-log/","section":"categories","summary":"","title":"rt-log","type":"categories"},{"content":"","date":"28 December 2024","externalUrl":null,"permalink":"/categories/data/","section":"categories","summary":"","title":"data","type":"categories"},{"content":"","date":"28 December 2024","externalUrl":null,"permalink":"/categories/data-engineering/","section":"categories","summary":"","title":"data-engineering","type":"categories"},{"content":"Yet another take on poor man’s lakehouse / or how to NOT contribute to Databricks\u0026rsquo; IPO success even though it seems inevitable.\nOverview # Recently, I asked a friend to help me acquire a NAS server with the requirement that it could run extra services. That was about three months ago, in the past few months, I’ve been learning the fundamentals of navigating the TrueNAS Scale system. It wasn’t until I discovered jailmaker that I could start building something concrete. Staying true to typical human behavior, I chose to build a lakehouse since, well, I’m a data engineer, and I wanted to use this opportunity to play around with all the popular tools I’ve heard about in the past few years that I have yet tried during my day job.\nIn this blog, I’ll cover the setup of a poor man’s lakehouse, a solution that is slightly more mature than a POC setup but definitely requires further polishing if you wish to run it as the main lakehouse for your startup. This won’t be a strict step-by-step tutorial, but I’ll outline as many technical details as possible to make it reproducible. My immediate goal is to set up a decent lakehouse to host the financial data collected from the internet, which will help with my options trading.\nWhat Does Poor Man’s Mean? # As the name suggests, it should be cheap upfront (not so much in human effort, but that’s beyond the scope of this discussion). You literally pay nothing for software licenses by using open sourced projects as much as possible. If you already own hardware to host it, you’re in luck! However, you’ll need to invest a lot of time making things work. This is why I wouldn’t recommend this unless you’re really into the topic and doing it as a hobby (like me) or just looking for a challenge.\nIdeally you have an unused computer sitting somewhere in your house, with a decent amount of RAM and collecting dust, and a solid CPU (you need to perform data transformation after all). In my case I have the NAS built with 128GB of RAM, so I could use it for more than just a NAS server.\nWhy Extend it Slightly Beyond Just a POC? # While a POC will work for my homelab use case, there are countless articles that demo the integration of DuckDB and DBT or DuckDB and Delta. However, as a practitioner who actually has a day job as a data engineer, I need to try it myself to make a sound judgment. Blindly following online articles is like putting faith in AWS\u0026rsquo;s Redshift benchmarks against Databricks — optimistic at best. Additionally, I wanted to see if DuckDB could function as a pure compute engine, similar to Spark. If successful, it might even replace Spark in future projects — saving tons of money on paper (and earning some of that all-important visibility that everyone seems to chase, for better or worse 🙂).\nWhy a Lakehouse? # The trend is clear: with the decoupling of compute and storage and the rise of Open Table Formats, I personally don\u0026rsquo;t see the general need of a traditional batch focused data warehouse anymore — except maybe for legacy compatibility and for ultra-low latency queries. If I were to start a new company, a Lakehouse will be a safer choice in most cases. This might sound a bit extreme and narrow-minded, but my work primarily involves building new data platforms for teams or transforming/migrating existing ones. So, when I decided to build an analytical platform for personal use, I chose the lakehouse approach with a different stack to see if I could build something similar to my profession-grade stack entirely with open-source tools.\nFor paid work, though, I’d still go for Databricks — gotta keep the team sane and weekends free.\nEnough BS, let’s dive to the details..\nDesign # High Level Architecture # First let’s talk about the high level design, and to repeat our goal: to set up a lakehouse architecture that decouples compute from storage, using open sourced tools to keep costs low and flexibility high so that if we want to migrate to something like a cloud setup, it would be less painful.\nI picked the following tools:\nuses DBT for data transformation DuckDB as a pure compute engine (think of it as your \u0026ldquo;query engine\u0026rdquo;) NAS File system and MinIO as the storage layer This setup will allow me to work with large datasets without needing a traditional data warehouse or the complexity of Spark clusters (unless your dataset is truly biggy big then perhaps you need spark afterall).\nIn the next sections, I’ll dive into how each of these components fits into the architecture.\nDecoupled Compute and Storage # One of the key design goals of this architecture is to decouple compute and storage. This means we want to use DuckDB purely as a compute engine, while the actual data storage will be handled separately through MinIO (or potentially other object stores). This is achieved via External Tables. Though frankly speaking, it’s a bit hard to get duckdb to work with external tables via dbt like what we used to on Databricks + Spark.\nWhy External Tables? # The main reason here is to allow concurrent read on the actual data from different system, thereby bypassing the DuckDB single player constraint. With external tables, data can be accessed from various platforms (like Trino) without being tied to a specific storage format or location. It also allows for avoiding certain constraints, like file locks, which can happen when reading from a DuckDB file directly (e.g., when using tools like Datagrip, it will hold the file lock even you specify the mode to be read-only , thus preventing any writing from the dbt side).\nHow it Works (data flow) # Let’s assume we are using the standard medallion architecture. You have a landing layer before the bronze layer, followed by silver and gold.\nNote, there is an limitation that dbt-duckdb does not yet support writing of Delta Table, or any Open Table Formats yet. Hence, the write from dbt are all done using Parquet External Table, this also means that incremental write via MERGE command will not be available either. There are interesting workarounds online that talks about persisting models at partitions level to achieve incremental load, but for my use case I choose to FULL RELOAD everything everyday.\nYour python scripts will land the data into landing layer in Delta Table format (pandas / polars → Delta) Landing → Bronze: Read: DBT will instruct DuckDB to read data from external delta table with the dbt-duckdb-delta plugin. Write: DBT will instruct DuckDB to write into bronze layer as an external parquet table Bronze Onwards: Read + Write: to be done via standard external Parquet tables I am using Delta because I am more familiar with it, but I am eager to migrate to other formats if they are supported by duckdb.\nDistributed Read: DuckDB File as a Catalog # If you use dbt-duckdb as intended / in single player mode, you will have to hold on to that single connection during your write process, which prevents you from reading using standard GUI tools (I have tried it via DataGrip and Metabase, both will fail).\nSince we are using DuckDB External tables, the data is actually not saved in the .duckdb file, we can easily duplicate the DuckDB file as if it’s a data catalog. This is inspired by this article. The idea is if I duplicate the catalog file after each write on dbt side, and DataGrip (or whatever reading client you are using) will read using this duplicated .duckdb file, we will have no problem accessing the data while further execute dbt transformation concurrently. The actual duckdb file will be used only for registering table namespace and the actual reading of data will happen between the compute engine and the storage layer, so the size of the file is super small and portable.\nObviously this is a pretty simplistic way of expanding the default concurrency support of the native DuckDB, and to enable distributed writes (writing from multiple containers / host to the same storage, and to the same duckdb catalog) requires more work and is much more complicated, and so far I don’t think it’s possible without writing a custom layer of application / service to manage a pool of duckdb files / locks. Personally, i think DuckDB has some potential to become the defacto production ready Small Data compute engine, I have not really checked MotherDuck in detail but probably that’s what they are doing.\nRegarding catalogs, there are indeed the newly open sourced UC and iceberg catalog, however none seems to be working with Duckdb perfectly yet. and given my use case, I will be more than ok with a crippled distributed read setup. No distributed write at all (it means all Airflow tasks will need to be executed sequentially).\nMinIO for distributed write (Optional) # I have mentioned above that I will be more than happy to accept a sequential write, but I will still want it to be semi-distributed (sequential write from a few hosts). For that, instead of relying on the native linux file system i have in my NAS server, I turned to MinIO. Reasons being:\nyou have an easier codebase to maintain - you do not have to manage directories like mkdirs all the time, permissions are farm more easier than SMB / NFS shares able to share across different hosts if you ever have to. In my case even though it’s a homelab project, I have set aside one jail for development (when I m away from home and on mobile data), and one jail for production. Enabling write to the same storage layer from both hosts will be extremely convenient easier to migrate to cloud if you ever have to later You actually get an UI for managing your files, like what happened in S3 Minio setup is somewhat simple if you are using docker-compose, thus it’s worthy of the effort to use it However i did have some issue with the minio SSL when integrating with duckdb and dbt. This indeed presents a blocker for real production usage and thus I have disabled all SSL in both MinIO and Delta table read. I hope to revisit this and it will be solved by others in the future.\nSome other thoughts # Some other features i have not tested but i think possible to be done:\nIncremental loading / merge into: despite being a little bit too far fetched for my current use case, i would definitely revisit this when duckdb enables real external writing capabilities with any one of the existing Open Table Format. One potential workaround is to generate dbt model at partition level, essentially going back towards the old hive-style data management for delta. It’s a bit backward but I think it will work just fine with proper orchestrator logic File compaction and other Delta specific features like liquid clustering - a bit too early for my use case, but I think i will need it eventually and it should not be very hard to implement If I really really need a fully working production ready lakehouse that enables distributed write and read, I can still turn to self hosting a spark cluster, which sounds really annoying but I think it’s not that hard anymore in today’s landscape. Integration snippets # in this section, i will talk about some integration configuration to provide a starting point for anyone wish to replicate my setup.\nWrite from Python memory to Delta Tables in MinIO # Use the write_deltalake method from deltalake package together with the storage options.\nThe key-value pairs documentation is very hard to locate online, I have spent some time to make MinIO work with the setup below.\ne.g.\nfrom deltalake import write_deltalake storage_options = { \u0026#34;AWS_ENDPOINT_URL\u0026#34;: config.remote_storage.STORAGE_ENDPOINT_URL, \u0026#34;AWS_ACCESS_KEY_ID\u0026#34;: config.remote_storage.STORAGE_ACCESS_KEY_ID, \u0026#34;AWS_SECRET_ACCESS_KEY\u0026#34;: config.remote_storage.STORAGE_SECRET_ACCESS_KEY, \u0026#34;AWS_ALLOW_HTTP\u0026#34;: \u0026#34;true\u0026#34;, # Required \u0026#34;AWS_S3_ALLOW_UNSAFE_RENAME\u0026#34;: \u0026#34;true\u0026#34;, } write_deltalake( table_or_uri=\u0026lt;delta_table_dir\u0026gt;, data=\u0026lt;your_pyarrow_table_obj\u0026gt;, mode=\u0026#34;overwrite\u0026#34;, schema_mode=\u0026#34;overwrite\u0026#34;, storage_options=storage_options ) DBT setup # There are mainly 3 things to consider, the profile.yaml , which is used for ensuring successful authentication to MinIO, and dbt_project.yaml and the macro for external_location which are crucial for automatic creation of external tables.\nIn your profile.yaml, you should declare plugins with delta and specify the secrets. this corresponds to the s3 secrets in DuckDB setup. If you choose to use iceberg, find the corresponding extension in dbt plugins.\nAnother thing to note here is the path , where your .duckdb file has to be on your local file system. This is essentially the catalog of your current dbt session. I don’t see an obvious way to make this accessible / writable in a remote sense.\nNow for sequential write, you can have a single host pointing towards a single .duckdb file, and then duplicate it layer for distributed read.\nproject_name: target: \u0026#34;{{ env_var(\u0026#39;DBT_TARGET\u0026#39;) }}\u0026#34; outputs: dev: type: duckdb path: \u0026#34;{{ env_var(\u0026#39;PROJECT_ROOT\u0026#39;) }}/dbt.duckdb\u0026#34; plugins: - module: delta threads: 2 extensions: - httpfs - parquet secrets: - type: s3 region: ap-southeast-1 key_id: \u0026#39;\u0026lt;enter your key id here, pass in with env var\u0026gt;\u0026#39; secret: \u0026#39;\u0026lt;enter your secret here, pass in with env var\u0026gt;\u0026#39; endpoint: \u0026#34;xxx.xxx.xxx.xxx:port\u0026#34; url_style: \u0026#39;path\u0026#39; use_ssl: \u0026#39;false\u0026#39; # adjust for your case schema: dev Typically you will need to declare the external_location manually for each model to ensure your model to be written as external table. However this can be quite tedious, and we can very well utilise dbt jinja syntax to make this less painful.\nIn your dbt_project.yaml, declare the general model config as the following:\nmodels: finget: +external_root: \u0026#34;{{ env_var(\u0026#39;EXTERNAL_ROOT\u0026#39;) }}\u0026#34; silver: +materialized: external +mode: \u0026#34;overwrite\u0026#34; +schema: \u0026#34;silver\u0026#34; bronze: +materialized: external +mode: \u0026#34;overwrite\u0026#34; +schema: \u0026#34;bronze\u0026#34; And in your macro folder, override the original duckdb external location macro to be:\n{% macro external_location(relation, config) %} {%- set default_schema = target.schema -%} {%- set location_root = config.get(\u0026#39;external_root\u0026#39;, validator=validation.any[basestring]) -%} {%- set identifier = model[\u0026#39;alias\u0026#39;] -%} {%- set fqn = model[\u0026#39;fqn\u0026#39;] -%} {%- set pkg_name = model[\u0026#39;package_name\u0026#39;] -%} {%- set format = config.get(\u0026#39;format\u0026#39;, \u0026#39;parquet\u0026#39;) -%} {%- set model_layer = fqn[1] -%} {%- if location_root is not none %} {%- if config.get(\u0026#39;options\u0026#39;, {}).get(\u0026#39;partition_by\u0026#39;) is none -%} {%- set identifier_ending = identifier ~ \u0026#39;.\u0026#39; ~ format -%} {%- else -%} {%- set identifier_ending = identifier -%} {%- endif -%} {%- if default_schema == \u0026#39;prod\u0026#39; -%} {%- set location = location_root ~ \u0026#39;main\u0026#39; ~ \u0026#39;/\u0026#39; ~ model_layer ~ \u0026#39;/\u0026#39; ~ identifier_ending -%} {%- else -%} {%- set location = location_root ~ \u0026#39;main\u0026#39; ~ \u0026#39;/\u0026#39; ~ model_layer ~ \u0026#39;_\u0026#39; ~ default_schema ~ \u0026#39;/\u0026#39; ~ identifier_ending -%} {%- endif -%} {%- endif -%} {{ log(\u0026#39;Writing the data files to \u0026#39; ~ location, True) }} {{ return(location) }} {%- endmacro -%} Essentially, once we have set the location_root / external_root to be something like s3://bucket-name/ or a local path like /mnt/data/ , for each one of your model declared in external materialization mode, the actual path will be generated based on environment and the model name automatically, thus saving you the hassle of declaring it inside every model. At this point, all transformation (dbt build commands) will function properly.\nAnd lastly, for constructing the dbt sources, it took me a while to realise you need to include the actual scan function in the declaration, like below (note on the usage of delta_scan which directly invoke duckdb-delta extension. )\nsources: - name: staging tables: - name: table_1 - name: staging_2 tables: - name: table_2 meta: external_location: \u0026#34;delta_scan(\u0026#39;s3://bucket/landing/xxx/table_2\u0026#39;)\u0026#34; Couple of limitations here to note:\nthis only supports writing of standard data format such as parquet, csv etc. the code example uses parquet. This is mentioned above being part of the writing limitation for duckdb I have not tried writing real partitions with DuckDB. It is supported but I have no idea whether it’s possible to integrate with this setup. Conclusion # This blog provides a potential \u0026ldquo;low effort\u0026rdquo; way of extending DuckDB read capabilities beyond single player, which are typically more prioritised than distributed write in a homelab setup.\nThere are several workarounds presented online for distributed writes, most designs have been already highlighted in the official doc here.\nHope you find this helpful and informative.\n","date":"28 December 2024","externalUrl":null,"permalink":"/2024/12/de-log-9-extend-duckdb-beyond-the-single-player-mode-by-just-a-little-bit/","section":"posts","summary":"Yet another take on poor man’s lakehouse / or how to NOT contribute to Databricks’ IPO success even though it seems inevitable.\n","title":"DE Log 9: Extend duckdb beyond the single player mode, by just a little bit","type":"post"},{"content":"最近用chatgpt越来越频繁，感触良多。这里简单聊一下最近的一些感想，尤其是在自我学习上的一些辅助作用，如果能给予读者一些启示那再好不过了。\n一些个人背景 # 在开始之前，我想要大概描述一下我的学习方式。从大二开始，我对知识的学习方式开始从自上而下系统化的，慢慢转变为一种按需学习，按需拓展的方式。前者大概是最常见的学习方式，你去学校上课，学校有着经过反复优化迭代的教学大纲，按部就班的学习，完成一个个里程碑之后，你自然能够掌握知识。而我所描述的后者，有更强的功利性，这完全是我第一次实习影响下的结果。做成一件事，或者用代码写出一个完整的解决方案所带来的正向反馈，会强力驱动我继续挖掘那个知识点。\n从小作为一个还算“好”的学生，我很早其实就意识到了一件事 - 学习所带来的正向反馈对我的影响太大了，换句话说，我的学习是十分利益驱动的。举几个例子：小时候化学竞赛做卷子，一两个月做了八十多套，但是现在其实对化学的喜爱没有那么强烈，请不要误会，并不是没有，而是没有那么强烈。同理我工作这么久之后一直拖延没去学前端或者底层架构的一些东西，也是因为它们在我的工作中带来的收益太低。这么一想，幸好我当初没有从ba转去cs，不然毕业可能有困难。\n说了这些，跟chatgpt的关系在哪里呢？\n迅速的正向反馈 # 我之前说了因为我的学习方式是非常依赖正向反馈的，所以我知识获取渠道从最讨厌到最喜欢是：\n看视频 / 上讲座 （所以大学经常逃课） 看书 看文档，博客 （比看书更快更直接） 我有个问题，你回答我，并且给我例子 可能写到这里大家也看出来哈哈4可不就是chatgpt嘛。这里我想强调的是，chatgpt不一定让我学习效率更高（其实应该是更高了但是这里不讨论这个），它把我学习过程中的步骤正向反馈变得无限大。比如之前看文档，我可能只能跟随它的新手指南搭建一些固定的，预设的东西。但是有了chatgpt我可以很轻易的让他做真正我想做的事情，然后在这个过程中通过不断尝试来发现它的错误，以此巩固自己的知识。\n哪怕这种正向反馈是虚妄的或者并没有实际提升我的学习效率，我想也是利大于弊的吧。至少终于促使我今天把terraform给学了（一定不是因为我们组一个人要走）。\n无比契合我的学习模式 # 因为我的学习模式是功利的，所以每个小知识点的获取其实是点状的。但实际上我自己会试图把知识点重新互相串联起来，找到知识点之间的联系，组成最终我自己的知识网。举个例子，如果大家用过docker的话，大概有过需要在里面放个db啊web app啥的，比如一个postgres在docker里，我怎么用我的datagrip连上它。然后这里就会带出docker network / vm network的东西（相信我不是cs本科的话network会很难受）。我只是想要每次用postgres的时候起一个container，然后就能正常连上我的IDE。但因为这个问题我“被迫”学习了一些network的知识，大概理解了两个不同网络之间dns解析是怎么一回事，这些知识最后又帮助了我理解vpc natgateway的东西。 而chatgpt的一个好处就是我能很迅速的获取这部分知识，因为我完全可以问“is data block in terraform like constant in programming language” （which is not）它能帮我把知识变成通用的啊，那通过一个知识点触及到另一个知识点的速度就无比的快了。\n刚好这种学习模式又是我熟悉的，而且现阶段职业要求不需要非常深入的学习这些东西，只需要知道他们怎么运作，怎么和其他部分的技术共同实现某个目的就可以了。实在是非常契合。\n总结 # chatgpt 的功能也不仅限于辅助学习，当然chatgpt更不是万能的，至少它在我学习过程中也给我制造了不少麻烦，比如一些错误的信息，胡编乱造，但是对于编程这一行来说你总有办法去测试的，所以问题也不算太大，总体来说还是利大于弊的。如果是在学习一个全新的，没有任何知识储备的课题，或者是深挖一个领域的时候，我不知道chatgpt在学习中扮演的角色会不会有一些不同，或许更多的它会成为一个提供思路而不是答疑解惑的对象吧，那会可能需要充值才知道了。\n","date":"30 April 2023","externalUrl":null,"permalink":"/2023/04/rt-log-7-%E5%85%B3%E4%BA%8Echatgpt%E5%92%8C%E5%AD%A6%E4%B9%A0/","section":"posts","summary":"最近用chatgpt越来越频繁，感触良多。这里简单聊一下最近的一些感想，尤其是在自我学习上的一些辅助作用，如果能给予读者一些启示那再好不过了。\n","title":"RT Log 7: 关于Chatgpt和学习","type":"post"},{"content":"In this article, I am sharing my first impression on the open-sourced almighty data ingestion tool - Airbyte.\nI am currently on a week long leave after shipping our streaming pipeline foundation in Q3, this break finally allows me to pause a bit so I could fully focus on watching Dota2 The International thinking about the data engineering scene strategically.\nData ingestion has always been an interesting component to me, collecting data automatically into a database/warehouse/lake is a very convenient remedy for my mild (yes, just mild) hoarding disorder. Therefore it is no doubt that Airbyte has been one of the many data products that I wish to try, so I did.\nDisclaimer: I have not used this tool to its fullest capability, I have left out certain features like CDC syncs, or performed a very robust load test, what I will do instead is to highlight some of the features that I really like (and provide reasons for it), and some of the concerns I have observed so far based on my unique and non-universal sets of needs when it comes to data ingestions.\nThe Good # Overall I think this tool is perfect for small teams or companies with a very small data engineering team, commonly observed at their initial stages. It without a doubt speeds up the data movements from the operational product environment into the offline (online/live is out of the question here since Airbyte is not for streaming) analytical environment with minimal efforts.\nI will highlight some of the features that make this possible.\nConfiguration-based pipelines, with GUI # Over the years of practices I have come to realisation that building data pipelines is fun only when you are building something new constantly, it quickly becomes boring once the repetitive pattern starts, even with the help of a few for loops. I think it has very much established in the industry that we would want to dynamically generate data pipelines, or automatically generate data pipelines from scripts as much as possible.\nWhat I have done recently in my day job is basically generating Airflow DAGs and Tasks based on a YAML git repository that got synced in a S3 bucket via CI pipeline, this makes adding/removing a pipeline extremely convenient, not mentioned the additional benefits comes with CI, such as validation logic and versioning.\nAirbyte took a very similar approach, it does not only provide the YAML configuration option via their cli offering, but the application itself comes with a very nice GUI.\nThe GUI can be very useful under two conditions:\nData replication is done on a requested basis - we only request what we needs, instead by default syncing everything from the operational databases or data systems The organisation is embracing a self-service culture, or at least in the data usage aspect - I would say GUI will be the most important pre-requisite to drive this culture. It’s hard to convince even more technical data analysts or data scientists to write YAMLs, it would be easier if the company is started with this culture. It also shortens new hires onboarding time and opens up more room for hiring when certain skills are not required. This miniatures the platform strategy many big techs are taking to a certain extend. Handles long tail problem with SaaS APIs # We often need to ingest data from a third party SaaS product, via their raw APIs or SDKs. The SDKs can vary in terms of usability, and limited by the programming language used, plus it’s mentally draining for our engineers to build a custom Hook to a system just to bringing 30 rows of data from Zendesk.\nAirbyte solves this with finesse, as an open-sourced tool it has a lot of community contributed connectors, saving your extensive amount of time to deal with integration. I think this is pretty obvious, so no further elaboration will be penned down here.\nThe Bad # However I do have some concerns, some of them can be found in the Github Issues, some of them are very specific to my current usage. I do want to stress again that I have not used the tool in production extensively so it’s purely based on speculations and other discussions online. I would not say these are critical flaws that make this product bad, it’s more like nuances that one might need to overcome to fit this great product into their stack.\nPerformance # Performance either on read or write is one of my initial concern, in particular against those often heavily used operational databases, when you have billions of records in a single Postgres table.\nPer this discussion, it looks like you will have to create multiple connections to achieve parallel sync within a single database. Would this also mean that within the connections syncs are done sequentially? Of course this could still be overcome by dynamic YAML generations so we can still tinker a bit of configs and generate the correct set of tables in each connection, just need extra work.\nBased on the official docs, the workers are horizontally scalable globally, but seems impossible to at connection level.\nSync jobs use two workers. One worker reads from the source; the other worker writes to the destination.\nThis most likely means that one connection would at most take two workers, but not more than that, if I understand this correctly. You can’t devote a cluster of workers on either read or write process here, which is possible in other data system. In Spark you might to something like a controlled parallelism allowing each worker to read a partition of a table, by providing a list of predicates, but on the write side I believe the destination databricks lakehouse will allow you to tap in the scalability of managed Spark.\nEssentially I am not 100% sure that this is going to be a critical problem without trying it with production tables, since if incremental tables can be setup and if the incremental load is pretty manageable, the gain from boosting performance on read and write beyond one worker setup might be marginal. Nonetheless I think it would be nice to have similar options.\nAbstraction Level # In Airbyte each pipeline or connection could sync one or more tables from the source, so each pipeline could be used to handle a set of tables that are similar in configurations, however this means that the configuration is not set at table level. Let’s say we have 100 tables, if we want to run everything concurrently, we need to create 100 connections. This would require us to create a different layer of abstraction of configs so we could manage this dynamically.\nI would like to have a destination table level pipeline because in my current setup, we are using data lake(house), essentially we would want to have:\nAble to run (or have the option to run) everything concurrently Able to execute other optional post-ingestion actions, such as registering the lake table in Presto/Trino The Solution # I will just propose a very high level solution here for the issue above briefly. Note that this might not be an issue at all for you based on your needs, and I would say with confidence that this will not be a problem for most teams. The solution is based on what I have proposed and created in Xendit, essentially revolves around a centralised configuration repository, what I called the Registra.\nOverview # The Registra is a central storage table level configurations, tables that are in your main analytical storage, being either data warehouse or data lake, or lakehouse.\nA rough flow will be like the following:\nconfigurations are used and parsed to generate the Airbyte YAML files. We could easily introduce new config fields that helps group connections logically together, something like cron_groups, connection_groups at table level so eventually we know which tables can be grouped together into a single Airbyte connection, which (normally those heavy tables) requires it’s own slot in Airbyte\nApply the generated YAMLs so the YAMLs gets recognised by Airbyte\nTrigger the connections so they are online, and let Airbyte takes over the Data Ingestion\nOn a separate process, most likely in a separate orchestrator as well, you dynamically generate a list of tasks based on the same Registra YAML repo and use that to handle any post-ingestion operation\nThe above flow could be easily converted into a proper CI flow too, any changes pushed into the Registra repo will be the trigger and the rest can just follow.\nOther directions # Besides YAML Github Repo based solution, the centralised configs Git repo could be replaced with a proper relational database too, which will enable you to better control on concurrent changes to the configurations. This will easily get you a decent self-service tool internally if a GUI is attached in front of that database.\nWait a second, we already have that self-service GUI comes with Airbyte that allows people to self-serve their data ingestion needs, would not this be a bit duplicated?\nI would say the Registra is not just designed for data ingestion, it should be the central config and a single source of truth for tables in your analytical storage, including things that are not yet possible in Airbyte, and other things around the tables such as metadata, lineage, access control or archiving.\nHowever if we are looking at the data ingestion step alone, it is a little bit overlapped because Airbyte has covered a lot of angles in the data ingestion aspect, any wrapping layers around it will not be very useful unless you have a very specific real use case. To reconcile this, we need to recognize that there are two sets of configurations here, one from Airbytes, in the form of YAML and eventually values inside Airbyte DB, one from Registra, in the form of YAMLs or DB values directly. The Registra YAMLs will be used to handle those specific stuff that can’t be done with Airbyte, where Airbyte configs will stay the same. In this regard we are opening up two entry points for configuration changes, one for data ingestion, done with Airbyte GUI, one for analytical table, done with Registra.\nA typical user flow would be:\nI want to replicate this table, so I went to Registra GUI and setup the source, destination and destination configs I pushed the changes CI triggered, Airbyte YAMLs will be generated and connections will be registered automatically, after a while If needed, I could go into Airbyte GUI to further configure the ingestion changes, if not needed, the default setting should be enough to get my data into the destination Yes I would consider this a bit confusing, but with this setup it is possible to integrate many other open source tools into the data stack, centrally controlled from the central configs. We have successfully integrated dbt into this setup where Registra configs sits side by side with the entire dbt YAML configs, connected using the path of dbt models.\nConcluding Remarks # At high level I feel Airbyte is a great tool for companies / orgs that:\ninitially start to build up their analytical platform to extract data insights does not have a huge amount of data in their operational databases are not financially stable enough to support a relatively complete data engineering team With Airbyte, one experienced data engineer, or even generic software engineer could probably handle the data ingestion needs of the team and unlocks more potential in data usages downstream. However, I would still caution every team to examine their specific data ingestion needs and evaluate whether Airbyte is something worth trying. There is little gain from converting some of the existing working pipelines, but it does help save the time of reading different API Docs when integrating a new data sources like Salesforce or Zendesk.\nDespite such tools are making data engineers more replaceable (my job alike) over the time, I still feel this is good for the world overall. It\u0026rsquo;s just the wheel of history, you do not counter that popular and cliche term these days, democratize data engineering, you just embrace it, but that discussion demands a separate blog post.\nHope this article is helpful for you.\n","date":"21 October 2022","externalUrl":null,"permalink":"/2022/10/det-log-1-a-first-byte-of-airbyte/","section":"posts","summary":"In this article, I am sharing my first impression on the open-sourced almighty data ingestion tool - Airbyte.\n","title":"DET Log 1: A first “byte” of Airbyte","type":"post"},{"content":"刚在看冬奥女子花滑决赛，特鲁索娃太过分了这么多四周跳，不过妹子长大了之后感觉没有原来轻盈了。结果看到有人在询问第三份总结的事情，想了想自从上次跳槽过了快半年了，也差不多是时候写一下新的总结。\n这篇文章会分为两个部分，包含上一份工作和这份工作的总结。\n上一份工作 # 这份工作主要是带一个团队在初创公司搭数据平台，做下来最大的感受就是累，也因此觉得回去做IC不失为一个不错的选择。管理团队人员的成长同时也要赶各种deadline，实际上对整个团队的成长都不好。成员没有得到足够的学习时间，没办法积累，而我也差不多被掏空了，无论是耐心还是精力。\n收获1：技术上的自我实现 # 当然收获是不小的，主要是证明了自己能够带着一群人从零搭起来一整套东西，而且是scalable和extensible的（at least to some extent）。当然从各个方面来说一个创业公司的数据架构不需要太复杂，所以也不能很说明问题。尤其是这几年data stack这一块变化实在太快了，当我还在为前移到Airflow 2暗自窃喜的时候 （可以看下我这篇文章DE Log 7: Migrating from Airflow 1 to 2），过了一年已经有人开始说其实我们本不需要Airflow (https://blog.fal.ai/the-unbundling-of-airflow-2/)。其他方面就是愈发认可开源的理念，虽然我并不主动贡献开源 - 主要是因为懒， 但是我开始积极提issue和讨论bug。因为用了诸如dbt之类的开源工具之后发现，自己吭哧写不如跟社区一起进步。\n这一点上，我觉得无论大厂小厂都应该花时间去看看，尤其是数据这一块。小厂造轮子浪费时间质量不一定比开源好（先承认自己是平庸的），大厂造轮子你作为个体会比较难看到全局，从学习角度来说开源的工具，哪怕玩一玩还是有很大帮助的。多看看博客发现发现新工具啊啥的，比如\nhttps://github.com/monosidev/monosi\nhttps://github.com/re-data/re-data\n这些最终可能最终不一定能做起来，但是能看一下他们到底在解决什么问题，哪些可以被我们工作中取其精华加以利用。\n收获2：大概率排除了一种职业规划 # 除了技术上的自我实现，也正儿八经在管理团队，也曾经想过要不要写一点这方面的总结。最终还是放弃了，第一觉得也没啥好说的，第二是最终意识到自己不喜欢做people manager。不是说我不关心成员的个人成长，我其实很关心，但是我觉得培养一个人的成就感还是不如设计系统本身的成就感强烈。可能这也促使我在现在的工作中更多的专注于代码和产出而不是人，虽然现在要升职还是得高一些有的没的。\n关于数据工程的职业细分，我也写过一篇文章（英语）：DE Log 8: Thoughts on Data Engineering Specialisations\n收获3：公司文化的认同 # 现在说文化认同这件事有点老调重弹，但是实际上随着年龄工龄的增长，这个因素在我选择offer的时候变的愈发重要，这次跳槽面试的trigger也是对前公司文化的不认同，在疫情期间这一点显得尤为突出。面试过程中，包括出结果之后最后的选择也是基于文化而决定的。拒了腾讯选了现在的公司那当然是因为文化，没人敢说去腾讯学习不到东西对吧，或者发展前景不好。当然不得不说因为数据工程本身细分出来起步晚，所以市场还是很小，求职者的bargaining power还是很强的。\n这一份工作 # 我干啥 # 目前的工作方向是让做一点工具让我们的数据湖仓（data lakehouse?) 能够支持自助操作，比如让其他团队的人，无论你是不是技术人员，选择要往湖里加什么新数据，要schedule什么batch job 或者real time streaming job，这个设计方向也跟我这几年来的感受很吻合。例如在上一份工作中，我个人总是先做出一个框架，然后让团队成员去填充具体的业务内容，本质上跟我现在的工作是很类似的，在职责上也感觉是同时在做开发和数据产品经理的感觉。未来我相信没有太多的人会需要手动setup pipeline了 ，可能都是configuration as a service。\n好的方面 # 选这份工作的原因基本上也是为啥我喜欢目前这份工作的原因吧，所以面试时传达的承诺基本都实现了：\n这份工作基本上就是IC目前，闷头写代码，开始啥会不开，有会你不去好像也没事，只跟美国的lead async 交流，简直爽歪歪。我个人是很讨厌开会的，async的交流我觉得能有效避免废话，能让你有条理的想清楚自己要说啥 - 当然能随时有条理的说话这也很重要，加上进来过试用期就要当什么pod lead 带一个小弟于是又开始了开一点会的节奏 工作模式当然一直在家，也很自由。早上三个小时集中，下午三小时，晚上看心情和累不累可以酌情工作，兴致来了也会多干一点，所以本质上已经没有加班的概念了，但是不会累，因为这样工作本身是不断带来正反馈的 文化很好，基本的workload management和友好交流模式就不提了，基本硅谷风格吧。也不是水，毕竟还是（独角兽）创业公司，虽然团队水平不是顶尖，但都很有学习的氛围和动力 技术对我是新的，spark + hive + presto/trino 属于我知道存在但是没咋用过的东西，不过知识上的迁移成本其实不大。因为我本身只是一个使用者，很多东西看看docs 看看别人讲的也就理解了。比如partition keys 跟redshift distkey 的使用和注意事项其实是很相似的。除此之外，还做了delta lake的迁移。虽然我不确认这一定是一个好技术，但是能在一个公司里大规模的真实体验一下这种使用场景，帮助也还是挺大的。 工作的时长和分配 # 由于工作量的合理分配，我有更多的时间去思考一整个季度的工作规划。虽然我们团队不讲什么okr kpi模式，但是就总体设计上应该是更偏向okr。 比如我需要保持pipeline health在99.5%，那我需要做什么？我需要哪些checks，哪些自愈功能。\n同时因为写代码比原来更能集中，写高质量代码的熟练度也是有上升的（废话）。在python下严格执行type hinting和dataclass passing，然后实现了一些strategy pattern之类的东西，总之接手原先的代码之后发现自己在实现一些东西前总能想一想能不能做一些小的重构，慢慢把整个代码库搞得还行。也同时践行了那句，\nWriting clean code is what you must do in order to call yourself a professional. There is no reasonable excuse for doing anything less than your best.\n确实在写一些懒代码之前会问自己几遍，这真的没时间好好写吗？\n最后想说一下写文档的熟练度，感谢我至少在毕业之后依然偶尔在写博客。由于一直是远程工作和异步交流，所以文档很重要。很多时候我写个分享文档可能就是要写个半天，但是幸好这个公司把文档当做产出（我觉得这也是正确的）。\n总体来说，因为整个公司的人员和团队结构就是分布式且分散的，当然这是ceo本来的设计。他想牺牲一部分效率来让整个公司的文化更团结，也更多元，就不会出现什么哪里是总部哪里是分部的结构，虽然决定过程长一些，但是最终结果看起来是不错的。不过这也导致了一些工程上的冗余，虽然我对目前公司的总体数据架构不是很满意 - 但是就我目前操作的这一块地盘来说我还是有比较大的操作空间，最终做出来的impact也是比较大的，也就是说我还比较能在目前的工作中找到意义，故比较满意。\n升职规划 # 好巧不巧明天要写outcomes docs，就是一个说明自己干了啥为什么值得一个升职加薪的文档。所以短期内可以争取一下，感觉自己进来之后主要成就还是比较多的。升职不重要，钱比较重要说实话。如果还可以的话也可以过几年指望这个公司上市毕竟是独角兽哈。\n好了睡觉了 - 现在由于早上8点要起床跟美国开会，睡得也比原来早了，挺好的。\n","date":"17 February 2022","externalUrl":null,"permalink":"/2022/02/rt-log-6-career-summary-v3/","section":"posts","summary":"刚在看冬奥女子花滑决赛，特鲁索娃太过分了这么多四周跳，不过妹子长大了之后感觉没有原来轻盈了。结果看到有人在询问第三份总结的事情，想了想自从上次跳槽过了快半年了，也差不多是时候写一下新的总结。\n","title":"RT Log 6: Career summary v3","type":"post"},{"content":"Happy New Year everyone! And I am too lazy to change the thumbnail image.\nAs the title of data engineer starts to gain more recognitions, their responsible domains are becoming more specialised. Data engineers have been asked to both cover from the lowest end of a pipeline, that is more infra involved - the DevOps side of fun when designing a data intensive application, to the highest end, that is more analytically involved - the data modelling and engineering support of your fellow business analysts.\nA person who is capable of developing a data product end-to-end, including the data pipelines that supports the data aspect of the product, which can be described by a prevalently used phrase these days - full stack. Yes besides data alchemist and data magicians on LinkedIn, we are getting full stack data engineers as well.\nRemember the core responsibility of a data engineer is to help your company / teammates have easy and reliable access to data, in vague terms, regardless what technologies you have decided adopt. Hence as the data engineering industry grows, the old craft of setting up your own distributed Airflow cluster, Spark Clusters, sadly becomes less important, all thanks to these commercially available services like Databricks and AWS MWAA. Unless you are developing some new customized in-house features, you do not and should not build everything from scratch.\nIf you are like me, circulating from startups to startups in the last few years, most likely you have been working as this so called full stack data engineer . Some may argue that you are not learning anything in-depth with this behaviour, you are just Airflow monkeys busy fixing pipelines that breaks everyday. It tends to be true if you do not have a great mentor, but it could become a good learning package with proper guidance, and an effective way for you to nurture the bigger picture. Eventually if you want to dive deeper into the craft (pursuing the IC track or something), you will start to lean towards one of the two spectrums mentioned above.\nThere are different titles coined for the specialised data engineers. Categorising by tech stack, you will have data enigineers and big data engineers (this is a joke); categorising by functional roles, you will have a variety of ETL engineers Data Warehouse Engineers (Chinese companies) / Analytical Engineers, Big Data Infrastructure Engineers (Big) Data Platform Engineers.\nDifferent Skillsets, Reusable Fundamentals # After all, these variants are still software engineers, that means they are supposed to solve problems using automated / programmatic ways. Although the skillsets and area of focuses involved could be drastically different for each title, there are a few fundamental skills that all variants of data engineers should possess (the discussion here is excluding common skill sets that should be a norm for all engineers: programming language, git etc…)\nUnderstand those generic data concepts # There are several data concepts that applies the entire life cycle of a data pipeline. If your variant of data engineers directly or indirectly works on a pipeline, being batch or streaming, you will most likely need to understand concepts such as Slowly Changing Dimension (SCD), Change Data Capture(CDC), ACID properties, data partitioning and those common challenges a distributed data system would face. By understanding these concepts in depth, you will then be able to make technical decision with confidence, avoid saying “I saw this on StackOverflow but I m not sure why” when ur pipeline boomed.\nCultivate data quality sense # Data quality, data health, whatever, in my opinion is a set of dynamic, subjective metrics you derived to measure the quality of your data,. I can’t stress this more than enough, as long as data is flowing through your manned position, find a way to check it, it improves the general quality of life for everyone, especially yourself. Otherwise you will be running the same select count(*) statement over and over and over again.\nAside from the basic checks, you might need to derive some customized / business logic specific checks to monitor certain aspect of the data source. Average incoming data daily? Seasonal reporting deviation percentage? Now it’s time to apply some middle school mathematics (i hate math since young as a Chinese, but I do admit it’s tremendously useful).\nUnderstand a few various data tools # There are tons of tools and technologies out there for each component in the data tech stack. I would say one should have a firm grasp of at least one tech for each component, and preferably for both Hadoop and none Hadoop based solutions of applicable. For the complete list I would suggest to take a look at the awesome data engineering list here: https://github.com/igorbarinov/awesome-data-engineering\nNow the why part, I would argue that some technologies in the same component are similar , hence the knowledge is transferable to some extend. If you understand Postgres, you should not have major problem understanding MySQL, but this barrier gets higher when the underlying implementation diverges. A file system based batch SQL engine like Presto could be more different than MySQL when comparing to Postgres.\nLet’s say you want to prepare a table with partitions. You know that partitioning the data would allow you to have better read performance in some cases, but not always. Despite the fact the different data warehouse / processing tools partition the data differently, it’s all about having a dictionary/index like access to shards of data, so the scanned data is less, read speed is faster. You would then realise you need to read how the new data tool implements this mechanism, after which you should be able to make a technical decision and get it with some hands-on.\nStay informed with the latest data design trends and paradigms # The data engineering scene, like any other field in the software industry, is constantly changing. I still remember that in 2014 summer during my first university course, I was introduced with Hadoop \u0026amp; MapReduce, before I could even code in Python. Fast forward to today, Dataspark is gaining presence, cloud computing is a cliche and making products is easier than ever. It is utmost important to stay with the trend. What’s the latest open source/ proprietary technologies that makes existing solution obsolete? What’s the new tool that makes your original in-house design look like a student project? Constantly absorbing these knowledge, if not information, is helpful for you to have a bigger picture.\nStart with Big Data or Small Data? # Short answer: it is better to have knowledge with Hadoop based tech experience and if you are a CS graduate, it’s going to help you land a decent job. However, if you are like me coming from a non-CS background, it really does not matter, you should start small and learn as much as you can on the job, so in this case the Small Data stack might be more suitable.\nBack then there is always this constant fear that I will not be able to catch up the latest technology, especially those Hadoop based, since I mostly operate in companies that data computational load and storage could be handled by a single Redshift cluster. And I stand by this design today, where the computation and storage are both handled by the data warehouse. This is the most economic solution for a small startup (well if I am really really familiar with ClickHouse I could probably bring down the cost even more I guess).\nNow I am working in a company with Spark Hive and Presto, a lot of new info to digest at first but soon I realize many concepts are transferable, especially those that are closer to the fundamentals.\nThere are after all several differences between a managed Redshift and a Hive+Spark setup, but ultimately one should understand both side of the things. It does not make sense to fire up a Spark cluster whenever.\nWhats next? # Keep reading blogs or other online materials. I like reading technical blogs because they are concise, they are really up-to-date and they are fun to read. It’s like as if a tech leader in the industry is chatting with you face to face, it’s much more fun than reading a book.\nLike this https://www.dataengineeringpodcast.com/\nThat being said, I do need to gain knowledge more systematically sometimes, or dig deeper in the craft, that’s when I start to read a book. Shamelessly, I have never completely finished a technical book, I read it selectively and more like an information lookup. It has something to do with my learning habits - which is something based on needs (so I learn stuff that earns money, or brings joy)\nOn a concluding note, I will list a few things that I believe worth checking out. I am no master in anyone of the following fields, and I’m not going to be. A generalist seems to suit me better than a specialist so far.\nP.S. the jargons are extremely “useful” in interviews\nConcepts: # Snowflake vs Star Schema design\nData Warehouse paradigm shift: data warehouse → data lake → data lakehouse\nData layers in the data warehouse (and their technical names)\nData modeling\nData Quality\nData Governance (Security, Access, Privacy and Compliance)\nPipeline \u0026amp; Query optimization\nBooks: # Read selectively as some of the concepts are outdated, but I do believe as the main contributors to the modern data warehouse, one should at least read them.\nThe Data Warehouse Toolkit - Ralph Kimball\nBuilding the Data Warehouse - Bill Inmon\nDesigning Data-Intensive Applications - Martin Kleppmann\nBuilding the data lakehouse - Bill Inmon (Waiting to get this, I personnally still have not formed a good view on data lakehouse)\nAnd a few other things transferred from software engineering # Design patterns\nData Mesh\nDomain Driven Design in Data Warehouse\nThe general testing mindset of software engineering and how it can be applied to data engineering\n","date":"1 January 2022","externalUrl":null,"permalink":"/2022/01/de-log-8-thoughts-on-data-engineering-specialisations/","section":"posts","summary":"Happy New Year everyone! And I am too lazy to change the thumbnail image.\n","title":"DE Log 8: Thoughts on Data Engineering Specialisations","type":"post"},{"content":"这本书是我偶然看一本历史穿越小说得知的，通过这本小说我知道了谁是苏颂，谁是韩公廉，沈括除了写《梦溪笔谈》还具体干了啥，也知道了在几百一千年前的中国历史中一些神奇的发明。作为完成了九年义务教育的我来说，谈及开封只悉晓清明上河图，古代科技只认四大发明，这在现在看来无疑是可笑的。\n《文明的滴定》是一个很有趣的标题，学过化学的大体知道什么是滴定（titration）- 也就是通过一个已知浓度的溶液与另一个未知浓度溶液发生反应，通过指示剂颜色的瞬间变化观察到滴定终点，从而由所用溶液的体积和浓度测算出结果。而这里则是指通过用一个文明的已知条件，时间和历史事件去滴定另一个文明的未知。此书则主要关注和讨论科学，包括古代和现代科学，在中西方的发展差异，以及它们与经济，地理，尤其是社会等因素之间的关系。这本书，亦或是说一系列演讲以及论文的合集，更多的并不是在辩论而是呈现观点，或提出疑问以及讨论的必要性，我想无论是作为消遣读物，还是科学史的入门书籍都是十分合适的，故很推荐大家去看。\n第一遍读下来我更多的体会倒不是记住了多少观点，多少史实，而是作为一个从小接受了中国和新加坡教育的人来说，即使我对中华文明的自豪感已不止于喊口号（上下五千年？），我依旧震惊于自己对西方科学的一种些许的，潜在的盲目崇拜。我一直尊崇理性和逻辑，也不免陷入科学发展自发论的谬误。同时，前些年过多片面的阅读西哲也让我忽视了璀璨的东方哲学。没错就是那个“诸子百家”的时代，这种意识的好处自然是激发了我阅读的兴趣。\n在这里我想要试图以几句话精炼出我的个人科学史观，我摘取了几个对我认知冲击比较大，或是值得特别提出的观点。我想我的认知水平放在当代年轻人身上应该还是能达到平均水平的，故希望观点能对大家有所帮助。\n这本书最核心的部分就是对现代科学起源的讨论。我们如今所学的科学，涵盖了从物理化学生物以及这些学科下的许多分支，这些学科和知识共同交织，系统化地构成了现代科学的大框架。当我们意识到了物理学种原子和电子的关系可用电子云模型更好的对原子建模之后，化学学科中的对于电子能级迁跃的解释也相应发生了改变。大多数时候我们是不太去考虑他们的起源的，往往只是为了获得好成绩而学习一些学科和知识。由于这些知识本身对大多数人不一定实用，而且也不简单。我们的精力在学习的过程中也不可避免的被消磨殆尽。\n实际上在我看来，对于具体科学知识的学习以及科学史的学习是同等重要的。中国作为一个历史观十分厚重的文明，我觉得（至少在我接受过的教育中），在科学史尤其是中国科学史的探究上还是很不足的，这无论是为了提升民族荣誉感还是建立正确的历史观都很重要。\n当然也有一部分人怀着对知识的热情吸收着一切可触及的知识，没有什么特别的理由，仅仅是为了更好的认知这个世界。我个人很欣赏这种求知欲，也在尽我所能践行着这种求知欲。所以当读到李约瑟（本书作者）驳斥现代科学仅起源于欧洲的相关论调，我开始感到好奇。印象中我对伽利略还是在望远镜和比萨斜塔自由落体实验（据说是传闻）上，但伽利略实际上第一次贯通了数学和科学实验，可以说由此科学实验变得富有逻辑和严谨，而不再是如亚里士多德时代那样注重主管思考和推理。对数学的使用放到今天似乎已经是一个下意识的举动，尤其是对于我这种数据从业者来说，一个东西的好与坏，可行性我都想要量化，也就是用数学对其建模，这放在五百年前确实是一个巨大的突破。\n回归话题，李约瑟认为中国古代科学的发展是平稳而包容的，虽然未能出现伽利略式的突破，从而引导出爆炸式的科学革命，但是由于社会和地理因素，所孕育的发明创造直接或间接的影响了现代科学的开端。我倾向于这么认为，也就是我的个人科学史观之一：若以伽利略为分界点，即以现代（也有一说近代）科学之父为坐标轴原点，狭义来说现代科学确实起源于欧洲，但是所有在此之前的发明创造都为现代科学架构的蓝图做出了贡献，这其中不乏一些无比关键的发明。所以广义来说现代科学起源于欧洲是一种十分自大的说法，对历史的学习在于总结经验指导行为而不是争夺往日荣光（瞧瞧庞大的罗马帝国）。\n在这里我想举两个例子，分别直接和间接的影响了西方现代科学的开始。\n北宋的苏颂（与韩公廉及其他能工巧匠）重现了水运仪象台，最直接的影响就是提高了人类对时间测量的精度（这里还并没有讨论对天文的贡献）。而时间的测量在今天又是一个常人不太会过多思考的概念，但可以说是现代科学一块不可或缺的基石，尤其是对运动的描述中十分重要。作为首个将擒纵器应用在机械钟的高精度装置，在传入欧洲之后，大大加速了欧洲机械表的精度提升，所以我觉得使用“直接推开了现代科学的大门”这样的比喻也是不过分的。\n火药的发明是另一个例子。当然作为四大发明之一，这种发明本身的就是举足轻重的，不过我在此想要加以拓展的是它对中西方社会的影响。在我们的想象中可能火药是热兵器（在这里我只想对火药的军事应用加以讨论）的开端，热兵器对与冷兵器的优势在于为了达成同样的杀伤力需要更少的单兵训练。意味着如果一个15岁的乡村青年有了鸟铳枪，假设他搞清楚了怎么用，且运气好火药没有炸膛，那他应该能够越过护卫干掉欺负村民的某扒皮。但是中国的体制在很长一段时间并没有发生过巨变，虽然火药以及其军事应用很早就被发明了，但男耕女织，士农工商的社会基层和分工并没有过多的改变。反而当火炮传入西方，贵族式的封建社会开始衰落 - 因为城堡不再是近乎无敌的存在。由此西方开始了一系列的社会变革，从工业革命中孕育了资本主义，再孕育出了现代科学（资本主义与现代科学的发展也是密不可分的，但是这其中的关系颇为复杂，书中也有阐述，在此就不展开了）。\n书中一些其他的讨论，尤其是对中哲与科学关系的讨论也是十分有趣的。例如我们知道光（电磁波）的波粒二象性，但是中国古代哲学天生是倾向于波的，场论也是如此。所以对于超距作用的理解，例如潮汐作用对于聪慧的古代劳动人民就不是很难接受。而反观西方可能是因为亚里士多德对物理学的阐述：他认为物体只有在一个不断作用的推动者直接接触下才能够保持运动。当然几千后的我们看来这是非常不准确的，但是作为几千年前的贤人能够根据周围观察试图总结世界运转的规律这其实非常了不起，不过这可能也阻碍了场论和超距作用观念在西方的出现时间（直到牛顿的出现）。\n另一个我想提及的是中西方对于几何代数观念上的差异。中国的数算其实起源很早，我们听过《九章算术》和鸡兔同笼问题。而在西方因为欧几里得的几何学的突破，使得数学以及其他学科对几何学的应用得以蓬勃发展。这其中对于天文学的贡献是巨大的，但这不代表中国古代在天文学方面所取得的成就就不如西方，实际上正因为天文学属于国家批准科研计划（因为历法在农耕文化太重要了，古代夷人效忠的标志之一就是使用中国的历法），天文学家往往都是宫廷任职，极少数部分也是极为杰出的政治家。\n中西方由于哲学和神学的不同孕育出了对世界不同的解释框架，而也正是这种不同，结合许许多多其他社会，地理和经济因素等，使得在慢慢历史长河中只有欧洲大陆上出现了伽利略。如今的中国，毋庸置疑在科技水平上总体依然是落后于西方的，但在国家执政者眼里科学的发展不只是指导生产提升民用的辅助者，而是国与国之间的战略重点。这种改变是令人欣喜的，但是四五千年的封建官僚体制，固有其优越性，但是正如王安石变法时苏颂试图分经科举一般，都是在触动利益。娱乐圈最近的风波或许是一个好的现象，但希望不只是昙花一现。\n另，对于讽刺古代科技发明只是发明而不是科学（实际上只是不曾融入我们常识认知下的西方科学体系的知识）的人来说，要知道学与术在当代确实是高度分工的。但由于古代人才的稀缺和在中国政治科举取仕的选择下，花时间精力写《本草纲目》或者《新仪象法要》并不会给你带来过多的财富或名声，但古代科学家依旧笔耕不辍，他们大概率图的不是死后的封号或者被群众立庙瞻仰，而是对自然科学的渴求以及文明的传承。\n","date":"13 September 2021","externalUrl":null,"permalink":"/2021/09/book-log-3-%E6%96%87%E6%98%8E%E7%9A%84%E6%BB%B4%E5%AE%9A%E4%B8%8E%E4%B8%AA%E4%BA%BA%E7%A7%91%E5%AD%A6%E5%8F%B2%E8%A7%82/","section":"posts","summary":"这本书是我偶然看一本历史穿越小说得知的，通过这本小说我知道了谁是苏颂，谁是韩公廉，沈括除了写《梦溪笔谈》还具体干了啥，也知道了在几百一千年前的中国历史中一些神奇的发明。作为完成了九年义务教育的我来说，谈及开封只悉晓清明上河图，古代科技只认四大发明，这在现在看来无疑是可笑的。\n","title":"Book Log 3: 《文明的滴定》与个人科学史观","type":"post"},{"content":"","date":"13 September 2021","externalUrl":null,"permalink":"/categories/book-log/","section":"categories","summary":"","title":"book-log","type":"categories"},{"content":"I have been using SAXO for three years and Interactive Brokers for more than one year, and recently I have decided to switch from SAXO to IB completely, but I could not find a proper instruction on the Internet.\nNote: I am in Singapore and I believe my account was under SAXO Singapore, but I think it does not really matter when it comes to security transfer. However, obviously you can only transfer assets that are available in IB. e.g., you can\u0026rsquo;t transfer SGX stocks into IB.\nDisclaimer: This article is just a recollection of me doing something a few days ago, please always check for the detail input of the table. Use at Your Own Risk.\nThis article is going to cover all the steps involved and address some of the confusions I have during the process of transferring securities, but before that \u0026hellip;\nMajor reasons for switching # SAXO has a way higher minimum commission per trade - which is pretty costly if you have developed the bad habits of regular trading SAXO forced you to convert to and from SGD to USD when trading US stocks, your trading routines are essentially deprived of control over the currency conversion timing SAXO has a pretty bad conversion rate compared with other platforms Major advantages for switching: # You avoid paying for the closing cost (conversion cost plus the commission) on SAXO especially when you do not wish completely exit the US market You automatically have a multi-currency account You can easily carry out margin trading (don\u0026rsquo;t do it if you do not know what that is) Major disadvantages for switching # Bad UI (both Mobile and Desktop) Bad server connection on peak period How # IB # Just like the funding process in IB, you will need to initiate a transfer request, search it in the search bar, it could also be found in some of the dropdown menu, but I realise there is an UI bug which might somehow make the dropdown menu disappeared, so just stick to the search bar, click Transfer Positions Read / Google all available methods for transfer IN to IB, that is you need to select the Incoming tab. I used the Basic FOP Transfer type, because SAXO can only be found there. Filling the form accordingly, the confusing part is the contact info. so I am going to write it down in detail in the Note column - I think this is going to work since my transfer request is completed. So once the form is filled, proceed to add in the incoming assets and you are done with the work on IB side I believe, I can\u0026rsquo;t remember clearly now but I think IB will give you a message containing the information on their side such as the contact email and name. This will take a while, like one or two days, so I think they are handling the securities transfer manually. Saxo # Once you have received the information from IB, follow this partially useful and outdated documentation provided by Saxo to find the transfer security window https://www.help.saxo/hc/en-sg/articles/360001270546-How-do-I-request-a-securities-transfer- It\u0026rsquo;s now under the hamburger button → portfolio transfer The form detail Assuming you are going to close your SAXO account, you should be prepared to have at least 160 EUR cash in your bank - that\u0026rsquo;s the total amount of cost this process will incur, once this is done you can just sit back and relax Final note: # Most of the fields in both forms are poorly designed, key information is hard to retrieved or located online, this is particularly triggering for users who are trying to move a huge amount of money across bank - it should be designed with more accuracy and precision That being said, I found the entire process is most likely done manually by both parties, so you should be fine even if you made some typos or entered a wrong value (they will probably email you and clarify with you) The original open price on SAXO will NOT be transferred into IB, so that kinda messed up your performance calculation a bit. You will have to do it yourself Happy trading everyone ","date":"31 August 2021","externalUrl":null,"permalink":"/2021/08/a-guide-to-transfer-securities-from-saxo-to-interactive-brokers/","section":"posts","summary":"I have been using SAXO for three years and Interactive Brokers for more than one year, and recently I have decided to switch from SAXO to IB completely, but I could not find a proper instruction on the Internet.\n","title":"A guide to transfer securities from Saxo To Interactive Brokers","type":"post"},{"content":"","date":"31 August 2021","externalUrl":null,"permalink":"/categories/toy-tip/","section":"categories","summary":"","title":"toy-tip","type":"categories"},{"content":"这个问题说实话有点烦。\n本来是在写一个随想的，结果在notion里写中文打着打着就感觉卡顿十分明显，之前特别卡的时候已经把touchbar联想关闭了，所以我想如果没关应该会更糟糕。但是基本来说，我发现主要的问题应该是有好几层。\nnotion本身输入中文的时候是有一定卡顿的。这个主要是通过对比sublime text输入的时候， 很明显作为一个轻量级的文字编辑器，sublime对中文输入的反应更快一些 filevault加密也会影响到输入卡顿。一个显著的特征就是当我把电源拔掉之后，似乎就不那么卡了，虽然backspace依旧会有一丢丢卡顿，但是比连着电源的效果好一些。按照网上的说法是连着电源会自动开启filevault，用电池会自动关 最后我觉得就是系统级别的问题了，我工作的时候用microsoft teams，那个中文输入法加上microsoft teams本身就足够垃圾，我经常卡的动都动不了。所以哪怕是跟中国同事感觉用英文交流还舒畅一些，尤其是那种需要中英切换的时候，microsoft teams能直接给你卡死 （当然这个锅主要是在MS而不是在Mac） 解决方案我感觉目前没有，只能等系统update了。回滚Catalina可能也有用但是我不太确认提升有多大，总之这个事情确实还是蛮影响心情和效率。想想用英文在Terminal里健步如飞的酸爽，再看看现在中文输入的步履蹒跚，哎。\n我这个电脑还是32Gb ram，我都觉得有时候开多了pycharm windows会卡，可能是时候转战vscode了。\n","date":"13 August 2021","externalUrl":null,"permalink":"/2021/08/rt-log-5-%E5%85%B3%E4%BA%8Emac%E7%B3%BB%E7%BB%9F%E4%B8%AD%E6%96%87%E6%89%93%E5%AD%97%E5%8D%A1%E9%A1%BF%E9%97%AE%E9%A2%98/","section":"posts","summary":"这个问题说实话有点烦。\n","title":"RT Log 5: 关于Mac系统中文打字卡顿问题","type":"post"},{"content":"","date":"29 April 2021","externalUrl":null,"permalink":"/tags/airflow/","section":"tags","summary":"","title":"airflow","type":"tags"},{"content":"Today I have successfully migrated my current Airflow setup from v1.10.14 to v2.0.2. This article will not be a very detailed step by step guide for upgrading, instead I will introduce the general migration step worked very specifically for my setup, and share some of the problems I encountered during the process, and finally some general feelings with Airflow 2.\nDeployment Setup # My current setup in production is a mini Airflow cluster hosted on AWS, one master node that contains the webserver, scheduler, flower, and an autoscaling worker cluster, deployed on AWS Fargate. Redis and Postgres Metadata database are deployed separately into existing resources.\nLocally, it is very similar except every resource is grouped in a single docker-compose file, and the decoupled DAGs folder is directly mounted into those containers. Thus I can have an update of the pipelines without restarting the docker-compose.\nDeployment of DAGs is also slightly adjusted from the common standalone setup. Instead of deploying the scheduler, plugins and DAGs in the same repo, which would inevitably slow down the iteration of common pipeline DAGs, I decoupled the DAGs out into a separate repo, synced inside the master and worker node containers using a simple bash script. I kept the plugins inside the repo because you know, they are not going to reload anyway, and they change so infrequently.\nMy goal here is to first migrate this setup completely into Airflow 2, without changing much with the setup. Then I want to implement the HA and smart sensor setup, highlighted by the documentations as the star improvements in Airflow 2. Along the way I might start to move out of Fargate as I start to feel the limitation of Fargate, but definitely not ECS with EC2 as it is quite bulky too. So the only sensible choice in the end is k8s (sounds exciting).\nI created a new repo for hosting the scheduler level code such as Dockerfile, deployment script etc.\nSteps # Extend the original Airflow docker image and start a local Airflow instance # Originally I build my own version of the image based on puckel\u0026rsquo;s, thus with the new official I image, I need to dive into the Dockerfile and extend it, most importantly it\u0026rsquo;s the sync script I mentioned previously.\ntrouble migrating plugins # In Airflow 1 plugins are imported at run time (if my understanding is of the architecture is sound), and it is imported from Airflow\u0026rsquo;s own module operators hooks etc. In Airflow 2 they changed it just as if it\u0026rsquo;s a normal python module. However, I have some problems because my DAGs are separated from my plugins, tried to bake them into the same container even logically they are coming from different repo. It did not work for me, so I say ok let\u0026rsquo;s go to the other approach: installing those plugins as python package into the scheduler environment which works pretty well until \u0026hellip;\ntrouble installing additional python packages into Airflow Environment # This took me a while because during my last extension of puckel\u0026rsquo;s image, I switched the user from airflow to root when building the environment inside the container - I know I know it\u0026rsquo;s not the safest but still I did it and then I forget all about this until the webserver told me I have missing dependencies. Then naturally I bashed into the container and pip list grep the whole thing. The package was there so I was pretty confused, maybe after 10mins or so my mind started to think along the line of the existence of multiple python environments, hence multiple python binaries. I went to check the Dockerfile - nah nothing related to virtual environment which I kinda expected. Then it clicked, it\u0026rsquo;s the system user, and thus it\u0026rsquo;s using different python binary. by using root to build system level dependencies and switched it back to airflow when building python packages, it worked pretty well\ntrouble deploying using git # This is a bit embarrassing but because the last time I did Airflow setup it was one year ago, and it costs me 10 mins to realise I forgot to enable the deployment key in Gitlab\ntrouble with SQLAlchemy # SQLA upgraded recently to 1.4, I did not follow it since I don\u0026rsquo;t really have the time to upgrade my current projects. So I sticked to the 1.3.0 version, until I have this bug.\nSomething bad has happened. Please consider letting us know by creating a bug report using GitHub. Python version: 3.6.13 Airflow version: 2.0.2 Node: 057953b46ef2 ------------------------------------------------------------------------------- Traceback (most recent call last): File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/flask/app.py\u0026#34;, line 2447, in wsgi_app response = self.full_dispatch_request() File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/flask/app.py\u0026#34;, line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/flask/app.py\u0026#34;, line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/flask/_compat.py\u0026#34;, line 39, in reraise raise value File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/flask/app.py\u0026#34;, line 1950, in full_dispatch_request rv = self.dispatch_request() File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/flask/app.py\u0026#34;, line 1936, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/airflow/www/auth.py\u0026#34;, line 34, in decorated return func(*args, **kwargs) File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/airflow/www/views.py\u0026#34;, line 497, in index filter_dag_ids = current_app.appbuilder.sm.get_accessible_dag_ids(g.user) File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/airflow/www/security.py\u0026#34;, line 271, in get_accessible_dag_ids [permissions.ACTION_CAN_EDIT, permissions.ACTION_CAN_READ], user File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/airflow/utils/session.py\u0026#34;, line 70, in wrapper return func(*args, session=session, **kwargs) File \u0026#34;/home/airflow/.local/lib/python3.6/site-packages/airflow/www/security.py\u0026#34;, line 285, in get_accessible_dags .subqueryload(Role.permissions) AttributeError: \u0026#39;_UnboundLoad\u0026#39; object has no attribute \u0026#39;options\u0026#39; In one of my customized Python dependencies I used v1.3.0 but Airflow 2.0.2 uses v1.3.14, this caused the bug.\nPreparing New Metadata database # Once the DAGs are all reading fine in the local environment, it\u0026rsquo;s time to deploy. However, we have running DAGs throughout the day and to minimize the downtime, I decided to do a mirror deployment with every resource created separately for Airflow 2, new metadata database (not new RDS instance, just a separate DB, same applies for redis).\nI also want to keep the logs because you know how things gets out of control if you have a old start date and there are no execution records - DAG run records. So I need to replicate the metadata database.\nI created a new branch in my DAG repo, so the actual production branch which still used the Airflow 1 import syntax will not be affected by the deployment.\nAnd finally once the DAG is running on the new Fargate service with the new branch, I can turn the Airflow 1 setup off.\nSounds like a good plan and here we go.\ncreate TWO replica of existing metadata database in your production database, grant the necessary db access rights, connect to the new prod meta db LOCALLY. The 2nd replica is a backup, so that you won\u0026rsquo;t need to cut off every connection in the current production in order to re-do the replication\nPAUSE ALL DAGS! OR the active state will be replicated into your new replica, which will inevitable auto start some of the DAGs in your local Airflow instance - we don\u0026rsquo;t want that because here we are just testing the database setup, not the DAGs.\n-- revoke access so we can successfully terminate all connections in the prod REVOKE CONNECT ON DATABASE airflow FROM airflow; REVOKE CONNECT ON DATABASE airflow FROM public; SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE -- don\u0026#39;t kill my own connection! pid \u0026lt;\u0026gt; pg_backend_pid() -- don\u0026#39;t kill the connections to other databases AND datname = \u0026#39;airflow\u0026#39; ; CREATE DATABASE airflow_db_v2 TEMPLATE airflow; CREATE DATABASE airflow_db_v3 TEMPLATE airflow; -- grant it back because you want your current prod Airflow to continue -- to work, you might already received some alerts from the current -- Airflow, let it auto-heal / redeploy if needed GRANT CONNECT ON DATABASE airflow TO airflow; GRANT CONNECT ON DATABASE airflow TO public; GRANT ALL ON DATABASE airflow_db_v2 TO airflow; GRANT ALL ON DATABASE airflow_db_v3 TO airflow; Modify in the new repo airflow.cfg to use the correct celery_result_backend and also the sql_alchemy_conn to the newly created DB\ntemporarily export AIRFLOW_HOME to the new scheduler repo - this is small but important\nrun in the new local airflow virtual env, or start a new container to execute the upgrade DB command airflow db upgrade . I did run into another problem while upgrading:\nsqlalchemy.exc.IntegrityError: (psycopg2.errors.NotNullViolation) column \u0026#34;conn_type\u0026#34; contains null values Well it\u0026rsquo;s not an alien error message, so you simply just understand and navigate in the Metadata database and realise ok one of my existing Airflow Connections have a NULL conn_type, thus during alembic migration it failed and throws this. This is probably due to some legacy issue since I believe I started my cluster from v1.10.10?\nAt this point, we have tested that the new pipelines more or less works with the new scheduler and the new database, we can then actually deploy it.\nMirror Deployment # Updated my deployment script, in this case my Fargate Task Definitions. Switched the credentials etc so the new deployment is pointing to the new DB. Note here that the executing CMD has changed, worker → celery worker and flower → celery flower In local Airflow, switch to use an actual git synced non-production branch you created earlier, instead of using the mounted DAGs Deploy the image and create necessary AWS resources e.g. TD, aws log group and stream - if you have existing cloudformation script / terraform solution, that\u0026rsquo;s great, mine is semi automated so i still need to create something, which is error-prone Once deployment is successful, you can consider start the DAGs all together if there are not much discrepancies between your current state of logs and the current Airflow 1 instance I did another full replication - wait until all pipelines are done - pause everything - switch connection locally and apply replication and then alembic update Another interesting issue I wanted to mention here before go to bed (this is getting really long \u0026hellip; )\nWhile I was trying help my colleagues set the environment up, one of them cannot load the DAGs despite the DAGs are present in the image, mounted successfully. The only error log is something like DagFileProcessing terminated after 53.8s . I did not realise what\u0026rsquo;s this is suggesting initially so I tried every, built from a clean start etc. Until I stopped and start to think\nnot reading DAGs correctly, so the DAG Bag is not getting scanned successfully I remember there is a config controlling the scanning interval, maybe that would help While looking through that particular config, I noticed another config which is the DAG loading timeout , default to be 50s HMMMMMMMM OK this is very similar to the error log, and the computer is pretty laggy, maybe it\u0026rsquo;s the timeout? After increase that config to 200 it starts to load the DAG successfully General Feelings # Docs are much much much much better. They become readable and reproducible. Some of the earlier gotchas are now documented too. e.g. the execution date, start date and time interval issue.\nThe quick start aspect has been improved a lot. Generally it took me a while to fully understand how Airflow 1 works both in Python virtual environment and in containers, with quite some work devoted into maintaining the three different main processes on a EC2 machine. In fact thanks to those experience now I can fairly easy to identify the source of the bugs - whether it\u0026rsquo;s the problem of scheduler or worker, or something related to network.\nBack to the topic, it is now much easier for people to start using Airflow, not just the documentation, they also provide an official container that works out of the box.\nRemember the good old puckel image in Airflow 1? and the image is very well built too. They have also considered those annoying details such as wait_script etc., which would bothers me quite a lot during the early times too.\nThey even have purposely shorten the docker-compose command, though I did not use them, I am still using make script, start my local containerized environment with just make up.\nI would say the UI is a big change, user authentication too. Logs now can be opened in a new tab, experimental APIs are finally not experimental - I have not tried this but previously I have quite a few automation thoughts I wanted to do, it is much easier to do with an API exposed.\nI would not really say the UI style has become better due to my personal taste, maybe its a habit problem. There are some rough edges in the tree view where the solid line around those small squares looked blurred\nDAG file are almost fully compatible except the plugins and operator imports. They are pretty easy to change too for me since it\u0026rsquo;s just global bulk replace of import statements.\nConclusion # I did not spend too much time this due to my decent understand in the deployment of Airflow. Overall it\u0026rsquo;s a pretty rewarding experience, and hopefully I can write another post regarding it\u0026rsquo;s usage, and HA setup.\n","date":"29 April 2021","externalUrl":null,"permalink":"/2021/04/de-log-7-migrating-from-airflow-1-to-2/","section":"posts","summary":"Today I have successfully migrated my current Airflow setup from v1.10.14 to v2.0.2. This article will not be a very detailed step by step guide for upgrading, instead I will introduce the general migration step worked very specifically for my setup, and share some of the problems I encountered during the process, and finally some general feelings with Airflow 2.\n","title":"DE Log 7: Migrating from Airflow 1 to 2","type":"post"},{"content":"这一页还是用中文写吧，毕竟做饭的话还是以中餐居多。\n有空我挺喜欢做饭的，尽管不敢称自己很会做饭，但至少不会饿死，有机会退休了还是想去好好学一学。\n朋友们可自行通过此页面观察能吃的食物。\n请不要把这个页面当作是一个展览，更多这就是给你看下我会做啥。\n会不断加入新的食物。\nI like to prepare food when I am relatively free, though I would not say I am very good at it. My cooking is decent enough for me to live by, but I would love to receive formal culinary training after retirement (Yes I am planning retirement that early.)\nThis page will be used as a mini-menu for my home visitors. The photos / presentations are not in its best form, but I am not really a great camera guy. I would not really treat this page as a gallery.\n目录\n荤菜 鸡肉 猪肉 羊肉 牛肉 虾 鱼 素菜 鸡蛋 豆腐 茄瓜与番薯 时蔬 凉拌菜 汤 其他 主食 甜点 乱炖果腹类 西餐 荤菜 # 鸡肉 # 土豆烧鸡 # 最简单的烧菜了我觉得。发挥比较稳定，鸡腿肉和土豆的结合，腌制会放孜然粉。 可辣可不辣，可以加如豆泡炸豆皮等容易入味吸汁的配料。\n烤鸡翅 # 鸡中翅和各种调味料的混合。香料可自选。\n烧鸡翅 # 烧制版本，跟可乐鸡翅差不多但是我不喜欢甜的所以一般不做可乐鸡翅。\n手撕鸡 # 本来想要做口水鸡，但是没想到高压锅压过了\u0026hellip;\n双椒鸡丁 # 炸鸡丁 # 这个其实是半成品，可以加辣椒酱油，加椒盐，甚至咸蛋黄酱炒制。\n猪肉 # 红烧肉 （可以加土豆就成了土豆烧五花肉） # 粉蒸肉 # 辣椒炒五花肉 # 回锅肉的概念，因为我不咋吃大葱。\n辣椒炒肉 # 家传菜从小吃到大，顿顿都有。除了辣椒和肉，可以搭配豆干，土豆丝，胡萝卜丝等。\n水煮肉片 # 这里用的五花肉片，嫩里脊比较正宗但是我买不起里脊肉了\n红烧排骨 # 贼香\n红烧猪脚 # 猪脚啃起来!\n烤排骨 # 各种排烤起来都是吮指级别的好吃😋\n羊肉 # 土豆烧羊蝎子 # 萝卜烧羊蝎子 # 烤羊排 # 这里用的redmart羊排，最好不是冷冻的\n牛肉 # 牛肉的菜一般耗时都比较长。\n萝卜烧牛尾 # 土豆烧牛肉 # 鱼 # 棍子鱼 # 小时候经常吃的一种鱼，新加坡买不到原版棍子鱼但是有一种海鱼很像，试了一下新鲜的话还是很好吃。这个的 主要问题是如果不新鲜（因为海鱼都是死的），因为鱼太小了，内脏和腮会很难处理而保证鱼本身不破，出菜卖相不好\n香煎三文鱼 # 把握好时间基本不会出错，香。\n蒸三文鱼 # 很看三文鱼的品质只能说\n清蒸鲈鱼 # 烧个鱼 # 新鲜的鲳鱼\n冷冻多利鱼\n是个鱼就能这么烧，好吃的秘诀就是越新鲜的鱼越好。\n酸菜鱼 # 这个是用料包做的。但是理论上是可以做的。水煮鱼同理。\n虾 # 酸辣虾 # 用的两种韩式酱料\n泰式虾煲与粉丝 # 去了一趟泰国习得的菜品\n青椒酿虾滑 # 这个搭配真的很好吃。里面的滑可以各种换-鸡肉滑，鱼滑都不错。\n蒜蓉粉丝虾 # 家常蒸菜，蒜得用蛮多\n花荤 / 素菜 # 鸡蛋 # 西红柿炒鸡蛋 # 家常菜，不加醋，咸口。\n虎皮鸡蛋 # 这里的是红烧版本，也可以炒\n三文鱼蒸蛋 # 同理除了三文鱼也可以蒸其他东西\n青龙菜炒鸡蛋 # 口感跟韭菜很像的菜但没有韭菜的味道，很可。\n豆腐 # 砂锅豆腐 # 煎豆腐之后砂锅烧制而成，煎豆腐可选裹蛋液或者生粉蛋液，也可以直接煎炸\n麻婆豆腐 # (当然也可以做不辣🌶的)\n茄瓜与番薯 # 肉沫茄子 # 土豆泥 # 土豆泥为基础可混合各种味道。可以加肉末。\n土豆丝 # 我土豆丝也就是。。做的凑合\n时蔬 # 手撕包菜 # 为数不多吃的绿叶菜，可以做糖醋或者辣的版本。辣的更上手一些。\n蚝油生菜 # another 为数不多吃的绿叶菜\n蒜蓉小青菜 # 好了，我现在好像能吃挺多绿叶菜了🤭\n干煸四季豆 # 猪肉末和四季豆，小米辣。\n水煮四季豆 # 四季豆这次用煮的，在进行炒制，少油之后健康很多，口感出人意料的清脆，也不用担心中毒问题\n虾仁花椰菜 # 配色不错的\n凉拌菜 # 双椒皮蛋 # 就皮蛋。\n凉拌毛豆 # 凉拌的毛豆，佐料可以自选口味可以自选。\n菠菜拌粉丝 # 汤 # 鸡蛋肉饼汤 # 按照江西瓦罐汤的玩法来的。\n玉米排骨汤 # 玉米可以各种切换，什么山药莲藕土豆萝卜胡萝卜\n西红柿蛋汤 # (🍅+🥚汤，图可以意会， 同理紫菜蛋汤等)\n其他 # 主食 # Basil Fried Pork # 没啥难度就炒肉末，调味改成泰式的调味就OK\n馄饨 # 皮是买的（还不会白案），肉馅自己调的，现包的确实好吃一些，然后可以自己加料调味\n卤味 # 叉烧 # 忘了咋做的了，本来想做日式调味，没忍住加辣椒，反正出来味道很好\n卤牛肉 # 甜点 # 红枣小丸子 # 大颗红枣配上糯米小丸子，却不知中间的黄瓜是假借他人之手\n乱炖果腹类 # 家常香锅 # 说实话这个吃起来最舒服，配个气泡水绝妙。\n家常小龙虾 # 外面买点冷冻的小龙虾炒炒\n西餐 # Shrimp aglio e olio # 可以做有虾和无虾版本的。虾不是冷冻虾仁。\n中式三文鱼意面佐杂蔬 # 中式是因为调味和佐料以葱蒜辣椒为主，辅以酱油调味。且杂蔬做法雷同其实并不是特别搭这里，倒是不难吃。 应该用酸甜口的杂蔬会更好。\n","date":"30 January 2021","externalUrl":null,"permalink":"/food/","section":"Shadowsong's Personal Website","summary":"","title":"Food","type":"page"},{"content":"Data engineering jobs are really popular nowadays, mostly contributed by the rising demand of data insights and data driven decision making.\nData is essentially the go-to-crystal-ball these days whenever you want to check some random hypothesis just popped up in your mind while you are in the toilet. But it\u0026rsquo;s more than that, people start to realise they need a systematic way to approach any data work. You cannot leave random SQL snippets scattered around in your local computer, not version controlled, and manually executing scripts every morning when you arrive in the office. It will be also too inefficient if you specifically hire some dedicated guys (SQL bois) running SQL in your legacy Microsoft SQL Server for the entire organisation. Some engineering work and engineering practices has to be installed here, that\u0026rsquo;s how data engineering started, because people are sick of emailing Excel files around.\nToday I want to talk about something more abstract, I call it Meta Data Engineering because it sounds cool. Data engineering in most companies now, especially who just hired their first data engineer team, are busy creating their pipelines, migrating data warehouses etc, which is the painful but necessary process when a company transitioned to apply the data engineering concept. However, after being in practice for a few years, I start to realise scheduling pipelines is boring and troublesome, the fear of which is one of the best qualities you can find in software engineers. If you have read my previous articles, you will know I like automation a lot. I would like to replace any manual repetition work in my life as much as possible.\nThen you may be thinking, how can you automate the automation itself, when you have different pipelines and different type of tasks, and the evidently different underlying data? There are actually commonalities you can abstract, and so that it not only improves the efficiency and consistency of pipeline scheduling, but also hands over the very responsibility of pipeline scheduling towards the user - data analysts and data scientists, who should be the best person to decide what kind and what shape of datasets they would like to have.\nThat\u0026rsquo;s basically meta data engineering, you are supposed to build frameworks and services to support the scheduling of tasks, the pipelines, the very popular nowadays Data Platform or 数据中台, not the actual scheduling itself. I think this is the future of data engineering, because how hard can scheduling an Airflow task get?\nSpeaking of the redistribution of responsibilities, I have thought about it briefly and here is the list that I think should stay with data engineers:\ninfrastructure design, provisioning (yes unless you have a DevOps team) schema level design of data lake, data warehouse table level design for base tables, the ground layer of the data warehouse (here we need the participation of contextual knowledge, when normalisation occurs) services, frameworks or more complex and engineering demanding processing pipelines to interact with data, analyse data or train ML models more efficiently Things that I think should be transferred to others (Data Analysts and Data Scientists)\ntable level design for upper layer data warehouse scheduling of pure SQL transformation transformation (very dependent on use cases) design and installation of data validation POC, provision and deployment of machine learning models For me personally I would like to have some tools that allow my teammates to build pipeline and schedule them in a GUI, like a back office, but unfortunately I am in and will probably always in a small team, so I might have to do it myself one day, or some bigger companies like Gojek could open source their framework.\nEssentially, the responsibilities of data engineers will start to converge with software engineers, but serve a different purpose, another sign is the existence of data product managers nowadays. You need somebody who understands how data can be used to reinforce the decision making process in the company, in the form of modern technology. Eventually, as the world inevitably becomes more digitalised, data will be just as important as any product itself, or even more for those companies that sell data analytics as a service.\nAnd for some time, a lot of data engineers, including me, will be responsible to write the pipelines that are not fun.\n","date":"19 August 2020","externalUrl":null,"permalink":"/2020/08/de-log-6-from-data-engineering-to-meta-data-engineering-the-future-of-data-engineering/","section":"posts","summary":"Data engineering jobs are really popular nowadays, mostly contributed by the rising demand of data insights and data driven decision making.\n","title":"DE Log 6: From Data Engineering to Meta Data Engineering - the future of Data Engineering","type":"post"},{"content":"I just reached Diamond 3 on American server in Starcraft 2. Yeah yeah I know NA server is easier compared to Asian but it is still a diamond, and I am pretty happy. For this particular season I have not played a single game of cheese, but I have, sadly to say, counter cheesed opponents. Most of my games are played with heavy reliance on macro.\nSo in this log I would like to record down some of the stuff I have learnt in the process, I believe the entire duration is about 2 months, with probably 200 Protoss games and 250 Zerg games (I reached diamond using Protoss).\nLearn to not panic # This is the most important lesson to me to be honest, and I believe it helps me a lot in decision making. In Platinum rank there are just so many cheese plays from Terran and my fellow Protoss players. Cannon rushes followed by cannon rushes, bunkers and proxy barracks, however the proud Zerg army seldom 12 pooled me (approval nodding). I used to play a lot of proxy Stargate when I was in Uni 1, that\u0026rsquo;s also the last time when I heavily invested my time in Starcraft 2, but then I realise it\u0026rsquo;s not the right way to play it. You need to know how to macro and how to win on the frontline.\nAnyway back to the topic, when people Cannon rush you, the least thing you want to do is panic. You make decisions, quickly get batteries and Stalkers and robos, do not build zealots if the cannon is up. If you are lucky and you have discovered it earlier, pull your probes immediately. All I am trying to say is, I used to have a shaky hand when I saw people cannon rush me, now I am completely numbed. Worst case? You lost, then learn to scout earlier the next game. Panic does not help at all.\nIn fact to avoid this cheese, I scout really really early, like immediately after I built the first pylon. So sometimes I can even fake a cannon rush to make the opponent pull away a bunch of workers from the mineral line (mind game is not really cheesy right?)\nKnowing what to do next # Despite I am practicing Beethoven Sonata Op.31 No.2 1st movement with insanely fast moving triplets as accompaniment on the right hand, my actual APM could not increase for a very long time because I realise I do not know what to do exactly at some given time in the game. Yes I know the build order for the first 3 mins, and then I am completely lost.\nSo I watched some videos online, and I watched with somewhat serious attitude, especially on the prism harassment. I then quickly realise how badly Platinum players could handle multiple battlefields. That\u0026rsquo;s how I get my MMR up in the last few days. You quickly expand to 4th base, 70+ probes, making a combo thats either colossi and gates unit, or archon and immortal, and start to build Stargates too. It becomes intuitive for me over time. Essentially I have a better sense knowing what to do next, is it to open my 4th base, or wrap in 8 units and move for an attack.\nAnother important thing about this is information gathering. You need to scout, you need to constantly scout your opponents\u0026rsquo; next play to anticipate the army combo, otherwise it\u0026rsquo;s like going to an exam without doing any revision.\nLearn to prioritise # So Starcraft is all about prioritise your resources - your pathetically limited APM, and turning them into a good army combo so that you just need to right click through the map. You just simply could not control everything in the game, just like life. So, make simple moves that frees up your APM, e.g.\nbuild batteries and cannons - you are not pro players, spend your money so that you could use your APM on more important things constantly use prism zealot harassments to throw away floating minerals - reduce the actual amount of APM your opponent could spend in the frontline optimize your hotkeys harass with shift actions, pick the most important battle and micro it - it\u0026rsquo;s just life, you do what matters most to you Practice is important # I have actually known this quite a while ago. Funny thing is I did not really take it seriously until I earned diamond in Overwatch a few years ago. I was playing 3 hours every night, just free standard multiplayer games, and I eventually got to diamond. I was thinking then, hmmmm, practices (even without systematic exercises) do help a lot, though I should have known this when I was doing my past year papers in Junior College. This kind of makes me feel better when practicing piano too, because I kind of feel I am making progress everyday, which is a good thing I guess.\nSo I have just conjured up some bullshit life lessons gained from playing this old dead game - Starcraft 2. With this fermented chicken soup, I bid you farewell and good night.\n","date":"13 August 2020","externalUrl":null,"permalink":"/2020/08/rt-log-4-about-starcraft-2/","section":"posts","summary":"I just reached Diamond 3 on American server in Starcraft 2. Yeah yeah I know NA server is easier compared to Asian but it is still a diamond, and I am pretty happy. For this particular season I have not played a single game of cheese, but I have, sadly to say, counter cheesed opponents. Most of my games are played with heavy reliance on macro.\n","title":"RT Log 4: About Starcraft 2","type":"post"},{"content":"So I am trying out Notion, so far it looks ok. With all the fancy templates it looks like it can do much more than Bear, but I have also heard rumors about how it can get pretty messy after you have a lot of stuff. It\u0026rsquo;s free for personal use anyway so I thought I want to give it a try, at least the table structure is pretty useful for me.\nOn to something more related to the category topic, I am officially on the 8th consecutive daily piano practices today, that\u0026rsquo;s probably the longest streak in maybe 16 years. I felt a slight resistance before starting today, but once I have put my hand on the keys it went away immediately. I realised I kinda enjoy practicing and perfecting the piece like solving a challenging problem:\ndivide and conquer (I have only progressed one page per piece in two weeks) practice slowly and then speed up practice with variation in tempo practice similar and basic techniques in order to be able play what\u0026rsquo;s needed Back in the days when I was still playing in the orchestra, I imagined I was practice for some PvE dungeon raids in FFXIV. You need to practice and fail so many times with 7 other strangers in order to pass the dungeon. Music practices, well I guess just like other forms of practices, are the same. It\u0026rsquo;s about how closer you are to the perfect state in your mind.\nRecently I am also asked to play with flatter fingers. This is a big change for me since I did not have much systematic finger training ON PIANO, and any form of finger practices in a very long time. I am spending the first 10 to 15 mins everyday to warm up + practice my fingers. I can\u0026rsquo;t really tell whether it has been helpful so far, or whether I am in the correct direction, but I think I am doing better.\nLast thing, my piano starts to have weird metal sound again. This is the sad thing about second hand piano, you will bump into weird problems more frequently. I planned to ask for help when I moved in a few month, hopefully I can get a better room with less reverb. My Beethoven Op 31 sounds horrible to my ears, but when I recorded it and played it back it actually sounded better. So I am pretty sure it is the reverb.\n","date":"10 August 2020","externalUrl":null,"permalink":"/2020/08/music-log-2-about-music-practices-and-flat-fingerings/","section":"posts","summary":"So I am trying out Notion, so far it looks ok. With all the fancy templates it looks like it can do much more than Bear, but I have also heard rumors about how it can get pretty messy after you have a lot of stuff. It’s free for personal use anyway so I thought I want to give it a try, at least the table structure is pretty useful for me.\n","title":"Music Log 2: About music practices and flat fingerings","type":"post"},{"content":"","date":"10 August 2020","externalUrl":null,"permalink":"/categories/music-log/","section":"categories","summary":"","title":"music-log","type":"categories"},{"content":"","date":"10 August 2020","externalUrl":null,"permalink":"/categories/piano-practice/","section":"categories","summary":"","title":"piano-practice","type":"categories"},{"content":"So this week I was pretty serious in my piano practices, played for one hour everyday except on Friday, it has become an important part of my daily routine.\nPreviously I am more casual in playing: pick a piece I like, practice the piece until it’s 60% - 70% fluent and try some instagram recording, and then continue to polish it until I lose the interest or the drive. Many of the pieces I liked and played are left in this awkward state: I can play through the piece, with mistakes here and there, but I have never been really satisfied with any of them.\nTo be fair, I do sometimes “feel” the music during certain parts of the music, but I would not be able to fully immersed into one during playing. I am not sure whether it is because I am bad at it (not able to control the music with finesse), or because of my over-reverbed environment, or simply just my piano sound quality. I think most likely it’s a mixture of all three.\nStarting on Monday this week, I decided to try the lovely-small-hand-unfriendly waterfall etude by Chopin. It’s not the first etude I have attempted composed by Chopin, but I feel I should at least completed this, even at half of the speed. So I started, on Monday, at the speed of crotchet equals 70, the first 10 bars of the etude, and kind of stabilized it to 95 by Thursday for the first 16 bars. It’s pretty slow I know, both the tempo and the progress, but I did feel the joy of playing etude, which I had never felt during my childhood while practicing Czerny.\nI will probably leave Schubert’s impromptu 4 aside for a while and focusing on this, because my hand gets really tired after one hour of practice, and I kinda get fixated on this, hope by the end of this journey my apperggios could be as smooth as babies’ skin (guess this will also serve as a preparation for Beethoven Moonlight sonata 3rd movement).\nOn a side note, I do need to semi-polish my Moonlight 1st, might need to play it on Saturday. I am not sure what else to play honestly, I don\u0026rsquo;t want to butcher Mozart in front of other people though.\n","date":"25 July 2020","externalUrl":null,"permalink":"/2020/07/music-log-1-about-my-recent-obsession-with-chopin-etude-op.10-no.1/","section":"posts","summary":"So this week I was pretty serious in my piano practices, played for one hour everyday except on Friday, it has become an important part of my daily routine.\n","title":"Music Log 1: About my recent obsession with Chopin Etude Op.10 No.1","type":"post"},{"content":"I have been on some sort of movie spree for a few nights in a row, without dropping my piano practices of course, and I ran into this movie called Knives Out a few hours ago.\nI did not realise this is a mystery / crime solving movie until I saw Harlan committed “suicide” (I thought this is comedy because I was browsing a list of comedies in 2019 a few days back).\nThe story is pretty twisted, and good. As a person who had some liking for deductions (to make myself feel smart), this story did trick me a bit. It has three layer of puzzles, or “donuts” quoting from the movie, but I guess I SHALL NOT be a spoiler by talking about the story or the answers behind all those puzzles. I just want to talk about the movie itself.\nI am pretty surprised to see many familiar faces, even among the support ing roles. Besides Chris Evans and Daniel Craig, I even spotted Michael Shannon, who also played George Westinghouse in The Current War which I happened to watch last night (I do not why but I am officially making Katherine Waterston in my top 3 actresses in my list after the movie). I guess I have watched enough movies to recognise some other famous actors. To be fair, I do not really like Daniel’s accent, its a bit unnatural in SOME parts of the movie, but it’s Daniel Craig so everything is great. I did not really recognise the female lead initially, but just find her familiar. It turns out she starred in Bladerunner 2049 too, which is another movie I found worth rewatching.\nA quick description on the story, besides the compelling story and mysteries that driver the characters, especially the female protagonist Marta, in their own developments (I would say all of the supporting roles did a fantastic job in depicting their not-so-lethal grudge with each other), I had to applaud for the details in the movies. Well I know it’s a movie for crime solving and they have to pay extra attention to details to win over the audiences. Still, it does not feel forced, not even for the cup in the end marked with My house, my rule, my coffee . Not to mention the word already in Ransom, you came back again already, and the word / name ransom itself already kind of suggests something (I might be over reading this, oops that’s the spoiler).\nThis is a really good watch if you are up for some mind twisters, it’s light yet complex enough to be interesting. There is little or none thrilling element so it’s good if you are scared of bodies and blood.\n","date":"9 July 2020","externalUrl":null,"permalink":"/2020/07/movie-log-5-knives-out/","section":"posts","summary":"I have been on some sort of movie spree for a few nights in a row, without dropping my piano practices of course, and I ran into this movie called Knives Out a few hours ago.\n","title":"Movie Log 5: Knives Out","type":"post"},{"content":"","date":"9 July 2020","externalUrl":null,"permalink":"/categories/movie-log/","section":"categories","summary":"","title":"movie-log","type":"categories"},{"content":"This is a bold title I know, and I am not even a data analyst.\nI am a data engineer, but I do wield my analytical skills frequently to solve both technical problems (reading a monitoring chart) and analytical problems (explain why MAU drops).\nSo I think probably I should also write articles on analytics, just to make this skill less mysterious and keep refreshing my memory, of course, with my limited capabilities.\nThis article will cover many high level concepts in data analysis, instead of how to actually calculate the average of a column. In my opinion, the technical requirements in data analysis are not as high as other data related jobs and in fact the value of data analysis lies in the analysis, not the technical difficulty. To extract useful, accurate insights and present it in a understandable way is far more meaningful than optimizing a query or find a fancy way to calculate.\nData analysis can be generally categorised to descriptive data analysis (like getting the average, 25th percentile) and inferential data analysis (like forecasting the trends of sales). In this article, I will be ONLY cover the descriptive type.\nThis article covers some common approaches to data analysis in the industry. I wrote this article because I feel data analytical skills will become an essential part of the skillset in future work forces. Product managers need to know how to read data, engineers in chemical plants need to know how to read charts too.\nThis is purely based on my experience both from the university courses and working experience. It might not be exhaustive. Feel free to comment and add on if you think something are important too.\nOverview # Data analysis is actually quite intuitive if you apply your logic frequently. It is simply some logical conclusions yielded naturally, when applying a series of templated logic on the data sets. This might sound confusing, but let’s get to this later on.\nA few very very very important things before we actually start to talk about data or data analysis, and what does these actually mean in practice\nCorrelation does not imply causation # The first part of this sentence is a very commonly heard phrase, it should act as a constant warning for you to keep doubting your conclusion until you think it is comprehensive and sound enough.\nFor (the famous) example, Amazon data shows as the number of diapers sale went up among the 30-ish male adults, the number of beers also went up. So the most direct conclusion you can draw is, diaper sales will help with the beer sales, which is very weird. This trend does not really tell you what is the causation behind this correlation, and you certainly can’t blindly report to your boss that “hey I think they have a magic correlation and we should just trust this and make decision based on this”. Well worst case scenario, if you have no other insights to offer and you can actually reproduce this trend in other similar user cohorts (e.g. same age, same gender but different region in the same country), this might be regionally generic and can be used as advise for decision making.\nThe main reason of this weird trend is 30-ish males feel can easily get bored when tending their babies, so they need beers.\nData is never perfectly clean # Real world data can be extremely dirty, it contains noise that you will find both hilarious and annoying. You could have typo in string field, you could have 105 years old in age or you could have missing data too. After a while you will get very used to the existence of dirty data.\nThis can be attributed to the poorly designed collection process, or it can be an embedded attribute in your data. (How can you prevent people from entering their age as 105?)\nHowever the handling is another big topic, as it depends on both the natural of your data and business context, and your analytical goal. For now let’s just remember that we need to do be aware of this, since this could lead disastrous conclusion.\nE.g. Let’s say 70% users did not really fill in their gender in Asian countries like Japan, Korea, and you realise the majority of the users who have filled their gender status in these countries are females (90%), which severely deviated from the normal value (say 60%). If you do know notice the missing 70%, you would wrongly conclude that most of the asian users are females, where in fact, its really just that Asian people tend to hide their gender status.\nAsk the right question # The analytical goal should never be static, even it is given by your boss. Yes your boss might have given you a specific question like “what is the gender distribution of all users in Singapore, in the age between 25 to 35”. What your boss was actually trying answer is: “what is the best tag line to use in order to target the consumers?” He is trying to segment the user base to yield the largest segment, so the tagline could hit the majority of the user. Then you as an analyst should be able to understand your boss’ actual needs and start to perform your data magic to attend to these needs.\nLet’s say your product is a medium price lipstick, you would definitely first try to cut out all males in your user, and then start to categorise the remaining females by other attributes, like age. However, if your product is a high end lipstick, you should consider including males who have strong purchasing power as it might be a very appropriate gift for their partners or friends. In this case, both products are similar in nature, but the strategies and the analytical goal is completely different.\nDeveloping a Framework First # Like I mentioned before, analytical tasks are not just about writing SQLs and performing statistical calculations, in fact, they are just the last implementation step in the process.\nWhen I am doing analytical tasks, what typically going on in my mind is the following:\nShape of the Data # What is the dimension of the data, the shape of the data, which granularity is this data set at?\nBy knowing this, you will have a rough grasp of the data sets, and what potential meaningful metrics you can extract from this data sets eventually, especially after when you have performed many similar tasks.\nIf you are constantly dealing with events like data, meaning every record has a timestamp attached to it, the analytical metrics you can think of commonly related to time window. e.g. the number of unique occurrences within the first seven days of signup, or 30 days, depending on the nature of the product.\nMetrics Selection # What kind of metrics / aggregates are interesting to the business context / stakeholders? Average number of sales over one year is not really useful if your product is seasonal.\nHere at this step you will need to select your metrics based on your understand of the domain, even the limitation of the data itself to provide meaningful support for decision making.\nConclusion validation # Is the conclusion sound enough? How can I verify it from other aspects?\nThis is like solving mathematic problems when we are in middle school, we solve the equation using the formula we memorised in the book. However we know there are more than one way to solve this question. After I finished the entire paper, I would use that to verify the answer.\nData analysis is the same, you will need to constantly verify your results even though they seem correct. A conversion rate of 5% seems normal because that’s probably the industrial average, but if your user retention is higher than the industrial average, meaning your user is pretty loyal, the conversion rate should probably more than 5% by intuition as the goodwill to pay is probably high too.\nA lot of these are based on intuition and experience in the field, that’s why you will need constantly repeat this hypothesis testing cycle, until you are pretty convinced that the result / conclusion you draw from the metrics are sound.\nData Visualisation # If I am showing the result to somebody who are not really familiar with data, how should I present / explain it?\nThis is technically not a data analysis step, but it is too important to be left unmentioned. And the reason is simple too: if you can’t get anyone to understand your results, or you can’t explain it in human words clearly, it’s either your results are wrong (overly complicated), or your results are wrong (just wrong).\nSo in fact none of these is involving any technical details, e.g how to write the SQL for certain type of aggregations or what chart to use to visualise the result. This is because whether it is SQL, Python or Excel, they are simply tools and when you are approaching a data problem, I would definitely recommend using a top-down approach, meaning you should get the overall analytical framework right.\nIf I really have to give a weight to the skills, I would say technical skills determined whether you can perform the job, even if like a robot. So maybe 40%, but what truly makes you a good data analyst and differentiating yourself from others is the remaining 60%, where you need know how to ask the right question, and tell an understandable story.\nConclusion # I think that’s enough said. I have not really talked about anything technical, meaning how to actual write a query, do a group by or anything like that. Here we are only discussing methodologies. I will write a more detailed article for how to actually how to actually execute your analysis with the previously mentioned framework if someone really wants to learn it in practice using either Python or SQL.\n","date":"13 May 2020","externalUrl":null,"permalink":"/2020/05/da-log-1-a-quick-introduction-to-data-analysis/","section":"posts","summary":"This is a bold title I know, and I am not even a data analyst.\n","title":"DA Log 1: A Quick Introduction to Data Analysis","type":"post"},{"content":"","date":"13 May 2020","externalUrl":null,"permalink":"/categories/data-analysis/","section":"categories","summary":"","title":"data-analysis","type":"categories"},{"content":"还有不到两周的时间就要换工作了，时隔两年再打算写一个总结。\n工作还是在做数据工程师，对这个角色的理解也自然更加深刻。不同场景下职责也是不同的，无法一概而论。故此文旨在讨论过去个人这两年的收获，成长和对未来的展望。\n收获与成长 # 不得不说过去这两年在技术上有了很大的提升，一个很明显的感觉就是如今如果你跟我说任何一个开源项目，我第一反应是有没有兴趣去尝试，而不会像两年前那样害怕潜在的冗长的setup process。这大概是对自己自学和动手能力的自信又有加强吧。\n之前看到一个说法说现在的软件工程师其实写代码很少，我也同意。我认为软件工程师现在就是胶水工程师。很多东西你不需要写了，你用现成的服务或者开源项目就好，你要做的只是把他们粘起来。但是作为一个软件工程师，你得知道需要粘什么，怎么粘，这也是有一定门槛的。你如何评估一个服务的好坏，与你需求的契合度呢？我感觉目前来说大部分常见的架构问题我大概都能找到比较好的解决方案，AWS 里的常见使用也是比较熟悉了，哪怕poc一些新的东西也不是什么很难的事情。\n写Python开始有了比较稳定的一套代码结构，也接触了稍微深一些的语法比如async这种。除此之外，也能比较轻易的理解并作出一些常见代码结构上的决定，比如哪些东西放哪里怎么放为什么这么放这种问题。就效率来说哈哈想的更多写得更慢了，比原来写出来的废代码少很多。\nSQL或者数据库一套的话本身没有什么好说的，技术上的进步不是很多。写比较复杂的analytical queries可能会更熟练吧。window function的使用现在已经比较intuitive了不太需要花时间想，可能还是因为写多了。\n数仓这块还是有很多收获的，无论是底层原理的理解（其实也不是很强）还是实际操作。也有了自己一套关于数仓整理的经验和设计，也就还阔以吧。我相信这玩意随着时间的增长会慢慢进化变得更好的。只能说目前这个设计我还是很满意的，但是可能换了一个环境就不适用了，所以还是说要理解本质。\n求职大环境 # 这段时间疫情WFH无论是面试还是跟猎头打电话都挺方便的，其实我是觉得如果可以的话要是以后面试都这样就好了，因为感觉能得到的信息跟去现场面试差别不大啊。\n总的来说数据工程师在新加坡最近还是火的，我觉得这很明显吧因为之前两年都在找Data scientist， 找多了发现一大半人要不在做dashboard要不在做pipelines，但是一些DS又不是engineering出身所以做出来的pipelines也可以进一步优化，DE 的需求起来我觉得一点都不意外。尤其在数据化是未来企业方向的情况下，基建还是得靠工程师们。所以估计一两年内在新加坡数据工程还是会很火的。\n薪资方向毕业生5k-6k甚至6.5k的我都听过。再往上这个差别就看个人能力和造化了。\n大概的发展方向 # 中期内肯定是Head of data之类的位置。因为当你横跨了DE DA DS 之后那很自然就是要做这个，带一个data team去搭建公司的整个数据生态。实际上我新工作跟这个应该很类似，但毕竟还是小公司所以没有head不head的说法。可以预见的是我DS方面的薄弱在未来的一段时间内可能需要加强，除此之外还得晚上现代数据工具，Kafka 和 Spark这两个在接下来的工作中肯定是要尽量用的，不然一直没机会上生产不知道到底是咋回事，坑踩得不够多是学不到东西的。\n被字节跳动刷掉之后的思考 # 前段时间面了几家公司，包括字节跳动。当然最后字节跳动是因为算法把我干掉了（至少面试官是这么说），所以我也在思考一个问题我到底喜欢的是纯粹的工程 - 也就是字节跳动的招聘方向，还是它在数据领域的应用。其实这个问题我现在还没有答案，我觉得我个人对工程是有一些热忱的不然我玩游戏可能也不会那么喜欢经营或者建设类，从无到有的实现过程本身就很令人激动吧。但数据也是一个很有意思的东西，我甚至觉得所谓的数据思维？应该是一种人生哲学，当你相信这个东西之后，你的行为规则就会具有一致性。好处就是你省下了很多精力去干其他事，减少了随机性；坏处就是减少了随机性，多了很多时间不知道干什么（取决于你的看法）。 所以当我从数据中证实/证伪一些想法的时候，还是很有趣的。\n当然了大公司会将分析和工程单独划分招两个人，我还没太想好到底应该着重往哪个方向发展，in the long run。小公司来说两者兼具肯定是有优势的，但是大公司的话其实是不需要。在我现在的层次来看分析这块在长远应该是能走的更远的，也从侧面证实了数据分析能力或许将在未来成为跟 office 一样的必要技能（其实我不会excel）。\n总结 # 本来打算从宏观角度讨论为啥数据化是未来所以数据相关的职业在未来一段时间还是会很吃香，但是我想这都快成共识了加上我也懒得打字所以就算了。就这样吧，下次跳槽再见。\n","date":"5 May 2020","externalUrl":null,"permalink":"/2020/05/rt-log-3-%E5%8F%88%E4%B8%80%E4%B8%AA%E5%B7%A5%E4%BD%9C%E6%80%BB%E7%BB%93/","section":"posts","summary":"还有不到两周的时间就要换工作了，时隔两年再打算写一个总结。\n","title":"RT Log 3: 又一个工作总结","type":"post"},{"content":"估摸着约是我初二那年，也就是二零零八年，本来跟往常一样趁爸妈不在家准备偷偷玩电脑，结果误打误撞开始看电影，那也是我第一次在电脑上看电影，看的就是《后天》。记得在那之前，电影对我来说只是偶尔学校组织去厂放映厅看一些红色电影，多动症的年纪自然是无法沉浸其中。\n我一直记得《后天》的故事线，我小时候一直把这个片子当爱情片来看，男主女主克服了各种困难，甚至是自然灾害，终于在一起了。男主那时候还是一个少年还没有演蜘蛛侠的反派，女主还是一个少女还没有演Shameless，啧啧啧这片子2004年的电影。\n如今在这个2020年已经过去了三分之一的当儿，我重看了一遍。我看之前我就知道我会很想说点什么。这一次我还是被感动了，不过不是对男主女主的感情线，而是雪地里被冻僵的警察，雪后屋顶如蚂蚁一般冒出的幸存者，还有总统关于第三世界的发言。\n电影本身，评分似乎不高，烂番茄也不很受欢迎，但我很喜欢因为我看完电影我有感动，有resolution后的喜悦，反正我也不是影评人不需要用简洁的话把这部电影批评一番。好像这是我今年看的第一部电影？哦好像新年是看了釜山行，总之就是很久没去电影院了。\n这的确是一个多事之秋，澳洲与亚马逊大火，南极冰架也在融化，就更不要提我觉得今年可能都无法结束的病毒。虽然我觉得可能不会像电影中出现那么极端的天气，但还是很明显地感觉到去年比往年都热。你想这是新加坡呀，哪天不热。\n很多场景跟现实是真的很相似，又惊人的正好相反。故事视角在美国，副总统和特朗普有着相同的自己一番经济vs撤离的论调。虽然如今病毒的致死率却是不如暴风眼。但谁又知道会不会有那一天呢。\n今天同事还在跟我说，美国好像已经有人在抗议想要复工了。我站在全局的一个点上，自然是无法做出任何判断的。我只是觉得，人类依旧跟十几万年前一样愚蠢（或者睿智），我们眼中只有利益，我们才站在了食物链的顶端。或许几百年后，我们分化成了不同的种族，互相隔离，在或许未来的生物学家会进一步细分Sapiens，然后只存留了最后一支Sapiens的后代（这倒是一个不错的小说背景）。\n互联网的发展便捷了人们的生活，节省出来的时间人们都用来干什么了呢？精神文明建设有成效吗？其实还是有的。科学发展有突破吗？也是有的。我们可以解放更多农耕劳动力吗，我们可以自动化，工业化农耕吗？\n我也不知道我为什么会操心这些，我说我关心天下大事真的不是开玩笑哈哈，思而不学则殆说的就是这个了。希望有生之年有机会能为新能源，新农业和医疗产业贡献一些力量吧。\n没有什么主题，纯粹的意识流扯淡，想到哪里说哪里。该睡觉了。\n我对人类还是很有信心的。对目前普世的价值观，虽有诸多不满，但依旧觉得大体是一个健康的发展。说到底，还是要加强精神文明建设呀同志们。\n","date":"19 April 2020","externalUrl":null,"permalink":"/2020/04/movie-log-4-the-day-after-tomorrow/","section":"posts","summary":"估摸着约是我初二那年，也就是二零零八年，本来跟往常一样趁爸妈不在家准备偷偷玩电脑，结果误打误撞开始看电影，那也是我第一次在电脑上看电影，看的就是《后天》。记得在那之前，电影对我来说只是偶尔学校组织去厂放映厅看一些红色电影，多动症的年纪自然是无法沉浸其中。\n","title":"Movie Log 4: The Day After Tomorrow","type":"post"},{"content":"","date":"7 March 2020","externalUrl":null,"permalink":"/tags/aws/","section":"tags","summary":"","title":"AWS","type":"tags"},{"content":"","date":"7 March 2020","externalUrl":null,"permalink":"/categories/dev-ops/","section":"categories","summary":"","title":"dev-ops","type":"categories"},{"content":"This article will describe the process to setup Fargate tasks with only private IP, using images stored in a private registry.\nDisclaimer: I am not a security expert, in fact my computer network knowledge was even picked up during work. I realized as the need of designing your own architecture grows, I will inevitably fill in the holes in my computer science theory. I merely receive a goal, research the tools and theories behind, and execute it to the best of my ability.\nBefore covering the main steps, there are some important concepts one needs to know in the AWS, besides Fargate related concept.\nVPC\nSubnet\nRoute table\nInternet Gateway\nNAT Gateway\nAll related concepts could be found in the official AWS documentations\nOverview # For Fargate to pull images from private registry, it needs Internet access. I initially would think that ECR, the private registry provided by AWS, would provide some sort of intranet access so we don’t have to go through Internet, but I did not find such route.\nFor a task running in Fargate to access the Internet, it either needs a public IP, which could be enabled during service configuration, or more securely, resides in a private subnet that routes outbound Internet request to the etwork address translation (NAT) gateway, and then the Internet Gateway (IGW).\nThis approach obviously provides better network control, monitoring and security protection.\nConfigure your network (Optional) # This step is optional if you have already did it. According to AWS docs here\nIf you are using Fargate tasks, in order for the task to pull the container image it must either use a public subnet and be assigned a public IP address or a private subnet that has a route to the internet or a NAT gateway that can route requests to the internet\nSo configure this setup, you need to do the following:\nThe following is just a steps paraphrasing of the AWS docs here Choose (or create) a VPC, everything after will be down within this VPC\nCreate a subnet, name it xxx-public to indicate this will be your public subnet\nCreate another subnet, name it xxx-private\nCreate a NAT gateway, attach it with an elastic IP\nCreate an Internet gateway\nCreate two route table, one routes internal requests to the NAT gateway (RT 1), one routes internal requests to the Internet gateway (RT2)\nAssociate (In Route Tables side tab, Subnets Association) xxx-private subnet to RT1 (Bottom subnet in the diagram above, where private servers reside in)\nAssociate xxx-public to RT2 (Top subnet in the diagram, where Internet facing servers reside)\nLaunch applications in Fargate # After this is successfully setup, you can start to configure your Fargate SERVICE, yes once again the network layer is configured at service level, not task level.\nThere are three things in particular that require some attention if we want to disable the public IP, the rest is going to be as usual.\nSelecting the appropriate subnets: Instead of randomly select your subnets or select across availability zones, you need to select the subnets that are attached to NAT, so you could\nDisable your Auto-assign Public IP option in the settings, this will make your Fargate Task inaccessible and receive calls from the Internet\nConfigure your ALB / NLB - This has to be in the public subnet and facing the IGW, so you load balancer could take receive HTTP requests and route them to the private Fargate instance\nConclusion # This is actually pretty easy if one understands how network works in AWS, and for those are not familiar with computing networks (like me), reading the AWS docs is enough to complete all these setups too.\n","date":"7 March 2020","externalUrl":null,"permalink":"/2020/03/devops-log-2-setup-fargate-service-without-public-ip/","section":"posts","summary":"This article will describe the process to setup Fargate tasks with only private IP, using images stored in a private registry.\n","title":"DevOps Log 2: Setup Fargate Service without Public IP","type":"post"},{"content":"","date":"7 March 2020","externalUrl":null,"permalink":"/tags/fargate/","section":"tags","summary":"","title":"Fargate","type":"tags"},{"content":"","date":"7 March 2020","externalUrl":null,"permalink":"/categories/software-engineering/","section":"categories","summary":"","title":"software-engineering","type":"categories"},{"content":"The battle against the fearful death himself is an everlasting one, and today we are, after seventeen years, effortless watching another heavy strike laid upon us. Well, to be more specific, mostly the people lived in China and who had been to China during this unfortunate period.\nIt saddens me to see that we had not learnt enough from the previous fights to contain it more effectively this time. And it saddens me even more to see doctors, who have fought valiantly at the frontiers, who deserved the most protection, support and respect, are stabbed and murdered in broad daylight.\nI have always considered myself a selfish person, that means I do not resonate normally with tragedies that do not impact me directly. A nice way to put this when I am actually talking with acquaintances: I am politically neutral and detached.\nA few weeks before the outbreak I went to a hospital for health screening. There are a few things that I felt strongly uncomfortable.\nThe light in the hall way was not working, the paint on the wall were full of scratches\nWhile I was doing my teeth polishing, patients waiting and relatives were standing there blatantly, probably observing my cleaning process\nI don’t really care my privacy that much in this particular case, nor the interior design of the hospital. The first point shows that, IMHO, the budget for the (public) hospital is just not sufficient to even consider a renovation, and the second point scares me the most: patients, or even random people could just walked into a room that is full of medical equipments. Yes I know Stomatology department is probably ok to be less strict , which probably proves my first point regarding the limited medical resource in China.\nLet’s hope for the best, and let’s hope this serve as yet another warning for the rising tensions between patients and doctors.\n","date":"22 January 2020","externalUrl":null,"permalink":"/2020/01/rt-log-2-about-wuhan-virus/","section":"posts","summary":"The battle against the fearful death himself is an everlasting one, and today we are, after seventeen years, effortless watching another heavy strike laid upon us. Well, to be more specific, mostly the people lived in China and who had been to China during this unfortunate period.\n","title":"RT Log 2: About Wuhan Virus","type":"post"},{"content":"I have taken some Database design course back in university days, though I have skipped almost all of the lectures as I was also self-learning during my first internship on the same matter, in a much more practical manner. I did not even know the word OLAP back then. However, I am not dismissing the importance of data modelling in data engineering. On the contrary, data modelling is one of the important skills if you want to be a data engineer.\nThis article aims to address some common problems in the following types of column found in analytical tables\nMetric Column\nFilters / Segmentations Column\nTime Column\nAnalytical Table Types # The word analytical suggests that this article will not talk about ER diagram or transactional type, normalised data modelling. In fact, there are in my opinion two broad types of analytical tables in general.\nSnapshot\nHistorical\nSnapshot Tables # Snapshot tables are refreshed during each pipeline runs, making them up to date with the raw data. Any changes in data made by users would safely be included into such tables without much implementation work.\nHistorical Tables # Historical tables are rather nasty, but extremely important and common for monitoring and time dependent reporting. The word historical suggests that it preserves certain attributes of the raw data.\nFor example, if you are producing a report for a client, and you have computed some important numbers. On the next reporting period, your user base grew a little and that important number inevitably changed if you are using Snapshot type of table. Of course this might be ok and understandable for some clients, especially when they were train in engineering, but sometimes it’s just easier for everybody if we abstract the complex but unnecessary aspect of data preparation away from the valuable and respectful clients.\nThey are happy, we are happy. Data engineers need to do work.\nThis is an example, of course there are use cases for this which I don’t really want to spend my brain power on, you get the point.\nOther types # There are more than just these two types of tables in a typical data warehouse. You could have\nLookup Tables - static tables that does not often get updated or updated manually. e.g. hard to extract data, categorical hierarchy\nExploration Tables - temp tables created by your dear friends and value colleagues\nTime related Columns # This is very straight forward, a date / timestamp column in the table, usually will be converted into the x-axis for times-series visualisations, or act as an attribute waiting for further aggregations. Not much to talk about but there are some few traps and tricks.\nHowever, depending on the purpose of the columns, we should treat them differently.\nUsed as a metric - Localisation # It is convention that all timestamps stored in database, if not specified should be treated as UTC timezone. In fact, many of the columns in raw data would be often in the format of BIGINT, the epoch. Since time data comes in different format, we need to standardise for analytical use.\nAlso depending on the analytical usage, one should consider whether to include UTC timestamp or Localised timestamp.\nFor example, in the immediate downstream after the extraction of raw data, we would normally prepare some big analytical that does not contains much aggregation, but normally transformation and cleaning for columns. This table, which serves as a foundation for other analytical tables, should contain both time columns.\nHowever, if you are building analytical tables for monitoring purpose, ideally you should use the UTC time and omit the other. This is because monitoring is often associated with your infrastructure, to identify when a bug might cause a drastic drop in MAU, you linked the UTC timestamp in the table and your system log time, or converted once more to YOUR (not users) local time and link with your git commits timestamp.\nWhen building dashboards to answer a time dependent question, say when does people sleep normally, you should obviously use local time. This is important but often neglected during practice.\nUsed as a label - Weekly and Monthly # Using time a label is too common that I would not go too deep in example. Here it is more about a convention in a company, there are people in the world view the start of the week as Sunday, where others as Monday. So I guess you should pick one and stick to it when creating weekly labels. Monday is more likely the default choice by database date_trunc functions.\nAs for monthly label, it makes more sense to show the first date of each month as on the chart you definitely do not want to see 30th Sept as your label. However, the interpretation of 2019-06-01 within the dashboard needs to be clarified, does it mean metrics generated based on data before 2019-06-01 or 2019-07-01?\nFilters / Segmentations # Filters / Segmentation columns such as countries and languages play a crucial role in making the table more explorable. The difficult part of this is to anticipate the filters that make sense to your business, very much similar to set distkey and sortkey in AWS Redshift - because if you want to change it later, it won’t be so easy. You will need to recreate the table by changing a few column selections and downstream tables if needed.\nIf the recreation is not yet explained clearly, think about an analytical tables that has three columns - (DATE | COUNTRY | DAU). This allows you to explore DAU numbers by country. Now your boss would like to see a distribution according to languages simultaneously , you will need to recreate the table by adding a new column, which is troublesome.\nNote the phrase simultaneously, here we are actually talking about a combinational usage for filters, it has always been the case for me during work too. However, recently I start to wonder do we really need a combinational filters in some cases? Does the information `users who are active in French (FR) also uses language Chinese (zh) matter for your business, because without considering this edge cases that, in some business context insignificant, could save the engineers some work by transforming the table into something like - (DATE | FILTER NAME | FILTER VALUE | DAU)\nand do\nSELECT date , dau FROM table WHERE filter_name = \u0026#39;country\u0026#39; AND filter_value = \u0026#39;US\u0026#39; Less engineering work in the long run, easier to scale.\nIt’s really dependant on the business usage.\nMetric Column # This is your precious insight. By working every hard in gathering raw data, you finally have your processed / aggregated / transformed (whatever you would like to call) insights, or intermediates before the final insights.\nI would like to examine it from two different angles.\nData Window size # I use the term data window because of window functions, frequently used in aggregating operations.\nCASE WHEN size = 0 # This is static data, simply inherited table by table, CTE by CTE into the final data set. However, exactly since they are static, you should consider abstract them into a lookup table if possible and logical.\nCASE WHEN size = all # This is computed base on the entire raw data sets. Not much to talk about to except if you want to be strict, truncate the head of your raw data during initial selection because Replication of Raw data does not happen all at UTC 0 o clock.\n(You could argue that my extraction pipeline will make sure no trailing data is imported into DW, but lets just ignore tiny details ok?)\nCASE WHEN size = other time interval # You could also compute metrics based on sole on the previous complete day / week / month, for this you will need to specify the time range in the WHERE clause.\nPersonally, I prefer to use something like the following for daily. It’s more readable in my opinion, but other people use INTERVAL , DATEDIFF (which is an absolute value!) and others. Be careful with the inclusion and exclusion on each side of the time interval. Do not recommend BETWEEN because it’s both inclusion.\nWHERE creationtime \u0026gt;= \u0026#39;STRING_DATE\u0026#39;::DATE AND creationtime \u0026lt; \u0026#39;STRING_DATE\u0026#39;::DATE + 1 For week and monthly, date_trunc or a variation of it is good enough.\nCASE WHEN you need window functions # There are quite a few window functions so I won’t try to categorise them and list them one by one (not really helpful). It’s just for these computations, it could be slow + memory intensive. So you might need to consider create temp tables if needed, and dropped them alter.\nMaturity # I would like to talk about the maturity of the computed data. I define maturity as how close it is to be used in final presentation (BI dashboards). For example, we would like to compute some ratios, probably we are going to compute a denominator and a numerator separately then use a / to compute within the same SQL clause.\nYes you have finished your task, with filters, say DAU / MAU ratio in US.\nHow about now I ask you, can you tell me the global DAU / MAU ratio, and how does it different from the one in US?\nThat’s when I realise I should probably not aggregated them to be fully mature. So instead of saving the ratio, I would like to choose to save DAU and MAU count separately, which are integers and can be further aggregated if I want a global count (using a SUM() in the next CTE).\nThis step could even be done in the Dashboard view level, depending on how power your data warehouse engine is. It is usually pretty fast, but if unfortunately that’s slow then I would say save both the ratio and intermediate columns in the table.\nOrientation # Finally, about how to save those columns. (This post is really long and I got very tired mentally at this point. )\nThere are two obvious way to save it:\nHorizonally - each metric takes a column Pros: intuitive, this make the read query pretty readable, easier to access more features within a query Cons: less scalable as every additional feature requires a new column on the table\nSELECT metric_name FROM table WHERE … Vertically - each metric is stored in row by row as a key-value pair, like a python dictionary Pros: easier to scale Cons: one extra WHERE clause, troublesome to computations among the metrics stored within a query, needs to use either CTE / Subquery or other operation to transform first.\nSELECT generic_value FROM table WHERE generic_name = ‘\u0026lt;METRIC_NAME\u0026gt;’ AND … In practice, I generally prefer the second approach as I believe writing separate CTEs is much much more easier than regenerate the table from an engineering’s point of view. However, if let’s say the metric requirement is really really static, then by all means use the first approach.\nThe above discussion only consider the usage of pure SQL. If we want to go one step further and include other scripting language, we could dynamically generate the the feature in a columnar format. That would be another post I guess.\nConclusion # I don\u0026rsquo;t want to conclude anything, see you next time.\n","date":"9 January 2020","externalUrl":null,"permalink":"/2020/01/de-log-5-thoughts-on-analytical-tables/","section":"posts","summary":"I have taken some Database design course back in university days, though I have skipped almost all of the lectures as I was also self-learning during my first internship on the same matter, in a much more practical manner. I did not even know the word OLAP back then. However, I am not dismissing the importance of data modelling in data engineering. On the contrary, data modelling is one of the important skills if you want to be a data engineer.\n","title":"DE Log 5: Thoughts on Analytical Tables","type":"post"},{"content":"We have all heard of the term ETL. If you are working in the data field, you might have been asked to do some sort of ETL work regardless of your actual job description.\nHowever, if we think it from a slightly different perspective, we could “invent” a different methodology in harvesting analytical data sets - we load raw / intermediate data into data warehouse first, and then transform it in-place in the data warehouse.\nI believe both approaches are useful and I would further argue that ELT approach is more advantageous under certain business contexts. In this blog post, I am going to first describe the fundamental difference between the aforementioned approaches, and then present three advantages that the ELT approach may possess over the other.\nOverview # By examining and reducing the problem further, we could see that the fundamental difference between ETL and ELT is a result of the uneven distribution of provisioned computational (data transformation and aggregation) power. The traditional ETL approach emphasizes more on the transformations outside of the data warehouse, such as Spark jobs or even simple Python scripts using various libraries. ELT approach on the hand invests much more resources on the data warehouse itself, and transformations of data sets often occur within the data warehouse in the form of SQL scripts. Transformation that are not executable solely within a database engine will not be discussed here and should ideally followed a traditional ETL approach.\nIn my limited experience, I believe ELT tends to work better in a versatile, flexible and bootstrap environment, where a break from workflow is commonly acceptable to achieve the goals with minimum cost. While ETL works better in the more resourceful, stable business lines, where strict workflows are respected (for good reasons).\nWe must also recognize that eventually as the business grows, more budget hence more computational power will be provisioned on both ends, and the problem that was originally choosing one of the two approaches will then inevitably evolve into: For any particular transformation job, where should I put it? I do not want to dive too deep on the issue mentioned above because that deviates too much from the original discussion. Let’s come back and focus again on the choice between ETL and ELT, and here are my two (three) coins:\nELT has a lower development cost, but the marginal gain is diminishing as the company grows # Transformation jobs tend to be just SQL scripts at the beginning. By relying the strong computational power within the data warehouse, data analysts or data engineers could easily implement the pipelines by testing complex SQL scripts directly against the pre-loaded raw data within data warehouse, provided that the employees know SQL, which is commonly required in the job description. This allows new startups to distribute data pipeline building workload within the entire data team.\nBesides development, debugging is also much easier as you could directly query against intermediate CTE/tables to check the validity of data. You could even automate data healthchecks simply using SQLs, again achievable by analysts which is great for smaller teams.\nAs the data size grows and the analytical data layers in your data warehouse becomes deeper, one will have to shift the stable (less susceptible to changes) transformations and those cold (less frequently accessed) intermediate data sets outside of the data warehouse, thus slowly balance the distribution of transformation jobs and the usage of ETL/ELT approaches.\nELT has a lower operational cost, but it requires a rather big refactoring once things start to get too slow # Let’s be honest, which one is easier? To hire a decent SQL guy/girl or Python/Java guy/girl? Ok you may say that’s not a fair and meaningful comparison at all. Then how about the Ops work involved during coding deployments. At the initial phase of your data warehouse, an ELT approach could save a large amount of operational costs which could then translate to less salary expense and debugging hours. Ask the tech lead or the CTO to setup an auto-deployed SQL repo for your scheduler. Steps that a person worked on pipeline should only involved\nDeveloping new SQL script (pipeline node) Testing SQL in DW Committed to repo and submit PR Once the reviewer merges the PR, the pipeline should be automatically updated. Codes in this way is much more isolated (only in SQL) and again easier to involve people who are not originally trained in the software engineering field.\nHowever, once the size of intermediate data becomes too big to store in data warehouse, a big refactoring is needed to transform those SQL scripts into something similar to spark job and save them in S3.\nELT has a lower exploration cost # I guess this point is a relatively weak because it is of course easier to explore when you have less data and everything is querable within the same database. But imagine if we only load the final data sets required by internal reports / external data products into the DW, left those intermediate tables in S3, we won’t be able to easily check some random idea, validate some random hypothesis and do all kinds of experiments with just one SQL script. Sure I guess one could build a view and query against all Parquet files in S3, but LBH it’s not that realistic for smaller team to setup this (other priorities always cut in the line).\nAgain we need to acknowledge one thing, that the connenvience ELT provides comes with a cost, once intermediate data and pipelines are shifted outside of DW, the visibility and the accessibility of raw data would inevitably decrease, or in a well managed situation, not the visibility but mostly the accessibility. However, one could also argue that the visibility of such vast data sets may be overwhelming and not meaningful for everyone, a concrete point of which I would concede to. Personally I am trying to perfect the methodology that allows analytical data to be built layer by layer so that access rights will not be granted excessively.\nConclusion # I just came up with this post on the train, by knowing myself so well that most likely I would not revisit this article in the future even if my thoughts above has changed. What I can say is I have practiced the ELT approach (thanks to my Boss) for more than a year in a very small team with just one data engineer, I kinda of feel this is the reason why I am still able to get everything to flow even I am the only data engineer in the team. It works pretty nice with small teams, especially if you would like to save the money for your Spark cluster until you really need it.\n","date":"1 January 2020","externalUrl":null,"permalink":"/2020/01/de-log-4-etl-vs-elt/","section":"posts","summary":"We have all heard of the term ETL. If you are working in the data field, you might have been asked to do some sort of ETL work regardless of your actual job description.\n","title":"DE Log 4: ETL vs ELT","type":"post"},{"content":"This short article aims to reproduce my current workbench, using color plattes of my choice.\nFinal Look: # iTerm2 - terminal\nVS Code\nResources to configure:\niTerm2\nagnoster.zsh-theme\nVS Code\nVS Code Configuration # This requires installation of Menlo for Powerline font. It has to be a monospace version.\n{ “terminal.integrated.fontFamily”: “Menlo for Powerline\u0026#34;, “workbench.startupEditor”: “newUntitledFile”, \u0026#34;terminal.integrated.shell.osx\u0026#34;: \u0026#34;/bin/zsh\u0026#34;, \u0026#34;window.zoomLevel\u0026#34;: 1, \u0026#34;editor.fontFamily\u0026#34;: \u0026#34;Menlo for Powerline\u0026#34;, \u0026#34;workbench.colorTheme\u0026#34;: \u0026#34;Solarized Light\u0026#34;, \u0026#34;editor.fontSize\u0026#34;: 14, \u0026#34;workbench.colorCustomizations\u0026#34;: { \u0026#34;terminal.background\u0026#34;:\u0026#34;#594f4f\u0026#34;, \u0026#34;terminal.foreground\u0026#34;:\u0026#34;#e5fcc2\u0026#34;, }, \u0026#34;terminal.integrated.lineHeight\u0026#34;: 1.2 } Agnoster Theme # Locate your agnoster.zsh-theme file by searching oh-my-zsh home. There is a themes folder.\nReplace with the following config:\nNote color number (256 color system) can be customized.\n# vim:ft=zsh ts=2 sw=2 sts=2 # # agnoster\u0026#39;s Theme - https://gist.github.com/3712874 # A Powerline-inspired theme for ZSH # # # README # # In order for this theme to render correctly, you will need a # [Powerline-patched font](https://gist.github.com/1595572). # # In addition, I recommend the # [Solarized theme](https://github.com/altercation/solarized/) and, if you\u0026#39;re # using it on Mac OS X, [iTerm 2](http://www.iterm2.com/) over Terminal.app - # it has significantly better color fidelity. # # # Goals # # The aim of this theme is to only show you *relevant* information. Like most # prompts, it will only show git information when in a git working directory. # However, it goes a step further: everything from the current user and # hostname to whether the last call exited with an error to whether background # jobs are running in this shell will all be displayed automatically when # appropriate. ### Segments of the prompt, default order declaration typeset -aHg AGNOSTER_PROMPT_SEGMENTS=( prompt_status prompt_context prompt_virtualenv prompt_dir prompt_git prompt_end ) ### Segment drawing # A few utility functions to make it easy and re-usable to draw segmented prompts CURRENT_BG=\u0026#39;NONE\u0026#39; if [[ -z \u0026#34;$PRIMARY_FG\u0026#34; ]]; then PRIMARY_FG=black fi # Characters SEGMENT_SEPARATOR=\u0026#34;\\ue0b0\u0026#34; PLUSMINUS=\u0026#34;\\u00b1\u0026#34; BRANCH=\u0026#34;\\ue0a0\u0026#34; DETACHED=\u0026#34;\\u27a6\u0026#34; CROSS=\u0026#34;\\u2718\u0026#34; LIGHTNING=\u0026#34;\\u26a1\u0026#34; GEAR=\u0026#34;\\u2699\u0026#34; # Begin a segment # Takes two arguments, background and foreground. Both can be omitted, # rendering default background/foreground. prompt_segment() { local bg fg [[ -n $1 ]] \u0026amp;\u0026amp; bg=\u0026#34;%K{$1}\u0026#34; || bg=\u0026#34;%k\u0026#34; [[ -n $2 ]] \u0026amp;\u0026amp; fg=\u0026#34;%F{$2}\u0026#34; || fg=\u0026#34;%f\u0026#34; if [[ $CURRENT_BG != \u0026#39;NONE\u0026#39; \u0026amp;\u0026amp; $1 != $CURRENT_BG ]]; then print -n \u0026#34;%{$bg%F{$CURRENT_BG}%}$SEGMENT_SEPARATOR%{$fg%}\u0026#34; else print -n \u0026#34;%{$bg%}%{$fg%}\u0026#34; fi CURRENT_BG=$1 [[ -n $3 ]] \u0026amp;\u0026amp; print -n $3 } # End the prompt, closing any open segments prompt_end() { if [[ -n $CURRENT_BG ]]; then print -n \u0026#34;%{%k%F{$CURRENT_BG}%}$SEGMENT_SEPARATOR\u0026#34; else print -n \u0026#34;%{%k%}\u0026#34; fi print -n \u0026#34;%{%f%}\u0026#34; CURRENT_BG=\u0026#39;\u0026#39; } ### Prompt components # Each component will draw itself, and hide itself if no information needs to be shown # Context: user@hostname (who am I and where am I) prompt_context() { local user=`whoami` if [[ \u0026#34;$user\u0026#34; != \u0026#34;$DEFAULT_USER\u0026#34; || -n \u0026#34;$SSH_CONNECTION\u0026#34; ]]; then prompt_segment $PRIMARY_FG default \u0026#34; %(!.%{%F{yellow}%}.)$user@%m \u0026#34; fi } # Git: branch/detached head, dirty status prompt_git() { local color ref is_dirty() { test -n \u0026#34;$(git status --porcelain --ignore-submodules)\u0026#34; } ref=\u0026#34;$vcs_info_msg_0_\u0026#34; if [[ -n \u0026#34;$ref\u0026#34; ]]; then if is_dirty; then color=220 ref=\u0026#34;${ref} $PLUSMINUS\u0026#34; else color=167 ref=\u0026#34;${ref} \u0026#34; fi if [[ \u0026#34;${ref/.../}\u0026#34; == \u0026#34;$ref\u0026#34; ]]; then ref=\u0026#34;$BRANCH $ref\u0026#34; else ref=\u0026#34;$DETACHED ${ref/.../}\u0026#34; fi prompt_segment $color $PRIMARY_FG print -n \u0026#34; $ref\u0026#34; fi } # Dir: current working directory prompt_dir() { prompt_segment 30 $PRIMARY_FG \u0026#39; %~ \u0026#39; } # Status: # - was there an error # - am I root # - are there background jobs? prompt_status() { local symbols symbols=() [[ $RETVAL -ne 0 ]] \u0026amp;\u0026amp; symbols+=\u0026#34;%{%F{red}%}$CROSS\u0026#34; [[ $UID -eq 0 ]] \u0026amp;\u0026amp; symbols+=\u0026#34;%{%F{yellow}%}$LIGHTNING\u0026#34; [[ $(jobs -l | wc -l) -gt 0 ]] \u0026amp;\u0026amp; symbols+=\u0026#34;%{%F{cyan}%}$GEAR\u0026#34; [[ -n \u0026#34;$symbols\u0026#34; ]] \u0026amp;\u0026amp; prompt_segment $PRIMARY_FG default \u0026#34; $symbols \u0026#34; } # Display current virtual environment prompt_virtualenv() { if [[ -n $VIRTUAL_ENV ]]; then color=cyan prompt_segment $color $PRIMARY_FG print -Pn \u0026#34; $(basename $VIRTUAL_ENV) \u0026#34; fi } ## Main prompt prompt_agnoster_main() { RETVAL=$? CURRENT_BG=\u0026#39;NONE\u0026#39; for prompt_segment in \u0026#34;${AGNOSTER_PROMPT_SEGMENTS[@]}\u0026#34;; do [[ -n $prompt_segment ]] \u0026amp;\u0026amp; $prompt_segment done } prompt_agnoster_precmd() { vcs_info PROMPT=\u0026#39;%{%f%b%k%}$(prompt_agnoster_main) \u0026#39; } prompt_agnoster_setup() { autoload -Uz add-zsh-hook autoload -Uz vcs_info prompt_opts=(cr subst percent) add-zsh-hook precmd prompt_agnoster_precmd zstyle \u0026#39;:vcs_info:*\u0026#39; enable git zstyle \u0026#39;:vcs_info:*\u0026#39; check-for-changes false zstyle \u0026#39;:vcs_info:git*\u0026#39; formats \u0026#39;%b\u0026#39; zstyle \u0026#39;:vcs_info:git*\u0026#39; actionformats \u0026#39;%b (%a)\u0026#39; } prompt_agnoster_setup \u0026#34;$@\u0026#34; iTerm2 Config # The rest of the color platte is controlled by iTerm2.\nUsing:\nForeground: #e5fcc2\nBackground: #594f4f\nANSI Red: #e96a67\nANSI Red was specifically changed because the original color looks too dark. Same rules should be applied to other ANSI colors if they are not obvious.\n","date":"14 December 2019","externalUrl":null,"permalink":"/2019/12/customize-terminal-and-ide/","section":"posts","summary":"This short article aims to reproduce my current workbench, using color plattes of my choice.\n","title":"Customize Terminal and IDE","type":"post"},{"content":"我读完此书，顿时对作者的职业素养肃然起敬。作为一个青年作家，日日夜夜重现着一段承载了自己最痛苦的记忆，是多么煎熬的一件事。小说的质量，或许有一些地方还显稚嫩，但我觉得这是十分值得尊敬的，并不由于她的遭遇，而是哪怕在这种情况下，她依然能追随心中对文学的坚持，或者用她自己的话说，“对语言最下等的迷恋”。\n我不太想讨论小说本身，因为无论故事梗概，写作背景其实都还是蛮明显的。大部分是通过作者自杀的消息知道的这本书，所以大部分人也知道了在书中谁是坏人，谁是受害者。我读书的过程中，不断的看到一些读者评论说“真的好恶心”，“读不下去了”。我当然不是在指责他们，但是看到作者自己说“（大意）读者可以选择不读，而我却不能选择不生活这段经历”，我便有些默然。我想一个靠写作进行“生理发泄”的作家，她的文字总还是值得我们去认真思考的。\n在这里插一句：第三章的群像饭局真的是全书我觉得最好的一幕。可惜我的文字不足以华丽的称赞它，无论从故事情节上还是小说架构上都是十分满足的结尾。\n接下来我想讨论的，只是作者本身。社会性的问题，已有太多的讨论。话题也太宏大，我不太能把握，我只是想从一个读者的角度，讲讲我看到了作者的什么。\n在这之前我建议大家去看一下这个视频：「這是關於《房思琪的初戀樂園》這部作品，我想對讀者說的事情。」──林奕含 | Readmoo電子書 - YouTube，以及书中的后记。\n首先我想要论述作者对文学，或者是读书这件事来说，是有瘾的。有人可能要问了，这并没有讨论的必要，是文中显而易见的。房思琪家里都是书，当然是嗜书（文学）如命的。其实我想强调的不是她喜欢文学这件事，而是她喜欢文学，同时也讨厌自己喜欢文学。这才是成瘾，你知道这是不好的，想要戒掉它，这才是成瘾。\n列举几个我能回忆起来的事实吧，房思琪感知到李国华任女儿xixi不读书，变得“蠢笨”，似乎是有意为之，除此之外也借钱老太太点出并不关系伊纹肚子里墨水而只是想要她生孩子。乍一看似乎都是作者对重男轻女/平权的抨击，我一开始同样也是这么以为的，直到后来我才发现，其实作为一个心理上被虐待了这么久，似乎怎么样都是“对男人不起”的人，她会更觉得喜欢上文学，是自己对不起自己，也如伊纹说的“跌倒了，也只能怪自己”。\n文学是最徒劳的，且是滑稽的徒劳。写这么多，我不能拯救任何人，甚至不能拯救自己。这么多年，我写这么多，我还不如拿把刀冲进去杀了他。真的。\n这是后记中作者对心理医生的话。\n如林奕含在此书访谈中所谈及的，她认为李国华的情话毫无疑问是美的（符合她审美观的），这也带出了她的叩问：艺术是否都是巧言令色的。很明显她是无法接受，在文学背后贯穿历史的的浩然正气又为何能被如此轻易的背叛，利用。这才是让她最崩溃的地方，我觉得这是一个信仰级别的崩塌，这也是我突然发现，她意识萌芽，开始摆脱，终发现自己已经成瘾。文中也有提及，关于李国华对《红楼梦》的看法。房思琪是无法相信但接受了这种对文学的背叛的存在。\n也正是因为这种瘾，让房思琪无法逃离李国华，无法自救。在采访（抑或是自述？）中，我们能很清晰的看到作者点出了房对李是有爱的，下一句她说“我甚至相信李国华在某些时刻，他是爱的”。 请注意这里她用了“相信”和“某些”这两个词，可以很明显的认识到作者本身对这种爱是有主观认知的，她知道这种爱很畸形。她没有试图抽身吗？现实生活中我无从知晓，但是在小说中她多次想要跟伊纹透露，确又怕与之分担。这种不应该，但存在于我们文化中的自责，是很容易湮没自己的，尤其是对抑郁症人群来说。\n所以其实这里成瘾有两层意思，第一层是对李国华的瘾。知道这种感情关系是不伦的，但依旧扣死不放，因为似乎放弃了便不知道还能再爱了，自己已经被玷污了，只好自我麻醉才能获得这种爱，伊纹的流产则是作者对自己这种想法的嘲弄，指望浪子回头只是痴人说梦。她也经常有这“小羊”“迷路”一般的神情，她是被误导的，诱奸的。第二层就是对文学本身的瘾，而李国华正是在房思琪那个年纪能接触到最能代表文学的一个人，因为对于作者或者主角来说，文学本身的意义就是信仰，是一生的追求，是“志愿”和“梦想”。这就好比，世界上最顶尖的物理学家，穷极一生试图理解宇宙运行的真正规则，然后终有一天他们发现了，地球只不过是更高阶生物的培养皿，我们所有人类只是在被观察的环境下进行，宇宙中同时还存在与我们几乎一样的培养皿。人是有自我保护意识的，是会拒绝相信的，拒绝之后，只能在给定的信息里寻找合理的解释，比如他是爱我的。她是分裂的，她知道他不是，但如果他是，她会好受很多。此为瘾症。\n其实写到这里我大概已经意识到我的文字（我那生锈多年的中文表达）已无法驾驭这个命题，我通读一遍前几段的论述，用我有限的逻辑分析能力也知道有一些不通顺。其实我突然想到了一个前段时间读到的女雕塑家，因为才华喜欢上了自己的老师然后被老师利用，最终也悲惨的活着然后死去，也有留存一些经典的作品。我觉得能对世间大道成瘾的人，最开始都是脆弱的，也都是需要被多加保护的。因为他们太专注于大道，却没想过在这个世界上有太多的人并无缘大道，但他们不知足，他们怨恨，自私并想要满足本不该属于自己的自尊。\n作为人类中稀有的璀璨明珠，进化中的异类甚至是推动文明 — 无论是精神文明还是物质文明 — 发展进步的关键，紧紧是因为一些人的私欲，利益而随之消亡，真的是很可惜的一件事。而这些人，则可以毫不客气的说是历史的罪人。\n","date":"16 November 2019","externalUrl":null,"permalink":"/2019/11/book-log-2%E6%88%BF%E6%80%9D%E7%90%AA%E7%9A%84%E5%88%9D%E6%81%8B%E4%B9%90%E5%9B%AD-%E6%96%87%E5%AD%A6%E6%88%90%E7%98%BE/","section":"posts","summary":"我读完此书，顿时对作者的职业素养肃然起敬。作为一个青年作家，日日夜夜重现着一段承载了自己最痛苦的记忆，是多么煎熬的一件事。小说的质量，或许有一些地方还显稚嫩，但我觉得这是十分值得尊敬的，并不由于她的遭遇，而是哪怕在这种情况下，她依然能追随心中对文学的坚持，或者用她自己的话说，“对语言最下等的迷恋”。\n","title":"Book Log 2:《房思琪的初恋乐园》- 文学成瘾","type":"post"},{"content":"Recently Amazon Redshift launched a new console interface, which is pretty nice. It actually gives some valuable optimisation tips. A data warehouse is like a sword, you need to constantly sharpen it so it won’t lose its edge.\nI will cover a few disk space saving tips in this article. Though it is Redshift specific, I believe the idea could be generalised into other Data Warehouse solution.\nTo analyse the disk usage, you just need to run\nSELECT schema as table_schema, \u0026#34;table\u0026#34; as table_name, size as used_mb, tbl_rows, pct_used FROM svv_table_info d order by size desc; It will give you a pretty nice overview of your disk usage. This is apparently the first thing you should do as it helps you pinpoint the priority of optimisation, and potentially the actual place of optimisation. Yes yes it could be more than just the storage layer, sometimes a complex analytical query would eat up all your storage too.\nCheck the list of tables, and do some house cleaning # We create temp tables all the times. Some of them emerged and get committed as a pipeline step, some of them are just eating dust at the corner of the data warehouse.\nThis tip is pretty straight forward, you just need to convince yourself that you really really do not need the table, and then drop it, just like what you do at house cleaning. Also, next time be a good guy and committed the queries somewhere, with documentation please.\nIdentify heavy pipelines and optimise them # There are certain transformations that would temporarily take up a large chunk of disk space and only release it back once the process is done. It could be problem if you have several heavy pipelines running simultaneously. It’s simple to identify them, just proceed to the Redshift console and look at the timeline for disk usage. If you see it rises to 100% usage periodically, checked the queries and optimised them by splitting the logic into several steps, persisted the transformation result of each step.\nNote: Personally, I find this is also a pretty useful trick in avoiding some encrypted errors in Redshift. This happens all the time to my dear friends and value colleagues:\nWrote a query with 6 CTEs Execute and gives an error message that is not even readable Debug CTE by CTE and realise the bug does not make sense at all because the error throwing CTE is so simple and normal Fixed it by persisted the result of the first few correct CTEs and execute the dead-simple-error-throwing CTE separately Happily looking at the final result while cursing Redshift Incremental Processing # I have talked about idempotence in data engineering in one of my previous blog, however sometimes it is pretty costly and unnecessary to transform data in an idempotent fashion. Thus, incremental processing would be a nice alternative, it is essentially a tradeoff between performance and data quality. It is again a pretty straight forward idea, but one must be careful about the higher maintenance of such pipelines, and it often requires manual data patching when bad stuff happens.\nHowever, if your data is\nretrospective in nature, e.g. we don’t care about the past, like client reports which should stay the same even the past data has changed slightly due to various technical limitations Events (time series) data as it followed a strict non-reversible time line I still recommend periodic auto data adjustment for the second type mentioned above, in order to counter the effect of delayed events / minor updates. e.g. a full replication every week.\nHot/Cold data strategy + Spectrum / Athena query # Following the incremental processing strategy, we could also implement a hot/cold data pattern when we store the data. For large analytical events data generated from Google Analytics or Amazon Pinpoint alike, often we are generating user behaviour insights for a very recent period.\nE.g.\na post analysis of a feature release an analysis to determine the next batches of optimisations Thus, we only need to keep a fixed range of data, truncate the tail and append the new events everyday. It will act as a moving window, for the cold data, of course we will still store them somewhere like Amazon S3, and access them in ad-hoc way, using tools like Redshift Spectrum/Athena.\nColumn Compression Encoding # This is something I find out recently, so I do have some very naive comparison stats. The main idea is to compress your data by columns, period. You could google more on the algorithm and implementation, I have taken a look, it is a good read.\nTo analyse:\nANALYZE COMPRESSION \u0026lt;table name\u0026gt;; This will tell you how much you can compress by percentage for each column.\nBefore compression:\nTotal disk space usage: 9.14%\nTotal time to generate: ~17mins\nTotal time cost for the immediate downstream task: 1 m 14 s 513 ms\nAfter compression:\nTotal disk space usage: 2.xx% (I do not remember the actual value)\nTotal time to generate: ~17mins (it’s pretty comparable)\nTotal time cost for the immediate downstream task: 1 m 18 s 276 ms\nSo yeah definitely recommend this on large table, only unsurprising downside is you will need to rebuild your tables, as if you are not doing this when you are re-arranging your distkey and sort key\nConclusion # Hope this helps.\n","date":"4 November 2019","externalUrl":null,"permalink":"/2019/11/de-log-3-amazon-redshift-disk-space-saving-tips/","section":"posts","summary":"Recently Amazon Redshift launched a new console interface, which is pretty nice. It actually gives some valuable optimisation tips. A data warehouse is like a sword, you need to constantly sharpen it so it won’t lose its edge.\n","title":"DE Log 3: Amazon Redshift disk space saving tips","type":"post"},{"content":"I will have some random thoughts here and there. This is a new series of logs, which will be short.\nIt has been more than one year since I returned to the position of a data engineer. After a detour, I ended up doing exactly what I initially love to do - building automation tools and pipelines.\nAutomation always has been attractive for me because it grants me some unfathomable sense of satisfaction when I watch processes and tasks are executed strictly adhering to your design. I probably got this feeling when I was developing crawlers back in the days and watching data cleaned, structured and stored in a database. It is exactly like (agriculture) farming, but with a much timely feedback.\nI tend to believe that video games like Factorio had developed my likings for the conquering of manual repetition during my pre-uni time. I am not a fan of programming even though I have started it at the age of 11. Apparently solving algorithmic problem using Pascal in a blue screen was not as fun as trying to score full marks in exams in middle school. Programming started to become fun when it becomes useful to me, that’s how I discovered I am definitely a utilitarian.\nDon’t get me wrong, I am not looking down on the manual labour or anything alike, and on the contrary, it is exactly the repetitive labours and endurance of boredom and hardship that helped build what humans now have, many of which are unfortunately taken for granted.\nIt is fun to build automation tools, however recently I did not have much time or thoughts to think and design certain tools. I realise even though I love such practices - hobby projects and stuff, my mental stamina has been drained during my work time (especially this week). I have to admit that age had gotten the best of me (speaking at the age of 24, which is old enough to understand what is old). You start to appreciate youth once it was truly gone, sadly.\nWhat is the purpose of life then, if not doing things that we love. To truly achieve that, we do things and only once in life, and use automation to buy us more time in the world.\nHopefully I can still be this optimistic at the age of 25.\n","date":"23 October 2019","externalUrl":null,"permalink":"/2019/10/rt-log-1-about-automation/","section":"posts","summary":"I will have some random thoughts here and there. This is a new series of logs, which will be short.\n","title":"RT Log 1: About Automation","type":"post"},{"content":"Recently I was working on the deployment of a predictive model built by my dear friend valued ex-colleague. Here I will share some thoughts and challenges I have encountered during its production deployment.\nThis topic will be discussed over at least two blog posts, as I do not write a super long article and then give up before I can finish it (Heck I almost did not finish this one). I will discuss the entire system, from the preparation of data and model artifacts to the deployment of the actual service.\nEssentially, the service itself will load the artifacts and doing real time prediction based on information posted by the frontend, and the update of the models/other datasets is done at a scheduled interval.\nDisclaimer: I am working in a startup, that means the size of data is not as huge as Facebook or Twitter, some strategies adopted here may not be realistic in your context, and I am sure I have not yet encountered every problem in the field so this is not a guide, it’s just to share some thoughts.\nArchitecture # The development generally could be divided into two big parts:\nData and model preparation pipelines\nService Development\nThis preparation step is done at a scheduled interval.\nAs illustrated by the diagram above, we are mainly doing\nAt the start of each month (or week or day depend on the business need and marginal gain of incremental learning), prepare the data -\u0026gt; train the models -\u0026gt; save the model artifacts for future use.\nAt an independent but often much shorter interval, prepare the additional data sets needed for prediction from Data Warehouse, and then they would be further replicated to the service backend for later use.\nModel Preparation # It is pretty straight forward, politely ask your data scientist colleague to organize the commands to prepare, train the model. If your dear friend and valued colleague does not know how to do that, help them by wrapping the code with a CLI command interface.\nOne very important thing to note: Some parts of the data science project such as prediction function or other data processing functions will inevitably be reused in the backend service. While preparing the commands for scheduling, I would package those necessary tools into a separate folder, decoupled from memory intensive libs such as pandas (still in the same repo), and later on built by running pip install git+ssh.\nroot |____prediction_model | |____stuff for model training | |______init__.py |____prediction_model_tools | |______init__.py | |___ stuff for production use |____README.md |___ setup.py (setup for prediction_model_tools folder) People who are improving the model might forget to update the production tools, it is vital to setup tests for those production tools, mainly on the shape of input. The accuracy test of the model should also be installed so that we do not deploy a worse model into production (if not using incremental learning). More details will be discussed during the deployment phase.\nData Preparation # Normally it is just data transformation, within Data Warehouse, nothing much to talk about. Complexity might vary depends on your usage and source data. However, we do need to think about the feasibility of TRANSACTIONAL queries in a production environment. Let’s say we prepared an analytical table prejoined that has 1 billion rows, which is easily handled by any modern data warehouse. However, when you want to serve them in a standard Postgres instance, you will need to think about the tradeoffs. Will a normal select be slow? How should we index it (Data replication does not usually handle indexin)? Should we prepare everything in one table and simplify the ORM layer / raw SQL layer or should we do custom joins on a needed basis. Anyway, the core idea is to think about the feasibility when replicating the data. It would be easier if the data engineer has some background in transactional queries, or even the optimization of the service databases.\nA Note on Data Replication # We do not serve data directly from data warehouse as it is brittle, not ACID (Redshift and many File System based Data Warehouse), this should be a common understanding so I will not further explain the need to replicate data.\nHowever, I would like to stress the importance of having the same data set being replicated to both the TEST and PROD environment. It ensures the reproducibility of certain data bugs.\nIn development environment, I would suggest a direct connection to data warehouse if your service is not crazy at opening connections.\nAnyway, now I always prepare two separate books for similar projects. One for the actual application backend DB, one for the replicated data in a separate schema.\nData bugs are often considered hard to catch because we can only tell whether it is a bug by examining the final results - the prediction result (or intermediate results if optimisation is done correctly.)\nService # In my case I did not really spend much time on the development of the service itself, due to the simplicity of our need for the current iteration. Thus, I will not and could not go very deep into the subject. With such a simplistic setup, I feel there is not much difference from setting up a conventional web backend.\nArchitecture is simple:\nPersonally I did not like ORM in simple projects so I have not yet tried integrating ORM with this dual-schema database setup.\nRequest rate # One thing I have to agree with the platform backend engineers (they are my users yes) is rate of requests, subsequently the responsibility of validating changes in the raw input. Ideally, we want to make a new prediction when there is a change in the raw input. In my case, I managed to persuade my colleagues to verify the input updates before calling my service, but it might not be appropriate for every scenario.\nModel (Pre)Loading # Model artefacts could be huge in terms of size, so it is crucial (and common knowledge I guess?) to preload the artefacts as global and accessible by the service workers.\nIn the specific case of gunicorn , use the —preload option.\nDocker image preparation # This should probably be discussed Part II, but personally I usually develop using docker directly. Some people prefer to develop outside of the containers, and then later on dockerize it. There are pros and cons for both approaches, but they will not be discussed here.\nAssuming taking approach one (develop with containers directly), we often need to rebuild the image several times. That means if not carefully planned, we will waste a lot of time waiting for libraries to be built, which results in a discontinuity in thoughts as we switch to other projects. Thus, build a common machine learning base image seems a sensible choice. Of course there are several popular existing pre-built images for machine learning, but I have to decide to build everything either from source or pip, and it is definitely a good learning experience but a chaotic nightmare. Anyway I only recommend you to do it once and only once.\nSummary # This is my favourite project as it was completely built by me - of course you get attached to our own work don’t you? :). I could image that the complexity and the size of the modules will grow when there are more requirements / data sets, and hope I had the chance to work on such projects in the future. Ultimately, I feel quite a few problems could be successfully addressed using established approaches from the conventional software engineering, with a touch of data sense to handle the potential variants.\nMore discussion on deployment for Part II, if I am not too lazy to write it, and I have only finished this post now because I started to forget what I have done one week ago (feeling old man).\n","date":"17 October 2019","externalUrl":null,"permalink":"/2019/10/de-log-2.1-develop-data-science-project-in-production/","section":"posts","summary":"Recently I was working on the deployment of a predictive model built by my dear friend valued ex-colleague. Here I will share some thoughts and challenges I have encountered during its production deployment.\n","title":"DE Log 2.1: Develop Data Science Project in Production","type":"post"},{"content":"刚看了《Ad Adstra》回来，趁热还是把想法写一下，不然等过个几个小时我可能就懒得写了。 (结果还是晚了一天发哈哈)\n观影感受 # 先讲一下全程观影的感受，我其实是没有预料到这是一部偏文艺和极简主义的科幻片。本来我一看故事简介大概估摸着是走亲情路线的科幻片，肯定又是升华人性主题的老套路，哦我不是说我不喜欢这个套路，毕竟我很喜欢《星际穿越》，结果。。。\n看电影的时候有一点很烦，以后还是要注意，定位定在了右边后排角落的位置，紧急出口的灯光照的有点烦，以后宁愿坐前面呢。\n下面大概说一下对这部片子喜欢和不喜欢的地方。\n不喜欢的地方 # 从批评开始嘛，我看完全陌生的片子（纯随机去看而不是续作这种）都是这样很自大地以一种批判的眼光去看。有些科幻片无法避免的小问题我就不说了，什么带着取掉一个铁棒就能弄下来的钢板穿越了一个小行星带，还能以肉体成功迫降于飞船的栏杆上（讲道理还是有点出戏的但不是重点好吧），还有什么轻易摆在船舱的一瓶毒气就这么把所有人干掉了（飞船晃动多正常啊怎么可能放在那里。。）也就是为了让主角一个人感受孤独设计的，不说啥了好吧都可以理解。\n前半程的剧情交代还是有些混乱的，例如政府欺骗主角有关他父亲的情报，神秘的火星运营总监（Operation Director这个翻译成运营总监我怎么就觉得这么奇怪）的长时间特写，主角一个心跳一直不过八十的突然接受了这个神秘运营总监的邀请看到父亲干掉了一些船员之后心跳就过八十了（之前说父亲还活着的时候都还57）。这些其实我整体看下来是很乱的，也说不上不自洽，我就有一种非常不舒服的感觉。思考了很久，我想明白这是为啥了。\n因为我开始以为这是什么剧情悬疑片，各种政府迫害优秀宇航员的阴谋，摩拳擦掌准备当商业片看的，结果看着看着发现不对啊，这是一个文艺片啊，这个bgm这个第一人称叙述，这不是走的《Gravity》的套路么，然后我越往后看越觉得正常，直到最后他父亲执意要求主角把自己放生，我才觉得这个片子还不错。\n所以其实这是一个我很不喜欢的地方，悬疑和背景交代的部分冲淡了主题的冷漠和严肃，也冲淡了电影后半部分想要呈现的那种感觉。\n喜欢的地方 # 总的来说我认为这个片子的主题还是很明显的，人类很孤独，人类很渺小，宇宙很宏大。其实我蛮喜欢这个主题的，很硬核很极简主义，让你在感受到这一切主题的同时，不加塞过多的情感。不一定所有人都喜欢，但是我觉得这是科幻本身一个非常重要的主题，主流的科幻片还是以科幻外壳包裹人性为主题的，有亲情啊爱情啊啥的。\n当然电影也是有a touch of humanity，最终主角感悟回归了人性回归眼前，寻找的智力生物其实就在眼前。其实感觉如果最终在放生他爸之后，自己坐在飞船上看太阳然后跟着核弹一起爆炸也挺好的，就是很depressing但是我觉得挺好的（我是不是心理不健康）。\n强行总结一下 # 总体感觉来说不如印象中的gravity，但如果喜欢太空文艺片还是可以看一下的。不过跟同伴一起的话还是先问问同伴喜不喜欢看这种类型的，我女朋友反正不是很喜欢，还看了贼久的gv广告。\n睡觉睡觉！\n","date":"27 September 2019","externalUrl":null,"permalink":"/2019/09/movie-log-3-ad-astra/","section":"posts","summary":"刚看了《Ad Adstra》回来，趁热还是把想法写一下，不然等过个几个小时我可能就懒得写了。 (结果还是晚了一天发哈哈)\n","title":"Movie Log 3 - Ad Astra","type":"post"},{"content":"In linear algebra, an idempotent matrix is a matrix which, when multiplied by itself, yields itself.\nOverview # Today I would like to talk about idempotency in Data Engineering.\nI first learnt this word in Year 1 Uni, an intro Linear Algebra course (probably my best Maths module ever in the entire uni life).\nThe term idempotent matrix means:\nIn linear algebra, an idempotent matrix is a matrix which, when multiplied by itself, yields itself.\nWell in data engineering, it most likely means no matter how many times you execute a ETL job, it will return you the same transformation result, provided the source data is unchanged.\nI have known this concept and the benefits of it at the early stage of my career, but only recently (after working on the Data Warehouse migration to Redshift) I have start to appreciate its importance.\nIncremental vs Idempotent # Before describing the benefits, I would like to define the following two types of ETL jobs:\nIncremental job, ETL jobs that require incremental changes to a table or data storage, e.g. append or update mode Idempotent job, or I often call it refresh job for naming convention (easier to spell), it is something that I have explained above. Benefits # Basically, when you have an idempotent ETL job:\nMinimal maintenance # No backfill required, no need to worry about discontinuity in data during execution, and you know exactly what data you are transforming during re-execution.\nThis is the most important thing, though it might take a longer time in the process compared to incremental task (not necessarily, and if it is too long it means you should look for an alternative solution, e.g. move your Python Operator to Spark), but the cost and trouble saved are much more crucial here.\nThe cost here could refer to debugging time, data validation and patching time etc.\nEasier to debug # It is easier to debug since everything can be refreshed as many times as you like, any time you want as long as dependencies are fulfilled. Normally data debugging is different from functionality debugging because the data is often dynamic, and that posts some challenges to even locate the bugs.\nLess temporal dependency # One of the common incremental jobs is the building of a time series table, e.g DAU MAU computation. They often require certain level of data quality because they are used frequently in making business decisions.\nHowever, sometimes we have data arrived later than the execution date (it could be a few days late), that requires you to recompute and update the data if you want a more accurate result or the portion of delayed data is too significant.\nOne way to approach is to setup another pipeline to adjust the numbers after the first execution, or simply delayed the execution by a few days (which is fine because honestly you can\u0026rsquo;t do anything since you do not have the data in data warehouse yet).\nAn idempotent job would save you the trouble to maintain two separate pipelines, and the dependencies in between, it makes sure every newly arrived data are considered during daily transformation.\nLess bug-prone # A bit similar to point 2. This means there will not be any confounding factors contribute to your problematic transformation result, leaving you only the transformation logic to debug.\nOtherwise you might think \u0026hellip; hmmmm I remember two days ago some guy just manually updated a static table, when was it, let me ask him / check logs first (if you do not want to disturb others), and ahhhhhhhhhhh it’s right before the execution time that\u0026rsquo;s why! I have wasted two hours of my life again \u0026hellip;\nWorks better with modern Data Warehouse # We used to have a lot of update statements, getting only the latest data chunks, update the necessary columns when we were still using Postgres as our Data Warehouse. Over time the data quality starts to drop because of accidentally failed ETLs. It was apparently a trade off between performance and data quality.\nAfter we shift to Redshift, single insert or update start to become costly, I am forced to rethink and redesign many of our jobs and I am so happy that I eventually got rid of all the update statements.\nFinal comment # I did not mean we should always do idempotent jobs (sometimes we will have to use incremental jobs, they are great too but they require much higher maintenance), we have deadlines, we have legacy codes, it all depends on priorities. If the data quality is slightly off, it might not have an impact at all. So think twice before actually start to refactoring pipelines.\nThe ultimate goal is of course, to walk into the office every morning without worrying about something / somebody is breaking the pipelines again, thank you very much.\n","date":"31 August 2019","externalUrl":null,"permalink":"/2019/08/de-log-1-idempotency/","section":"posts","summary":"In linear algebra, an idempotent matrix is a matrix which, when multiplied by itself, yields itself.\n","title":"DE Log 1: Idempotency","type":"post"},{"content":"A list of TV series I would like to watch / have watched, and a short comments on them, by category.\nDisclaimer: It is just a personal comment without thinking through, I simply wrote them for fun. A lot of grammatical mistakes could be generated in the process, and probably some irresponsible comments also.\nPandora # I have always a soft spot for sci-fi genres. Inter-celestial interactions, spaceships and questioning the existence of human beings, but most importantly, the world that is shaped in the series.\nI have only watched the pilot for this show, and unfortunately I will not watch this unless I am super slack. The story settings are ok, nothing new but nothing bad, actually I was excited for a while when I saw an academy teaching proper imagined courses such as inter-celestial protocols and a history of earth, in the form of holograms and VRs. For a while I thought this is a teenage story that baked with sci-fi elements - very good for a not-so-young man like me.\nHowever, what annoys me is the narrative (Ah yeah it is usually the narrative), I have no idea that suddenly the female protagonist had already spent the first three months in the academy, and I understand the needs of explaining the world and stories via narrative, and speeding up the pace to unravel the meaning of Pandora, but still I feel fractal.\nI must confess, that it did not impress me, but it does not mean that is not going to be a good show. I have hated Person of Interest for the first two seasons before I really started to enjoy it.\n","date":"24 August 2019","externalUrl":null,"permalink":"/2019/08/dramas-checklist/","section":"posts","summary":"A list of TV series I would like to watch / have watched, and a short comments on them, by category.\n","title":"Dramas Checklist","type":"post"},{"content":"","date":"24 August 2019","externalUrl":null,"permalink":"/categories/tv-log/","section":"categories","summary":"","title":"tv-log","type":"categories"},{"content":"Recently I am migrating some microservices and streaming pipelines to AWS Fargate. This article mainly covers the issues, pitfalls and other useful notes when I was deploying a web backend with DataDog.\nPreviously we set up DataDog on a standard EC2 machine, together with the web application.\nThis article would omit stuff that are clearly mentioned and explained in the official DataDog documentations. It would only annotate on the issues I encountered during setup.\nResources/Terms Explanation # AWS Fargate: AWS managed container service, allows you to run your docker without worrying about underlying infrastructure.\nDataDog: A monitoring tool for application, supports tracing of requests and calls in various resources\nBreakdown Steps # Preparation # Push your application container(s) into ECR or any container registry (if you have not done so) Generate API Key in DataDog Understand the basics of docker, how docker communicate with each other in a network Create Task Definition # Setup the Task Definition normally, configure your Task Execution Role based on your application’s need. The tricky part is the container setup, since we want to have the DataDog Agent and application in different containers, we need to configure both of them so that they could talk to each other.\nIt is stated in the official docs that:\nTo monitor your ECS Fargate tasks with Datadog, run the Agent as a container in same task definition as your application. To collect metrics with Datadog, each task definition should include a Datadog Agent container in addition to the application containers.\nBasically this means we need to have at least two containers in the Task Definition setup.\nConfigure DataDog Agent Container # This is technically similar to set up the datadog-agent.yaml config file on a EC2 instance for DataDog, except on EC2 everything is filled with many options and you just need to uncomment stuff on a need basis. Here we need to configure everything using Environment Variable and Docker Label.\nIntegration Setup tutorial could be found here: Integration Setup for ECS Fargate\nEnvironment Variable configurations: Details of each ENVs used to configure the agent could be checked here: Docker Agent\nKey Value DD_API_KEY YOUR_API_KEY DD_APM_ENABLED true DD_APM_IGNORE_RESOURCES YOUR_IGNORED_RESOURCES DD_APM_NON_LOCAL_TRAFFIC true DD_APM_RECEIVER_PORT 8126 DD_DOGSTATSD_NON_LOCAL_TRAFFIC true DD_DOGSTATSD_PORT 8125 ECS_FARGATE true You will also need to open the ports configurations in containers, by default it is using 8126 to receive traces. Since we are not running the agent and the application on the same machine, NON_LOCAL_TRAFFIC related ENVs needs to be set as true.\nAPM \u0026amp; Distributed Tracing\nOpen the following ports as stated in docs:\nPort Protocol 8126 TCP 8125 UDP Configure Application Container # The configuration on the application container is much simpler. It will need to be able to send traces to the agent container. So we need to define two extra ENVs:\nKey Value DD_AGENT_HOST localhost DD_TRACE_AGENT_PORT 8126 The localhost is not written anywhere on official docs. I originally use datadog-agent , the name of the agent container, after that I see the logs saying:\n2019-07-16 07:57:03,286 : ERROR : Failed to send traces to Datadog Agent at datadog-agent:8126: [Errno -2] Name does not resolve This tells us that the hostname in this case is not datadog-agent, which should work if we are using docker-compose by just stating the service name. Because of this I start to read more deeply in the Fargate docs and I found this here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.html\nWhen a task is started with the awsvpc network mode, … This means that all containers in a task are addressable by the IP addresses of the ENI, and they can communicate with each other over the localhostinterface.\nWell then I tried localhost and it worked.\nWhen I go back and try to understand the whole thing, I associated this with my effort to setup a docker Postgres Together with other applications. The IP address of the database to the application is never localhost, it is rather the assigned IP address by the docker host, but in Fargate because of ENI they seem to solve this problem.\nConfigure Application Code # I am using Python, so basically it is configure the tracer during application initialisation, nothing much to be elaborated here. Something similar to the following code would do:\nimport os from ddtrace import tracer tracer.configure( hostname=os.environ[\u0026#39;DD_AGENT_HOST\u0026#39;], port=os.environ[\u0026#39;DD_TRACE_AGENT_PORT\u0026#39;], ) Conclusion # Enable logs for the DataDog agents help with the debugging. It is advised to leave it on for the first few times, and disabled it later on if you find it unnecessary.\nI would consider myself pretty weak when it comes networks (never studied in an academic sense), but with constant exposure to things like docker-compose and managed services, you focus much more on the connection of resources rather than the provision of them. This is pretty useful.\n","date":"16 July 2019","externalUrl":null,"permalink":"/2019/07/devops-log-1-setup-datadog-and-apm-in-aws-fargate/","section":"posts","summary":"Recently I am migrating some microservices and streaming pipelines to AWS Fargate. This article mainly covers the issues, pitfalls and other useful notes when I was deploying a web backend with DataDog.\n","title":"DevOps Log 1: Setup DataDog and APM in AWS Fargate","type":"post"},{"content":"","date":"16 July 2019","externalUrl":null,"permalink":"/tags/monitoring/","section":"tags","summary":"","title":"Monitoring","type":"tags"},{"content":"This short article aims to address issues I encountered during hosting personal static blog on Github Page.\nProblem Description # Github Page provides three different ways to host a static personal website, for free.\nhost in /docs folder host in gh-pages branch host in master branch However, in the case of personal blogs, i.e. using repo \u0026lt;github_username\u0026gt;.github.io limits the option to just build the website on master branch.\nIt indeed created some minor issues on publishing flow and setup when I switched from jekyll to hugo, and it is not that straight-forward on the official docs (they are great docs), hence this article.\nSolution # 1. Install Hugo and necessary notes # Click to view offical docs.\nDocs are pretty good, and IMHO Hugo has the most UI friendly and comprehensive docs among its competitors.\nMore docs here on hosting.\nNote in the second last section, build and publish on master branch is omitted because\nSteps should be similar to that of the gh-pages branch, with the exception that you will create your GitHub repository with the public directory as the root.\nNot cool! Especially inconvenient for people like me, who often enjoys blindly following instructions in a tutorial or documentation.\nSo I have to read the gh-pages section carefully, and made some tweaks.\n2. Understand the structure # The website is built when we run the command hugo at project root, the project root initiated from hugo new site \u0026lt;project_name\u0026gt;, all the static content are automatically stored in public folder by default.\nIn the official instructions, they git worktree add the gh-pages to make it become a linked worktree to the main git worktree, and then hosted the content of public folder there. This is just like having a mini-repo in a repo (not accurate at all but you get the idea).\nIn order to have the static content in public at the project root on master branch, we first switch to a separate branch, called builder.\nThen we build the project by running hugo, by adapting the command in the docs, we could make the master branch become the mini-repo we want, set up stream to master as well.\ngit worktree add -B master public origin/master Checkout to local master branch by simply cd public, and do git push, use --force if necessary.\nRemember to push your changes in builder branch to remote as well.\nIn summary, use builder branch to write, build and publish the static content, use master branch to host on Github Pages\n3. For people who are lazy # For those who are lazy to read, or for lazy publication, I have the following deployment script, I copy-paste the entire thing here because I want to see how it looks like when a large chunk of code is rendered using this Hugo Theme.\n#!/bin/sh if [[ $(git status -s) ]] then echo \u0026#34;The working directory is dirty. Please commit any pending changes.\u0026#34; exit 1; fi echo \u0026#34;Deleting old publication\u0026#34; rm -rf public mkdir public git worktree prune rm -rf .git/worktrees/public/ echo \u0026#34;Checking out gh-pages branch into public\u0026#34; git worktree add -B master public origin/master echo \u0026#34;Removing existing files\u0026#34; rm -rf public/* echo \u0026#34;Generating site\u0026#34; hugo echo \u0026#34;Copy README.md\u0026#34; cp README.md public/README.md echo \u0026#34;Updating master branch\u0026#34; cd public \u0026amp;\u0026amp; git add --all \u0026amp;\u0026amp; git commit -m \u0026#34;Publishing to github personal page master branch\u0026#34; git push echo \u0026#34;Updating builder branch\u0026#34; cd ../ git push Create a deploy.sh file at project root and copy paste the above code in.\nCheck the original script in my repo.\nThen follow the steps below:\nalways work on builder branch locally\npreview blogs with hugo serve\ncommit your changes on local builder branch\nrun bash deploy.sh\nSide Note # I originally wanted to write a detailed comparison and user experience between hugo and jekyll, but I think similar articles are all over the Internet, so never mind.\nIn short, hugo is more architecturally-pleasing to me, but requires some simple tweaks. I also learned git worktree because of this, it is pretty cool.\njekyll is easier to setup, and to publish as well, but it is slow to build, it does not really matter to me at this point because I only have one aritcle, but I do not like to see a long list of red/green git status result (weirdo alert!).\n","date":"11 June 2019","externalUrl":null,"permalink":"/2019/06/build-hugo-blog-on-github-pages-master-branch/","section":"posts","summary":"This short article aims to address issues I encountered during hosting personal static blog on Github Page.\n","title":"Build Hugo Blog on Github Pages Master Branch","type":"post"},{"content":"","date":"11 June 2019","externalUrl":null,"permalink":"/tags/hugo/","section":"tags","summary":"","title":"hugo","type":"tags"},{"content":"","date":"11 June 2019","externalUrl":null,"permalink":"/tags/notes/","section":"tags","summary":"","title":"notes","type":"tags"},{"content":"","date":"11 June 2019","externalUrl":null,"permalink":"/tags/static-website/","section":"tags","summary":"","title":"static website","type":"tags"},{"content":"","date":"28 March 2019","externalUrl":null,"permalink":"/tags/cloudwatch/","section":"tags","summary":"","title":"Cloudwatch","type":"tags"},{"content":"I wrote this initially in our internal documentation website, so this is a very short note on how to setup Cloudwatch on AWS EC2.\nI have not touched EC2 deployment for a very long time since I switched to Fargate, but I guess it’s good to keep a notebook of what I have tried and experienced.\nNote: this was written more than one year ago, some of the things may be updated / outdated. Reference with care.\nDownload the agent package wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb Install package sudo dpkg -i -E ./amazon-cloudwatch-agent.deb Configuration using wizard sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard Check the EC2 instance is attached with an IAM role that has Cloudwatch rights\nInstall collectd if not done so (most likely you are not)\nsudo apt-get install collectd Start the Agent with the latest configuration sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s To check the status:\nsudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a status Response should be something like\n{ \u0026#34;status\u0026#34;: \u0026#34;running\u0026#34;, \u0026#34;starttime\u0026#34;: \u0026#34;2019-03-28T04:36:52+00:00\u0026#34;, \u0026#34;version\u0026#34;: \u0026#34;1.208036.0\u0026#34; } Note a few things:\nwe enable the usage of statsd and collectd\nPort of statsd should be set to 8127 instead of the default 8125, which is commonly used by datadog - modern note: If you do not use Datadog I think it is fine, but conflicting ports could be one of the problem one encounters during setup, it is crucial to check the agent logs\nNaming convention I use for log group and log stream:\nLog group name should be the name of the service / ec2 instance\nLog stream name should be the functionality of the instance (or logger if you have multiple loggers configured and running in the same instance)\nTroubleshooting # If the agent is not running for some reason, first thing check the agent log at:\nsudo tail /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log Usually it provides you with the necessary error message (This is not included in the AWS guide)\nPotential problems\nPort usage: other services might have taken the spot, check carefully for the port usage\nIAM role is invalid or wrong\nHave two running awslog agent: this guide accounts for the newer Cloudwatch agent, so stop the current awslogs service using\nsudo service awslogs stop","date":"28 March 2019","externalUrl":null,"permalink":"/2019/03/devops-log-0.1-setup-cloudwatch-on-ec2-instance/","section":"posts","summary":"I wrote this initially in our internal documentation website, so this is a very short note on how to setup Cloudwatch on AWS EC2.\n","title":"DevOps Log 0.1: Setup Cloudwatch on EC2 Instance","type":"post"},{"content":"","date":"28 March 2019","externalUrl":null,"permalink":"/tags/ec2/","section":"tags","summary":"","title":"EC2","type":"tags"},{"content":"I had watched this movie some years ago in the cinema, together with my friends. We learned one phrase that day - “Old sport”.\nI remembered enjoying watching the movie on that day, perhaps I’ve always had this undiscovered soft spot for these classy movies. Pride and Prejudice, which I had never thought I would love, and the Anonymous, played by Rhys Ifans, whom lately become one of my favourite actors.\nThe phrase “The Great Something” mentioned in our telegram group chat reminded me of this movie again today, and I started my second run on this movie, and I stopped halfway after Gatsby told Nick that \u0026ldquo;the past could be fixed\u0026rdquo;. It occurs to me now that, with such comfortable atmosphere in the dark back then, I was deeply drawn into the world of the portrayed prosperous American Dream in the early twenties. However, right at this moment, I am sitting in my bedroom, with the bright white light on, I feel much less impressed.\nPerhaps it has nothing with the atmosphere after all, perhaps it is really just I had grown these days. I started to feel sad about Mr Gatsby halfway through the movie, who was desperately trying to make Daisy part of his wealthy life, and it all started with that kiss at her house. I don’t think I had such feeling when I was watching it five years ago, the only thing I seemed to recall is the jealousy of such love. Yes I do believe in love, unlike many people in the world nowadays. In the movie Nick and Gatsby started to have this conversation of the past after Daisy and Tom left Gatsby’s mansion. They started to talk about the past and Nick, who had been the narrator of the story, shared his opinion on the past. That is, the past cannot be repeated. Gatsby, on the other hand, tried to mock but probably instinctively constrained by his successful eduction of being a gentleman. At the end of the conversation, he just corrected Nick by saying “you are wrong”.\nI feel sad because I think I could understand the weight of the sentence now. The past indeed could not be repeated or fixed. After living on this planet for more than twenty years, I started to learn a few things. Every desperate and yet constrained move by Gatsby reminded me how I used to believe things. An idealist indeed, but also a fool. I could not help but dive into the memories. I would love to stop writing these down because living in a memory once again is dangerous yet attractive.\nIf Daisy and Gatsby could be together from the beginning, we will have a mundane and great love story. What if the reality stands in front of you and tells you to give up, or just accept what is given. Idealist would fight back, I would perhaps fight back as well. I could actually feel the desperation Leonardo was trying to convey in the movie, it reminded me years ago running headless, seeking the love that did not exist. Everything at the party was just a facade, just like every tiny controlled motion of us in our daily life, yet more the often the true self was too hideous to bear.\nI stopped my second run of the movie when Gatsby started to call Nick again, saying that he had to help him getting Daisy out of the current unhappiness. He mentioned the servants were fired, and an unrespectful comment during the desperate pleading. Gatsby looked back at the butler twice as if he was uncomfortable of the butler’s presence, the butler eventually went out of the room. That’s when I said ok enough for tonight. It left me with just pure sadness, and some sympathy. It showed me that how much gentleman was left inside if you were really in despair. I am not blaming him, he cared too much and thus it hurt him too much. It hurt the people around him consequently.\nInteresting how your feelings of the same movie change over time isn’t it?\n","date":"7 January 2019","externalUrl":null,"permalink":"/2019/01/movie-log-2-from-admiration-to-sympathy-the-great-gatsby/","section":"posts","summary":"I had watched this movie some years ago in the cinema, together with my friends. We learned one phrase that day - “Old sport”.\n","title":"Movie Log 2 - From Admiration to Sympathy - The Great Gatsby","type":"post"},{"content":"This article briefly records the step to reschedule a DAG on Apache Airflow\nSo this week one of my tasks is to reschedule some DAGs in our current data pipelines on Airflow.\nIt all started with a complain made by my colleague, “Why is this query taking 30 minutes to complete, is there a DAG running?”\nYou get my point, scheduling a DAG, particularly a massive DAG that often involves many crucial data transformation and transportation during the office hour is not something desirable. It happens to us not long ago, when one of us is testing a complex query and it ended up jamming both Airflow tasks and other manual executions.\nAs a data engineer, at some point of time you will need to be very familiar with the DAGs. They are like your babies, please take good care of them.\nTime Range # Let’s say we work from 9 am to 10 pm (because somebody likes work or run some queries at night) SGT, we will need to schedule DAGs outside of this period, which is 1 am to 2 pm UTC.\nTypes of DAG # We are dealing with time, therefore it is important to identify first whether the DAG is temporal related - you kinda use the execution_date in your query / python jinja templating / or other stuff.\nYou would probably like to keep the historical DAG run of this particular DAG as well, or perhaps you are using the backfill feature provided by Airflow, which is a great feature by the way. The following steps describe how to handle both of them.\nSteps # Step 1. Identify scheduled time and actual execuion time of DAGs # It is important to actually go to airflow metadata database (the name …) to checkout the actual execution time of the DAG. Sometimes, for some unknown reason, your DAG does not start at the time designated. So to avoid further confusion, make sure check the actual execution time.\nNow you need to decide whether you should reschedule a DAG based on the actual execution time.\nIf the actual execution time is acceptable, what you really need to do is just correct the start date, if not you will need to make some other changes.\nStep 2. Identify temporal relevant DAGs # Decide whether the DAG you want to move is dependent on time, as mentioned in previous section.\nStep 3.1 If Not # Rename (usually use a _v* suffix) Change Start Date Deploy and unpause your DAG If everything is great, delete your old DAG in metadata database (optional, somebody may find the disabled DAG disturbing) Step 3.2 If unfortunately yes # Change the last execution time in metadata DB to the desired starting time, one scheduled_interval before. (I might remember the wrong column name) UPDATE dag_run SET execution_date = ... WHERE dag_id = ... e.g. you have a daily DAG at 2am, now it is running every at 1am. So you change your latest DAG run to 1am. Change your start date And that is it. You can relax and enjoy every successful slack notification when you get up in the morning, knowing everything is running fine at night.\n","date":"22 September 2018","externalUrl":null,"permalink":"/2018/09/de-log-0.1-rescheduling-of-airflow-dags/","section":"posts","summary":"This article briefly records the step to reschedule a DAG on Apache Airflow\n","title":"DE Log 0.1: Rescheduling of Airflow DAGs","type":"post"},{"content":"这几天读书时间不太固定，心情也比较浮躁，生活中事情也比较多。觉得《芳华》这本书的厚度也比较好欺负， 所以最终还是一头扎进了那个年代的想象。其实本来只想写一个朋友圈的，但是觉得，反正也闲着这段时间， 多写点东西也是好的，至少能动一下脑子。\n对书的感受 # 书读下来，总感觉有一种莫名的忧伤。这些年来尽量还是说让自己隔绝一些情感，但好像效果并不好，尤其是 看电影流泪的次数越来越多，可能是越老越sentimental。整本书，说了挺多件事，刘峰的\u0026quot;触摸事件\u0026quot;，何小曼 所遭遇的排挤，歧视。以及作者从萧穗子的视角展现出的那种，冷漠观察者的态度。哪怕最后郝淑雯向\u0026quot;我\u0026quot; \u0026ndash; 萧穗子 透露当初的情书事件是她出卖的，她也只是一张\u0026quot;扑克脸\u0026quot;。\n我没读过严歌苓的其他作品，但是我还是因为这部作品对这个作者产生了兴趣。一是因为她对情感描写的独特方法，她用比喻的那些个 例子，都是那么的不文艺，那么有汗水味。然后我联想着电影的中的那些镜头，阳光下的排练厅，空中因为丁达尔效应看到的灰，就是 那种味道吧，年代感。所以我还想着啥时候自己能写点小说呢，现在想想还是老老实实多看书提高知识水平比较好。第二是作者以萧穗子的 口吻，对当初\u0026quot;批斗\u0026quot;刘峰的群体，包括她自己，做了两个层次的心理画像。我看到这里的时候我是比较服气的，之前作者引用弗洛伊德的 本我自我超我的时候，我还想着这样解释也行，刘峰表象的确就是一个超我的极致。但这个理论由于被太多人滥用了，我有一些不太感冒， 直到最后一段萧穗子的内心独白：\n我们干吗那么对刘峰？\u0026hellip; 从始至终对刘峰的好没有信服过。就像我一样，所有人心底都存在着那点阴暗，想看到刘峰露馅 \u0026hellip; \u0026ldquo;触摸事件\u0026quot;发生了，所有人其实都下意识松了一口气\n这是在分析当初的战友们，或者更确切一些，萧穗子等女兵们，觉得跟刘峰谈恋爱会很恶心的心理。一个圣人，是不能有七情六欲的， 是不能做凡人做的事情的，你做了，你就连凡人都不如了。\n原来那么个无所事事、一事无成的军二流子都嫌我们呢，嫌我们不会自我闲恶。 \u0026hellip; 而我们的丑恶一旦发生在刘峰身上，啊，他居然也 包含着我们的不堪 \u0026hellip; 我们无法打自己，但是可以打他\n在书中，郝淑雯的男友是个军二流子，这段心理独白紧跟上一段，其实最终都是在剖析那一代人对\u0026quot;刘峰\u0026quot;的背叛。第二个角度，我完全没有想到， 而且除了本身这段话内容对我的打击，同时还让我觉得作为一本小说，真的很意识流了。\n跟电影的对比 # 电影内容不太能回忆起来了，我只记得当时出电影院说了一句感受，对战争描写的还是多了，或者说，与开始的一些 氛围脱节了。我其实能理解这种心态，想要多纳入一些主题，或者我想多了，就是单纯的叙述。但是书中其实关于战争的笔墨还是比较少的， 只是侧面描写了何小曼（何小萍）的精神分裂，和刘峰的手是怎么断的。我也不是专业的影评人，只是单纯的从观众的角度，觉得如果用力有一些 过猛，然后看了书之后才发现其实这种力道最合适。大部分书中的场景，基本都还原了，不过有一些细节可能删掉了。\n我想提一下，刘峰到底怎么摸的这件事。电影因为后来的审问，由\u0026quot;拥抱\u0026quot;谣传成了\u0026quot;触摸\u0026rdquo;。电影中的版本，会让观众更同情刘峰， 更讨厌林丁丁，更容易与刘峰的遭遇共鸣。而书中，根据萧穗子的\u0026quot;想象力\u0026quot;，那的确是摸了的。这本书的英文标题，就是\u0026quot;You Touched Me\u0026quot;。我会更喜欢 书中的版本，因为这样让刘峰superego的形象变得更破碎，更令读者感受到这帮女兵的情感。一旦读者这么想了，开始少许的责怪刘峰，讨厌刘峰的时候 （我的确这么想了），到最后进行自我心理剖析和检讨的时候，才会更有感触。\n这一摸，摸死了一个20多岁一米六九的山东\u0026quot;大\u0026quot;汉对爱情的所有向往，摸出了那个年代独有的\u0026quot;背叛战友等价于正义\u0026quot;，也摸出了刘峰潜意识求死的欲望。 可能这里就要提到另外一个摸，那就是刘峰对何小曼的摸（拥抱，托举等）。这个行为和林丁丁相反，摸出了何小曼对刘峰的感情，从触摸甚至是要求与 何小曼的伴舞换位置之前，就产生了。但可惜的是林丁丁在前，后面的对刘峰来说便无所谓了。其实想来这是很个人主义的东西，是刘峰自己的选择， 当然外界的压力也是很大的，不过依旧是自己的选择。\n我想最后提一下关于黄轩演的刘峰，在\u0026quot;雷锋形象上\u0026quot;塑造的几乎是没有毛病了，这也是我对近年来国产电影重新产生兴趣的原因，演员质量还是有提升的啊。 不过怎么说呢，黄轩没演出刘峰内心的自卑，我觉得确实很难（黄轩演自卑我愣是没想像出来），而且也不算是刘峰的主要显性性格， 但依旧是他的书中的主性格。其实这一系列对刘峰的改动，都是在简化刘峰，可能因为如果太复杂了，观众刘峰的情感也会很混淆吧。至少我读完书了之后， 我无法准确描述我对刘峰的情感，你说同情有的，感到无奈也有的，有一些觉得不值，很多情感揉杂在一起。到最后我反而看完书，就觉得忧伤，但是说不出 哪里忧伤。简化之后就方便了，刘峰受欺负了，自由表达爱情被时代蹂躏了，上战场做贡献被国家工厂抛弃了，我们该同情他，该讨厌那个时代，（对比下） 该对文工团的日子感到美好。\n当然书与电影本身就是两个独立的作品，我做对比也只是单纯为了趣味性，这个东西在电影中出现了，嗯很有意思，诶这个地方被改了， 嗯也很有意思，仅此而已，没有优劣。\n","date":"17 July 2018","externalUrl":null,"permalink":"/2018/07/book-log-1-%E9%87%8D%E6%B8%A9%E8%8A%B3%E5%8D%8E/","section":"posts","summary":"这几天读书时间不太固定，心情也比较浮躁，生活中事情也比较多。觉得《芳华》这本书的厚度也比较好欺负， 所以最终还是一头扎进了那个年代的想象。其实本来只想写一个朋友圈的，但是觉得，反正也闲着这段时间， 多写点东西也是好的，至少能动一下脑子。\n","title":"Book Log 1: 重温《芳华》","type":"post"},{"content":"本来我想着明早起来再写这篇文章，不过我感觉到那时候可能就忘的差不多了。还是趁热写吧，这倒不能算影评，只能算是感想了（应该还没有到写影评的水平）。\n这是一部科幻电影，自然也是我被吸引去看的一个原因之一。但与我过往看过的科幻不同的是，这个科幻电影给我的感觉，怎么说呢，很朦胧。它不是说烧脑，因为其实关于人基因突变这个事情我在Lena说一束花上长着明显不同物种的时候就猜到了，再联系一下开头关于癌症的lecture，对这个电影就大概有一个模糊的猜测了。我想表达的是（其实也不是那么好表达），电影就是给你这种模糊感，虽然有时候这里给一个线索那里给一个线索，但是到最后还是非常模糊的。\n总体来说，我喜欢物理啊空间科幻胜过生物科幻，但是这个电影还是非常值得看的。\n标题 # 第一眼看这个标题我就觉得有点恐怖，然后果然电影也是有一些惊悚元素的。我原本这个词给我的第一反应是是物理学的Annihilation. 还记得被JC H3 Modern Physics 支配的恐惧，物质和反物质相撞用的这个词就是annihilation，湮灭。我当时还想这个词挺文艺哈，然后我刚又查了一下这个概念：\nAnnihilation, in physics, reaction in which a particle and its antiparticle collide and disappear, releasing energy. The most common annihilation on Earth occurs between an electron and its antiparticle, a positron.\n这个releasing energy，跟电影中那个所谓的外星（生物？）从Dr. Ventress 体内出来的感觉就很类似了。当时我就以为，漂浮在空中的亮晶晶的东西其实可能就是能量吧，但是再一看那些东西同时也在模拟着细胞分裂的状态，那这不就违反了质能守恒定律了，所以我估计其实可能只是电影手法而已。\nAnnihilation过后就真的什么都不剩了，这才是这个词最可怕的地方。\n三种湮灭 # 既然标题叫这个，那么电影肯定也是有很多意象是在表达这个概念的。我理解的吧大概有三种。\nThe annihilation of Earth\n电影开头就在解释tumor的概念，同时提到了benign和malignant tumor的区别。后者会扩散，invade nearby tissues。这个基本上直接就能联想到The Shimmer，就是电影中被光幕笼罩的区域。它就好像是地球的一个恶性肿瘤，在向外扩散，入侵其他地方。然后关于这个光幕内的效果，就是对DNA突变，这个算是”地球肿瘤“意象的一个证明，肿瘤里的细胞都不是正常的细胞。这里暗示的是地质层面上的湮灭。我觉得可以理解为terraforming吧，你看最后那些水晶一般的树，基本就不是地球物种，而且最后灯塔烧掉之后树直接凭空消失了，我是感觉这是外星人在terraforming。\nThe annihilation of Physical body\n冒险小队一共五个人，有几个人在不同程度上都有身体或上的自我摧毁。Dr. Ventress是癌症，物理学家是自残， Anya是addict。按照电影中的说法，他们都是“damaged goods”。\n同时Lena在电影刚开始没多久就提到过，人的老去不是“自然”的，而是基因决定的。这个理论我其实原来听过，感觉电影其实也是想带出人体的自我灭亡也是annihilation的一部分这一概念，算是身体上湮灭的一部分。\nThe annihilation of Mind\nAnya和Sheppard经历的是心理上/思维上的湮灭。先说Anya, 最开始她主动想和Lena做朋友，非常友好的一个人。但是随着电影的发展，基本可以看得出来她越来越暴躁，直到最后把三个队友绑起来准备拷问，可以看得出来理智已经湮灭了。\nSheppard是第一个死的，但是在参加之前女儿已死于白血病。同时在划船的时候也说了一次其实死了两个人，一个是女儿，一个是过去的自己。这个自然是指心理层面上的死亡，已经不在过多留恋这个世界。这也算是心智上的一种湮灭吧。\n其他 # 我觉得导演可能在表达上述几个层次的湮灭之下，也探讨了其他的湮灭。\n比如关于Lena，我其实不好说再总结出一个分类来单独放。她也是在真实地践行自我摧毁。与同事出轨这种自我摧毁行为，不单纯是心理上或者身体上的，但是电影中很明显的表现了这是一种湮灭。刚进光幕的时候就闪回了一次，后来在不断的闪回。\n又比如最后Kane的那一番提问，我是谁你是谁的，基本自己的人性已经不存在了，自我意识也在湮灭中。我们知道复制人和原本的人最开始是完全同步的，就跟电影中那两个角上长花的鹿一样，但在Kane自杀的那一幕中，他与他的复制人是有对话的。然后再想到那只熊，能把猎物的声音转移过来，我觉得自我意识的转移也不是不可能了。那么这也是说，原来Kane也已经湮灭了。\n一些镜头 # 说一些镜头吧，感触比较深的。\n熊\n这个熊叫的时候出来Sheppard的声音是真的把我下了一跳，我不是那么容易被吓着的人，尤其是看电影的时候，脑子里总还是有一个理智的声音告诉自己这都是假的。不过这个熊确实有点。。嗯。\n水杯\n水杯有两个镜头值得说。一个是当Kane突然出现在房子里的时候，他和Lena相对而坐，手放在一起。那段镜头是通过水杯拍摄的，学过物理我们都知道这个现象叫折射，refraction。然后这个词在电影中被提及过很多次，也正是我们的五人小组中物理学家提出的refracting everything in Shimmer. 虽然我个人觉得，refracting DNA does not make sense at all， 但是我觉得这个还是一个比较明显的echo。\n第二个是Lena回忆的时候，喝了一口水，水有一部分往上走一部分往下。这个镜头还给的蛮久的，我一开始以为是张力，可能过度解读了。然后我自己喝了口水我发现水不会那么扩散的，我这个杯子只会有一些水滴附着在杯壁上。不过吧水又不是有生命的，难道还能变异。所以我也不太理解这一幕，但还是得拿出来说一下因为很显然是要表达什么。\nLena和她的复制人打架\n基本我们确立了复制人最初都会镜像模拟一切行为。Lena跑向门，所以复制人也跑向门，所以俩人就重合在一起，一个压着一个。当Lena被按在门上的时候，她试图反抗，可以看得出来很用力，但是纹丝不动真心是纹丝不动。这莫名其妙让我想到牛顿第三定律，你用多少力你就会被施加多少力。所以当基本要憋死了，没有力气了，自然俩人都倒下了。\n结尾 # 我有点困，说下结尾吧。这个其实还比较有意思，因为又是一个开放式结尾。\n首先我们能确定的是，Kane肯定是复制人。这个镜头里大家都看到了，复制人自己也亲口说了所以没什么疑问。主要的疑问在于Lena是不是复制人。\n我个人的感觉是，她不是复制人，也不是自己了。\n大家觉得她可能是复制人的原因，无非是在打斗中俩人调过一次包。但是我没办法理解，原来的Lena是如何从人的形态变成“金属”形态再变成人的，然后再燃烧变成金属。虽然整部电影也不太科学，但是这个解释也太不科学了。我拒绝相信发生了调包。\n但是最终俩人眼睛都变色呀，所以确实是俩人都有变异基因。但是Lena在里面呆了那么久，总该变异了。先不说每个人变异的不一样，有的长了花有的指纹会动，有的人体成流体了，但至少这个解释还是比调包解释通顺一些我感觉。\n最终的结局是俩带有变异基因的人回到了人类的社会，还是一对夫妇的身份。我感觉这就好像一个会复发的恶性肿瘤，在潜伏中而已。\n我觉得科幻就是这样，让你觉得嗯小时候学点物理化学生物现在看看梗还能有一些些共鸣，挺好的。\n","date":"11 June 2018","externalUrl":null,"permalink":"/2018/06/movie-log-1-annihilation/","section":"posts","summary":"本来我想着明早起来再写这篇文章，不过我感觉到那时候可能就忘的差不多了。还是趁热写吧，这倒不能算影评，只能算是感想了（应该还没有到写影评的水平）。\n","title":"Movie Log 1 - Annihilation","type":"post"},{"content":"Bull is an American drama television series starring Michael Weatherly. CBS ordered the program to series on May 13, 2016, and it premiered on September 20, 2016. The show is based on the early days of talk show host Dr. Phil McGraw\u0026rsquo;s career, when he was a trial consultant. ——— Wikipedia\nYes I am lazy I know.\nGeneral Feedback # All in all it is still a satisfying experience to watch the series because of the diverse topics and categories of the cases presented and the charming in-house council Benjamin Colon. I am strangely less attracted to the tech guy - in this case a rather beautiful girl “Cable”, probably due to my some knowledge in the related fields. Since I know really nothing about trials and laws, I find the attorney super cool.\nComments on the last episode in Season 2 # It is certainly heavy and worthy enough to conclude the entire season with a death penalty case, and as an audience I am more than delighted to see a happy ending. However, I would like to point out some flaws in the narrative.\nAnother deus ex machina ending.\nI could see that the writer is trying to create a drastic reverse in the plot - convicted guilty first, and overruled the entire verdict. However, it is not handled very delicately. Despite some preparation such as “cigarettes” and “limping” in the expansion of the case, as an audience, it is not enough for me to accept the existence of another suspect, let alone acquitted the current one.\nMrs O’Neil’s confusing testimony.\nIt is rather strange for her in the first scene to testify with such confidence, which was shown to be wrong in later development of the show. By her confession, she did not wear the glasses during the witnessing of the crime. Let’s imagine she as a normal person (probably not so normal since she was emotionally related to the victims), I always feel that identifying a person as a murderer requires certain courages, not because of the potential vengeance and danger, but the moral burden it might have on you if it is a false identification. I would not dare to make such testimony with uncertainty.\nCheesy move on the defendant’s girlfriend.\nAs an audience again I do not present professional opinion on trials or legal matters alike, but I feel the plaintiff should have a higher winning chance even without attempting to influence the defendant’s girl friend. To me it looks like a sloppy move, and it should not have been done in the first place.\nConclusion # The entire show are generally satisfying and recommended for people if you want to watch something during dinner. It is definitely not hardcore enough for legal drama, unlike “the Practice”, yet I could not finish the latter due to its less dramatic way of narrative (yes I am this shallow), it focuses more on the small people and subsequently the conscience of the audiences, as compared to another popular legal drama “Suits”.\nOn the emotional storyline, it is rather less developed and I wonder why. It could be turned into a good relationship and then stories could be spun off with conflicting of interests, thus the growth of characters and etc. Probably the writer has something to offer in the next season. It is a good show and many well built characters so far, please do not waste it.\nLooking forward to season 3.\n","date":"4 June 2018","externalUrl":null,"permalink":"/2018/06/drama-thoughts-bull/","section":"posts","summary":"Bull is an American drama television series starring Michael Weatherly. CBS ordered the program to series on May 13, 2016, and it premiered on September 20, 2016. The show is based on the early days of talk show host Dr. Phil McGraw’s career, when he was a trial consultant. ——— Wikipedia\nYes I am lazy I know.\n","title":"Drama Thoughts - Bull","type":"post"},{"content":"","externalUrl":null,"permalink":"/archives/","section":"archives","summary":"","title":"archives","type":"archives"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"authors","summary":"","title":"authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/series/","section":"series","summary":"","title":"series","type":"series"}]