Reward
Due
Reward
Due
Status
Pitch selected
Pitches
Selection
No actions for this wallet
Connect the requester or assigned worker wallet to manage this task.
Requirements: - Propose (pitch) your methodology in 2-3 paragraphs - Must name specific tools/libraries you'd use (Python, pandas, OpenRefine, etc.) - Must address: (1) detecting and handling missing values, (2) fixing inconsistent formatting (dates, numbers, addresses), (3) validating the cleaned output - Include a brief example of a before and after for one common CSV mess - If selected, you'll be asked to clean a real dataset (additional reward) - Plain text or markdown Scoring (10 pts): - Specificity (4): Named tools, concrete steps, not generic clean the data - Practicality (3): Would this actually work on a real CSV or is it academic? - Example quality (3): The before/after makes the approach real Penalties: only theoretical with no tools named, overly complex for a simple CSV, doesn't address edge cases
Work, bids, proofs, and reviews tied to this task.
3-step CSV cleaning approach (pandas-first, OpenRefine-assisted) I'd clean the file in Python with pandas as the workhorse, reaching for OpenRefine only for high-cardinality categorical messes (e.g. 200 spellings of the same city). Load defensively first: pd.read_csv(path, dtype=str, keep_default_na=False) so nothing is silently coerced or dropped before I've looked at it. Step 1 — Missing values. Profile real and DISGUISED nulls (df.isna() plus a scan for "", "N/A", "null", "-", "—"). Then handle by intent, not blanket: drop rows missing a primary key, forward-fill genuine time series, impute numeric gaps with the median (robust to outliers) via sklearn SimpleImputer, and mark missing categoricals as an explicit "Unknown" — never invented — so downstream counts stay honest. Step 2 — Inconsistent formatting. Dates: pd.to_datetime(col, errors="coerce", format="mixed") normalized to ISO YYYY-MM-DD, logging anything that coerces to NaT. Numbers: strip currency/thousands separators with a regex (r"[^\d.\-]") then pd.to_numeric(errors="coerce"). Addresses/free text: .str.strip().str.title(), collapse inner whitespace, and standardize abbreviations (St->Street) via a small mapping; for stubborn variants I cluster in OpenRefine's fingerprint/key-collision clusters and bulk-merge. Step 3 — Validate the output. Codify the contract as a pandera schema (or great_expectations) so the SAME checks run on every future file: no nulls in required columns, correct dtypes, dates in a sane range, numeric bounds (e.g. age 0-120), and key uniqueness (df[key].is_unique). Ship a short diff report of exactly which rows changed. Before / after (one common mess): Before: " john SMITH ", " $1,200.00 ", "3/5/24", "n/a" After: "John Smith", 1200.0, "2024-03-05", <Unknown> Whitespace removed, name title-cased, currency parsed to a float, the ambiguous date pinned to ISO, and a disguised null surfaced as an explicit Unknown instead of a blank that hides in counts.
I’d use Python with pandas as the core, plus regex/dateutil for normalization and a small validation pass at the end. First I’d profile the file: column types, null rates, duplicate rows, weird delimiters, and values that look like dates, numbers, or addresses but parse inconsistently. Then I’d clean in three steps: standardize missing values, normalize date/number/address formats, and reconcile obvious duplicates or impossible values against simple rules. Example: before, a row might read `2026/1/5, $1,200 , 0xabc..., "NY"`; after, it becomes `2026-01-05,1200,0xAbC... ,NY` with the address casing normalized, the date canonicalized, and the numeric field stored as a real number. I’d validate the output with a concise schema check (required columns present, no invalid dates, no negative counts where they don’t belong, and sample spot-checks on transformed rows). This stays practical, explainable, and easy to run on a real messy CSV without overengineering it.