빅데이터분석기사

[빅분기 실기] 데이터마님 데이터 전처리 100문제(20~43번)

프로그린 2024. 6. 16. 13:07

Filtering & Sorting

020. 데이터를 로드하라.

DataUrl = 'https://raw.githubusercontent.com/Datamanim/pandas/main/chipo.csv'
df = pd.read_csv(DataUrl)
type(df)
[output]
pandas.core.frame.DataFrame

 

021. quantity 컬럼 값이 3인 데이터를 추출하여 첫 5행을 출력하라.

df[df['quantity'] == 3].head(5)
[output]

 

022. quantity 컬럼 값이 3인 데이터를 추출하여 index를 0부터 정렬하고 첫 5행을  출력하라.

df[df['quantity'] == 3].reset_index(drop = True).head()
[output]

 

023. quantity , item_price 두개의 컬럼으로 구성된 새로운 데이터 프레임을 정의하라.

df_new = df[['quantity', 'item_price']]
df_new
[output]

 

024. item_price 컬럼의 달러표시 문자를 제거하고 float 타입으로 저장하여 new_price 컬럼에 저장하라.

df['new_price'] = df['item_price'].apply(lambda x: x.replace('$', '')).astype(float)
df['new_price']
[output]

 

025. new_price 컬럼이 5이하의 값을 가지는 데이터프레임을 추출하고, 전체 개수를 구하라.

df[df['new_price'] <= 5].shape[0]
[output]
1652

 

026. item_name명이 Chicken Salad Bowl 인 데이터 프레임을 추출하고 index 값을 초기화 하라.

df[df['item_name'] == 'Chicken Salad Bowl'].reset_index(drop = True)
[output]

 

027. new_price값이 9 이하이고 item_name 값이 Chicken Salad Bowl 인 데이터 프레임을 추출하라.

df[(df['new_price'] <= 9) & (df['item_name'] == 'Chicken Salad Bowl')].head()
[output]

 

028. df의 new_price 컬럼 값에 따라 오름차순으로 정리하고 index를 초기화 하라.

df.sort_values(by = 'new_price').reset_index(drop = True)
[output]

 

029. df의 item_name 컬럼 값중 Chips 포함하는 경우의 데이터를  출력하라.

df[df['item_name'].str.contains('Chips')]
[output]

 

030. df의 짝수번째 컬럼만을 포함하는 데이터프레임을 출력하라.

df.iloc[:,::2]
[output]

 

031. df의 new_price 컬럼 값에 따라 내림차순으로 정리하고 index를 초기화 하라.

df.sort_values('new_price', ascending = False).reset_index(drop = True)
[output]

 

032. df의 item_name 컬럼 값이 Steak Salad 또는 Bowl 인 데이터를 인덱싱하라.

df[(df['item_name'] == 'Steak Salad') | (df['item_name'] == 'Bowl')]
[output]

 

033. df의 item_name 컬럼 값이 Steak Salad 또는 Bowl 인 데이터를 데이터 프레임화 한 후, item_name를 기준으로 중복행이 있으면 제거하되 첫번째 케이스만 남겨라.

df_new = df[(df['item_name'] == 'Steak Salad') | (df['item_name'] == 'Bowl')]
df_new.drop_duplicates('item_name')
[output]

 

034. df의 item_name 컬럼 값이 Steak Salad 또는 Bowl 인 데이터를 데이터 프레임화 한 후, item_name를 기준으로 중복행이 있으면 제거하되 마지막 케이스만 남겨라.

df_new = df[(df['item_name'] == 'Steak Salad') | (df['item_name'] == 'Bowl')]
df_new.drop_duplicates('item_name', keep = 'last')
[output]

 

035. df의 데이터 중 new_price값이 new_price값의 평균값 이상을 가지는 데이터들을 인덱싱하라.

df[df['new_price'] >= df['new_price'].mean()]
[output]

 

036. df의 데이터 중 item_name의 값이 Izze인 데이터를 Fizzy Lizzy로 수정하라.

df[df['item_name'] == 'lzze']['item_name'] = 'Fizzy Lizzy'
df.head()
[output]

 

037. df의 데이터 중 choice_description 값이 NaN 인 데이터의 개수를 구하라.

df['choice_description'].isnull().sum()
[output]
1246

 

038. df의 데이터 중 choice_description 값이 NaN 인 데이터를 NoData 값으로 대체하라.(loc 이용)

df.loc[df['choice_description'].isnull(), 'choice_description'] = 'NoData'
df
[output]

 

039. df의 데이터 중 choice_description 값에 Black이 들어가는 경우를 인덱싱하라.

df[df['choice_description'].str.contains('Black')]
[output]

 

040. df의 데이터 중 choice_description 값에 Vegetables이 들어가지 않는 경우의 개수를  출력하라.

len(df[~df.choice_description.str.contains('Vegetables')])
[output]
3900

 

041. df의 데이터 중 item_name 값이 N으로 시작하는 데이터를 모두 추출하라.

df[df.item_name.str.startswith('N')].head()
[output]

 

042. df의 데이터 중 item_name 값의 단어갯수가 15개 이상인 데이터를 인덱싱하라.

df[df.item_name.str.len() >= 15]
[output]

 

043. df의 데이터 중 new_price값이 lst에 해당하는 경우의 데이터 프레임을 구하고 그 개수를  출력하라.

lst = [1.69, 2.39, 3.39, 4.45, 9.25, 10.98, 11.75, 16.98]
df[df.new_price.isin(lst)].shape[0]
[output]
1393

출처 : https://www.datamanim.com/dataset/99_pandas/pandasMain.html