Thanks! That did it.
Thanks! That did it.
Add files via upload
@cmacdonald
Okay I tried this:
df = df[['qid', 'query']]
(used corpus_id
from the scenario above as qid
)
Say I have 100 rows in this dataframe which is now processed row by row by the following function.
def _my_func(indf):
qRow = indf.iloc[0]
return pd.DataFrame([ [qRow["qid"], qRow["query"], docid] for docid in range(len(colbert)) ], columns=['qid', 'query', 'docid'])
Create and apply pipeline
pipe = pt.apply.by_query(_my_func) >> colbert.index_scorer(query_encoded=False)
pipe(df)
This gives me KeyError: 'score'
Complete error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'score'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pyterrier/apply_base.py:36, in ApplyForEachQuery.transform(self, res)
35 if self.add_ranks:
---> 36 dfs = [add_ranks(df, single_query=True) for df in dfs]
37 rtr = pd.concat(dfs)
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pyterrier/apply_base.py:36, in <listcomp>(.0)
35 if self.add_ranks:
---> 36 dfs = [add_ranks(df, single_query=True) for df in dfs]
37 rtr = pd.concat(dfs)
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pyterrier/model.py:33, in add_ranks(df, single_query)
31 if single_query:
32 # -1 assures that first rank will be FIRST_RANK
---> 33 df["rank"] = df["score"].rank(ascending=False, method="first").astype(int) -1 + FIRST_RANK
34 if STRICT_SORT:
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/core/frame.py:3505, in DataFrame.__getitem__(self, key)
3504 return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
3506 if is_integer(indexer):
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance)
3622 except KeyError as err:
-> 3623 raise KeyError(key) from err
3624 except TypeError:
3625 # If we have a listlike key, _check_indexing_error will raise
3626 # InvalidIndexError. Otherwise we fall through and re-raise
3627 # the TypeError.
KeyError: 'score'
Okay I tried this:
df = df[['qid', 'query']]
(used corpus_id
from the scenario above as qid
)
Say I have 100 rows in this dataframe which is now processed row by row by the following function.
def _my_func(indf):
qRow = indf.iloc[0]
return pd.DataFrame([ [qRow["qid"], qRow["query"], docid] for docid in range(len(colbert)) ], columns=['qid', 'query', 'docid'])
Create and apply pipeline
pipe = pt.apply.by_query(_my_func) >> colbert.index_scorer(query_encoded=False)
pipe(df)
This gives me KeyError: 'score'
Complete error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'score'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pyterrier/apply_base.py:36, in ApplyForEachQuery.transform(self, res)
35 if self.add_ranks:
---> 36 dfs = [add_ranks(df, single_query=True) for df in dfs]
37 rtr = pd.concat(dfs)
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pyterrier/apply_base.py:36, in <listcomp>(.0)
35 if self.add_ranks:
---> 36 dfs = [add_ranks(df, single_query=True) for df in dfs]
37 rtr = pd.concat(dfs)
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pyterrier/model.py:33, in add_ranks(df, single_query)
31 if single_query:
32 # -1 assures that first rank will be FIRST_RANK
---> 33 df["rank"] = df["score"].rank(ascending=False, method="first").astype(int) -1 + FIRST_RANK
34 if STRICT_SORT:
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/core/frame.py:3505, in DataFrame.__getitem__(self, key)
3504 return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
3506 if is_integer(indexer):
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance)
3622 except KeyError as err:
-> 3623 raise KeyError(key) from err
3624 except TypeError:
3625 # If we have a listlike key, _check_indexing_error will raise
3626 # InvalidIndexError. Otherwise we fall through and re-raise
3627 # the TypeError.
KeyError: 'score'
Not yet, but will try it out soon and let you know. But I already have some questions regarding this snippet though.
Assuming my queries are in the same dataframe as my documents (with columns corpus_id
, query
, corpus
). Here each row is a small collection of documents from different domains.
_my_func()
? Is it one row at a time or the entire dataframe?colbert
here? Is it the indexer
or the ranking_factory
?Hi,
While using the dense retrieval pipeline via factory.end_to_end()
I notice that not all documents are ranked and returned using the search(query)
method of the pipeline. I am struggling a bit to understand where do I explicitly provide the number of results to return.
For e.g., my index has 5000 documents, but the search returns around 3000 documents even when I call search like this
ranked_df = (end_to_end_factory % 5000).search('my query')
Can someone please guide me in this regard?
Thanks for the quick response!
Apparently the max depth for faiss GPUIndex search is 2048 items which probably means I can't rank all the documents.
RuntimeError: Error in virtual void faiss::gpu::GpuIndex::search(faiss::Index::idx_t, const float*, faiss::Index::idx_t, float*, faiss::Index::idx_t*) const at /project/faiss/faiss/gpu/GpuIndex.cu:248: Error: 'k <= (Index::idx_t)getMaxKSelection()' failed: GPU index only supports k <= 2048 (requested 21825)
Hi,
While using the dense retrieval pipeline via factory.end_to_end()
I notice that not all documents are ranked and returned using the search(query)
method of the pipeline. I am struggling a bit to understand where do I explicitly provide the number of results to return.
For e.g., my index has 5000 documents, but the search returns around 3000 documents even when I call search like this
ranked_df = (end_to_end_factory % 5000).search('my query')
Can someone please guide me in this regard?